Papers
Topics
Authors
Recent
Search
2000 character limit reached

Advancing General-Purpose Reasoning Models with Modular Gradient Surgery

Published 2 Feb 2026 in cs.CL, cs.AI, and cs.LG | (2602.02301v1)

Abstract: Reinforcement learning (RL) has played a central role in recent advances in large reasoning models (LRMs), yielding strong gains in verifiable and open-ended reasoning. However, training a single general-purpose LRM across diverse domains remains challenging due to pronounced domain heterogeneity. Through a systematic study of two widely used strategies, Sequential RL and Mixed RL, we find that both incur substantial cross-domain interference at the behavioral and gradient levels, resulting in limited overall gains. To address these challenges, we introduce Modular Gradient Surgery (MGS), which resolves gradient conflicts at the module level within the transformer. When applied to Llama and Qwen models, MGS achieves average improvements of 4.3 (16.6\%) and 4.5 (11.1\%) points, respectively, over standard multi-task RL across three representative domains (math, general chat, and instruction following). Further analysis demonstrates that MGS remains effective under prolonged training. Overall, our study clarifies the sources of interference in multi-domain RL and presents an effective solution for training general-purpose LRMs.

Summary

  • The paper introduces Modular Gradient Surgery, a novel method that locally resolves gradient conflicts in transformers for multi-domain RL.
  • It demonstrates superior performance improvements across math, chat, and instruction tasks compared to conventional training methods.
  • Empirical results reveal consistent gains with up to 4.5 point improvements, emphasizing the method’s scalability and module-specific benefits.

Modular Gradient Surgery for Multi-Domain Reasoning Model Optimization

Introduction

Recent advances in Large Reasoning Models (LRMs) have been driven by Reinforcement Learning (RL) techniques, particularly RL with Verifiable Rewards (RLVR), which has substantially improved open-ended and verifiable reasoning abilities across domains such as mathematics, code synthesis, chat, and instruction following. However, extending these gains to a general-purpose LRM that consistently performs well on heterogeneous domains remains nontrivial due to domain-specific reward structures and conflicting optimization objectives. This paper systematically analyzes sequential and mixed RL training paradigms, identifies quantifiable forms of cross-domain interference—mode interference, catastrophic forgetting, and gradient conflicts—and introduces Modular Gradient Surgery (MGS), a gradient-level conflict resolution algorithm at the modular granularity of transformers. Figure 1

Figure 1: Modular Gradient Surgery (MGS) outperforms naive multi-domain RL by resolving local gradient conflicts, achieving state-of-the-art balanced capability across math, chat, and instruction tasks.

Experimental Framework and Failure Modes in Multi-Domain RL

Experimental Setup

Experiments utilize Qwen-2.5-7B and Llama-3.1-8B backbones over three representative domains: mathematics, open-ended chat, and instruction following. Each domain incorporates domain-specific RLVR or model-based reward signals, and evaluation is conducted across in-domain and generalization benchmarks.

Sequential RL Training: Forgetting and Rigidity

Sequential RL, where domains are optimized in succession, is shown to suffer from two detrimental behaviors:

  • Forgetting: Training on a second domain degrades previously acquired competency (catastrophic interference).
  • Rigidity: Prior optimization on one domain impedes learning efficiency on subsequent tasks, often due to reduced policy entropy inhibiting exploration. Figure 2

    Figure 2: Sequential RL demonstrates severe forgetting and rigidity effects, limiting cross-domain transfer.

    Figure 3

    Figure 3: Entropy dynamics illustrate rigidity, with Math-first training constraining response diversity in chat tasks.

Empirical results show significant asymmetric degradation; Chat→\rightarrowMath preserves math skill better than Math→\rightarrowChat preserves chat ability. The sequencing of domains significantly impacts the resultant Pareto frontier, and entropy analysis confirms that initial training on high-entropy tasks best supports subsequent structured learning.

Mixed RL Training: Gradient Conflicts and Performance Trade-offs

Mixed RL integrates batches from multiple domains, applying gradients from all tasks in each step, which induces persistent gradient conflicts. Figure 4

Figure 4: Data mixing ratios reveal Math performance scaling monotonically with math data, but chat performance exhibits complex non-monotonic correlation.

Figure 5

Figure 5: Gradient cosine similarity and norm ratios highlight pronounced conflicts between math and chat gradients during training.

Adjusting mixing ratios yields limited improvements. Even heavily task-skewed batches do not match single-domain expert models, substantiating that mere batch proportionality cannot eliminate interference. The negative cosine similarity between gradients for different domains confirms destructive gradient interactions and motivates algorithmic interventions.

Modular Gradient Surgery: Algorithmic Advances

Motivation and Mechanism

Transformer models are highly modular; distinct layers and blocks often specialize in different computational functions. Standard global gradient manipulation may be overly conservative or ineffective due to non-uniform conflict localization. MGS builds upon PCGrad [yu2020gradient] but applies projection operations independently within each detected module (attention, MLP, layer normalization).

By partitioning model parameters and resolving conflicts locally where negative gradient interactions are detected, MGS preserves beneficial updates while suppressing incompatible directions only in affected subspaces. Figure 6

Figure 6: Modular gradient cosine similarity analysis reveals module-specific conflict patterns, underscoring the necessity of local interventions.

Empirical Validation

MGS consistently outperforms baseline model merging, normalized advantage normalization, global gradient surgery (GGS), and naive mixing across all evaluated multi-domain reasoning metrics. Figure 7

Figure 7: MGS delivers the best averaged performance in both balanced math and chat scores, outperforming global surgery and naive approaches, even surpassing single-task baselines on chat.

Key numerical results include 4.3/4.5 point average improvements for Llama and Qwen over standard multi-task RL, with peak scores achieved in all principal domains. Notably, global gradient surgery is less effective than modular surgery, as conflicts existing only in specialized modules unnecessarily constrain optimization of the entire parameter space.

Scalability and Mechanistic Analysis

Multi-Task and Prolonged Training Scalability

MGS benefits increase with expanded domain inclusion and longer training budgets, maintaining robust performance gains in settings with three or more domains and scaling positively with additional epochs. Figure 8

Figure 8: Scaling training steps further enhances MGS performance while sequential methods plateau or regress due to compounding interference.

Ablation and Module Sensitivity

Ablation studies reveal differential module importance: Figure 9

Figure 9: Exclusion of LayerNorm, Attention, or MLP modules from MGS impacts capabilities variably; LayerNorm is critical for both math and chat, while MLP exclusion degrades chat disproportionately.

Applying MGS only to top-norm modules affects chat but is insufficient for math, which requires distributed coherence across model modules.

Practical and Theoretical Implications

MGS establishes a principled framework for efficient multi-domain RL post-training of LRMs, leveraging the modular architecture of transformers for fine-grained conflict resolution. The approach eliminates the necessity for complex domain scheduling or reward engineering pipelines, enabling parallelized, compute-efficient multi-task training through FSDP and similar high-performance infrastructure. The clarity of modular interference loci expands interpretability of RL dynamics, informing future research on targeted adaptation, unlearning, LLM safety, and agentic system construction.

Future Directions

Potential extensions include exploring alternative heuristics for module selection (beyond gradient norm magnitude), refining task-to-module attribution, and adapting modular gradient manipulation for broader settings (vision-language, multimodal, supervised, or offline RL). Integrating MGS with advanced reward modeling and distillation protocols could further propel general-purpose reasoning models towards higher reliability and adaptability.

Conclusion

This work rigorously diagnoses multi-domain RL interference in language reasoning models and introduces Modular Gradient Surgery—a scalable, transformer-aware gradient conflict resolution method. MGS surpasses global projection and conventional mixing strategies, advancing the empirical and conceptual frontier for general-purpose reasoning RL post-training. These results motivate continued development of modular optimization schemas for dynamically heterogeneous learning objectives.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 112 likes about this paper.