- The paper introduces a mid-training phase that leverages self-generated diverse solutions to enhance RL performance in language models.
- It employs heuristic-driven data generation with few-shot exemplars and rule-based filtering to curate multiple valid solution pathways per problem.
- Empirical results reveal significant gains in pass@k scores and improved multi-heuristic composition across math, code, and narrative reasoning tasks.
Mid-Training with Self-Generated Data Improves RL in LLMs: A Technical Analysis
Motivation and Problem Statement
Reinforcement Learning (RL) is increasingly adopted for eliciting advanced reasoning behaviors in LLMs, particularly in mathematical problem solving and related complex domains. Despite successes, RL often only sharpens the modes already present in base models, with limited capacity to induce genuinely new abilities, and can even degrade sample efficiency in certain settings. The diversity and structure of data provided prior to RL—the so-called "priors"—critically influence the ultimate effectiveness of RL post-training.
The central question addressed is: Can systematically increasing the diversity of problem-solving strategies within the mid-training phase amplify RL’s impact and facilitate the learning of richer, more compositional reasoning behaviors?
Methodology: Heuristic-Guided Self-Generated Mid-Training
The approach introduces a mid-training phase, inserted between standard supervised fine-tuning (SFT) and RL, leveraging self-generated, diverse solution trajectories. The core innovations are:
Heuristic-Driven Data Generation: Drawing on Pólya’s taxonomy of mathematical problem-solving heuristics, the model is conditioned via prompts on explicit, interpretable heuristics—each accompanied by few-shot exemplars. For every problem/heuristic pair:
Mid-Training Objective: The model is then fine-tuned on this expanded dataset—exposing it to n distinct, correct strategies per question (rather than the conventional single answer), adapting the likelihood objective to encourage multi-modal next-token distributions.
Theoretical Analysis: The paper provides formal justification for this pipeline—demonstrating, under policy gradient updates, that mid-training on diverse strategies produces N-modal probability distributions over continuations, which results in more conservative, balanced updates during RL. This mitigates mode collapse and facilitates the composition of multiple strategies within single reasoning traces.
Empirical Results
Extensive experiments are conducted with Llama 3.2–3B–Instruct on a spectrum of mathematics benchmarks (including Math-500, AIME, HMMT, AMC, OlympiadBench) and out-of-domain tasks (code and narrative reasoning).
Main Empirical Findings
- Diversity Amplifies RL Gains: When mid-training with n heuristic-guided trajectories per question, increasing n results in systematically higher pass@k scores after RL—especially for larger k (e.g., pass@64), indicating improved robustness and solution diversity.
- Quantitative Improvements: At pass@64, improvements over vanilla RL baselines reach 2.85% (Math-500), 5.7% (AIME 2024), 6.55% (AIME 2025), and similar margins on other datasets.
- Benefits Beyond Mathematics: Gains transfer to code generation (HumanEval) and narrative reasoning (MuSR), with mid-trained models outscoring vanilla RL in all measured settings.
- Importance of Correctness: Only mid-training with correct, diverse approaches improves RL; exposure to heuristic-guided incorrect chains degrades performance.
- Multi-Strategy vs Multi-Problem: For a fixed mid-training budget, learning multiple strategies per problem yields superior RL performance compared to learning one strategy for a larger set of distinct problems.
(Figure 2)
Figure 2: Pass@k curves for RL-trained models as a function of the number of mid-training heuristics per problem, demonstrating consistent improvements with greater diversity.
Analysis of Generated Reasoning Traces
By classifying model-generated chains with an LLM-based judge, the paper empirically verifies that RL-trained models compose multiple distinct heuristics within single responses at substantially higher rates than models exposed to only a single solution pattern per problem. At n=16 heuristics, 56.7% of RL traces exhibit multi-heuristic composition (vs. 23.3% pre-RL).
(Figure 3)
Figure 3: Post-RL models exhibit a higher rate of composing multiple problem-solving heuristics within a single reasoning trace.
Comparison to Distillation and Baseline Self-Improvement
Distillation from larger teacher models—while conceptually similar—yields less diverse data (as measured by the Vendi Score) and produces less effective RL post-training than the heuristic-guided self-generation strategy. This suggests that compositionally rich behavior cannot be trivially imported from stronger teachers, but is better fostered via targeted mid-training on diversified self-generated solutions.
Theoretical Contributions
The policy gradient analysis shows that when a LLM’s next-token distribution is N-modal (due to mid-training with N diverse, correct strategies), single-step RL updates induce gradual redistribution of probability mass among high-probability modes, rather than quickly collapsing onto a single solution. This not only stabilizes RL learning, but also maintains behavioral plasticity—promoting the emergence of compositional reasoning strategies.
Propositions and proofs detail both the mechanisms of probability mass redistribution (favoring high-probability continuations) and the efficient incentive structures for composition of learned reasoning patterns.
Practical and Theoretical Implications
For practice, this work substantiates a general methodology: systematic enrichment of training data with diverse, correct solution strategies—guided by human-interpretable heuristics—significantly improves RL-based post-training for LLMs. The resulting models display higher sample efficiency, greater robustness to evaluation regime, and qualitatively richer multi-step reasoning behaviors.
Theoretically, it clarifies that RL does not inherently produce new capabilities ex nihilo, but is most powerful when leveraged to compose and amplify diverse competencies already present in the model’s prior. This highlights the criticality of explicitly engineering diversity into the training pipeline.
The findings also motivate further exploration toward domain-general heuristics and more sophisticated self-generated data pipelines for enhancing LLM reasoning, as well as clarifying the trade-offs between reasoning pattern diversity and problem coverage under limited supervision.
Conclusion
This paper demonstrates—both theoretically and empirically—that mid-training LLMs on self-generated, heuristic-guided diverse solutions is a highly effective strategy for amplifying the benefits of RL post-training. The approach is robust, model-agnostic, and yields strong gains across mathematical, code, and narrative reasoning domains. By exposing models to multiple correct approaches per problem, RL is empowered to compose these behaviors into more flexible and effective multi-step solutions, revealing a path toward models with deeper, more human-like problem-solving fluency.
References: (2605.08472)