Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

Published 8 May 2026 in cs.AI | (2605.08472v1)

Abstract: The effectiveness of Reinforcement Learning (RL) in LLMs depends on the nature and diversity of the data used before and during RL. In particular, reasoning problems can often be approached in multiple ways that rely on different forms of reasoning, and exposure to only a limited range of such approaches in the training data may limit the effectiveness of RL. Motivated by this, we investigate using diverse self-generated data during mid-training as an intermediate step before RL training. Specifically, we adopt a bootstrapped data-generation framework guided by George Polya's problem-solving approaches for generating multiple variants of correct answers for each question in the training data, and then perform fine-tuning. We first provide a theoretical perspective on how mid-training on such data improves RL and explain how policy-gradient updates can incentivize combining multiple approaches. We then empirically demonstrate that RL-trained models initialized with our mid-training data achieve consistent improvements across various mathematical reasoning benchmarks and other OOD tasks like code generation and narrative reasoning. Overall, our investigative study shows that a LLM learning multiple problem-solving approaches, through self-generated data helps subsequent RL.

Summary

  • The paper introduces a mid-training phase that leverages self-generated diverse solutions to enhance RL performance in language models.
  • It employs heuristic-driven data generation with few-shot exemplars and rule-based filtering to curate multiple valid solution pathways per problem.
  • Empirical results reveal significant gains in pass@k scores and improved multi-heuristic composition across math, code, and narrative reasoning tasks.

Mid-Training with Self-Generated Data Improves RL in LLMs: A Technical Analysis

Motivation and Problem Statement

Reinforcement Learning (RL) is increasingly adopted for eliciting advanced reasoning behaviors in LLMs, particularly in mathematical problem solving and related complex domains. Despite successes, RL often only sharpens the modes already present in base models, with limited capacity to induce genuinely new abilities, and can even degrade sample efficiency in certain settings. The diversity and structure of data provided prior to RL—the so-called "priors"—critically influence the ultimate effectiveness of RL post-training.

The central question addressed is: Can systematically increasing the diversity of problem-solving strategies within the mid-training phase amplify RL’s impact and facilitate the learning of richer, more compositional reasoning behaviors?

Methodology: Heuristic-Guided Self-Generated Mid-Training

The approach introduces a mid-training phase, inserted between standard supervised fine-tuning (SFT) and RL, leveraging self-generated, diverse solution trajectories. The core innovations are:

Heuristic-Driven Data Generation: Drawing on Pólya’s taxonomy of mathematical problem-solving heuristics, the model is conditioned via prompts on explicit, interpretable heuristics—each accompanied by few-shot exemplars. For every problem/heuristic pair:

  • The base model generates multiple candidate rationales conditioned on the heuristic.
  • These outputs are filtered for correctness (using rule-based verifiers).
  • A reward model scores remaining chains for heuristic adherence.
  • The optimal chain per pair is selected, yielding a dataset with multiple diverse, correct solutions per underlying problem. Figure 1

    Figure 1: Structured pipeline for heuristic-guided mid-training—diverse solution trajectories generated, verified, and curated for fine-tuning.

Mid-Training Objective: The model is then fine-tuned on this expanded dataset—exposing it to nn distinct, correct strategies per question (rather than the conventional single answer), adapting the likelihood objective to encourage multi-modal next-token distributions.

Theoretical Analysis: The paper provides formal justification for this pipeline—demonstrating, under policy gradient updates, that mid-training on diverse strategies produces NN-modal probability distributions over continuations, which results in more conservative, balanced updates during RL. This mitigates mode collapse and facilitates the composition of multiple strategies within single reasoning traces.

Empirical Results

Extensive experiments are conducted with Llama 3.2–3B–Instruct on a spectrum of mathematics benchmarks (including Math-500, AIME, HMMT, AMC, OlympiadBench) and out-of-domain tasks (code and narrative reasoning).

Main Empirical Findings

  • Diversity Amplifies RL Gains: When mid-training with nn heuristic-guided trajectories per question, increasing nn results in systematically higher pass@kk scores after RL—especially for larger kk (e.g., pass@64), indicating improved robustness and solution diversity.
  • Quantitative Improvements: At pass@64, improvements over vanilla RL baselines reach 2.85% (Math-500), 5.7% (AIME 2024), 6.55% (AIME 2025), and similar margins on other datasets.
  • Benefits Beyond Mathematics: Gains transfer to code generation (HumanEval) and narrative reasoning (MuSR), with mid-trained models outscoring vanilla RL in all measured settings.
  • Importance of Correctness: Only mid-training with correct, diverse approaches improves RL; exposure to heuristic-guided incorrect chains degrades performance.
  • Multi-Strategy vs Multi-Problem: For a fixed mid-training budget, learning multiple strategies per problem yields superior RL performance compared to learning one strategy for a larger set of distinct problems.

(Figure 2)

Figure 2: Pass@kk curves for RL-trained models as a function of the number of mid-training heuristics per problem, demonstrating consistent improvements with greater diversity.

Analysis of Generated Reasoning Traces

By classifying model-generated chains with an LLM-based judge, the paper empirically verifies that RL-trained models compose multiple distinct heuristics within single responses at substantially higher rates than models exposed to only a single solution pattern per problem. At n=16n=16 heuristics, 56.7% of RL traces exhibit multi-heuristic composition (vs. 23.3% pre-RL).

(Figure 3)

Figure 3: Post-RL models exhibit a higher rate of composing multiple problem-solving heuristics within a single reasoning trace.

Comparison to Distillation and Baseline Self-Improvement

Distillation from larger teacher models—while conceptually similar—yields less diverse data (as measured by the Vendi Score) and produces less effective RL post-training than the heuristic-guided self-generation strategy. This suggests that compositionally rich behavior cannot be trivially imported from stronger teachers, but is better fostered via targeted mid-training on diversified self-generated solutions.

Theoretical Contributions

The policy gradient analysis shows that when a LLM’s next-token distribution is NN-modal (due to mid-training with NN diverse, correct strategies), single-step RL updates induce gradual redistribution of probability mass among high-probability modes, rather than quickly collapsing onto a single solution. This not only stabilizes RL learning, but also maintains behavioral plasticity—promoting the emergence of compositional reasoning strategies.

Propositions and proofs detail both the mechanisms of probability mass redistribution (favoring high-probability continuations) and the efficient incentive structures for composition of learned reasoning patterns.

Practical and Theoretical Implications

For practice, this work substantiates a general methodology: systematic enrichment of training data with diverse, correct solution strategies—guided by human-interpretable heuristics—significantly improves RL-based post-training for LLMs. The resulting models display higher sample efficiency, greater robustness to evaluation regime, and qualitatively richer multi-step reasoning behaviors.

Theoretically, it clarifies that RL does not inherently produce new capabilities ex nihilo, but is most powerful when leveraged to compose and amplify diverse competencies already present in the model’s prior. This highlights the criticality of explicitly engineering diversity into the training pipeline.

The findings also motivate further exploration toward domain-general heuristics and more sophisticated self-generated data pipelines for enhancing LLM reasoning, as well as clarifying the trade-offs between reasoning pattern diversity and problem coverage under limited supervision.

Conclusion

This paper demonstrates—both theoretically and empirically—that mid-training LLMs on self-generated, heuristic-guided diverse solutions is a highly effective strategy for amplifying the benefits of RL post-training. The approach is robust, model-agnostic, and yields strong gains across mathematical, code, and narrative reasoning domains. By exposing models to multiple correct approaches per problem, RL is empowered to compose these behaviors into more flexible and effective multi-step solutions, revealing a path toward models with deeper, more human-like problem-solving fluency.


References: (2605.08472)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 2 likes about this paper.