Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Published 26 Jan 2026 in cs.LG and cs.CL | (2601.18734v1)

Abstract: Knowledge distillation improves LLM reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On-Policy Self-Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 4-8x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off-policy distillation methods.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper presents On-Policy Self-Distillation (OPSD) that leverages ground-truth solutions for dense, token-level supervision.
It demonstrates significant improvements in accuracy and sample efficiency compared to traditional supervised fine-tuning and RL methods on math benchmarks.
The method eliminates the need for separate teacher models and scales effectively with increasing model size, highlighting its practical impact.

On-Policy Self-Distillation in LLMs for Mathematical Reasoning

Introduction

The paper "Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs" (2601.18734) introduces On-Policy Self-Distillation (OPSD), a novel fine-tuning paradigm in which a single LLM acts as both teacher and student for mathematical reasoning tasks. OPSD leverages ground-truth solutions as privileged information during post-training, enabling dense token-level supervision without the need for a separate teacher LLM. This approach directly addresses limitations in reinforcement learning with verifiable rewards (RLVR), supervised fine-tuning (SFT), and traditional knowledge distillation, particularly distribution shift and sample inefficiency.

Motivation and Conceptual Framework

OPSD is motivated by the observation that for LLMs, rationalizing a solution when provided with privileged information is easier than generating an answer from scratch. The method instantiates two conditional policies from the same model parameters: the student, which sees only the problem prompt, and the teacher, which conditions on both the problem and the verified solution. Training minimizes the per-token divergence (typically with Jensen-Shannon divergence) between these distributions across student-generated trajectories, with gradients propagated only through the student’s logits.

This framework combines four key advantages:

On-policy supervision, matching inference-time distribution;
Dense, per-token feedback enabling more fine-grained optimization;
Utilization of ground-truth solutions for richer teacher signals;
Elimination of requirements for a separate, larger teacher model.

Relation to Prior Approaches

Traditional knowledge distillation [Hinton et al., 2015] for LLMs requires off-policy data and suffers from exposure bias, resulting in compounding errors at inference time. RLVR methods such as GRPO [Shao et al., 2024] optimize outcome-based objectives using binary rewards, but are computationally expensive and only provide sequence-level feedback. On-policy distillation [Agarwal et al., 2024; Lu & Lab, 2025] improves sample efficiency but typically requires an external teacher model. OPSD bridges these paradigms, transferring knowledge within the same model by exploiting privileged ground-truth context for self-distillation.

Experimental Design

The authors evaluate OPSD on four competitive mathematical reasoning benchmarks (AIME24, AIME25, HMMT25, AMO-Bench) using Qwen3 models at scales ranging from 1.7B to 8B parameters. The experimental setting includes:

Baselines: supervised fine-tuning (SFT) and GRPO.
Dataset: up to 30,000 math problem/solution pairs from OpenThoughts [Guha et al., 2025].
Training: single student rollout (OPSD) vs. 8 rollouts (GRPO), with full-vocabulary logit divergence minimized in OPSD.

Empirical Results

OPSD achieves strong numerical gains in both performance and efficiency:

Accuracy: OPSD consistently outperforms SFT across model scales, and exceeds or matches GRPO in 4B/8B models, with average accuracies up to 52.2% (Qwen3-8B).
Token Efficiency: OPSD achieves comparable or higher accuracy using 4-8x fewer generated tokens than GRPO, directly attributable to dense token-level feedback.
Scalability: The benefit of OPSD increases with model size; higher-parameter models are more capable of leveraging privileged reasoning traces for effective self-distillation.
Divergence Objective: Ablations show that distilling over the full vocabulary (logit distillation) yields superior results to sampled-token policy-gradient objectives, with increases in pass@K accuracy in all tested tasks.

Practical and Theoretical Implications

OPSD advances post-training techniques for LLMs in reasoning-intensive domains, notably in mathematical problem solving. The practical impact is substantial: models require only their own initial parameters and ground-truth solutions for effective distillation, dramatically reducing computational cost relative to RL-based methods and circumventing the need for process reward models.

Theoretically, OPSD generalizes the idea of self-improvement in LLMs, demonstrating that dense token-level distribution matching conditioned on privileged context enables significant learning without external teachers. The sample efficiency gains highlight the importance of dense feedback in autoregressive generation tasks, and the dependence on model scale elucidates a key bottleneck for self-distillation in less-capable architectures.

Limitations and Future Directions

Experiments are currently limited to models up to 8B parameters; whether the observed trends persist for much larger (e.g., 70B) LLMs remains an open question. OPSD does not leverage explicit correctness verification as a learning signal—future work may incorporate multi-objective optimization combining distribution matching with answer correctness. Additionally, the effectiveness of OPSD is modulated by both problem difficulty and model capacity, suggesting curriculum learning strategies as promising avenues to maintain optimization at the edge of current model ability.

Conclusion

On-Policy Self-Distillation (OPSD) presents an efficient, scalable method for enhancing LLM reasoning abilities using ground-truth privileged context for token-level supervision within a single model. The empirical gains in accuracy and sample efficiency, especially for larger models, position OPSD as a compelling alternative to both RLVR and traditional knowledge distillation. Future extensions could leverage verification and curriculum learning to further push the limits of self-improving LLMs in complex reasoning domains.

Markdown Report Issue

Paper to Video (Beta)

All Videos Subscribe on YouTube

Whiteboard

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces a new way to train LLMs to be better at step-by-step reasoning, especially in math. The method is called On-Policy Self-Distillation (OPSD). In simple terms, the model learns from itself: it pretends to be both a “teacher” and a “student,” and uses correct solutions (like answer keys) to guide its own practice on problems.

The big questions the researchers asked

Can a single model teach itself to reason better without needing a bigger, separate teacher model?
Can we give the model detailed, step-by-step feedback (not just “right/wrong”) while it practices on its own answers?
Does this self-teaching approach beat common training methods in accuracy and efficiency?
How big does a model need to be for this to work well?

How did they do it?

The “same model, two hats” idea

Think of the model wearing two hats:

Student hat: The model sees only the problem, like a regular test question.
Teacher hat: The same model also gets “privileged” information—like the correct answer or a correct chain of thought—as if it’s holding an answer key.

These are not two separate models; it’s the same model with different inputs.

On-policy training (practice on your own work)

“On-policy” means the student practices on the answers it actually writes. The student generates its own step-by-step solution. Then the teacher—who knows the correct solution—looks at the student’s work and computes guidance for each next step.

This avoids a common problem in training called “distribution mismatch,” where a model is trained on perfect examples but then gets confused when it has to write its own imperfect answers during real use.

Token-by-token guidance (dense feedback)

LLMs write text one token (roughly, one word or symbol) at a time. Instead of just saying “your final answer is wrong,” the teacher gives detailed hints for every next token in the student’s solution. The student then adjusts its behavior to be more like the teacher’s suggestions at each step. This gives rich, fine-grained feedback, not just a single score at the end.

In math terms, they measure “how different” the student’s and teacher’s next-token predictions are and push the student to reduce that difference. You can think of this like the teacher guiding the student at each word: “The next thing you should write is probably this term, not that term.”

A second, lighter variant (sampled-token objective)

They also try a simpler version where the teacher only critiques the specific tokens the student actually wrote, rather than looking at the whole set of possible next tokens. This is faster but gives less complete feedback.

What did they find?

The team tested OPSD on math competition-style benchmarks (AIME 2024/2025, HMMT 2025, AMO-Bench) using Qwen3 models of different sizes (1.7B, 4B, 8B parameters). They compared OPSD to:

SFT (Supervised Fine-Tuning): training on full solutions from a dataset.
GRPO (a kind of reinforcement learning): sampling multiple answers per problem and rewarding correct ones.

Here are the key results:

OPSD consistently beat SFT and improved over the base model across sizes.
OPSD matched or exceeded GRPO on mid-sized and larger models (4B and 8B), and was comparable at 1.7B.
OPSD was 4–8× more token-efficient than GRPO. In plain terms: it reached similar or better performance while generating far fewer tokens, and it needed only 1 answer per problem (GRPO needed 8).
Longer student generations (e.g., 2,048 or 4,096 tokens) helped because the teacher could provide more token-level guidance.
Using the full “probability distribution” over possible next tokens (full-vocabulary distillation) worked better than only critiquing the specific token the student wrote (sampled-token approach).
Self-distillation works best when the model is reasonably capable; bigger models benefited more.

Why this is important: OPSD cuts training costs and time while boosting performance, and it removes the need for a separate, often more expensive teacher model.

Why does it matter?

It makes training smarter: The model learns from its own attempts and gets guidance at each step, not just a pass/fail.
It saves compute: Fewer tokens and fewer sampled answers per problem mean lower training cost.
It uses existing data better: It leverages ground-truth solutions (like answer keys) without building extra reward models or relying on bigger teachers.
It’s practical: You can improve reasoning without complex reinforcement learning setups.

Limitations and future directions

The experiments used models up to 8B parameters; larger models might show even bigger gains, but that’s still to be tested.
Full-vocabulary feedback uses more memory than the lighter variant.
If problems are too hard, even the “teacher hat” (with the answer key) may struggle to give good guidance.
Adding answer verification and using a smart curriculum (start easier, get harder) could make training even more effective.

Bottom line

OPSD is like having the model grade and tutor itself using the answer key, step by step, on the work it actually produces. This self-teaching approach improves accuracy, needs less compute, and avoids the hassle of training with a separate teacher or complicated reward systems—making it a promising recipe for building better reasoning models.

View Paper Prompt View All Prompts

Knowledge Gaps

Unresolved Knowledge Gaps, Limitations, and Open Questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper, intended to guide future research:

Scaling beyond 8B parameters: Does OPSD continue to improve with larger models (e.g., 14B, 32B, 70B+), and is there a capability threshold after which gains saturate or diminish?
Teacher policy choice: What is the optimal strategy for the teacher policy (fixed initial weights vs. periodically refreshed vs. EMA/slow-moving teacher), and how does this affect stability, convergence, and final performance?
Privileged information type: How do different forms of privileged context (final answer only, verified step-by-step CoT, partial solutions, hints) impact supervision quality and outcomes? Which is most cost-effective per token?
Prompting sensitivity: How sensitive is OPSD to the teacher prompt design and rationalization style, and can standardized or learned prompts reduce variance across tasks or datasets?
Applicability beyond mathematics: Does OPSD generalize to other verifiable domains (coding, formal proofs, data analysis) and non-verifiable tasks (commonsense reasoning, open-ended QA), and what adaptations are required when ground-truth solutions are absent or ambiguous?
Handling label noise: How robust is OPSD to noisy, incomplete, or incorrect ground-truth solutions in training datasets, and can automated verification/filtering improve resilience?
Integration with verifiers: What is the benefit of incorporating verifiable rewards (checkers, PRMs, LLM-as-a-judge) alongside OPSD’s divergence objective, and how can this be done without reintroducing sparse/expensive signals?
Curriculum design: Which curriculum strategies (difficulty ramping, adaptive selection, frontier maintenance) most effectively enable OPSD on problems near a model’s competence boundary?
Token-level weighting: Are early tokens truly more important for learning in OPSD? Can position-dependent weighting (e.g., prefix-focused or decay schedules) improve efficiency and accuracy?
Divergence objective selection: How do different divergences (forward KL, reverse KL, α-divergences, tempered/label-smoothed objectives) and mixture weights (varying $B$ in JSD) affect performance, stability, and sample efficiency?
Temperature/softmax shaping: Does temperature scaling of teacher or student distributions during distillation improve gradient signal quality or reduce mode collapse?
Memory/compute trade-offs: What are the precise memory and wall-clock costs of full-vocabulary logit distillation vs. sampled-token objectives, and can techniques like vocabulary pruning, sampled softmax, or low-rank logit compression preserve performance while reducing peak memory?
Sample efficiency vs. wall-clock: Beyond token counts, how does OPSD compare to GRPO in actual training time per accuracy gain across different hardware setups and batch sizes?
Number of student rollouts: Does increasing the number of on-policy student rollouts (e.g., >1 per prompt) yield further gains, or do diminishing returns occur relative to token budget?
Hyperparameter sensitivity: How sensitive are results to learning rate, LoRA ranks/targets, divergence coefficients, generation lengths beyond 4k, and batch size? Are there robust default settings across model scales?
Decoding robustness: Do OPSD gains persist across decoding settings (temperature, top-p/k, min-p, presence/frequency penalties) and under constrained generation budgets typical of real deployments?
Statistical reliability: What is the variance across random seeds and runs, and are improvements statistically significant on each benchmark?
Data contamination: Are evaluation sets (AIME24/25, HMMT25, AMO-Bench) free from training-set leakage for Qwen3 and OpenThoughts? How does OPSD perform on rigorously decontaminated, held-out test distributions?
Out-of-domain generalization: Does OPSD-trained reasoning transfer to novel problem distributions and unseen formats, compared to RLVR baselines known to generalize better?
Failure modes analysis: When the student’s trajectory is highly off-target, does the teacher distribution (conditioned on $y^*$ ) provide meaningful guidance, or can it become uninformative/degenerate? How can OPSD detect and correct such cases?
Theoretical guarantees: Under what conditions does minimizing token-level divergence along student rollouts improve success rates or reduce exposure bias, and how does OPSD relate formally to imitation learning (e.g., DAgger) and on-policy KL regularization?
Combining OPSD with RLVR: What is the best way to hybridize OPSD with GRPO/PPO (e.g., alternating phases, multi-objective training, per-token vs. sequence-level rewards), and does this deliver additive gains?
Teacher–student drift: As the student improves while the teacher is fixed (initial policy), does the mismatch limit ceiling performance? Would a scheduled teacher refresh avoid over-regularization while preserving stability?
CoT availability: How does OPSD perform when only final answers (no CoT) are available? Can the teacher reliably reconstruct plausible rationales, and is that reconstruction necessary or beneficial?
Evaluation breadth: Beyond average@16 and pass@K, do OPSD improvements hold under single-sample accuracy, shorter contexts, or adversarially perturbed prompts?
Safety and bias: Without external teachers, can OPSD amplify model biases or spurious reasoning patterns present in the initial policy or dataset? What diagnostics and mitigations are effective?
Practical deployment constraints: How does OPSD interact with context length limits, inference-time latency requirements, and memory/compute budgets typical in production settings?
Process-level supervision: Can lightweight, automatically generated step-level signals (heuristic checks, weak PRMs) be integrated to provide finer-grained correction without incurring prohibitive labeling cost?
Benchmarks and datasets: Does OPSD benefit from curated small datasets vs. large noisy corpora (akin to LIMO/LIMA findings), and what data recipes maximize OPSD’s gains?
Multimodal extensions: Can OPSD be extended to multimodal reasoning (e.g., charts, diagrams) where privileged information might include visual annotations or verified intermediate states?

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

These applications can be deployed now using the OPSD method and accompanying insights, provided the listed dependencies are satisfied.

Post-training pipelines for reasoning LLMs (software/AI labs)
- What: Replace or complement GRPO-style RLVR with OPSD to improve math/coding reasoning using dense, token-level supervision from answer keys or verified CoT traces.
- Why: 4–8x token efficiency vs. GRPO with comparable or better accuracy; only 1 rollout per prompt needed.
- How: Add an OPSD stage to existing LoRA-based post-training: fix the teacher as the initial policy, sample one student rollout per prompt (2k–4k tokens), compute full-vocabulary JSD per token, backprop through the student only.
- Dependencies/assumptions: Access to datasets with ground-truth answers or reference CoT; models at least moderate capacity (≥4B preferred); sufficient memory for full-vocabulary logit distillation; careful prompt design for teacher rationalization.
Cost reduction for verifiable domains (software, education, coding)
- What: Swap RLVR (e.g., GRPO) for OPSD on tasks with verifiable outcomes (math, coding with tests).
- Why: Lower sampling cost and training time without sacrificing performance.
- Tools/workflows: “OPSD Trainer” as a drop-in training job; hyperparameters from paper (JSD with B=0.5; 2k–4k token generation length).
- Dependencies: Stable evaluation harness (answer checkers/unit tests); structured datasets.
Unit-test–guided self-distillation for coding assistants (software/DevEx)
- What: Use unit tests and expected outputs as privileged information to condition the teacher policy in OPSD.
- Why: Improves step-by-step code reasoning and reduces flakiness compared to sparse pass/fail rewards.
- Products: CI/CD plugin that periodically runs OPSD on error logs and failing tests to update a domain model.
- Dependencies: High-quality tests that serve as reliable verifiers; guardrails to prevent test leakage at inference.
Answer-key fine-tuning for tutors and study tools (education/daily life)
- What: Fine-tune tutoring LLMs on problem sets with answer keys or solutions as privileged teacher context.
- Why: Better chain-of-thought quality and pass@K in math/logic tutoring with limited compute.
- Products: “Answer-Key FT” edtech toolkit that ingests past exams/homework with solutions for OPSD.
- Dependencies: Problems within model comprehension; data licensing for answer keys; evaluation to prevent rote memorization.
Enterprise Q/A and compliance assistants (enterprise, policy)
- What: Train internal assistants to reason over policy/SOP questions using canonical answers as privileged info during training.
- Why: Improves adherence to internal policies with token-efficient training.
- Workflows: Curate Q/A sets with approved “gold” answers; run periodic OPSD refreshes; monitor reasoning fidelity.
- Dependencies: High-quality gold answers; content governance; documented change management when policies update.
Dataset packaging and MLOps support (data/ML tooling)
- What: Package datasets with explicit privileged fields (answer, CoT) for OPSD-ready training.
- Why: Simplifies adoption and standardizes supervision quality.
- Products: Dataset schemas and ingestion scripts; training configs for JSD vs sampled-token objectives.
- Dependencies: Clear licensing for distributing solutions; consistent formatting (prompt templates akin to Figure 1).
On-device and small-model domain boosters (edge/consumer)
- What: Use OPSD to boost 1.7B–4B domain models for personal or departmental assistants on modest hardware.
- Why: Gains over SFT with limited compute; LoRA-friendly.
- Workflows: Local LoRA OPSD on small curated datasets; inference with the student-only prompt.
- Dependencies: Enough VRAM for full-vocabulary logits (or fall back to sampled-token objective with some performance trade-off).
Academic benchmarking and method studies (academia)
- What: Reproduce OPSD vs SFT/GRPO studies; analyze effects of generation length, model scale, divergence choices.
- Why: Immediate research utility; low barrier to extension (single-model, no external teacher).
- Dependencies: Access to open benchmarks (AIME, HMMT, AMO-Bench) and reasoning subsets (e.g., OpenThoughts).

Long-Term Applications

These applications require further research, scaling, or engineering before broad deployment.

Frontier-scale OPSD as an RLVR alternative (software/AI labs)
- What: Apply OPSD to 70B+ models as a primary post-training method for reasoning.
- Why: Potentially large cost and energy savings vs multi-rollout RL; avoids sparse rewards.
- Dependencies: Memory-efficient full-vocab distillation (e.g., sharded logits), distributed training, robust stability when teacher/student share weights at scale.
Beyond verifiable domains with hybrid verifiers (law, healthcare, safety-critical)
- What: Combine OPSD with process reward models or LLM judges to emulate privileged “answers” in domains lacking hard ground truth (legal reasoning, clinical guidelines).
- Why: Extends dense token-level guidance to subjective or multi-criterion tasks.
- Dependencies: Reliable, bias-checked verifiers; safety and regulatory compliance; human-in-the-loop auditing.
Multimodal and tool-using agents (agents/robotics)
- What: Generalize OPSD to sequences of tool calls, code execution traces, action plans, or multimodal steps using privileged outcomes/plans as teacher context.
- Why: Dense, on-policy shaping for complex planning beyond text-only reasoning.
- Dependencies: Mappings from privileged outcomes to teacher prompts; divergence over action spaces; evaluation of long-horizon credit assignment.
Continual self-improvement from product logs (SaaS/contact centers)
- What: Periodically fine-tune agents with OPSD using historical interactions where outcomes are known (e.g., resolved tickets, confirmed correct responses).
- Why: Steady improvements without large-scale RL infrastructure.
- Dependencies: Robust data governance; drift detection; safeguards against overfitting to recent data; privacy controls.
Federated or on-prem OPSD for sensitive data (finance, healthcare, government)
- What: Local OPSD fine-tuning with private answer-key corpora; no external teacher required.
- Why: Privacy-preserving improvement; lower compute debts than RLVR.
- Dependencies: Secure training stacks; memory optimization; policy-compliant logging and audits.
Curriculum-driven OPSD trainers (edtech, training platforms)
- What: Automated schedulers that maintain problem difficulty at the model’s frontier to maximize learning signal.
- Why: Addresses failure modes when problems exceed model comprehension.
- Dependencies: Difficulty estimators; adaptive generation-length and sampling policies; dynamic data selection.
Energy- and cost-aware training policy (policy/sustainability)
- What: Encourage token-efficient post-training methods like OPSD in procurement and reporting standards.
- Why: Lower carbon footprint versus multi-sample RLVR; explicit accounting for generated tokens.
- Dependencies: Industry benchmarks for token efficiency; standardized environmental reporting for model training.
Test-generating + OPSD loops for software quality (software engineering)
- What: Agents generate tests (oracles), verify behavior, then apply OPSD with those oracles as privileged info to refine reasoning about edge cases.
- Why: Improves reliability of code assistants over time.
- Dependencies: High-quality test generation; safeguards to prevent overfitting to test artifacts; continuous integration hooks.
Enterprise knowledge management with canonical answers (enterprise)
- What: Build OPSD-ready knowledge bases where canonical, audited answers are maintained and used as privileged training signals.
- Why: More faithful, consistent reasoning over evolving internal knowledge.
- Dependencies: Editorial workflows; versioning; correctness verification; traceability for audits.
General-purpose model compression/distillation ecosystem (ML tooling)
- What: Offer OPSD as a standard self-distillation mode in distillation frameworks (next to teacher-student and off-policy modes).
- Why: Eliminates need for a separate teacher while retaining dense feedback.
- Dependencies: API abstractions for privileged context; memory-optimized full-vocab divergence; evaluation suites across tasks.

Cross-cutting assumptions and dependencies

Privileged information availability: OPSD critically depends on access to ground-truth answers or reliable surrogates (reference CoT, unit tests, verifiers) during training.
Model capacity: Self-rationalization benefits require moderate capability; gains grow with scale (≥4B showed clearer improvements than 1.7B).
Computation/memory: Full-vocabulary logit distillation delivers best performance but has higher memory footprint; sampled-token objectives trade performance for efficiency.
Prompting and stability: Teacher prompts must encourage rationalization before evaluation; fixing the teacher to the initial policy helps stabilize training.
Domain suitability: Tasks far beyond model comprehension will yield weak supervision even with answers; curriculum strategies can mitigate this.
Safety, privacy, and licensing: Ensure lawful use of answer keys/reference solutions; prevent leakage of privileged content into inference prompts; adopt auditing and evaluation to avoid memorization and bias amplification.

View Paper Prompt View All Prompts

Glossary

Advantage: A baseline-adjusted reward signal used to weight policy updates in reinforcement learning. "the advantages become zero"
Autoregressive LLMs: Models that generate sequences token-by-token, conditioning each next token on the previous context. "For auto-regressive LLMs"
Chain-of-thought (CoT): Explicit step-by-step reasoning traces accompanying an answer. "chain-of-thought reasoning"
Clipped surrogate loss: A PPO-style objective that limits the magnitude of policy updates for stability. "incorporates a clipped surrogate loss to moderate policy updates"
Curriculum learning: Training strategy that presents tasks in increasing difficulty to match a model’s current capacity. "curriculum learning strategies"
DAgger: An on-policy imitation learning algorithm where a teacher provides corrections on states visited by the learner. "DAgger (Ross et al., 2011)"
Exposure bias: A training–inference mismatch where a model is trained on ground-truth sequences but must generate on its own at test time. "Supervised fine-tuning suffers from exposure bias"
f-divergence: A family of divergence measures used to quantify differences between probability distributions. "yielding a proper token-level f-divergence"
Group Relative Policy Optimization (GRPO): An RL algorithm for LLMs that uses group-wise normalized rewards to compute advantages. "Group Relative Policy Optimization (GRPO)"
Group-normalized reward: A reward normalized across a sampled group to compute advantages for each response. "using a group-normalized reward"
Imitation learning: Learning to mimic an expert policy by training on its guidance or demonstrations. "imitation learn- ing (Ross et al., 2011)"
Importance ratio: The ratio used in off-policy correction to reweight updates based on the likelihood under old vs. new policies. "is the importance ratio"
Jensen–Shannon divergence (JSD_B): A symmetric divergence based on KL divergence that measures similarity between distributions. "JSD B(PT |PS)"
KL divergence: A measure of how one probability distribution diverges from a reference distribution. "DKL (PT|m)"
Knowledge distillation: Training a student model to mimic a teacher’s soft outputs to transfer knowledge. "Knowledge distillation"
LoRA: Low-Rank Adaptation; a parameter-efficient fine-tuning method that injects trainable low-rank matrices. "LoRA (Hu et al., 2022)"
Monte Carlo estimate: A sampling-based approximation of an expected value or function. "serves as a G-sample Monte Carlo estimate"
Off-policy distillation: Distillation that supervises the student on fixed (teacher or dataset) trajectories rather than its own samples. "off-policy distillation methods"
On-policy distillation: Distillation that trains the student on its own generated trajectories with teacher guidance. "on-policy distillation"
On-Policy Self-Distillation (OPSD): A framework where a single model acts as both teacher (with privileged context) and student (without), minimizing divergence on the student’s rollouts. "On-Policy Self-Distillation (OPSD)"
Pass@K: A metric reporting whether at least one of K sampled generations solves a problem. "Pass@K performance"
Per-token divergence: A discrepancy measure computed at each position to align next-token distributions. "per-token divergence"
Policy gradient: An RL method that updates policies in the direction of expected return gradients using sampled actions. "policy-gradient-style objective"
Privileged information: Extra context (e.g., ground-truth solution) available to the teacher but not the student at inference. "privileged information"
Process Reward Model (PRM): A model that provides dense, step-level feedback signals for reasoning trajectories. "process reward model (PRM)"
Proximal Policy Optimization (PPO): A popular RL algorithm that stabilizes training via clipped objective updates. "Proximal Policy Optimization (PPO)"
Rationalization: Explaining or validating a provided correct answer to derive consistent reasoning steps. "rationalization—explaining a given cor- rect answer"
Reinforcement Learning with Verifiable Rewards (RLVR): RL approach for tasks with outcomes that can be automatically checked for correctness. "Reinforcement learning with verifiable rewards (RLVR)"
Reverse KL penalty: A regularizer discouraging excessive deviation from a reference policy by penalizing reverse KL. "reverse KL penalty"
State-action value: The expected return of taking an action in a state under a policy. "state-action value Q(x, 0¡)"
Student policy: The policy that generates on-policy trajectories without privileged context and receives gradients. "student policy ps(. | x)"
Supervised fine-tuning (SFT): Post-training using labeled trajectories to improve task performance. "Supervised fine-tuning (SFT)"
Teacher policy: The policy that conditions on privileged information to produce guidance distributions. "teacher policy pr (. | x, y*)"
Token-level supervision: Dense feedback applied to each generated token rather than only sequence-level rewards. "token-level supervision"
Value function: The expected return from a state under a policy, used for advantage estimation. "value function V (x)"
Visitation distribution: The distribution over states (or prefixes) induced by a policy during its rollouts. "optimizing directly on the student's visitation distribution"

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Summary

On-Policy Self-Distillation in LLMs for Mathematical Reasoning

Introduction

Motivation and Conceptual Framework

Relation to Prior Approaches

Experimental Design

Empirical Results

Practical and Theoretical Implications

Limitations and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

The big questions the researchers asked

How did they do it?

The “same model, two hats” idea

On-policy training (practice on your own work)

Token-by-token guidance (dense feedback)

A second, lighter variant (sampled-token objective)

What did they find?

Why does it matter?

Limitations and future directions

Bottom line

Knowledge Gaps

Unresolved Knowledge Gaps, Limitations, and Open Questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets