Self-Distilled Agentic Reinforcement Learning
Abstract: Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper introduces a training method called SDAR (Self‑Distilled Agentic Reinforcement Learning) for teaching LLM “agents” to do multi‑step tasks, like searching the web, playing text games, or shopping online. The goal is to help these agents make better decisions over many steps without getting confused or “drifting” off track.
What questions are the researchers trying to answer?
The paper focuses on two simple questions:
- How can we combine two types of training—trial‑and‑error learning (RL) and learning from a “teacher” with extra hints (self‑distillation)—so an agent stays stable and improves over long, multi‑turn tasks?
- When the teacher sometimes gives bad or uncertain advice, how can we trust the good parts and ignore the harmful parts?
How does their method work?
Think of training an LLM agent like coaching a sports team:
- Reinforcement Learning (RL): This is like playing full games and getting a score at the end. If the team wins, great; if not, learn from it. It’s strong but only gives a “big picture” reward after many actions, so feedback is not very detailed.
- On‑Policy Self‑Distillation (OPSD): This is like having the same player look at the play again but with a “cheat sheet” of tips (called privileged context, such as retrieved “skills”). It gives advice at each small step (each token the model writes). This advice can be super helpful—but sometimes the cheat sheet is wrong or irrelevant.
The problem: In multi‑turn tasks, the agent’s path can drift from the teacher’s expectations. Then the teacher’s token‑by‑token advice can turn unstable and push the agent the wrong way.
The solution (SDAR): Keep RL as the main coach, and add the teacher’s advice carefully with a smart “gate.”
- The “gate” is like a dimmer switch that decides, for every token the agent writes, how much to trust the teacher’s advice:
- If the teacher strongly supports what the agent wrote (a positive sign), turn the gate up and learn from it.
- If the teacher disagrees or seems uncertain (a negative sign), turn the gate down and learn less from it.
- In everyday terms: SDAR listens more to clear, helpful hints and politely ignores doubtful ones, all while keeping RL in charge.
The gate uses simple signals to decide:
- How unsure the student is (if the model is uncertain, guidance helps).
- How much the teacher agrees with the exact token the student chose (focus on tokens the teacher endorses).
They also tested different ways to fetch “skills” (short, structured tips) to use as privileged context:
- Smart retrieval (UCB), keyword matching, full retrieval, and even random skills.
- The gate filters noisy hints, so even random skills gave small improvements.
What did they find?
Across three types of tasks and several model sizes (Qwen2.5 and Qwen3 families), SDAR:
- Beat pure RL (GRPO) by a clear margin:
- ALFWorld (a text game): about +9.4% improvement
- Search‑QA (web search questions): about +7.0% improvement
- WebShop (online shopping): about +10.2% accuracy improvement (for the 7B model)
- Stayed stable and avoided the crashes seen when you naively mix RL and OPSD.
- Learned the underlying “skills” into the model, so it didn’t need external skill hints at test time—yet still performed better than methods that do rely on those hints.
- Worked even when skill retrieval was weak; the gate filtered out bad advice and kept the good parts.
Why this matters: SDAR shows that letting each token “choose” how much guidance to accept makes training smoother, safer, and more effective over long sequences of actions.
What’s the impact?
SDAR makes it easier to train LLM agents that:
- Act reliably across many steps (multi‑turn), which is common in real applications like browsing, shopping, tools, and games.
- Learn from hints without getting misled when hints are noisy or partially wrong.
- Generalize better—handling new tasks without needing crutches like extra skill prompts.
In simple terms: SDAR teaches AI agents to be careful listeners. They take advice when it clearly helps and ignore it when it seems off, while still learning from trial and error. This balanced approach can lead to stronger, more trustworthy AI systems that handle complex, real‑world tasks more confidently.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of what remains missing, uncertain, or unexplored in the paper; each point is phrased to be directly actionable for follow-up research.
- Generality across backbones: SDAR is only tested on Qwen2.5/Qwen3 (1.7B–7B). It’s unknown whether the method transfers to other architectures (e.g., Llama, Mistral, GPT-style, mixture-of-experts) and to substantially larger models.
- RL backbone dependence: The approach is evaluated solely with GRPO. How SDAR interacts with other RL algorithms (e.g., PPO variants, AWR/IQL, RLHF/RLAIF, DPO/IPO, off-policy methods) is not assessed.
- Hyperparameter interactions: Only the SDAR-specific λ and β are ablated. The joint sensitivity with core RL hyperparameters (e.g., GRPO clipping ε, KL coefficient to reference, rollout group size G, sampling temperature/top-p) remains unexplored.
- Token-level vs action-level credit assignment: The method applies token-level gates uniformly across reasoning and action tokens. Whether separate gating/weighting for action tokens, thought tokens, or end-of-turn decisions improves learning is unknown.
- Single-sample reverse-KL estimator: The teacher–student gap uses a single sampled token for an RKL estimate. The variance/bias trade-offs, stability across seeds, and potential benefits of multi-sample or low-rank approximations are not analyzed.
- Gate design space: Gating is a fixed sigmoid on detached signals (gap/entropy/soft-OR). Open questions include learned gating networks, curriculum-conditioned gates, per-turn gates, adaptive/annealed β, or confidence-calibrated gates (e.g., temperature scaling or margin-based gating).
- Gradient flow through gates: The gate is detached (no gradient). It is unclear whether allowing controlled gradient flow (with regularization) could improve adaptability or harm stability.
- Distillation objective breadth: Only forward KL and JSD are compared against the default reverse KL. Other f-divergences (α-divergences, Rényi/Tsallis) and temperature-scaled teachers are not investigated.
- Interaction with policy KL-to-reference: The paper does not ablate how the reference-policy KL penalty interacts with SDAR’s auxiliary loss, nor whether rebalancing them alters stability or final performance.
- Privileged-context types: Experiments focus on “skills” as privileged context. It is open whether SDAR’s gating remains effective with other privileged signals (e.g., gold rationales, reference answers, verified tool traces, planner outputs, frontier states, or memory).
- Retrieval robustness boundary: While Random/Full/KM/UCB are tested, adversarially misleading or contradictory skills (worst-case retrieval) are not studied. Limits of robustness and failure modes under targeted distractors remain unknown.
- Retrieval policy learning: The UCB-based retrieval is simple and per-file. It’s unclear whether jointly learning retrieval under the SDAR objective (e.g., policy-gradient retrieval, contextual bandits with features, or learned retrievers) yields better outcomes or avoids cold-start issues.
- Long-horizon stability metrics: Instability is shown in pre-studies, but comprehensive metrics over horizon (e.g., per-turn KL, compounding error, recovery rates across environments) and how SDAR scales to much longer tasks are not provided.
- Benchmarks and ecological validity: Evaluation is limited to ALFWorld, WebShop (128 fixed tasks), and Search-QA. Generalization to more realistic, stochastic, or dynamic environments (real web, GUIs with drift, embodied robotics) is untested.
- Out-of-domain breadth: For Search-QA, only one in-domain (NQ/Hotpot) and several out-of-domain datasets are used. It remains unclear how SDAR fares on substantially different domains (code agents, SWE-bench, tool-use agents, program synthesis, safety-critical tasks).
- Sample efficiency and compute cost: SDAR requires on-policy rollouts plus teacher passes with privileged context. The paper lacks a thorough compute/memory/time profile, environment-interaction accounting, and comparisons in sample efficiency vs. baselines at equal budgets.
- Statistical robustness: Results lack confidence intervals and multi-seed variance analysis. Stability across random seeds, data shuffles, and skill-bank permutations is not quantified.
- Reward design interplay: How SDAR behaves under sparse, noisy, delayed, or shaped rewards (and with different verifiers) is not assessed; the method’s sensitivity to reward definitions is unclear.
- Negative guidance attenuation trade-offs: The gate soft-attenuates negative teacher signals; when the teacher is actually correct, this may slow learning. Criteria for when to trust negative guidance more (e.g., teacher certainty, cross-checks) are not explored.
- Curriculum at finer granularity: The paper relies on token-level gating rather than explicit curricula. Whether combining SDAR with adaptive, per-turn/per-skill curricula (or restart strategies) yields further gains remains open.
- Separation of training/inference distributions: SDAR removes skills at inference, but the extent to which learned policies rely on implicit patterns from training-only context (hidden overfitting) is not measured (e.g., probing under systematic distribution shifts).
- Multi-modal/tool-augmented agents: Applicability to agents using images, GUIs, speech, or complex tool chains is untested; how to define token-level gates for multi-modal action spaces is an open question.
- Safety and alignment impacts: The approach optimizes task performance without addressing safety constraints, hallucination control, or harmful behavior propagation via privileged guidance.
- SkillBank dependence and reproducibility: Results depend on a SkillBank from prior work; the paper does not detail how portable SDAR is to new domains without high-quality skill banks, nor provide guidelines for constructing such banks.
- Per-position dynamics: The paper shows average gap by relative position but does not exploit this to tailor gates across token positions/turn indices; position-aware gating or turn-aware weights may yield further stability.
- Inference-time calibration: No analysis of whether SDAR-trained policies require temperature/decoding calibration at test-time for best performance, or whether SDAR shifts calibration relative to GRPO.
- Combination with offline data: It is unknown whether pretraining a distillation signal offline (e.g., offline OPSD or supervised signals) before on-policy SDAR improves stability or sample efficiency.
- Theoretical guarantees: Beyond an appendix, there is no formal characterization of convergence/stability conditions for gated OPSD with on-policy RL in multi-turn settings; developing bounds or sufficient conditions remains open.
Practical Applications
Immediate Applications
Below are applications that can be built or upgraded now by incorporating SDAR’s training recipe (RL backbone + token-level gated self-distillation) with modest engineering effort. Each item names sectors, suggests concrete tools/workflows, and lists key assumptions or dependencies.
- Web shopping and product discovery assistants — [Sector: E-commerce, Software]
- Train browser-based agents to navigate product catalogs, filter by specs, compare options, and complete purchases more reliably (WebShop analogue), with fewer failures and reduced prompt length at inference (via skill internalization).
- Potential tools/products/workflows: “SDAR Shopper” fine-tuning kit for retail sites; internal SkillBank seeded from merchandising SOPs; verifier-driven task evaluators (e.g., product-matching rules).
- Assumptions/dependencies: Access to a scripted or instrumented shopping environment and simple verifiers; an initial SOP/skill corpus; model fine-tuning capacity.
- Search-augmented research copilot — [Sector: Enterprise knowledge work, Education, Media]
- Train agents that plan multi-step web searches, read sources, and synthesize answers (Search-QA), with SDAR reducing instability from drift and noisy retrieval while internalizing query patterns to shorten contexts.
- Potential tools/products/workflows: “SDAR Search” training pipeline integrated with enterprise search; gate-metrics dashboard (gap statistics, gate-activation ratio) for model monitoring; query planners distilled from privileged context.
- Assumptions/dependencies: Verifiers or automatic graders for answer correctness; retriever (e.g., E5 or in-house); curated query templates/skills.
- Customer support and IT helpdesk triage — [Sector: Customer Support, IT]
- Multi-turn agents that collect required information, navigate KBs/tools, and recommend next actions; SDAR helps internalize SOPs so the agent can operate with smaller prompts and fewer KB reads at inference.
- Potential tools/products/workflows: Fine-tune “SDAR Triage” on ticket logs with tool feedback as reward; SkillBank built from troubleshooting playbooks; plug-in verifiers (checklist completion, escalation rules).
- Assumptions/dependencies: Access to historical logs, tool APIs, and success verifiers; governance for PII handling; domain-specific skill curation.
- Robotic process automation (RPA) for web GUIs — [Sector: Operations, Finance, HR]
- Train agents to perform repetitive, multi-step form-filling, reconciliation, and report downloads across web portals. SDAR’s gating mitigates brittleness when instructions or portals vary.
- Potential tools/products/workflows: “SDAR-RPA” wrapper for browser automation frameworks (Playwright/Selenium); SkillBank from existing runbooks; verifiers for successful submission or record matching.
- Assumptions/dependencies: Stable DOM selectors or robust UI instrumentation; success criteria available as programmatic checks.
- Guided data collection and labeling workflows — [Sector: Data/ML Ops]
- Use SDAR to train label-assist agents that search, verify, and propose labels with on-policy feedback, while gating prevents overfitting to noisy teacher signals.
- Potential tools/products/workflows: Distillation gates as quality filters; bandit-based skill retrieval (UCB) to improve labeling prompts; “gate-on-noise” heuristics in annotation UIs.
- Assumptions/dependencies: Lightweight verifiers (agreement rules, spot audits); ability to log token-level signals during training.
- Tool-use and API orchestration agents — [Sector: Software, Cloud Platforms]
- Improve reliability of agents that call APIs (calendar, CRM, internal microservices) over multiple turns. SDAR reduces compounding errors and internalizes tool calling conventions.
- Potential tools/products/workflows: “SDAR ToolKit” training layer atop GRPO; SkillBank generated from OpenAPI specs and working examples; verifier checks on API call correctness and side effects.
- Assumptions/dependencies: Clear success metrics (e.g., API response validation); curated tool-use skills; access to sandboxed environments.
- Cost and latency reduction via skill internalization — [Sector: Any LLM-deploying org]
- Replace large inference-time skill prompts with parameters learned via SDAR, cutting token costs and latency while preserving performance (shown by Skill-GRPO vs SDAR).
- Potential tools/products/workflows: “Skill Internalizer” service that ingests SOPs/wikis and emits a fine-tuned model; prompt-length vs quality dashboards.
- Assumptions/dependencies: Adequate fine-tuning budget; stable SOPs; guardrails for drift monitoring post-internalization.
- Academic agent benchmarks and reproducible research — [Sector: Academia]
- Apply SDAR to standard agentic benchmarks (ALFWorld, WebShop, Search-QA) and to new domains (e.g., SWE-bench-like code agents, mobile agents) with more stable training and clearer diagnostics (gate ratios, gap trends).
- Potential tools/products/workflows: Open-source SDAR training harness; token-level gating ablation suite; standard reporting of gate-activation ratio and mean gap.
- Assumptions/dependencies: Access to evaluation environments and verifiers; compatible open-weight base models.
- Retrieval-robust training where KB quality is uneven — [Sector: Enterprises with legacy knowledge bases]
- SDAR’s gate filters noisy or irrelevant skills (even random retrieval yields gains), making it practical to start before perfect KB cleanup.
- Potential tools/products/workflows: “Retrieval Robustifier” that wraps existing retrieval with UCB and SDAR gates; incremental KB improvement based on skill win-rates.
- Assumptions/dependencies: Minimal KB indexing; logging pipeline to track skill contributions and rewards.
- MLOps monitoring and safety overlays — [Sector: ML Platform, Policy/Compliance]
- Use gate statistics (e.g., fraction of tokens with g_t > 0.5, mean Δ_t) as health indicators; alert on spikes that correlate with instability or distribution shift.
- Potential tools/products/workflows: “Gated Distillation Monitor” exporting time-series of KL, gap, gate activation; policy hooks to down-weight SDAR when negative gaps dominate.
- Assumptions/dependencies: Training-time telemetry capture; thresholds tuned per domain; human-in-the-loop review for safety-sensitive changes.
Long-Term Applications
These opportunities likely require additional research, domain validation, safety layers, or larger-scale deployment infrastructure before broad rollout.
- Home and service robotics with language interfaces — [Sector: Robotics, Consumer]
- Extend SDAR-trained agents from ALFWorld-like simulations to real robots for household tasks (pick/place, clean/heat/cool workflows), using privileged training contexts (maps, affordances) that are removed at inference.
- Potential tools/products/workflows: Sim2real curricula with SDAR gating; safety verifiers for manipulation; on-device distilled policies with shorter prompts.
- Assumptions/dependencies: High-fidelity simulators and verifiers; robust perception-action loops; strong safety certification.
- Clinical search and decision support — [Sector: Healthcare]
- Multi-step agents that search literature/guidelines and structure recommendations; SDAR may internalize clinical reasoning patterns while filtering noisy retrieval.
- Potential tools/products/workflows: Clinically validated verifiers (checklists, guideline concordance); SkillBank built from care pathways; audit logs of gate/gap metrics for regulators.
- Assumptions/dependencies: Rigorous clinical oversight; de-identification; liability and regulatory approvals; gold-standard verifiers.
- Regulatory and compliance assistants — [Sector: Finance, Legal, Public Policy]
- Agents that navigate statutes, filings, and policies over long horizons to prepare memos or compliance checks; SDAR stabilizes training with imperfect policy KBs.
- Potential tools/products/workflows: Verifiers tied to rule coverage and citation accuracy; “Compliance SkillBank” seeded from internal policies; attestation reports with gate diagnostics.
- Assumptions/dependencies: Up-to-date, authoritative sources; clear success metrics; human review loops.
- Autonomous scientific discovery and lab automation — [Sector: R&D, Biotech, Materials]
- Plan experiments, search literature/protocols, and operate instrument APIs. SDAR’s gated distillation could internalize lab SOPs and prioritize reliable steps.
- Potential tools/products/workflows: Verifiers based on experimental outcomes or simulator checks; UCB for protocol retrieval; audit trails of gate decisions for reproducibility.
- Assumptions/dependencies: Safe sandboxing for instrument control; robust simulators or quick surrogate verifiers; data governance.
- Enterprise-wide OS-level personal assistants — [Sector: Productivity, Platforms]
- Cross-application multi-turn agents (email → calendar → CRM → docs) with SDAR to manage drift and internalize app-specific skills while minimizing context sprawl.
- Potential tools/products/workflows: Unified SkillBank distilled from app usage logs; OS orchestration APIs; cost-aware prompt budgeting boosted by internalization.
- Assumptions/dependencies: Deep tool integration; privacy and access control; comprehensive verifiers for multi-app workflows.
- Standardized safety certification for agent training — [Sector: Standards, Policy]
- Use SDAR’s token-level metrics as part of certification (e.g., bounding negative-gap exposure, stability under noisy retrieval) to approve agent deployments.
- Potential tools/products/workflows: “Agent Stability Report” templates; stress tests with randomized skills; policy to throttle distillation when risk flags trip.
- Assumptions/dependencies: Consensus on metrics; third-party audits; domain-specific failure taxonomies.
- On-device and edge deployment of agentic models — [Sector: Mobile, IoT]
- Internalization reduces context size and dependency on live retrieval, enabling lighter, faster on-device assistants for constrained environments.
- Potential tools/products/workflows: SDAR fine-tunes targeting small/quantized models; periodic server-side refresh of internalized skills; hybrid on-device verification.
- Assumptions/dependencies: Efficient base models; battery/compute budgets; offline-capable verifiers.
- Continual skill acquisition with bandit-driven retrieval — [Sector: Any domain with evolving SOPs]
- Couple SDAR with UCB skill selection in a continual-learning loop to discover and internalize high-value skills over time without bloating prompts.
- Potential tools/products/workflows: “Skill Uplift” service tracking skill win-rates; automatic retirement/refresh of low-utility skills; governance for concept drift.
- Assumptions/dependencies: Stable reward signals; careful mitigation of catastrophic forgetting; lifecycle management for skills.
- Code agents for maintenance and QA — [Sector: Software Engineering]
- Multi-turn agents that read repos, run tests, and propose fixes; SDAR to filter noisy teacher hints and internalize repo-specific patterns, lowering context costs.
- Potential tools/products/workflows: Verifiers via unit/integration tests; SkillBank from past fixes and code review comments; RL+SDAR adapters in SWE-bench-like setups.
- Assumptions/dependencies: High-quality test coverage; sandboxed execution; secure handling of proprietary code.
- Complex, multi-party workflow orchestrators — [Sector: Supply Chain, Government, Large Enterprises]
- Agents coordinating tasks across stakeholders and systems over long horizons; SDAR helps maintain stability and internalize procedural knowledge while minimizing constant KB lookups.
- Potential tools/products/workflows: Verifiers tied to milestone completion; cross-system SkillBank; monitoring of gate activity as an early-warning signal for drift.
- Assumptions/dependencies: Reliable instrumentation and reward shaping; robust identity/permission management; change management processes.
Notes on feasibility and general dependencies common across applications:
- Access to verifiers or reward functions is critical; tasks without clear success signals will require surrogate metrics or human-in-the-loop feedback.
- A SkillBank (SOPs, examples, templates) accelerates training; SDAR is robust to imperfect retrieval but benefits from higher-quality skills.
- Compute and data budgets are needed for RL-style fine-tuning; hyperparameters (e.g., λ_SDAR ≈ 0.01, β ≈ 5) require validation per domain.
- Safety, privacy, and regulatory constraints must be addressed for sensitive domains; SDAR’s telemetry (gap and gate statistics) can support audits and monitoring.
- Integration with tool APIs, browsers, or simulators is necessary to collect rewards and enable multi-turn interaction during training.
Glossary
- Advantage: A scalar signal estimating how much better a taken action (or sequence) is than a baseline, used to weight policy updates in RL. "and computes a sequence-level advantage from environment rewards."
- Auxiliary objective: A secondary loss added to the main training objective to provide additional guidance without altering the primary optimization target. "the OPSD loss is treated as a direct, auxiliary optimization objective, leaving the verifier-driven RL policy loss untouched"
- Entropy (student entropy): A measure of uncertainty in the model’s output distribution; higher entropy indicates greater uncertainty. "denote the student entropy at position~."
- Entropy gating: A gating strategy that increases distillation strength on tokens where the student is most uncertain. "Entropy gating: "
- Exploration–exploitation trade-off: The balance between trying new options (exploration) and leveraging known good options (exploitation) in decision-making. "and controls the exploration--exploitation trade-off."
- Forward KL: The Kullback–Leibler divergence in the direction , encouraging mode covering by matching the teacher distribution broadly. "the mode-covering nature of forward KL"
- Full Retrieval: A retrieval strategy that supplies the complete set of available skills or context, regardless of specificity. "We implement four retrieval strategies ... (3) Full Retrieval, and (4) Random Retrieval."
- Gap gating: A gating strategy that weights distillation by the teacher–student log-probability gap, strengthening positive endorsements and attenuating negative ones. "Gap gating: "
- GRPO: Group Relative Policy Optimization, a policy-gradient RL method with clipping and KL regularization tailored for LLMs. "Compared to GRPO, it delivers substantial gains"
- Importance sampling ratio: The ratio between current and behavior policy probabilities for a sampled action, used to correct policy-gradient estimates. "where $r_t^{(i)}=\pi_{\theta}(y_t^{(i)} \mid s_t^{(i)}) / \pi_{\theta_{\mathrm{old}(y_t^{(i)} \mid s_t^{(i)})$ is the importance sampling ratio."
- Jensen–Shannon divergence (JSD): A symmetrized and smoothed version of KL divergence measuring similarity between two distributions. "Jensen--Shannon divergence (JSD)"
- Keyword Matching (KM): A retrieval heuristic that selects skills by matching task keywords to predefined categories. "Keyword Matching bypasses the bandit formulation and instead identifies the task scenario by matching keywords"
- KL divergence: A measure of how one probability distribution diverges from another; used here to quantify teacher–student mismatch. "This compounding error leads to surging per-turn KL divergence"
- Logistic sigmoid: The squashing function that maps real values to (0,1), used here to define smooth gates. "We compose each raw score with the logistic sigmoid~"
- Mode-covering: A behavior that pushes the student to cover all modes of the teacher distribution, potentially spreading probability mass too broadly. "the mode-covering nature of forward KL"
- Mode-seeking: A behavior that pushes the student to concentrate on peaks (modes) of the teacher distribution, focusing probability mass. "the reverse direction $D_{\mathrm{KL}(\pi_{\theta}\|\pi_T)$ is inherently mode-seeking"
- Multi-armed bandit: A sequential decision framework for balancing exploration and exploitation over multiple choices (arms). "Skill retrieval is cast as a multi-armed bandit problem"
- On-Policy Distillation (OPD): Distilling knowledge using data generated by the current policy to avoid distribution shift. "On-Policy Distillation (OPD) ... provide dense token-level guidance"
- On-Policy Self-Distillation (OPSD): Distillation where the teacher is a variant of the same model with privileged context, providing token-level guidance on-policy. "On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context."
- Privileged context: Training-only information (e.g., retrieved skills or references) available to the teacher but not at test time. "where denotes privileged training-only context available only to the teacher branch"
- Random Retrieval: A retrieval strategy that selects skills uniformly at random without task awareness. "and (4) Random Retrieval."
- Reference policy: A fixed or slowly moving policy used to regularize updates via KL penalties in policy optimization. "Using a reference policy $\pi_{\mathrm{ref}$, the GRPO objective can be written as"
- Reverse KL divergence: The divergence , often mode-seeking and used as the per-token distillation objective. "The per-token reverse KL divergence is defined as:"
- RLSD: A hybrid method that re-weights RL updates using self-divergence signals from the policy itself. "RLSD~\citep{yang2026rlsd} directly uses self-divergence to re-weight token-level RL advantages"
- Self-divergence: A measure of discrepancy between different forms or contexts of the same policy, used to modulate learning signals. "directly uses self-divergence to re-weight token-level RL advantages"
- Self-paced curriculum: An adaptive training schedule where the difficulty or intensity of supervision adjusts automatically based on signals from the learner. "This yields a dynamic, self-paced curriculum operating at the finest possible granularity: the individual token level."
- Sigmoid gate: A differentiable gating weight in [0,1] produced by a sigmoid, used to modulate distillation strength per token. "SDAR maps detached token-level signals into a sigmoid gate"
- Skill-conditioned (privileged guidance): Guidance whose quality depends on retrieved skills, leading to asymmetric trust in teacher signals. "skill-conditioned privileged guidance requires asymmetric treatment"
- Skill-GRPO: A variant of GRPO that injects retrieved skills into prompts during training (and optionally inference). "Skill-GRPO augments GRPO by retrieving skills via KM and injecting them into the training prompt"
- Skill-SD: A hybrid distillation method that conditions on skills, typically with hand-crafted schedules. "such as Skill-SD~\citep{wang2026skillsd} and HDPO~\citep{ding2026hdpo}"
- Soft-OR gating: A gating strategy that combines multiple signals (e.g., entropy and gap) in a soft logical-OR fashion. "Soft-OR gating: "
- Stop-gradient (sg): An operator that prevents gradients from flowing through a quantity, treating it as a constant during backpropagation. "the gate is detached via , so gradients flow exclusively through the student log-probability."
- Teacher branch: The teacher model path (often the same architecture) that has access to privileged context and provides guidance. "from a teacher branch augmented with privileged context"
- Teacher-Student log-probability gap: The difference between teacher and student log-probabilities on the sampled token, used as an importance signal. "The negation of this estimate directly yields the Teacher-Student log-probability gap :"
- TIP (Token Importance): A method that uses token-level signals to prioritize supervision or weighting during distillation. "Inspired by TIP~\citep{xu2026tip}"
- Token-level gating: Modulating distillation at the granularity of individual tokens via gates that depend on token-specific signals. "We introduce a token-level gate that modulates the OPSD signal on each student-sampled token"
- Token-level surrogate: A sampled-token approximation used to estimate otherwise expensive token-level divergences or losses. "and apply it to a sampled-token surrogate"
- UCB (Upper Confidence Bound): A bandit algorithm that selects actions to maximize an optimism-adjusted reward estimate. "according to the Upper Confidence Bound (UCB) criterion:"
- UCB Retrieval: A retrieval strategy that chooses skills using the UCB rule based on past rewards and selection counts. "We implement four retrieval strategies ... (1) UCB Retrieval, (2) Keyword Matching (KM), (3) Full Retrieval, and (4) Random Retrieval."
- Verifier-driven RL: An RL setup where the reward or feedback comes from an external verifier checking solution correctness. "leaving the verifier-driven RL policy loss untouched"
Collections
Sign up for free to add this paper to one or more collections.