Papers
Topics
Authors
Recent
Search
2000 character limit reached

Self-Policy Distillation via Capability-Selective Subspace Projection

Published 21 May 2026 in cs.CL | (2605.22675v1)

Abstract: Self-distillation bootstraps LLMs by training on their own generations. However, existing methods either rely on external signals to curate self-generated outputs (e.g., correctness filtering, execution feedback, and reward search), which are costly and unavailable for the best-performing frontier models, or skip curation entirely and train on all raw outputs, an approach that is often domain-specific and hard to generalize. Both also share a deeper weakness that self-generated outputs entangle task-relevant capability with others, such as stylistic patterns, formatting artifacts, and model-specific errors, diluting the signal for the specific capability one aims to improve. In this paper, we propose Self-Policy Distillation (SPD), which achieves generalizable, capability selective without any external signal. Specifically, SPD extracts a low-rank capability subspace from the model's own gradients on correctness-defining tokens, projects key-value (KV) activations into this subspace during self-generation, and fine-tunes on the resulting raw outputs with standard next-token prediction loss. Through extensive experiments across code generation, mathematical reasoning, and multiple-choice QA, we show that SPD achieves up to 13% improvement over state-of-the-art self-distillation methods without external signals and up to 16% improvement over pre-trained baselines. Notably, SPD demonstrates superior generalizability, achieving 15% better performance under out-of-domain generalization settings.

Summary

  • The paper introduces Self-Policy Distillation (SPD), a teacher-free method that extracts low-rank subspaces using correctness-aligned gradients.
  • It applies projection hooks during generation to focus output on target capabilities, yielding more accurate and less noisy completions.
  • Empirical evaluations show up to 16% improvement over baseline models across multiple domains and LLM backbones, highlighting robust cross-domain generalization.

Self-Policy Distillation via Capability-Selective Subspace Projection

Overview and Motivation

Self-distillation for LLMs aims to improve models by leveraging their own generations as supervision, addressing the bottlenecks of external teacher dependency and the inefficiency of on-policy or off-policy teacher–student paradigms. Existing self-distillation frameworks either employ costly external signals to curate generations or use unfiltered raw outputs, which both suffer from confounding between task-relevant signal and irrelevant artifacts, reducing transferability and diluting gains. This work introduces Self-Policy Distillation (SPD), a capability-selective, domain-general self-distillation paradigm requiring no external signal. SPD exploits internal model gradients on correctness-defining tokens to extract low-rank subspaces in the key-value (KV) activation space, steering model generation toward the target capability prior to standard self-distillation. SPD thereby selectively preserves capability-relevant information and suppresses model-specific errors and stylistic artifacts. Figure 1

Figure 1: SPD comparisons and performance summary, including comparison with existing self-distillation methods and aggregated performance improvements across five backbone models and multiple domains.

Method: Capability-Selective Subspace Induction

SPD operates in two phases:

Phase 1: Capability Subspace Extraction.

Given a small calibration set with gold pairs, SPD computes gradients at specific layers with respect to the loss over only those tokens that directly determine correctness (e.g., answer tokens, assertion outputs). The procedure is as follows:

  • For each calibration example and each target layer, compute token-level gradients of the correctness-aligned loss with respect to K and V activations.
  • Aggregate these gradients to form matrices GKG_K and GVG_V per layer.
  • Apply SVD, extracting the top rr singular directions to obtain orthogonal projection matrices (PK,PV)(P_K, P_V)—the capability subspace.

Phase 2: Policy Distillation via Subspace Projection.

During self-generation over the large unlabeled prompt corpus, insert "projection hooks" at target layers that project KV activations onto the extracted subspaces. Model parameters remain static; only intermediate activations are filtered. The resulting generations are shaped toward the target capability, suppressing noise and task-irrelevant artifacts. After generation, standard LoRA fine-tuning is performed on these (prompt, SPD-completion) pairs using standard next-token loss, with hooks removed during optimization. Figure 2

Figure 2: Overview of SPD—a two-phase process that extracts low-rank capability subspaces from gradients and applies them as projection hooks during self-generation before fine-tuning on shaped outputs.

Empirical Evaluation

Experiments employ five LLM backbones (Qwen2.5-0.5B/7B/14B, Qwen3-4B, Llama-3.1-8B) across three domains (code, mathematics, QA) under both in-domain (domain transfer within capability) and out-of-domain generalization (domain transfer across capability). SPD is compared against baseline self-distillation (SSD) and plain self-retraining (PSR, finetuning on raw completions).

Key results:

  • Domain-specific accuracy: SPD achieves up to 13% improvement over the best non-external-signal self-distillation baseline and up to 16% over pretrained base models within domain. For out-of-domain (e.g., QA calibration applied to Math and Code), gains reach up to 15%.
  • Generalization: SPD's improvements are robust: on 13/15 in-domain transfer benchmarks, SPD outperforms the base model (versus 7/15 for PSR and 8/15 for SSD), indicating better learned transfer rather than overfitting to calibration data.
  • SPD-generated data is more focused and less noisy, as shown both qualitatively (more direct, less verbose outputs with fewer artifacts) and quantitatively (better pre-finetuning accuracy on correctness metrics).

Qualitative and Diagnostic Analyses

To elucidate the impact of SPD's projection hooks, output comparisons indicate:

  • Base model completions often contain verbose, off-task explanations and mixed formatting, diluting supervision.
  • SSD completions are less verbose but may preserve incidental errors and imitative artifacts.
  • SPD completions are more compact, task-focused, and better aligned with the correctness span defined by the subspace extraction phase. Figure 3

    Figure 3: Sample self-generated outputs under different self-distillation regimes, highlighting SPD’s superior focus on task-critical information and suppression of irrelevant or error-prone content.

Ablative results confirm that full-sequence loss (on all tokens) for gradient extraction is clearly inferior to correctness-aligned loss, with differences exceeding 10% absolute on several benchmarks.

Calibrating the subspace with only 20–500 gold pairs suffices for strong generalization, indicating SPD's label efficiency.

Theoretical and Practical Implications

SPD demonstrates that internal transformation of the policy via subspace-based activation projection is effective for selective, domain-transferable self-distillation—without requiring any external verification signals, reward models, or task-specific curation. The method confirms that the latent KV spaces in LLMs encode capability-relevant directions separable from spurious ones and that direct manipulation of internal representations can meaningfully control model output. This aligns with prior findings in latent-space steering, yet SPD operates with no teacher, leveraging the model's own gradients and selectivity.

Practically, this enables iterative self-improvement of proprietary or frontier LLMs where external signals (e.g., verifiers, strong teachers) are unavailable or unreliable, while also reducing reinforcement of confounding stylistic or reasoning artifacts. SPD is architecture-agnostic, domain-general, and efficient (requiring only modest calibration data and cheap online hooks), hence directly applicable in scalable, closed-loop model training systems.

Theoretically, these findings highlight the importance of fine-grained supervised signal at the representation level for robust skill acquisition and transfer, suggesting future directions for subspace-guided, capability-specific continual learning and modular capability bootstrapping. Open problems include more principled subspace construction, automatic identification of correctness spans, and deeper theoretical connection between KV subspaces and semantic/model behavior.

Conclusion

SPD presents a compelling, empirically validated refinement of self-distillation for LLMs, leveraging capability-selective subspace projection of internal activations to steer self-generated data toward target competencies. The combination of correctness-aligned calibration, subspace projection hooks, and teacher-free updating offers an efficient and general recipe for internal self-improvement and enhanced cross-domain generalization. This paradigm has practical impact for distillation in resource- or supervision-constrained model development, and sets a foundation for further research into internal controllability and modularity in large neural systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

Overview

This paper introduces a new way to help LLMs improve themselves, called Self-Policy Distillation (SPD). Instead of needing extra tools or teachers to judge the model’s answers, SPD teaches the model using its own outputs while gently steering those outputs to focus on the most important parts of the task. The idea is to keep the helpful skills (like accurate reasoning) and avoid learning bad habits (like wordy explanations or formatting quirks).

Key Objectives and Questions

The paper asks a simple question: How can an LLM improve by training on its own answers without any outside help, while still learning useful skills that work across different kinds of tasks?

To do this, the authors aim to:

  • Select the parts of the model’s “thinking” that most affect getting the right answer.
  • Use those parts to guide the model’s self-generated training data.
  • Show that this makes the model better not just on the original task (like coding), but also on other tasks (like math and multiple-choice questions).

Methods Explained Simply

Think of training a model like practicing for a test. You can learn from your own practice answers, but you should focus on the parts that actually decide whether your answer is correct—not the extra fluff. SPD does exactly that in two phases.

Here are the main ideas behind SPD, explained with everyday language:

  • Correctness-defining tokens: These are the exact pieces of the model’s output that make the answer right or wrong. For example:
    • In math, the final number or the key steps.
    • In code, the function body that solves the task.
    • In multiple-choice, the chosen letter.
    • SPD looks closely at these (instead of every single token) to understand what matters for correctness.
  • Gradients: A gradient shows how much each part of the model changes when it tries to fix its mistakes. You can imagine gradients as arrows pointing toward “what to adjust” to be more correct.
  • Key and Value activations (“KV activations”): In a transformer (the type of model LLMs use), attention works like a smart “lookup system.” Keys are like labels for information, and Values are the information itself. SPD focuses on the model’s Keys and Values—its internal notes about “what to pay attention to.”
  • SVD (Singular Value Decomposition): This is a math tool that finds the most important directions in a big set of arrows (the gradients). Imagine you have a bunch of arrows pointing in many directions; SVD finds the few main directions that matter the most. This gives a “capability subspace,” a small, focused area that captures the skill we want (like solving math accurately).
  • Projection hooks: These are temporary “filters” placed inside the model. When the model generates text, the hooks nudge its internal notes (Keys and Values) to stay within the useful capability subspace. The model’s weights aren’t changed; it’s more like wearing glasses that sharpen your focus while you write your answer. After generating training data with these glasses, the hooks are removed.

Putting it together, SPD works in two phases:

  • Phase 1: Build the focus filter
    • Use a small labeled set (calibration set).
    • Compute gradients only on correctness-defining tokens.
    • Use SVD to extract the main “skill directions” and build projection matrices (the filter).
  • Phase 2: Generate and learn
    • Attach the filters (projection hooks) to steer the model’s own generations toward correctness-focused content.
    • Collect this self-generated data.
    • Fine-tune the original model on that data using normal next-token prediction.

Main Findings and Why They Matter

The authors test SPD on three types of tasks:

  • Code generation (MBPP, CodeAlpaca)
  • Math reasoning (GSM8K, SVAMP)
  • Multiple-choice QA (MMLU, BBH)

Key results:

  • SPD improves performance without any external judging or reward models.
  • It beats other self-distillation methods that don’t use external signals, with up to 13% better results.
  • It beats the starting/base models by up to 16%.
  • It generalizes well:
    • In-domain: Gains transfer to other datasets within the same skill area (e.g., from one math dataset to another).
    • Out-of-domain: Gains also show up in different areas (e.g., calibrating with QA still helps on math and code), sometimes by as much as 15%.

Why this matters:

  • Training on your own outputs often risks learning your own mistakes or style habits. SPD reduces that risk by focusing the training data on the “skill core” (correctness parts), not the fluff.
  • This makes self-improvement more reliable, cheaper, and more general.

Implications and Potential Impact

  • No external signals needed: SPD doesn’t rely on teachers, verifiers, reward models, or special tools. That’s helpful when such tools are costly or unavailable, especially for the strongest frontier models.
  • Better generalization: By steering generation to correctness-focused content, the improvements carry over across tasks, not just within one domain.
  • Data-efficient: SPD uses a small calibration set to build its focus filters. You don’t need tons of labeled examples.
  • Practical training pipeline: SPD creates better self-generated training data and then uses standard fine-tuning, making it straightforward to apply.

A note on limitations:

  • While tested across code, math, and QA with multiple model sizes, SPD should be validated further on more diverse and high-stakes tasks to confirm robustness.

In short, SPD is like teaching yourself by watching your own performances but only paying attention to the key moments that decide whether you win or lose. That focus helps you improve faster, more reliably, and across different types of challenges.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper, to guide future research.

  • Dependence on correctness-defining spans:
    • Requires task-specific rules and ground-truth answers to mark correctness spans; unclear how to generalize to open-ended tasks (summarization, dialogue, translation) where “correctness” is ambiguous.
    • No analysis of sensitivity to mis-specified or noisy correctness-span annotations.
    • Lack of an automated, domain-agnostic procedure to infer correctness spans without labels.
  • Calibration data requirements and domain shift:
    • Although small calibration sets worked, sensitivity to label noise, domain mismatch, and sampling bias is untested.
    • No investigation of cross-lingual or cross-genre calibration (e.g., calibrate on QA in one language, transfer to another).
    • Missing guidelines for selecting calibration prompts to maximize transfer.
  • Subspace extraction design choices:
    • Fixed layer choice (middle and last) is not justified; no systematic study of layer selection, per-head vs. aggregated subspaces, or including additional components (queries, residual stream, MLPs).
    • Rank r selection is ad hoc; no criterion or trade-off analysis between preserving capability and over-pruning useful variation.
    • Alternatives to SVD on gradients (e.g., PCA on activations, Fisher/Hessian eigenvectors, CCA/CKA, randomized SVD) are not compared.
  • Computational scalability:
    • Gradient collection over KV activations across tokens and calibration examples may be costly for large models and long contexts; no profiling or memory/runtime analysis.
    • SVD on large gradient matrices may be prohibitive; no discussion of streaming/approximate methods or distributed implementations.
  • Hooking strategy and stability:
    • Only K/V are projected; impact of also transforming Q or other submodules is unexplored.
    • Distillation trains without hooks on data produced with hooks, creating a distribution mismatch; stability and retention of hooked behavior post-finetune are unquantified.
    • No assessment of regressions in non-target capabilities (e.g., fluency, formatting, safety) caused by KV projection.
  • Data generation policy:
    • One completion per prompt is used; no exploration of multi-sample generation, temperature/nucleus sweeps, or sampling diversity to improve the self-generated corpus.
    • Interaction between decoding settings and projection (e.g., temperature effects on capability selectivity) is unstudied.
  • Iterative and multi-capability extensions:
    • No experiments with multi-round SPD (recompute subspace after fine-tuning) to assess convergence, compounding gains, or collapse.
    • Combining multiple subspaces (e.g., code + math) or gating/conditioning subspace selection per prompt remains open; potential interference/conflict between capabilities is unexamined.
  • Generalization scope:
    • OOD evaluation is limited to a few datasets; transfer to harder or broader distributions (e.g., MATH, HumanEval+, ARC-DA, DROP, CSQA, GSM-hard) is not tested.
    • Only English benchmarks and text-only tasks are evaluated; multilingual and multimodal (vision-language) generalization is unaddressed.
    • Long-context tasks and retrieval-augmented generation (where KV dynamics are crucial) are not evaluated.
  • Comparisons and baselines:
    • Baselines omit stronger teacher-free or representation-control methods (e.g., CoDI/KV-edits variants) and strong external-signal methods (verifier-/reward-based) under matched compute.
    • No ablation against simpler projections (e.g., random low-rank, PCA on activations) to isolate gradient-specific benefits.
  • Interpretability and analysis of the subspace:
    • No qualitative/quantitative interpretation of learned directions (e.g., association with heads/neurons, attention patterns) or whether subspaces align across domains/backbones.
    • Lack of diagnostics to verify that the subspace filters style/formatting rather than inadvertently dropping subtle task-relevant cues.
  • Safety, bias, and alignment:
    • No measurements of toxicity, bias, factuality/hallucination, or calibration; potential amplification of model-specific biases via self-generated data is unassessed.
    • Effects on chain-of-thought quality and faithfulness (e.g., hiding reasoning vs. improving it) are not analyzed.
  • Applicability to rationale-centric tasks:
    • SPD appears to de-emphasize non-essential text; impact on tasks that require explicit, high-quality rationales or verifiable proofs is unknown.
  • Robustness and reproducibility:
    • No report of variance across random seeds, statistical significance, or sensitivity to hyperparameters (LoRA rank, learning rate, projection rank, layer set).
    • Implementation details on hook placement and KV-cache interactions (e.g., for long sequences) are sparse.
  • Training dynamics and error propagation:
    • Although projection filters capabilities, self-generated outputs still contain errors; the risk of error accumulation/drift across distillation rounds is not quantified.
    • No safeguards or lightweight correctness checks to mitigate propagation of systematic mistakes.
  • Parameter adaptation choices:
    • Only LoRA is used; the effect of other adaptation schemes (full fine-tuning, adapters, prefix-tuning, LoRA ranks) on absorbing the hooked policy is untested.
  • Practical deployment:
    • Real-time use of hooks at inference (without fine-tuning) to steer outputs is not evaluated.
    • Potential efficiency gains (or slowdowns) from reduced effective KV rank during decoding are unmeasured.
  • Task coverage and evaluation breadth:
    • Limited to code, math, and multiple-choice QA; absence of instruction-following, multi-turn dialogue, information extraction, reasoning with tools/executors, and safety-critical domains.
  • Subspace lifecycle:
    • Subspace may become stale after fine-tuning; mechanisms for online/adaptive subspace updates or prompt-conditional subspaces are not explored.
  • Failure modes and negative transfer:
    • Cases where calibration domain harms unrelated domains (negative transfer) are not systematically sought or measured.
    • Criteria for detecting and mitigating harmful subspaces are not provided.

Practical Applications

Immediate Applications

These applications can be built and deployed today by teams that have access to model weights (for gradient/backprop) and can insert inference-time hooks into attention KV activations.

  • Targeted self-distillation for open-source or on-prem LLMs
    • Sector: software/ML infrastructure
    • What: Use SPD to upgrade a base model’s specific capability (e.g., code correctness, math reasoning, factual QA) without verifiers, reward models, or RL. Pipeline: small calibration set → correctness-aligned gradient collection on chosen layers → SVD to obtain low-rank KV subspaces → hook projections during self-generation → LoRA fine-tune on generated corpus.
    • Tools/workflow: Hugging Face Transformers hooks; PyTorch autograd; SVD (e.g., PyTorch/NumPy); LoRA adapters; one-sample-per-prompt data generation; scheduled MLOps job.
    • Assumptions/dependencies: Access to weights and KV activations; ability to identify correctness spans (e.g., final numeric answer, option letter, code output); modest GPU for gradient passes and LoRA; decoding config that matches domain.
  • Capability-focused code assistant refinement
    • Sector: software engineering/devtools
    • What: Calibrate on unit tests or canonical function solutions to extract a “functional core” subspace; generate code with hooks to reduce boilerplate and stylistic detours; distill into a concise, correctness-oriented code assistant.
    • Tools/products: IDE plugin or CI job that runs SPD monthly on a repository’s tests; “Capability Subspace Packs” versioned per repo/component; LoRA deltas checked into the monorepo.
    • Assumptions/dependencies: Tests/ground truths that define correctness spans; access to model internals; guardrails to prevent catastrophic forgetting of style/conventions.
  • Concise and accurate math tutor (reasoning over verbosity)
    • Sector: education/edtech
    • What: Use small sets of worked solutions with final answers marked as correctness spans to extract a reasoning subspace that steers self-generated rationales away from extraneous text and toward steps that determine correctness.
    • Tools/products: LMS plugin to regularly generate improved synthetic solutions/explanations; teacher-facing dashboard comparing pre-/post-SPD error rates.
    • Assumptions/dependencies: Clearly identified answer tokens; safeguards to avoid suppressing legitimate intermediate steps needed for pedagogy.
  • Domain QA improvement without external verifiers
    • Sector: customer support/enterprise search
    • What: Calibrate on a small, curated set of Q&A pairs (correct answer letters or gold spans) to extract a “correctness” subspace; generate self-corpus under projection; LoRA fine-tune to get succinct, high-precision answers with fewer tangents.
    • Tools/products: CRM/KM integration that feeds calibration examples; nightly SPD job producing updated LoRA; A/B rollout.
    • Assumptions/dependencies: Annotation of correctness spans; access to internal LLM weights; monitored coverage to avoid domain drift.
  • Privacy-preserving capability tuning in air‑gapped environments
    • Sector: regulated industries (finance, government, defense)
    • What: Improve models using only internal labeled snippets (20–500 examples suffice) and no external verifiers or human-in-the-loop feedback pipelines.
    • Tools/products: Air-gapped SPD toolkit; policy-compliant LoRA deltas; reproducible calibration manifests.
    • Assumptions/dependencies: Compute on-prem; robust tokenization and span marking for compliance-critical answers.
  • Higher-yield synthetic data generation
    • Sector: data engineering/ML
    • What: Use projection hooks to bias self-generation toward capability-relevant content before any training, yielding cleaner, more targeted synthetic datasets for downstream fine-tuning (even for other models).
    • Tools/products: “SPD-Generate” job that emits labeled synthetic corpora; quality gates comparing pre-/post-hook sample utility metrics (e.g., pass@1, EM).
    • Assumptions/dependencies: Meaningful correctness spans; careful rank/layer selection to avoid over-pruning useful variance.
  • Lightweight interpretability and capability probing
    • Sector: academia/research; ML platform teams
    • What: Use correctness-aligned gradients to identify low-rank capability subspaces; characterize how capability signal localizes across layers; compare subspaces across domains (code/math/QA).
    • Tools/products: Subspace explorer notebook; alignment diagnostics (singular value spectra, cross-domain transfer plots).
    • Assumptions/dependencies: Labeled calibration examples and layer-wise KV access; reproducible seeds/decoding for comparison.
  • Efficient continual learning for productized LLMs
    • Sector: software/consumer apps
    • What: Periodically recalibrate targeted capabilities (e.g., new API versions, product updates) with small labeled sets and distill improvements without reintroducing a reward model pipeline.
    • Tools/products: MLOps job template (calibration → hooks generation → data → LoRA → canary rollout); drift detectors to trigger recalibration.
    • Assumptions/dependencies: Stable span definitions across versions; CI/CD for model artifacts; monitoring for regressions in non-target capabilities.

Long-Term Applications

These require further research, platform support, or broader validation (e.g., closed-API access, safety/regulatory evidence, multimodal generalization).

  • Runtime “capability toggles” via dynamic subspace hooks
    • Sector: software/product
    • What: Switch capability modes at inference (e.g., “concise-math,” “robust-facts,” “core-code”) by loading different projection matrices without retraining; per-task routing that composes subspaces.
    • Potential products: User-selectable modes in chat/IDE; router that picks subspace based on detected task.
    • Assumptions/dependencies: Stable, composable subspaces; framework support for safe, low-latency KV hooking at scale; conflict resolution between overlapping subspaces.
  • Safety and hallucination mitigation subspaces
    • Sector: safety/alignment; healthcare; finance; policy
    • What: Define “correctness spans” as verifiable facts, citations, or structured fields; extract subspaces that de-emphasize unsupported claims during generation; distill into safer base behavior.
    • Potential products: “Safety SPD” packs for toxicity/hallucination reduction; safety-tuned LoRA additives.
    • Assumptions/dependencies: High-quality, domain-specific span annotations; rigorous external validation; risk controls to avoid suppressing necessary context.
  • Multimodal SPD (vision-language, speech-language)
    • Sector: robotics, autonomous systems, media
    • What: Generalize correctness-aligned gradient collection and KV projections to multimodal attention blocks to steer model generations toward capability-relevant cross-modal evidence.
    • Potential products: VQA systems that privilege evidence-bearing regions; instruction-following agents with reliable perception-to-action grounding.
    • Assumptions/dependencies: Access to multimodal KV streams; span definitions for targets (e.g., bounding boxes, timestamps); compute for larger models.
  • Subspace “patch” ecosystems and compliance distribution
    • Sector: enterprise software/platform marketplaces
    • What: Distribute capability improvements as small projection matrices and/or LoRA deltas (“capability patches”) rather than full model weights for modular upgrades (e.g., sector-specific compliance, jargon).
    • Potential products: Capability Subspace Hub (catalog of audited subspaces); patch management in MLOps.
    • Assumptions/dependencies: IP/licensing clarity; APIs for safe patch loading; standards for auditing and versioning subspaces.
  • Continual, self-healing LLMs in production
    • Sector: ML operations
    • What: Online SPD cycles that detect capability drift, recalibrate subspaces with fresh labeled trickle data, and auto-distill improvements with minimal supervision.
    • Potential products: Drift monitors; auto-calibration schedulers; rollback-aware LoRA managers.
    • Assumptions/dependencies: Reliable drift signals; small, continuously labeled datasets; safeguards against feedback loops and catastrophic forgetting.
  • Foundation model training curricula with subspace awareness
    • Sector: foundation model research
    • What: Incorporate capability subspace extraction into pretraining/finetuning curricula to encourage disentangled, steerable representations; train models to expose stable subspaces for key skills.
    • Potential outcomes: Models with native “capability interfaces” for safer steering and faster adaptation.
    • Assumptions/dependencies: Scalable training studies; metrics for disentanglement; impact on overall capacity and emergent behaviors.
  • Regulation-ready tuning pipelines
    • Sector: policy/government; regulated industries
    • What: SPD-based tuning that operates entirely within jurisdictional and privacy constraints; audit trails showing that only correctness-aligned tokens influenced capability updates.
    • Potential products: Compliance reports with span-level provenance; certifiable tuning playbooks.
    • Assumptions/dependencies: Standardized documentation of span selection; third-party audits; sector-specific validation benchmarks.
  • Agentic systems with skill-specific SPD stages
    • Sector: robotics/software agents
    • What: Apply SPD to distinct agent skills (planning, tool-use, verification) to obtain compact, high-utility self-generated corpora per skill; distill into more reliable tool-using agents.
    • Potential products: Toolformer-like pipelines augmented with SPD; agent orchestrators that load subspace patches for each subtask.
    • Assumptions/dependencies: Clear, verifiable correctness spans per skill; orchestration that avoids interference between skills.
  • Personal/model-local customization
    • Sector: consumer apps/daily life
    • What: Hobbyists using local models calibrate concise answer styles, domain lexicons, or problem types with a handful of examples, then run SPD to personalize without cloud verifiers.
    • Potential products: Desktop SPD app with “teach by examples” UI that extracts spans automatically.
    • Assumptions/dependencies: Local GPUs/accelerators; easy span-selection UX; safe defaults for rank/layer choices.

Cross-cutting assumptions and risks

  • Access: SPD needs weight-level access for gradient computation and KV-hooking; not applicable to black-box API models without special provider support.
  • Correctness span definition: Feasible, low-cost span identification is task-dependent; poor span choices can misalign the extracted subspace.
  • Interference: Over-aggressive projection (rank too low, layers poorly chosen) may suppress useful variance or degrade non-target capabilities; requires validation and guardrails.
  • Compute/ops: Though data-efficient (20–500 calibration examples), SPD still needs GPU cycles for backward passes, SVD, self-generation, and LoRA fine-tuning; operationalization benefits from MLOps investment.
  • Generalization claims: While the paper reports up to 16% over baselines (in-domain) and up to 15% out-of-domain, real-world gains depend on domain, prompts, decoding, and span quality; evaluate with A/B tests and holdouts before wide rollout.

Glossary

  • Ablation: An experimental analysis technique that removes or varies components to assess their impact on performance. Example: "Ablation on loss functions"
  • Answer-letter accuracy: An evaluation metric where the model must select the correct option letter (e.g., A/B/C/D) for multiple-choice questions. Example: "answer-letter accuracy on MMLU"
  • Autoregressive generation: A decoding process where each next token is generated conditioned on all previous tokens. Example: "During autoregressive generation, we insert projection hooks at each target layer L\ell \in \mathcal{L}."
  • Calibration set: A small labeled dataset used to estimate task-relevant directions (e.g., gradient-based) for steering generation. Example: "uses a small calibration set to extract a low-rank capability subspace"
  • Capability-selective: Designed to target and preserve signals pertinent to a specific capability while filtering irrelevant patterns. Example: "a generalizable and capability-selective framework"
  • Capability subspace: A low-dimensional subspace in the model’s activation space that captures directions most relevant to a target capability. Example: "extracts a low-rank capability subspace"
  • Consistency-driven rationale evaluation: A curation method that selects explanations or rationales based on their internal agreement or consistency. Example: "consistency-driven rationale evaluation~\citep{lee2025self}"
  • Correctness-aligned loss: A loss computed only on token positions that directly determine task correctness, focusing gradients on capability-relevant tokens. Example: "we define a correctness-aligned loss"
  • Correctness-defining token positions: Token indices whose correct prediction directly determines task success. Example: "correctness-defining token positions"
  • Execution-based feedback: Supervision obtained by executing model-generated solutions (e.g., code) to assess correctness. Example: "consistency or execution-based feedback"
  • Key-value (KV) activations: The key and value matrices used in the transformer attention mechanism, representing internal attention features. Example: "projects key-value (KV) activations into this subspace"
  • KV-space representations: Latent representations specifically in the key/value activation space of attention layers. Example: "latent or KV-space representations rather than only at the output level"
  • LoRA: Low-Rank Adaptation; a parameter-efficient fine-tuning technique that adds trainable low-rank updates to weight matrices. Example: "The last step applies LoRA~\citep{hu2022lora} to fine-tune the model"
  • Negative log-likelihood (NLL): A probabilistic loss measuring how unlikely the true data is under the model; lower is better. Example: "along with negative log-likelihood on CodeAlpaca."
  • Next-token prediction loss: The standard autoregressive training objective that maximizes the likelihood of each next token given previous tokens. Example: "with standard next-token prediction loss."
  • Normalized exact-match: An evaluation metric that measures exact-match accuracy with normalization (e.g., formatting), often used for reasoning benchmarks. Example: "and normalized exact-match on BBH"
  • Off-policy distillation: Distillation using trajectories or data not generated by the current student policy (e.g., from a teacher or offline set). Example: "Early off-policy distillation methods mainly focus on transferring information from a stronger teacher to a student model"
  • On-policy distillation (OPD): Distillation where training trajectories are sampled from the current student policy while receiving teacher guidance. Example: "on-policy distillation (OPD)~\citep{agarwal2024policy, boizard2024towards}"
  • Out-of-domain generalization: The ability of learned improvements to transfer to tasks or datasets outside the trained domain. Example: "out-of-domain generalization settings."
  • Pass@1: A code-generation metric indicating the success rate with a single sample attempt. Example: "We report pass@1 on MBPP"
  • Per-token supervision: Teacher guidance provided at each decoding step rather than only at the sequence level. Example: "per-token supervision from the teacher."
  • Projection hooks: Runtime modules that project internal activations (e.g., KV) onto a selected subspace during generation without altering model parameters. Example: "Projection hooks are applied to the K and V activations:"
  • Rank‑r orthogonal projection matrix: A projection operator onto an r-dimensional subspace, constructed from top singular vectors. Example: "The rank-rr orthogonal projection matrix is then defined as"
  • Reward-guided search: A data-generation or selection procedure that uses a reward signal to guide which outputs are chosen. Example: "reward-guided search~\citep{zhang2024rest}"
  • Reward model: A learned model that assigns scalar rewards to outputs to guide selection or training. Example: "without a verifier, reward model, or RL."
  • Right-singular vectors: Vectors from SVD that span directions in feature space; used here to define the capability subspace. Example: "We retain the top-rr right-singular vectors,"
  • Rollout policy: The policy used to sample trajectories (token sequences) during distillation or training. Example: "trajectories sampled from a rollout policy"
  • Self-distillation: Training a model on its own outputs instead of relying on an external teacher. Example: "Self-distillation has emerged as a powerful paradigm for improving LLMs by training them on their own generations."
  • Self-generated corpus: A dataset of prompt–completion pairs produced by the model itself for subsequent fine-tuning. Example: "we obtain a self-generated corpus"
  • Self-policy: An internally transformed generation policy derived from the model itself (e.g., via projection hooks). Example: "it first induces a self-policy through an internal transformation:"
  • Self-Policy Distillation (SPD): The proposed framework that steers self-generation via projection into a capability subspace and then distills that behavior. Example: "we propose Self-Policy Distillation (SPD), a generalizable and capability-selective framework that requires no external signals."
  • Singular value decomposition (SVD): A matrix factorization used to identify dominant directions (e.g., capability-relevant) in gradients. Example: "we perform singular value decomposition (SVD) on these gradients"
  • Truncation-based decoding: A decoding strategy that limits outputs (e.g., by cutting off at a length or after certain tokens), used to regularize generations. Example: "produced with truncation-based decoding."
  • Verifier-based selection: Selecting self-generated outputs using an external verifier to filter or evaluate correctness. Example: "verifier-based selection~\citep{hosseini2024v}"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 265 likes about this paper.