Underlying cause of prompt-masking degradation in K-FAC for CrispEdit

Determine the underlying cause of the observed suboptimal performance when masking prompt tokens during Kronecker-Factored Approximate Curvature (K-FAC) calculation in CrispEdit’s low-curvature projection for large language model editing, and ascertain whether the relaxed assumption of token independence during K-FAC computation is responsible for this degradation.

Background

CrispEdit enforces capability preservation by projecting edit updates onto low-curvature subspaces of the Gauss–Newton Hessian, approximated efficiently via K-FAC. In practice, the authors considered whether to mask prompt tokens during K-FAC statistics computation to mirror fine-tuning setups.

They observed that masking prompt tokens led to suboptimal performance, even with larger token counts, and thus opted to compute next-token prediction loss over the entire prompt–target sequence for K-FAC. However, they explicitly note uncertainty about why masking degrades performance and hypothesize that a relaxed assumption of token independence in K-FAC could be the cause, leaving a concrete unresolved question about the mechanism behind this behavior.

References

We found that masking prompt tokens for K-FAC calculation (mirroring the fine-tuning setup) yielded suboptimal performance, even with a larger number of tokens (\cref{tab:prompt_masking_ablation}). Instead, in our K-FAC calculation for edit samples, we calculate the next token prediction loss over the entire prompt–target sequence. While we are not sure about the underlying cause of this behavior, we suspect that it arises from our relaxed assumption of token independence during K-FAC calculation.

CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing  (2602.15823 - Ikram et al., 17 Feb 2026) in Appendix, Additional details on LLM experiments (Non-trivial K-FAC implementation for CrispEdit)