- The paper introduces a novel prefill-time intervention that proactively adjusts the initial key-value cache to mitigate hallucinations.
- It leverages contrastive object-background signals and modality-specific steering to preempt error cascades before autoregressive decoding.
- Empirical evaluations show a significant reduction in hallucination frequency and severity across multiple LVLMs without the need for model retraining.
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-LLMs
Abstract and Motivation
This work addresses a fundamental failure mode of Large Vision-LLMs (LVLMs): generation of hallucinated visual contentโthat is, textual descriptions that are not grounded in the visual input. Existing methods predominantly steer representations during autoregressive decoding using fixed steering vectors, but this continuous, reactive strategy has been shown to exacerbate the severity of residual hallucinations. The authors identify key weaknesses in the prevailing Decoding-Time Intervention (DTI) paradigm, namely its lack of modality awareness, its coarse granularity, and, most critically, its late intervention, which allows errors to snowball. As an alternative, the authors introduce Prefill-Time Intervention (PTI), a proactive, modality-aware framework that directly manipulates the initial key-value (KV) cache during the prefill stage, thereby correcting representation errors before autoregressive decoding begins.
Methodology
PTI fundamentally shifts when and how intervention occurs by targeting the formation of the initial KV cache within the Transformer decoder. Rather than persistent, position-invariant steering throughout text generation, PTI applies a one-time, fine-grained perturbation based on contrastive object-background signals. The intervention is explicitly decoupled by modality:
- Visual Directions: Extracted by contrasting the modelโs KV cache between object-isolated and masked-background images (utilizing segmentation labels from datasets like MSCOCO). The visual steering vectors are computed as the difference between pooled KV representations for object-only and background-only scenarios, then denoised via PCA.
- Textual Directions: Constructed by contrasting representations when the object-matching words are retained versus masked in the reference caption. These directions steers the last textual token toward proper object grounding.
PTI injects these learned directions as weighted shifts in the initial KV tensors at the corresponding visual and textual token positions (with normalization for stability) before decoding commences. Crucially, the intervention is plug-and-play, model-agnostic (requiring no retraining), and computationally negligible relative to baseline inference.
Empirical Evaluation
The evaluation encompasses three representative LVLMs: LLaVA-1.5, Qwen-VL-Chat, and DeepSeek-VL-Chat, spanning a diverse range of vision connector architectures. PTI is tested across object hallucination benchmarks (CHAIR [33], POPE [21], AMBER [46]), comprehensive multimodal hallucination sets (MMHAL [35], MME [57]), and various decoding strategies (greedy, beam search, nucleus sampling). Quantitative results demonstrate that:
- Hallucination frequency is substantially reduced (e.g., on LLaVA-1.5, CHAIRs drops from 47.4% to 15.4% in the greedy setting).
- Snowball hallucination severity is reduced compared to DTI paradigms, confirming PTIโs ability to preempt error cascades.
- PTI exhibits strong generalization across models, input distributions, and decoding strategies.
- PTI is orthogonal to DTI-based steering: when combined, further performance gains are observed, suggesting complementary mechanisms.
- Efficiency: Unlike contrastive decoding methods (e.g., VCD, PAI), PTI incurs <2% increase in inference time, with nearly identical throughput.
Ablation studies confirm that the visual value intervention exerts the largest effect on hallucination mitigation, but optimal performance is obtained by combining both modality-specific directions.
Analysis and Internal Insights
Interpretability analyses are performed to elucidate how PTI aligns attention mechanisms:
- Global Visual Attention: PTI counteracts the well-documented decay of visual token attention during text generation, leading to sustained and globally improved visual grounding.
- Local Object-Centric Attention: Intervention consistently and specifically shifts attention distributions toward ground-truth object regions, as verified with attention heatmaps and quantitative metrics.
- Perturbations using non-object-specific (e.g. random mask-based) directions are significantly less effective, establishing the necessity of precise object-background contrasts.
Further, qualitative case studies reveal that PTI suppresses instances of object insertion, attribute errors, and context misclassification that remain prevalent in vanilla models and DTI approaches.
Implications and Future Directions
PTI clarifies the criticality of when, how, and what is steered in the context of autoregressive LVLM decoding. The results indicate that representation errors are best addressed in the initial state, and that modality/position specificity is essential for robust factual alignmentโparticularly in settings where visual signals are weak or heavily confounded by linguistic priors. Given PTIโs plug-and-play nature, future work may extend analogous interventions to additional modalities (audio, video) or explore more powerful methods for automatic direction extraction that integrate richer semantics or counterfactual data. PTIโs computational efficiency also recommends it for real-world, latency-sensitive interactive applications across domains, including autonomous systems and digital assistance where factual accuracy is paramount.
Conclusion
The paper introduces a paradigm shift for hallucination mitigation in LVLMs via Prefill-Time Intervention, demonstrating that one-time, modality-specific steering of the initial KV cache robustly reduces both frequency and severity of visual hallucinations without incurring inference overhead or requiring model retraining. The approach outperforms prior SOTA baselines across both coarse- and fine-grained semantic metrics, and further establishes new principles for intervention design in large autoregressive multimodal models (2604.25642).