Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models

Published 28 Apr 2026 in cs.CV and cs.AI | (2604.25642v1)

Abstract: Large Vision-LLMs (LVLMs) have achieved remarkable progress in visual-textual understanding, yet their reliability is critically undermined by hallucinations, i.e., the generation of factually incorrect or inconsistent responses. While recent studies using steering vectors demonstrated promise in reducing hallucinations, a notable challenge remains: they inadvertently amplify the severity of residual hallucinations. We attribute this to their exclusive focus on the decoding stage, where errors accumulate autoregressively and progressively worsen subsequent hallucinatory outputs. To address this, we propose Prefill-Time Intervention (PTI), a novel steering paradigm that intervenes only once during the prefill stage, enhancing the initial Key-Value (KV) cache before error accumulation occurs. Specifically, PTI is modality-aware, deriving distinct directions for visual and textual representations. This intervention is decoupled to steer keys toward visually-grounded objects and values to filter background noise, correcting hallucination-prone representations at their source. Extensive experiments demonstrate PTI's significant performance in mitigating hallucinations and its generalizability across diverse decoding strategies, LVLMs, and benchmarks. Moreover, PTI is orthogonal to existing decoding-stage methods, enabling plug-and-play integration and further boosting performance. Code is available at: https://github.com/huaiyi66/PTI.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a novel prefill-time intervention that proactively adjusts the initial key-value cache to mitigate hallucinations.
It leverages contrastive object-background signals and modality-specific steering to preempt error cascades before autoregressive decoding.
Empirical evaluations show a significant reduction in hallucination frequency and severity across multiple LVLMs without the need for model retraining.

Prefill-Time Intervention for Mitigating Hallucination in Large Vision-LLMs

Abstract and Motivation

This work addresses a fundamental failure mode of Large Vision-LLMs (LVLMs): generation of hallucinated visual content—that is, textual descriptions that are not grounded in the visual input. Existing methods predominantly steer representations during autoregressive decoding using fixed steering vectors, but this continuous, reactive strategy has been shown to exacerbate the severity of residual hallucinations. The authors identify key weaknesses in the prevailing Decoding-Time Intervention (DTI) paradigm, namely its lack of modality awareness, its coarse granularity, and, most critically, its late intervention, which allows errors to snowball. As an alternative, the authors introduce Prefill-Time Intervention (PTI), a proactive, modality-aware framework that directly manipulates the initial key-value (KV) cache during the prefill stage, thereby correcting representation errors before autoregressive decoding begins.

Methodology

PTI fundamentally shifts when and how intervention occurs by targeting the formation of the initial KV cache within the Transformer decoder. Rather than persistent, position-invariant steering throughout text generation, PTI applies a one-time, fine-grained perturbation based on contrastive object-background signals. The intervention is explicitly decoupled by modality:

Visual Directions: Extracted by contrasting the model’s KV cache between object-isolated and masked-background images (utilizing segmentation labels from datasets like MSCOCO). The visual steering vectors are computed as the difference between pooled KV representations for object-only and background-only scenarios, then denoised via PCA.
Textual Directions: Constructed by contrasting representations when the object-matching words are retained versus masked in the reference caption. These directions steers the last textual token toward proper object grounding.

PTI injects these learned directions as weighted shifts in the initial KV tensors at the corresponding visual and textual token positions (with normalization for stability) before decoding commences. Crucially, the intervention is plug-and-play, model-agnostic (requiring no retraining), and computationally negligible relative to baseline inference.

Empirical Evaluation

The evaluation encompasses three representative LVLMs: LLaVA-1.5, Qwen-VL-Chat, and DeepSeek-VL-Chat, spanning a diverse range of vision connector architectures. PTI is tested across object hallucination benchmarks (CHAIR [33], POPE [21], AMBER [46]), comprehensive multimodal hallucination sets (MMHAL [35], MME [57]), and various decoding strategies (greedy, beam search, nucleus sampling). Quantitative results demonstrate that:

Hallucination frequency is substantially reduced (e.g., on LLaVA-1.5, CHAIRs drops from 47.4% to 15.4% in the greedy setting).
Snowball hallucination severity is reduced compared to DTI paradigms, confirming PTI’s ability to preempt error cascades.
PTI exhibits strong generalization across models, input distributions, and decoding strategies.
PTI is orthogonal to DTI-based steering: when combined, further performance gains are observed, suggesting complementary mechanisms.
Efficiency: Unlike contrastive decoding methods (e.g., VCD, PAI), PTI incurs <2% increase in inference time, with nearly identical throughput.

Ablation studies confirm that the visual value intervention exerts the largest effect on hallucination mitigation, but optimal performance is obtained by combining both modality-specific directions.

Analysis and Internal Insights

Interpretability analyses are performed to elucidate how PTI aligns attention mechanisms:

Global Visual Attention: PTI counteracts the well-documented decay of visual token attention during text generation, leading to sustained and globally improved visual grounding.
Local Object-Centric Attention: Intervention consistently and specifically shifts attention distributions toward ground-truth object regions, as verified with attention heatmaps and quantitative metrics.
Perturbations using non-object-specific (e.g. random mask-based) directions are significantly less effective, establishing the necessity of precise object-background contrasts.

Further, qualitative case studies reveal that PTI suppresses instances of object insertion, attribute errors, and context misclassification that remain prevalent in vanilla models and DTI approaches.

Implications and Future Directions

PTI clarifies the criticality of when, how, and what is steered in the context of autoregressive LVLM decoding. The results indicate that representation errors are best addressed in the initial state, and that modality/position specificity is essential for robust factual alignment—particularly in settings where visual signals are weak or heavily confounded by linguistic priors. Given PTI’s plug-and-play nature, future work may extend analogous interventions to additional modalities (audio, video) or explore more powerful methods for automatic direction extraction that integrate richer semantics or counterfactual data. PTI’s computational efficiency also recommends it for real-world, latency-sensitive interactive applications across domains, including autonomous systems and digital assistance where factual accuracy is paramount.

Conclusion

The paper introduces a paradigm shift for hallucination mitigation in LVLMs via Prefill-Time Intervention, demonstrating that one-time, modality-specific steering of the initial KV cache robustly reduces both frequency and severity of visual hallucinations without incurring inference overhead or requiring model retraining. The approach outperforms prior SOTA baselines across both coarse- and fine-grained semantic metrics, and further establishes new principles for intervention design in large autoregressive multimodal models (2604.25642).

Markdown Report Issue