- The paper presents a novel benchmark dataset of 9,576 annotated news images for binary sensational image detection.
- It employs a cross-modal similarity framework using adapted VLMs such as CLIP and SigLip, achieving over 80% baseline accuracy and robust fine-tuned performance.
- The study highlights prompt sensitivity and label ambiguity, guiding future developments in trustworthy disinformation detection and model calibration.
Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection
Introduction and Motivation
"Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection" (2605.10394) introduces a novel dataset and task targeting the automated identification of sensational visual content in news imagery. Existing disinformation detection pipelines predominantly attend to text-based sensationalism, neglecting the crucial role of emotionally charged images in accelerating inattentive content sharing and manipulating user perception. The need for robust visual sensationalism detectors is underscored by the increasing sophistication of multimodal disinformation, especially in the context of viral news cycles and fact-checking workflows.
Dataset Construction and Curation
The Sens-VisualNews dataset is derived from the large-scale VisualNews news image captioning dataset, comprising 9,576 meticulously annotated images. Central to the curation process is a comprehensive taxonomy of 194 sensational visual concepts and events, carefully compiled in collaboration with fact-checkers and journalists across major geographic regions. These include war, conflict, refugee crises, environmental disasters, racism, and religion, ensuring a domain-relevant, fact-checking-oriented scope.
Image selection leverages a cross-modal similarity estimation framework. Using adaptations of pre-trained VLMs such as CLIP and SigLip, images are correlated against the sensational concepts, facilitating semantic filtering and balancing across under-represented topics. This approach addresses inherent topic imbalance, as evidenced by the over-representation of war and migrant themes in initial automatic selections.
Annotation employs a triadic, independent labeling scheme with majority voting, eliminating ambiguous cases to ensure label reliability. Further, the dataset provides both a full and a "strict" subset, the latter containing only unanimously annotated samples from both the sensational and non-sensational classes. This strict subset is designed for evaluating robustness under reduced label ambiguity, though at the cost of task complexity.
Task Definition and Distinction
Sensational image detection is formally defined as the binary classification of whether an image contains visually encoded features likely to elicit strong emotional responses (e.g., shock, anger, disgust, anxiety). The task is distinguished from:
- Sensational content (textual) detection: Focuses on provocative linguistic cues.
- NSFW content detection: Emphasizes general or policy-violating visual inappropriateness.
- Visual sentiment analysis: Assesses overall emotional polarity or mood rather than intent to provoke.
This distinction mandates models capable of nuanced, intention-conditioned interpretation of visual stimuli, not merely surface-level sentiment or harm assessment.
Experimental Protocol
Performance is systematically benchmarked using leading open-source MLLMs: Qwen3 VL, LLaVA One Vision (and 1.5), InternVL 3.5, SmolVLM2, across parameter scales (0.5B–8B). Evaluation employs top-1 accuracy in both zero-shot and fine-tuned (via LoRA) settings, with prompt sensitivity rigorously investigated through three distinct, semantically targeted prompts per model. Both prompt placement (prepend/append) and formulation distinctly affect model outputs, revealing the subjectivity and context-dependent nature of the task.
Key Results
- Baseline: The image similarity-based baseline achieves >80% accuracy, reflecting the strength of cross-modal embedding retrieval for this binary task.
- MLLM Zero-shot performance: Larger MLLMs consistently outperform both the baseline and smaller counterparts. Importantly, LLaVA OV 1.5 (8B) achieves the highest zero-shot accuracy (87.6% on full, 93.9% on strict), closely followed by Qwen3 VL (8B).
- Fine-tuning: Substantial performance boosts are observed post-fine-tuning, with Qwen3 VL (8B) reaching 90.0% (full) and 95.5% (strict), marginally surpassing LLaVA OV 1.5 (8B).
- Prompt Sensitivity: All MLLMs exhibit non-negligible prompt sensitivity, especially at smaller scales, with standard deviations up to 15.8 percentage points.
- Label Ambiguity Effect: Accuracy is systematically higher on the strict subset, directly highlighting the challenge introduced by subjectivity in real-world annotation.
Strong claims: Fine-tuned large-scale MLLMs achieve near human-level labeling agreement for the strictest cases, and most evaluated MLLMs surpass the best cross-modal retrieval baseline by a significant margin on both subsets.
Implications and Future Directions
Practically, Sens-VisualNews provides a high-quality, incentive-aligned resource for developing and benchmarking multimodal models intended for disinformation detection workflows in journalistic and automated fact-checking applications. The results confirm that current open-source MLLMs are not only highly effective for this specialized visual reasoning task but also sensitive to prompt formulation nuances, indicating a requirement for prompt engineering or adaptation during deployment. The strict subset methodology further encourages exploration of subjective and adversarial edge cases.
Theoretically, the dataset challenges the boundaries of intention-conditioned visual understanding, emphasizing the need for models to differentiate between emotional polarity and intent to provoke. The subjectivity intrinsic to the concept of "sensationalism" also opens new research avenues in modeling annotator disagreement and epistemic uncertainty.
For future developments, the dataset's release together with code and annotation tools enables further research on:
- Modeling and mitigating label subjectivity with probabilistic or ensemble annotation frameworks.
- Extending sensationalism detection beyond the news domain (e.g., social media, memes).
- Robust multimodal fine-tuning strategies sensitive to ethical deployment in fact-checking scenarios.
- Cross-lingual and cross-cultural adaptation, testing model generalization on sensationalism cues that are culturally grounded.
Conclusion
"Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection" (2605.10394) addresses a critical gap in multimodal disinformation detection by systematically defining, curating, and evaluating the detection of sensational imagery in news. Its rigorous methodology, strong empirical results for fine-tuned MLLMs, and explicit treatment of subjective ambiguity make it a foundational resource for the continued advancement of trustworthy AI in news and media integrity.