Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection

Published 11 May 2026 in cs.CV | (2605.10394v1)

Abstract: The detection of sensational content in media items can be a critical filtering mechanism for identifying check-worthy content and flagging potential disinformation, since such content triggers physiological arousal that often bypasses critical evaluation and accelerates viral sharing. In this paper we introduce the task of sensational image detection, which aims to determine whether an image contains shocking, provocative, or emotionally charged features to grab attention and trigger strong emotional responses. To support research on this task, we create a new benchmark dataset (called Sens-VisualNews) that contains 9,576 images from news items, annotated based on the (in-)existence of various sensational concepts and events in their visual content. Finally, using Sens-VisualNews, we study the prompt sensitivity, performance and robustness of a wide range of open SotA Multimodal LLMs, across both zero-shot and fine-tuned settings.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents a novel benchmark dataset of 9,576 annotated news images for binary sensational image detection.
It employs a cross-modal similarity framework using adapted VLMs such as CLIP and SigLip, achieving over 80% baseline accuracy and robust fine-tuned performance.
The study highlights prompt sensitivity and label ambiguity, guiding future developments in trustworthy disinformation detection and model calibration.

Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection

Introduction and Motivation

"Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection" (2605.10394) introduces a novel dataset and task targeting the automated identification of sensational visual content in news imagery. Existing disinformation detection pipelines predominantly attend to text-based sensationalism, neglecting the crucial role of emotionally charged images in accelerating inattentive content sharing and manipulating user perception. The need for robust visual sensationalism detectors is underscored by the increasing sophistication of multimodal disinformation, especially in the context of viral news cycles and fact-checking workflows.

Dataset Construction and Curation

The Sens-VisualNews dataset is derived from the large-scale VisualNews news image captioning dataset, comprising 9,576 meticulously annotated images. Central to the curation process is a comprehensive taxonomy of 194 sensational visual concepts and events, carefully compiled in collaboration with fact-checkers and journalists across major geographic regions. These include war, conflict, refugee crises, environmental disasters, racism, and religion, ensuring a domain-relevant, fact-checking-oriented scope.

Image selection leverages a cross-modal similarity estimation framework. Using adaptations of pre-trained VLMs such as CLIP and SigLip, images are correlated against the sensational concepts, facilitating semantic filtering and balancing across under-represented topics. This approach addresses inherent topic imbalance, as evidenced by the over-representation of war and migrant themes in initial automatic selections.

Annotation employs a triadic, independent labeling scheme with majority voting, eliminating ambiguous cases to ensure label reliability. Further, the dataset provides both a full and a "strict" subset, the latter containing only unanimously annotated samples from both the sensational and non-sensational classes. This strict subset is designed for evaluating robustness under reduced label ambiguity, though at the cost of task complexity.

Task Definition and Distinction

Sensational image detection is formally defined as the binary classification of whether an image contains visually encoded features likely to elicit strong emotional responses (e.g., shock, anger, disgust, anxiety). The task is distinguished from:

Sensational content (textual) detection: Focuses on provocative linguistic cues.
NSFW content detection: Emphasizes general or policy-violating visual inappropriateness.
Visual sentiment analysis: Assesses overall emotional polarity or mood rather than intent to provoke.

This distinction mandates models capable of nuanced, intention-conditioned interpretation of visual stimuli, not merely surface-level sentiment or harm assessment.

Experimental Protocol

Performance is systematically benchmarked using leading open-source MLLMs: Qwen3 VL, LLaVA One Vision (and 1.5), InternVL 3.5, SmolVLM2, across parameter scales (0.5B–8B). Evaluation employs top-1 accuracy in both zero-shot and fine-tuned (via LoRA) settings, with prompt sensitivity rigorously investigated through three distinct, semantically targeted prompts per model. Both prompt placement (prepend/append) and formulation distinctly affect model outputs, revealing the subjectivity and context-dependent nature of the task.

Key Results

Baseline: The image similarity-based baseline achieves >80% accuracy, reflecting the strength of cross-modal embedding retrieval for this binary task.
MLLM Zero-shot performance: Larger MLLMs consistently outperform both the baseline and smaller counterparts. Importantly, LLaVA OV 1.5 (8B) achieves the highest zero-shot accuracy (87.6% on full, 93.9% on strict), closely followed by Qwen3 VL (8B).
Fine-tuning: Substantial performance boosts are observed post-fine-tuning, with Qwen3 VL (8B) reaching 90.0% (full) and 95.5% (strict), marginally surpassing LLaVA OV 1.5 (8B).
Prompt Sensitivity: All MLLMs exhibit non-negligible prompt sensitivity, especially at smaller scales, with standard deviations up to 15.8 percentage points.
Label Ambiguity Effect: Accuracy is systematically higher on the strict subset, directly highlighting the challenge introduced by subjectivity in real-world annotation.

Strong claims: Fine-tuned large-scale MLLMs achieve near human-level labeling agreement for the strictest cases, and most evaluated MLLMs surpass the best cross-modal retrieval baseline by a significant margin on both subsets.

Implications and Future Directions

Practically, Sens-VisualNews provides a high-quality, incentive-aligned resource for developing and benchmarking multimodal models intended for disinformation detection workflows in journalistic and automated fact-checking applications. The results confirm that current open-source MLLMs are not only highly effective for this specialized visual reasoning task but also sensitive to prompt formulation nuances, indicating a requirement for prompt engineering or adaptation during deployment. The strict subset methodology further encourages exploration of subjective and adversarial edge cases.

Theoretically, the dataset challenges the boundaries of intention-conditioned visual understanding, emphasizing the need for models to differentiate between emotional polarity and intent to provoke. The subjectivity intrinsic to the concept of "sensationalism" also opens new research avenues in modeling annotator disagreement and epistemic uncertainty.

For future developments, the dataset's release together with code and annotation tools enables further research on:

Modeling and mitigating label subjectivity with probabilistic or ensemble annotation frameworks.
Extending sensationalism detection beyond the news domain (e.g., social media, memes).
Robust multimodal fine-tuning strategies sensitive to ethical deployment in fact-checking scenarios.
Cross-lingual and cross-cultural adaptation, testing model generalization on sensationalism cues that are culturally grounded.

Conclusion

"Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection" (2605.10394) addresses a critical gap in multimodal disinformation detection by systematically defining, curating, and evaluating the detection of sensational imagery in news. Its rigorous methodology, strong empirical results for fine-tuned MLLMs, and explicit treatment of subjective ambiguity make it a foundational resource for the continued advancement of trustworthy AI in news and media integrity.

Markdown Report Issue