- The paper presents a novel framework that applies verbalized learning to optimize vision-language models for video anomaly detection using guiding questions.
- It decomposes complex reasoning into trained verbal interactions, enabling efficient computation and producing refined segment and frame-level anomaly scores.
- VERA achieves state-of-the-art performance on benchmarks like UCF-Crime, demonstrating enhanced explainability and reduced computational cost.
VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-LLMs
Introduction and Motivation
The paper "VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-LLMs" introduces a novel approach to video anomaly detection (VAD) using vision-LLMs (VLMs). VLMs have been increasingly applied to VAD due to their ability to combine visual understanding and language reasoning capabilities. However, existing methods often modify VLMs or incorporate external reasoning modules, which can result in high computational costs. VERA circumvents these challenges by implementing a verbalized learning framework that allows VLMs to perform VAD without altering model parameters. This innovative approach decomposes complex reasoning tasks into simpler guiding questions, optimizing them through verbal interactions between learner and optimizer VLMs, using coarsely labeled data.
Methodology
The VERA Framework
VERA operates by treating reflective questions as learnable parameters, optimizing them through verbal interactions with VLMs. During training, VERA refines these questions to capture distinct abnormal patterns, guiding VLMs in generating segment-level anomaly scores. These scores are further refined into frame-level scores by incorporating scene and temporal contexts. This approach eliminates the need for instruction tuning or additional reasoning modules, significantly reducing computational expense.
Training Process
VERA utilizes a data-driven framework where guiding questions are optimized iteratively through interactions between a learner and an optimizer. The learner VLM generates anomaly predictions, which are assessed by an optimizer VLM to refine the guiding questions. This process leverages coarsely labeled data, focusing solely on video-level labels to guide reasoning.
Inference Mechanism
Inference in VERA follows a coarse-to-fine strategy. Initial segment-level anomaly scores are generated by embedding learned questions into the VLM prompts. These scores are then refined using context from surrounding scenes and temporal progression, ensuring frame-level anomaly scores accurately capture the evolution of anomalous events.
Experimental Results
The experimental evaluation demonstrates VERA's efficacy on benchmarks such as UCF-Crime and XD-Violence. VERA achieves state-of-the-art performance in explainable VAD, outperforming methods that require model parameter modifications or external modules. The results underline VERA's capability to provide robust detection and intelligible explanations grounded in learned verbal interactions.
Implications and Future Directions
VERA's novel use of verbalized learning in VAD represents an important shift in leveraging VLMs for complex reasoning tasks. By optimizing language-based parameters, VERA enhances the explainability and performance of VAD models while minimizing computational costs. This framework opens new avenues for integrating verbalized learning in other AI domains, potentially improving reasoning across various tasks without extensive retraining. Future work could explore expanding VERA to different types of anomaly detection, enhancing its adaptability and generalizability across diverse datasets and applications.
Conclusion
VERA provides a compelling solution to video anomaly detection, utilizing verbalized learning to optimize VLMs for both detection accuracy and explanatory depth. The framework adeptly combines visual reasoning with language interaction, offering a scalable, efficient method to address the limitations of current VAD approaches. Through integrating learned guiding questions, VERA significantly advances the capabilities of VLMs in providing comprehensible, human-oriented predictions in video anomaly detection.