VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models

Published 2 Dec 2024 in cs.AI, cs.CV, and cs.LG | (2412.01095v3)

Abstract: The rapid advancement of vision-LLMs (VLMs) has established a new paradigm in video anomaly detection (VAD): leveraging VLMs to simultaneously detect anomalies and provide comprehendible explanations for the decisions. Existing work in this direction often assumes the complex reasoning required for VAD exceeds the capabilities of pretrained VLMs. Consequently, these approaches either incorporate specialized reasoning modules during inference or rely on instruction tuning datasets through additional training to adapt VLMs for VAD. However, such strategies often incur substantial computational costs or data annotation overhead. To address these challenges in explainable VAD, we introduce a verbalized learning framework named VERA that enables VLMs to perform VAD without model parameter modifications. Specifically, VERA automatically decomposes the complex reasoning required for VAD into reflections on simpler, more focused guiding questions capturing distinct abnormal patterns. It treats these reflective questions as learnable parameters and optimizes them through data-driven verbal interactions between learner and optimizer VLMs, using coarsely labeled training data. During inference, VERA embeds the learned questions into model prompts to guide VLMs in generating segment-level anomaly scores, which are then refined into frame-level scores via the fusion of scene and temporal contexts. Experimental results on challenging benchmarks demonstrate that the learned questions of VERA are highly adaptable, significantly improving both detection performance and explainability of VLMs for VAD.

Abstract PDF HTML Upgrade to Chat

Authors (3)

Summary

The paper presents a novel framework that applies verbalized learning to optimize vision-language models for video anomaly detection using guiding questions.
It decomposes complex reasoning into trained verbal interactions, enabling efficient computation and producing refined segment and frame-level anomaly scores.
VERA achieves state-of-the-art performance on benchmarks like UCF-Crime, demonstrating enhanced explainability and reduced computational cost.

VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-LLMs

Introduction and Motivation

The paper "VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-LLMs" introduces a novel approach to video anomaly detection (VAD) using vision-LLMs (VLMs). VLMs have been increasingly applied to VAD due to their ability to combine visual understanding and language reasoning capabilities. However, existing methods often modify VLMs or incorporate external reasoning modules, which can result in high computational costs. VERA circumvents these challenges by implementing a verbalized learning framework that allows VLMs to perform VAD without altering model parameters. This innovative approach decomposes complex reasoning tasks into simpler guiding questions, optimizing them through verbal interactions between learner and optimizer VLMs, using coarsely labeled data.

Methodology

The VERA Framework

VERA operates by treating reflective questions as learnable parameters, optimizing them through verbal interactions with VLMs. During training, VERA refines these questions to capture distinct abnormal patterns, guiding VLMs in generating segment-level anomaly scores. These scores are further refined into frame-level scores by incorporating scene and temporal contexts. This approach eliminates the need for instruction tuning or additional reasoning modules, significantly reducing computational expense.

Training Process

VERA utilizes a data-driven framework where guiding questions are optimized iteratively through interactions between a learner and an optimizer. The learner VLM generates anomaly predictions, which are assessed by an optimizer VLM to refine the guiding questions. This process leverages coarsely labeled data, focusing solely on video-level labels to guide reasoning.

Inference Mechanism

Inference in VERA follows a coarse-to-fine strategy. Initial segment-level anomaly scores are generated by embedding learned questions into the VLM prompts. These scores are then refined using context from surrounding scenes and temporal progression, ensuring frame-level anomaly scores accurately capture the evolution of anomalous events.

Experimental Results

The experimental evaluation demonstrates VERA's efficacy on benchmarks such as UCF-Crime and XD-Violence. VERA achieves state-of-the-art performance in explainable VAD, outperforming methods that require model parameter modifications or external modules. The results underline VERA's capability to provide robust detection and intelligible explanations grounded in learned verbal interactions.

Implications and Future Directions

VERA's novel use of verbalized learning in VAD represents an important shift in leveraging VLMs for complex reasoning tasks. By optimizing language-based parameters, VERA enhances the explainability and performance of VAD models while minimizing computational costs. This framework opens new avenues for integrating verbalized learning in other AI domains, potentially improving reasoning across various tasks without extensive retraining. Future work could explore expanding VERA to different types of anomaly detection, enhancing its adaptability and generalizability across diverse datasets and applications.

Conclusion

VERA provides a compelling solution to video anomaly detection, utilizing verbalized learning to optimize VLMs for both detection accuracy and explanatory depth. The framework adeptly combines visual reasoning with language interaction, offering a scalable, efficient method to address the limitations of current VAD approaches. Through integrating learned guiding questions, VERA significantly advances the capabilities of VLMs in providing comprehensible, human-oriented predictions in video anomaly detection.

Markdown Report Issue