Harnessing Large Language Models for Training-free Video Anomaly Detection

Published 1 Apr 2024 in cs.CV | (2404.01014v1)

Abstract: Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision, one-class supervision, or in an unsupervised setting. Training-based methods are prone to be domain-specific, thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper, we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm, exploiting the capabilities of pre-trained LLMs and existing vision-LLMs (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description, we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation, turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD-Violence), showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection.

Abstract PDF HTML Upgrade to Chat

Authors (5)

References (42)

Citations (6)

View on Semantic Scholar

Summary

The paper introduces LAVAD, a training-free method using pre-trained vision-language and large language models for video anomaly detection.
It refines generated captions and aggregates temporal data to compute robust anomaly scores that outperform traditional unsupervised methods.
Evaluations on UCF-Crime and XD-Violence datasets demonstrate improved AUC performance, highlighting its potential in privacy-constrained environments.

Harnessing LLMs for Training-free Video Anomaly Detection

Introduction to Training-free Video Anomaly Detection

The paper "Harnessing LLMs for Training-free Video Anomaly Detection" (2404.01014) introduces a novel approach to video anomaly detection (VAD) that eliminates the need for training-based models, which often suffer from domain-specific limitations and require costly data collection efforts. The authors propose LAVAD, a training-free method leveraging modality-aligned vision-LLMs (VLMs) and pre-trained LLMs. This method aims to address the challenge of detecting anomalies in videos without the traditional training phase, making it suitable for applications with privacy constraints where data collection is problematic.

Figure 1: A training-free VAD method diverging from state-of-the-art methods relying on training-based techniques.

Methodology: LAVAD Framework

LAVAD employs a series of components designed to efficiently process and evaluate video inputs for anomaly detection:

Caption Generation and Cleaning: Using VLM-based captioning models like BLIP-2, LAVAD generates textual descriptions for each video frame. Noisy and incorrect captions are refined using cross-modal similarity metrics to ensure that the most semantically aligned captions are selected.
Temporal Aggregation and Anomaly Scoring: LLMs are utilized to process text summaries derived from cleaned captions within a temporal window surrounding each frame. This LLM-based anomaly scoring involves querying the model with structured prompts to produce anomaly scores.
Score Refinement: The anomaly scores are further refined by aggregating scores from semantically similar frames within the video, leveraging video-text similarity assessments to enhance precision.
Figure 2: LAVAD architecture for training-free VAD, including components for capturing scene context and dynamics.

Evaluation and Results

The performance of LAVAD was empirically validated using benchmark datasets such as UCF-Crime and XD-Violence, demonstrating its efficacy in surpassing traditional unsupervised and one-class VAD methods without the need for supervised training data.

The approach outperformed unsupervised methods by achieving higher AUC scores, illustrating the potential of LLMs in effectively capturing video contexts and dynamics not easily discernible through supervised learning paths.
Figure 3: VAD performance compared using different captioning models and LLMs on the UCF-Crime test set.

Implications and Future Directions

The training-free paradigm introduced by LAVAD is promising for applications where constraints on data collection and privacy are critical. By incorporating pre-trained foundation models, the methodology benefits from broader generalization and adaptability to previously unseen domains.

The research paves the way for further exploration into leveraging foundational models in anomaly detection across other modalities like audio and text. Future work might focus on enhancing the robustness of caption cleaning and anomaly scoring processes, potentially incorporating real-time capabilities and deeper integration of multimodal cues.

Conclusion

The presented study contributes a significant stride towards efficient, training-free video anomaly detection by harnessing the capabilities of VLMs and LLMs. LAVAD not only achieves competitive results but also opens new perspectives for deploying VAD systems in environments where extensive data annotation and training are impractical. The approach exemplifies how foundation models can be repurposed for specific applications, highlighting a shift in anomaly detection strategies for surveillance and safety-related tasks.

Markdown Report Issue