Papers
Topics
Authors
Recent
Search
2000 character limit reached

Harnessing Large Language Models for Training-free Video Anomaly Detection

Published 1 Apr 2024 in cs.CV | (2404.01014v1)

Abstract: Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision, one-class supervision, or in an unsupervised setting. Training-based methods are prone to be domain-specific, thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper, we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm, exploiting the capabilities of pre-trained LLMs and existing vision-LLMs (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description, we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation, turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD-Violence), showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Traffic anomaly detection via perspective map based on spatial-temporal information matrix. In CVPRW, 2019.
  2. Mgfn: Magnitude-contrastive glance-and-focus network for weakly-supervised video anomaly detection. In AAAI, 2023.
  3. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  4. Semantic anomaly detection with large language models. Autonomous Robots, 2023.
  5. Mist: Multiple instance self-training framework for video anomaly detection. In CVPR, 2021.
  6. Imagebind: One embedding space to bind them all. In CVPR, 2023.
  7. Anomalygpt: Detecting industrial anomalies using large vision-language models. arXiv, 2023.
  8. Learning temporal regularity in video sequences. In CVPR, 2016.
  9. Mistral 7b. arXiv, 2023.
  10. Survey on video anomaly detection in dynamic scenes with moving cameras. Artificial Intelligence Review, 2023.
  11. Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection. In ICIP, 2023.
  12. Unsupervised video anomaly detection based on similarity with predefined text descriptions. Sensors, 2023.
  13. Scale-aware spatio-temporal relation learning for video anomaly detection. In ECCV, 2022a.
  14. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
  15. Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection. In AAAI, 2022b.
  16. Isolation-based anomaly detection. ACM TKDD, 2012.
  17. Improved baselines with visual instruction tuning. arXiv, 2023.
  18. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. In ICCV, 2021.
  19. Abnormal event detection at 150 fps in matlab. In ICCV, 2013.
  20. Learning normal dynamics in videos with meta prototype network. In CVPR, 2021.
  21. Learning memory-guided normality for anomaly detection. In CVPR, 2020.
  22. Learning transferable visual models from natural language supervision. In ICML, 2021.
  23. Subspace support vector data description. In ICPR, 2018.
  24. Real-world anomaly detection in surveillance videos. In CVPR, 2018.
  25. Hierarchical semantic contrast for scene-aware video anomaly detection. In CVPR, 2023.
  26. Rareanom: A benchmark video dataset for rare type anomalies. Pattern Recognition, 2023a.
  27. Dyannet: A scene dynamicity guided self-trained video anomaly detection network. In WACV, 2023b.
  28. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In ICCV, 2021.
  29. Llama: Open and efficient foundation language models. arXiv, 2023.
  30. Exploring diffusion models for unsupervised video anomaly detection. In ICIP, 2023a.
  31. Unsupervised video anomaly detection with diffusion models conditioned on compact motion representations. In ICIAP, 2023b.
  32. Anomaly candidate identification and starting time estimation of vehicles from traffic videos. In CVPRW, 2019.
  33. Gods: Generalized one-class discriminative subspaces for anomaly detection. In ICCV, 2019.
  34. Self-supervised sparse representation for video anomaly detection. In ECCV, 2022.
  35. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE TIP, 2021.
  36. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In ECCV, 2020.
  37. Feature prediction diffusion model for video anomaly detection. In ICCV, 2023.
  38. Old is gold: Redefining the adversarially learned one-class classifier training paradigm. In CVPR, 2020a.
  39. Claws: Clustering assisted weakly supervised learning with normalcy suppression for anomalous event detection. In ECCV, 2020b.
  40. Generative cooperative learning for unsupervised video anomaly detection. In CVPR, 2022.
  41. Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection. In ICIP, 2019.
  42. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In CVPR, 2019.
Citations (6)

Summary

  • The paper introduces LAVAD, a training-free method using pre-trained vision-language and large language models for video anomaly detection.
  • It refines generated captions and aggregates temporal data to compute robust anomaly scores that outperform traditional unsupervised methods.
  • Evaluations on UCF-Crime and XD-Violence datasets demonstrate improved AUC performance, highlighting its potential in privacy-constrained environments.

Harnessing LLMs for Training-free Video Anomaly Detection

Introduction to Training-free Video Anomaly Detection

The paper "Harnessing LLMs for Training-free Video Anomaly Detection" (2404.01014) introduces a novel approach to video anomaly detection (VAD) that eliminates the need for training-based models, which often suffer from domain-specific limitations and require costly data collection efforts. The authors propose LAVAD, a training-free method leveraging modality-aligned vision-LLMs (VLMs) and pre-trained LLMs. This method aims to address the challenge of detecting anomalies in videos without the traditional training phase, making it suitable for applications with privacy constraints where data collection is problematic. Figure 1

Figure 1: A training-free VAD method diverging from state-of-the-art methods relying on training-based techniques.

Methodology: LAVAD Framework

LAVAD employs a series of components designed to efficiently process and evaluate video inputs for anomaly detection:

  1. Caption Generation and Cleaning: Using VLM-based captioning models like BLIP-2, LAVAD generates textual descriptions for each video frame. Noisy and incorrect captions are refined using cross-modal similarity metrics to ensure that the most semantically aligned captions are selected.
  2. Temporal Aggregation and Anomaly Scoring: LLMs are utilized to process text summaries derived from cleaned captions within a temporal window surrounding each frame. This LLM-based anomaly scoring involves querying the model with structured prompts to produce anomaly scores.
  3. Score Refinement: The anomaly scores are further refined by aggregating scores from semantically similar frames within the video, leveraging video-text similarity assessments to enhance precision. Figure 2

    Figure 2: LAVAD architecture for training-free VAD, including components for capturing scene context and dynamics.

Evaluation and Results

The performance of LAVAD was empirically validated using benchmark datasets such as UCF-Crime and XD-Violence, demonstrating its efficacy in surpassing traditional unsupervised and one-class VAD methods without the need for supervised training data.

  • The approach outperformed unsupervised methods by achieving higher AUC scores, illustrating the potential of LLMs in effectively capturing video contexts and dynamics not easily discernible through supervised learning paths. Figure 3

    Figure 3: VAD performance compared using different captioning models and LLMs on the UCF-Crime test set.

Implications and Future Directions

The training-free paradigm introduced by LAVAD is promising for applications where constraints on data collection and privacy are critical. By incorporating pre-trained foundation models, the methodology benefits from broader generalization and adaptability to previously unseen domains.

The research paves the way for further exploration into leveraging foundational models in anomaly detection across other modalities like audio and text. Future work might focus on enhancing the robustness of caption cleaning and anomaly scoring processes, potentially incorporating real-time capabilities and deeper integration of multimodal cues.

Conclusion

The presented study contributes a significant stride towards efficient, training-free video anomaly detection by harnessing the capabilities of VLMs and LLMs. LAVAD not only achieves competitive results but also opens new perspectives for deploying VAD systems in environments where extensive data annotation and training are impractical. The approach exemplifies how foundation models can be repurposed for specific applications, highlighting a shift in anomaly detection strategies for surveillance and safety-related tasks.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.