Emergent Mind

Abstract

A self-supervised multi-task learning (SSMTL) framework for video anomaly detection was recently introduced in literature. Due to its highly accurate results, the method attracted the attention of many researchers. In this work, we revisit the self-supervised multi-task learning framework, proposing several updates to the original method. First, we study various detection methods, e.g. based on detecting high-motion regions using optical flow or background subtraction, since we believe the currently used pre-trained YOLOv3 is suboptimal, e.g. objects in motion or objects from unknown classes are never detected. Second, we modernize the 3D convolutional backbone by introducing multi-head self-attention modules, inspired by the recent success of vision transformers. As such, we alternatively introduce both 2D and 3D convolutional vision transformer (CvT) blocks. Third, in our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps through knowledge distillation, solving jigsaw puzzles, estimating body pose through knowledge distillation, predicting masked regions (inpainting), and adversarial learning with pseudo-anomalies. We conduct experiments to assess the performance impact of the introduced changes. Upon finding more promising configurations of the framework, dubbed SSMTL++v1 and SSMTL++v2, we extend our preliminary experiments to more data sets, demonstrating that our performance gains are consistent across all data sets. In most cases, our results on Avenue, ShanghaiTech and UBnormal raise the state-of-the-art performance bar to a new level.

Overview

  • SSMTL++ is a revised approach to self-supervised multi-task learning that significantly enhances video anomaly detection capabilities through several noteworthy updates.

  • The inclusion of optical flow and YOLOv3 for object detection, alongside a new backbone architecture featuring 3D convolutional multi-head self-attention modules, represents foundational improvements.

  • New proxy tasks such as adversarial training on pseudo-anomalies and patch inpainting are introduced to enrich the model's learning base and improve anomaly detection performance.

  • Extensive testing on multiple datasets demonstrates that SSMTL++ outperforms its predecessor in performance metrics, balancing enhanced detection with competitive running times.

Revising Self-Supervised Multi-Task Learning for Enhanced Video Anomaly Detection

Introduction to SSMTL++

In the continuous pursuit of refining video anomaly detection capabilities, a significant advancement has been observed through the implementation of self-supervised multi-task learning (SSMTL) frameworks. These frameworks leverage the correlation between multiple proxy tasks to improve anomaly detection accuracy without the need for labeled anomaly data. A recent work that draws attention is the revised SSMTL approach, known as SSMTL++, which introduces several noteworthy updates aimed at pushing the boundary of state-of-the-art performance in detecting anomalies in video sequences.

Enhancements in Detection and Architecture

One of the foundational improvements in SSMTL++ is the integrated use of optical flow along with YOLOv3 for object detection. This combination is pivotal in identifying a larger array of objects within video frames, thereby enhancing the model's anomaly detection scope. The inclusion of optical flow is particularly effective in capturing objects that might be missed due to motion blur or because they fall outside the predefined object classes recognized by YOLOv3.

This work also modernizes the underlying architectural backbone of the model by incorporating 3D convolutional multi-head self-attention modules. This adjustment is inspired by the successes witnessed with vision transformers (ViTs) and marks a significant leap from the traditional 3D CNN used in the original SSMTL framework. The novel backbone architecture promises to fortify the learning capacity of the framework, thus enabling a more nuanced understanding of video content for anomaly detection.

New Proxy Tasks for Enhanced Performance

SSMTL++ experiments with the addition of new proxy tasks, such as adversarial training on pseudo-anomalies and patch inpainting, aimed at enriching the model's learning base. The adversarial training on pseudo-anomalies is particularly innovative, as it involves optimizing the network in a manner that deliberately undermines its ability to represent pseudo-anomaly patterns. This approach is strategic for anomaly detection, where the capability to distinguish between normal and abnormal patterns is crucial. Similarly, patch inpainting serves as a self-supervised proxy task that enhances the model's discernment abilities by forcing it to predict missing portions of the input, thereby indirectly learning about the anomaly.

Evaluation and Results

Extensive experiments conducted across widely-used datasets such as Avenue, ShanghaiTech, and UBnormal showcase that both SSMTL++ variants (SSMTL++v1 and SSMTL++v2) surpass their predecessor in performance metrics. These improvements are attributed to the holistic upgrades in object detection methods, backbone architecture, and the incorporation of novel proxy tasks, each contributing to the overall efficacy of the anomaly detection framework.

Running Time Considerations

Despite the advancements, the inclusion of optical flow for object detection and the deeper transformer-based backbone architecture introduce additional computational overhead, affecting the model's running time. However, the research illustrates that SSMTL++ maintains competitive running times while significantly boosting anomaly detection performance. This balance between efficiency and accuracy underscores the practical value of SSMTL++ in real-world anomaly detection applications.

Conclusion

SSMTL++ stands as a testament to the evolutionary trajectory of video anomaly detection frameworks, highlighting the importance of continuous adaptation and incorporation of new methodologies. Through strategic updates to the detection process, architectural backbone, and learning tasks, this work achieves new heights in accurately identifying anomalies within video sequences. As the field progresses, the insights garnered from SSMTL++ will undeniably influence future developments in video anomaly detection.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.