Self-Supervised Representation Learning for Visual Anomaly Detection (2006.09654v1)

Published 17 Jun 2020 in cs.CV, cs.LG, and eess.IV

Abstract: Self-supervised learning allows for better utilization of unlabelled data. The feature representation obtained by self-supervision can be used in downstream tasks such as classification, object detection, segmentation, and anomaly detection. While classification, object detection, and segmentation have been investigated with self-supervised learning, anomaly detection needs more attention. We consider the problem of anomaly detection in images and videos, and present a new visual anomaly detection technique for videos. Numerous seminal and state-of-the-art self-supervised methods are evaluated for anomaly detection on a variety of image datasets. The best performing image-based self-supervised representation learning method is then used for video anomaly detection to see the importance of spatial features in visual anomaly detection in videos. We also propose a simple self-supervision approach for learning temporal coherence across video frames without the use of any optical flow information. At its core, our method identifies the frame indices of a jumbled video sequence allowing it to learn the spatiotemporal features of the video. This intuitive approach shows superior performance of visual anomaly detection compared to numerous methods for images and videos on UCF101 and ILSVRC2015 video datasets.

Citations (16)

View on Semantic Scholar

Summary

The paper presents a novel approach that leverages a frame permutation prediction task to learn robust spatiotemporal features for anomaly detection.
It demonstrates that self-supervised techniques, especially rotation prediction, can outperform traditional methods across diverse datasets like CIFAR-10, Fashion-MNIST, UCF101, and ILSVRC2015.
The study highlights the potential of SSL for real-world applications in data-scarce environments, paving the way for advancements in surveillance and autonomous systems.

Self-Supervised Representation Learning for Visual Anomaly Detection

This essay critically explores the paper "Self-Supervised Representation Learning for Visual Anomaly Detection" (2006.09654), which presents a novel approach to leveraging self-supervised learning (SSL) for anomaly detection in both images and videos. The work is rooted in the necessity to effectively employ unlabeled data to capture pertinent features for challenging downstream tasks such as anomaly detection.

Anomaly Detection and Self-Supervised Learning

Anomaly detection, often viewed as a form of one-class classification, aims to distinguish in-distribution (normal) data from out-of-distribution (anomalous) instances. In this paper, the authors harness SSL to mitigate the reliance on labeled data, which is expensive and labor-intensive to acquire. Self-supervised learning involves formulating a pretext task from unlabeled data, allowing networks to learn meaningful representations that can be transferred to more complex tasks.

Methodology: Learning Spatiotemporal Features

The paper primarily focuses on deriving spatiotemporal features for videos without relying on optical flow information. The innovative aspect lies in the use of a "frame permutation prediction task," which involves permuting video frames and training a neural network to predict their correct order. By solving this pretext task, the network learns extensive low and high-level features, crucial for identifying anomalies in videos.

Anomaly Detection in Images

The research compares self-supervised techniques such as jigsaw puzzles, rotation prediction, and colorization against existing anomaly detection methods. Results indicate competitive advantages for SSL methods in image anomaly detection tasks across datasets like CIFAR-10, CIFAR-100, and Fashion-MNIST. Among these, the deep learning architecture trained on rotation predictions consistently outperformed other approaches, demonstrating the utility of spatial features in detecting anomalies.

Anomaly Detection in Videos

In the context of video data, the paper evaluates anomaly detection over datasets such as UCF101 and ILSVRC2015. The novel frame permutation task is benchmarked against and demonstrates superior performance relative to existing self-supervised video representations, such as tracking and video colorization. The findings underscore the importance of learning both spatial and temporal features, highlighting the inadequacy of methods that focus solely on temporal order.

Empirical Evaluation

The empirical evaluation provides profound insights into the optimal configurations for self-supervised tasks. Crucially, the permutation prediction task requires careful tuning of hyperparameters such as the number of frames per segment and frame skipping to ensure network efficacy. The strategy of selecting frames with deliberate spatial-temporal gaps yielded the highest anomaly detection accuracy, affirming the task's sophistication in feature learning.

Implications and Future Directions

The significant performance of self-supervised learning in both image and video anomaly detection suggests the potential for broader applications in environments where labeled data scarcity impedes conventional supervised approaches. Future work could extend these methodologies to incorporate multi-modal data or explore hybrid models integrating weak supervision. Enhancements in computational efficiency or real-time deployment in anomaly-critical domains like surveillance and autonomous vehicles could also be pursued.

Conclusion

The presented approach in "Self-Supervised Representation Learning for Visual Anomaly Detection" establishes a compelling framework for unsupervised anomaly detection by exploiting self-supervised learning paradigms. The findings corroborate the hypothesis that SSL-derived representations are not only viable but may outperform traditional approaches in detecting visual anomalies. The paper contributes notably to the ongoing discourse on SSL applications, promising robustness in scenarios demanding minimal data annotation.