Frozen CLIP Models are Efficient Video Learners

Published 6 Aug 2022 in cs.CV | (2208.03550v1)

Abstract: Video recognition has been dominated by the end-to-end learning paradigm -- first initializing a video recognition model with weights of a pretrained image model and then conducting end-to-end training on videos. This enables the video network to benefit from the pretrained image model. However, this requires substantial computation and memory resources for finetuning on videos and the alternative of directly using pretrained image features without finetuning the image backbone leads to subpar results. Fortunately, recent advances in Contrastive Vision-Language Pre-training (CLIP) pave the way for a new route for visual recognition tasks. Pretrained on large open-vocabulary image-text pair data, these models learn powerful visual representations with rich semantics. In this paper, we present Efficient Video Learning (EVL) -- an efficient framework for directly training high-quality video recognition models with frozen CLIP features. Specifically, we employ a lightweight Transformer decoder and learn a query token to dynamically collect frame-level spatial features from the CLIP image encoder. Furthermore, we adopt a local temporal module in each decoder layer to discover temporal clues from adjacent frames and their attention maps. We show that despite being efficient to train with a frozen backbone, our models learn high quality video representations on a variety of video recognition datasets. Code is available at https://github.com/OpenGVLab/efficient-video-recognition.

Abstract PDF Upgrade to Chat

Citations (167)

View on Semantic Scholar

Summary

The paper demonstrates that freezing CLIP image encoders combined with lightweight temporal modules can achieve high video recognition accuracy while significantly cutting compute time.
The EVL framework employs a Transformer decoder and local temporal module to effectively integrate spatial and temporal features from video frames.
It achieves competitive results, including an 82.9% top-1 accuracy on Kinetics-400, with substantially lower GPU-hour requirements compared to traditional fine-tuning methods.

An Analysis of Video Recognition with Frozen CLIP Models

The paper "Frozen CLIP Models are Efficient Video Learners" addresses the challenges and inefficiencies associated with traditional frameworks for video recognition, which typically involve the end-to-end training of painstakingly constructed models derived from image representations. The authors propose an innovative approach, leveraging the strength of Contrastive Vision-Language Pre-training (CLIP) models, to streamline video learning processes while maintaining high accuracy and significantly reducing computational demand.

Framework Overview

The presented method, Efficient Video Learning (EVL), distinguishes itself by utilizing frozen CLIP features in lieu of gap-ridden, computationally intensive paradigms that require fine-tuning pretrained image backbones. The EVL framework employs a lightweight Transformer decoder that integrates a query token to effectively collect spatial features from the CLIP image encoder at the frame level. Additionally, the framework incorporates a local temporal module to extract temporal clues from adjacent video frames and their attention maps, together addressing the spatial-temporal challenges inherent in video recognition.

Results and Performance

The paper presents robust results across a spectrum of datasets, most notably the Kinetics-400 and Something-Something-v2. On the Kinetics-400 benchmark, an EVL configuration using the ViT-B/16 model achieves an impressive top-1 accuracy of 82.9% with substantially reduced computational requirements — specifically only 60 GPU-hours — as opposed to the demands imposed by complete model retraining. This efficiency is rooted in the revolutionary decision to freeze the CLIP model's image encoder, negating the need for refinement beyond the final layers. Comparative experiments across varying settings underline the efficiency of the EVL model both during training and inference phases without major accuracy trade-offs.

The paper juxtaposes EVL with various state-of-the-art models and methods, highlighting the approach's comparative lightweight construction against models synthesizing numerous streams or pretext tasks, such as CNN-3D architectures and Transformers with full finetuning regimes. Furthermore, across both datasets, the adoption of multiple intermediate features and temporal modules within the architecture substantiates incremental gains and solidifies the approach as both an efficient and effective alternative to traditional methods.

Theoretical and Practical Implications

From a theoretical perspective, this paper opens new avenues in the domain of video recognition, demonstrating that robust models can emerge from the efficient reutilization of pretrained multimodal features. By disentangling video representation learning from the constraints of computational-expensive finetuning, the paper positions the CLIP model as not only a prodigy in image-text tasks but a versatile workhorse for more diverse applications.

Practically, EVL presents significant implications for resource-limited institutions and real-world applications, allowing for the deployment of high-performing video recognition models without extraneous computational overhead. This shift could democratize access to advanced video analytics, enabling broad communities of practitioners to harness the power of CLIP models.

Speculation and Future Trajectory

The study opens a dialogue on the potential expansion of the EVL approach to other modalities and tasks — perhaps extending beyond video recognition. The alignment and augmentation of CLIP's foundational vision-language paradigm might further inform subsequent methodologies in domains including robotics, human-computer interaction, and even language-driven video editing tools. As larger, more semantically diverse datasets become available and training paradigms evolve, the industry may witness further refinements and integrations of this approach, wielding large pretrained multimodal models as foundational layers in cross-domain architectures.

In summary, the paper presents a well-supported, forward-looking approach to video recognition that adeptly balances efficiency with performance, emphasizing the utility of leveraging existing large-scale pretrained models while cognizant of evolving computational landscapes.

Markdown Report Issue