FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition (2402.03241v1)

Published 5 Feb 2024 in cs.CV and cs.LG

Abstract: In this paper, we introduce FROSTER, an effective framework for open-vocabulary action recognition. The CLIP model has achieved remarkable success in a range of image-based tasks, benefiting from its strong generalization capability stemming from pretaining on massive image-text pairs. However, applying CLIP directly to the open-vocabulary action recognition task is challenging due to the absence of temporal information in CLIP's pretraining. Further, fine-tuning CLIP on action recognition datasets may lead to overfitting and hinder its generalizability, resulting in unsatisfactory results when dealing with unseen actions. To address these issues, FROSTER employs a residual feature distillation approach to ensure that CLIP retains its generalization capability while effectively adapting to the action recognition task. Specifically, the residual feature distillation treats the frozen CLIP model as a teacher to maintain the generalizability exhibited by the original CLIP and supervises the feature learning for the extraction of video-specific features to bridge the gap between images and videos. Meanwhile, it uses a residual sub-network for feature distillation to reach a balance between the two distinct objectives of learning generalizable and video-specific features. We extensively evaluate FROSTER on open-vocabulary action recognition benchmarks under both base-to-novel and cross-dataset settings. FROSTER consistently achieves state-of-the-art performance on all datasets across the board. Project page: https://visual-ai.github.io/froster.

Citations (13)

View on Semantic Scholar

Summary

The paper presents a method that uses residual feature distillation from a frozen CLIP model to extract task-specific video features while preserving generalizability.
The approach achieves state-of-the-art results on benchmarks like Kinetics-400, UCF-101, and HMDB-51 through rigorous base-to-novel and cross-dataset evaluations.
The framework supports various model architectures, reducing computational cost and boosting practical deployment for real-time applications.

An Overview of FROSTER: Leveraging CLIP for Open-Vocabulary Action Recognition

The paper "FROSTER: Frozen CLIP is a Strong Teacher for Open-Vocabulary Action Recognition" presents a novel approach aimed at enhancing open-vocabulary action recognition by capitalizing on the strengths of the CLIP model. Despite CLIP's notable success in various vision-language tasks, directly applying it to action recognition is non-trivial due to its lack of temporal understanding and the risk of overfitting when fine-tuning on action datasets. FROSTER is introduced to address these challenges by employing a residual feature distillation technique, maintaining the generalized power of CLIP while adapting it for video-specific tasks.

Key Contributions

Residual Feature Distillation: The paper introduces a unique approach by treating the pre-trained, frozen CLIP model as a teacher. The method employs residual feature distillation to retain CLIP’s robust generalizability while extracting task-specific video features. This is achieved using a sub-network that distills the necessary features without altering the core capabilities of the CLIP model excessively.
Experimental Rigor: FROSTER is extensively tested against standard benchmarks in open-vocabulary action recognition, employing both base-to-novel and cross-dataset evaluations. The framework consistently demonstrates superior performance across all datasets, substantiating its effectiveness.
Compatibility with Different Model Architectures: Unlike other methods constrained by network architecture, FROSTER's distillation process is designed to accommodate various architectures freely, including adapter-based methods like Adaptformer and AIM, making it a versatile solution.

Results and Implications

FROSTER achieves state-of-the-art results in open-vocabulary tasks on multiple datasets, such as Kinetics-400, UCF-101, and HMDB-51, demonstrating enhanced recognition capabilities across both seen and unseen action categories. Notably, the ensemble models show significant performance improvements, evidence of the preserved and enhanced generalizability when using FROSTER.

The implications are profound for both theoretical and practical domains. Theoretically, the work suggests an adaptable method for balancing task-specific learning with pretrained generalizability, which is crucial for deploying models in diverse and dynamic real-world scenarios. Practically, the reduced computational cost with minimal impact on performance suggests its feasibility for real-time applications.

Future Directions

FROSTER opens avenues for further research, including exploring the integration of other foundational models as teachers and investigating the application potential across broader video understanding tasks beyond action recognition. Additionally, the balance between maintaining generalizability and learning video-specific nuances could be studied further, potentially leading to even more efficient distillation techniques.

In summary, FROSTER offers a compelling approach to leveraging the CLIP model's strengths for open-vocabulary action recognition, demonstrating significant improvements in performance while maintaining computational efficiency. This work lays a foundational framework that could be influential in the ongoing development of robust, adaptable AI systems for complex vision-language tasks.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/kaihan_vis/status/1755635527037096103