MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

Published 15 Mar 2023 in cs.CV | (2303.08914v2)

Abstract: Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress in zero-shot recognition, image generation & editing, and many other exciting tasks. However, VL models tend to over-represent objects while paying much less attention to verbs, and require additional tuning on video data for best zero-shot action recognition performance. While previous work relied on large-scale, fully-annotated data, in this work we propose an unsupervised approach. We adapt a VL model for zero-shot and few-shot action recognition using a collection of unlabeled videos and an unpaired action dictionary. Based on that, we leverage LLMs and VL models to build a text bag for each unlabeled video via matching, text expansion and captioning. We use those bags in a Multiple Instance Learning setup to adapt an image-text backbone to video data. Although finetuned on unlabeled video data, our resulting models demonstrate high transferability to numerous unseen zero-shot downstream tasks, improving the base VL model performance by up to 14\%, and even comparing favorably to fully-supervised baselines in both zero-shot and few-shot video recognition transfer. The code will be released later at \url{https://github.com/wlin-at/MAXI}.

Abstract PDF Upgrade to Chat

Citations (31)

View on Semantic Scholar

Summary

The paper presents MAXI, an unsupervised finetuning strategy that enhances zero-shot action recognition by up to 14% on benchmark datasets.
It leverages large language models to expand textual action descriptions from unpaired video data, eliminating the need for extensive manual labeling.
The method employs a Multiple Instance Learning paradigm to refine and generalize action recognition, achieving performance that rivals supervised approaches.

An Expert Analysis of "MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge"

The paper "MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge" introduces a novel unsupervised approach to enhance zero-shot and few-shot action recognition capabilities of vision-language (VL) models through leveraging LLMs and unpaired video data. It pivots from traditional methods that rely heavily on richly annotated datasets, proposing instead a method that circumvents the need for extensive supervision or precise manual labeling.

Overview of Methodology

The authors propose a strategy named "MAtch, eXpand and Improve" (MAXI), which builds on existing VL models like CLIP by incorporating unlabeled video data and an unpaired action dictionary to refine action recognition without requiring explicit action labels. The method progresses through three main stages:

Match: Initially, each video is paired with potential action descriptions from a predefined action dictionary using the unsupervised capability of the VL model. This initial matching serves as a grounding step to leverage the VL model's innate ability to associate textual and visual inputs.
Expand: The matched pair undergoes expansion using LLMs, such as GPT-3. The expansion involves generating verb-rich action descriptions that capture nuanced meanings and associations beyond the original dictionary. Simultaneously, visual information from frames is translated into text using captioning models like BLIP to diversify and enlarge the semantic scope presented in each action scenario.
Improve: The resultant "bags" of action texts are fine-tuned through a Multiple Instance Learning paradigm. This training process is designed to robustly manage the inconsistently labeled video data, fostering an environment where the model learns generalized and transferrable concepts efficiently.

Key Findings and Results

Experiments conducted on well-known benchmarks, including UCF101 and HMDB51, reveal significant performance gains, achieving up to a 14% improvement over base VL models in zero-shot tasks. Intriguingly, the study finds that even without supervised labels, the MAXI method can sometimes surpass existing supervised methods, underlining the effectiveness of leveraging unstructured data and language expansion for action recognition tasks.

The study articulates that the initial foundational models, typically tailored to minimize loss on recognized instances (objects in this case), often underperform in dynamic tasks like action recognition, which involves verb recognition. Enhancing these models through unsupervised methods with a strong emphasis on verb phrases and actions substantially mitigates these shortcomings.

Implications and Future Directions

The implication of this research is considerable. It underscores a shift towards more sustainable AI practices where task specificity burdens may be alleviated by unsupervised learning facilitated by LLMs and existing vision models. This paradigm change could democratize access to high-functioning AI systems in domains where annotated data is scarce or impractically costly to acquire.

Looking forward, this approach posits new avenues for exploring more nuanced uses of LLMs and suggests potential integrations involving cross-modal architectures that can further refine action recognition in contexts less reliant on abundant labeled datasets. Future work may explore optimal configurations of LLMs and video captioning mechanisms or even expand this approach to other domains requiring dynamic classification capabilities.

Overall, this paper enriches the ongoing conversation about unsupervised learning in AI, presenting a compelling case for leveraging larger LLMs alongside video data to achieve tasks traditionally dependent on robust supervision.

Markdown Report Issue