Distilling Vision-Language Models on Millions of Videos (2401.06129v2)
Abstract: The recent advance in vision-LLMs is largely attributed to the abundance of image-text data. We aim to replicate this success for video-LLMs, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-LLM from a strong image-language baseline with synthesized instructional data. The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-LLM performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-LLMs. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%. As a side product, we generate the largest video caption dataset to date.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
- Mixmatch: A holistic approach to semi-supervised learning. In NeurIPS, 2019.
- Language models are few-shot learners. In NeurIPS, 2020.
- Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
- A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
- Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023a.
- Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023b.
- Pali: A jointly-scaled multilingual language-image model. In ICLR, 2023c.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
- Epic-kitchens visor benchmark: Video segmentations and object relations. In NeurIPS D&B, 2022.
- Next-generation deep learning based on simulators and synthetic data. Trends in cognitive sciences, 2022.
- Flownet: Learning optical flow with convolutional networks. In ICCV, 2015.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Omni-sourced webly-supervised learning for video recognition. In ECCV, 2020.
- Oops! predicting unintentional action in video. In CVPR, 2020.
- Large-scale weakly-supervised pre-training for video action recognition. In CVPR, 2019.
- Google. Palm 2 technical report, 2023.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
- Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022.
- Temporal alignment networks for long-term video. In CVPR, 2022.
- Deep residual learning for image recognition. In CVPR, 2016.
- The curious case of neural text degeneration. In ICLR, 2020.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Segment anything. In ICCV, 2023.
- Dense-captioning events in videos. In ICCV, 2017.
- Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In EMNLP, 2018.
- The power of scale for parameter-efficient prompt tuning. In EMNLP, 2021.
- Fast inference from transformers via speculative decoding. In ICML, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023a.
- Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023b.
- Decap: Decoding clip latents for zero-shot captioning via text-only training. In ICLR, 2023c.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Visual instruction tuning. In NeurIPS, 2023.
- Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 2022.
- Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
- Generating training data with language models: Towards zero-shot language understanding. In NeurIPS, 2022.
- Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
- Spoken moments: Learning joint audio-visual representations from video descriptions. In CVPR, 2021.
- Learning audio-video modalities from image captions. In ECCV, 2022.
- Improving multimodal datasets with image captioning. In NeurIPS D&B, 2023.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- OpenAI. Gpt-4v(ision) system card, 2023.
- Occluded video instance segmentation: A benchmark. IJCV, 2022.
- Language models are unsupervised multitask learners. OpenAI blog, 2019.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
- Playing for data: Ground truth from computer games. In ECCV, 2016.
- Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS D&B, 2022.
- Neural machine translation of rare words with subword units. In ACL, 2016.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
- Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In NeurIPS, 2020.
- Ul2: Unifying language learning paradigms. In ICLR, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Connecting vision and language with video localized narratives. In CVPR, 2023.
- Unidentified video objects: A benchmark for dense, open-world segmentation. In ICCV, 2021.
- Image as a foreign language: Beit pretraining for vision and vision-language tasks. In CVPR, 2023a.
- Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, 2019.
- Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
- Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023b.
- Self-instruct: Aligning language model with self generated instructions. In ACL, 2023c.
- Finetuned language models are zero-shot learners. In ICLR, 2022.
- Next-qa: Next phase of question-answering to explaining temporal actions. In CVPR, 2021.
- Video question answering via gradually refined attention over appearance and motion. In ACM MM, 2017.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. In EMNLP, 2021.
- Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
- Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979, 2022.
- Zero-shot video question answering via frozen bidirectional language models. NeurIPS, 2022.
- Aim: Adapting image models for efficient video action recognition. In ICLR, 2023a.
- Mitigating spurious correlations in multi-modal models during fine-tuning. In ICML, 2023b.
- Coca: Contrastive captioners are image-text foundation models. TMLR, 2022.
- Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Merlot reserve: Neural script knowledge through vision and language and sound. In CVPR, 2022.
- Scaling vision transformers. In CVPR, 2022a.
- LiT: Zero-shot transfer with locked-image text tuning. In CVPR, 2022b.
- Training a large video model on a single machine in a day. arXiv preprint arXiv:2309.16669, 2023.
- Learning video representations from large language models. In CVPR, 2023.
- Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.