Distilling Vision-Language Models on Millions of Videos (2401.06129v2)
Abstract: The recent advance in vision-LLMs is largely attributed to the abundance of image-text data. We aim to replicate this success for video-LLMs, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-LLM from a strong image-language baseline with synthesized instructional data. The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-LLM performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-LLMs. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%. As a side product, we generate the largest video caption dataset to date.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
- Mixmatch: A holistic approach to semi-supervised learning. In NeurIPS, 2019.
- Language models are few-shot learners. In NeurIPS, 2020.
- Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
- A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
- Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023a.
- Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023b.
- Pali: A jointly-scaled multilingual language-image model. In ICLR, 2023c.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
- Epic-kitchens visor benchmark: Video segmentations and object relations. In NeurIPS D&B, 2022.
- Next-generation deep learning based on simulators and synthetic data. Trends in cognitive sciences, 2022.
- Flownet: Learning optical flow with convolutional networks. In ICCV, 2015.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Omni-sourced webly-supervised learning for video recognition. In ECCV, 2020.
- Oops! predicting unintentional action in video. In CVPR, 2020.
- Large-scale weakly-supervised pre-training for video action recognition. In CVPR, 2019.
- Google. Palm 2 technical report, 2023.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
- Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022.
- Temporal alignment networks for long-term video. In CVPR, 2022.
- Deep residual learning for image recognition. In CVPR, 2016.
- The curious case of neural text degeneration. In ICLR, 2020.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Segment anything. In ICCV, 2023.
- Dense-captioning events in videos. In ICCV, 2017.
- Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In EMNLP, 2018.
- The power of scale for parameter-efficient prompt tuning. In EMNLP, 2021.
- Fast inference from transformers via speculative decoding. In ICML, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023a.
- Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023b.
- Decap: Decoding clip latents for zero-shot captioning via text-only training. In ICLR, 2023c.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Visual instruction tuning. In NeurIPS, 2023.
- Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 2022.
- Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
- Generating training data with language models: Towards zero-shot language understanding. In NeurIPS, 2022.
- Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
- Spoken moments: Learning joint audio-visual representations from video descriptions. In CVPR, 2021.
- Learning audio-video modalities from image captions. In ECCV, 2022.
- Improving multimodal datasets with image captioning. In NeurIPS D&B, 2023.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- OpenAI. Gpt-4v(ision) system card, 2023.
- Occluded video instance segmentation: A benchmark. IJCV, 2022.
- Language models are unsupervised multitask learners. OpenAI blog, 2019.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
- Playing for data: Ground truth from computer games. In ECCV, 2016.
- Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS D&B, 2022.
- Neural machine translation of rare words with subword units. In ACL, 2016.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
- Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In NeurIPS, 2020.
- Ul2: Unifying language learning paradigms. In ICLR, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Connecting vision and language with video localized narratives. In CVPR, 2023.
- Unidentified video objects: A benchmark for dense, open-world segmentation. In ICCV, 2021.
- Image as a foreign language: Beit pretraining for vision and vision-language tasks. In CVPR, 2023a.
- Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, 2019.
- Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
- Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023b.
- Self-instruct: Aligning language model with self generated instructions. In ACL, 2023c.
- Finetuned language models are zero-shot learners. In ICLR, 2022.
- Next-qa: Next phase of question-answering to explaining temporal actions. In CVPR, 2021.
- Video question answering via gradually refined attention over appearance and motion. In ACM MM, 2017.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. In EMNLP, 2021.
- Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
- Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979, 2022.
- Zero-shot video question answering via frozen bidirectional language models. NeurIPS, 2022.
- Aim: Adapting image models for efficient video action recognition. In ICLR, 2023a.
- Mitigating spurious correlations in multi-modal models during fine-tuning. In ICML, 2023b.
- Coca: Contrastive captioners are image-text foundation models. TMLR, 2022.
- Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Merlot reserve: Neural script knowledge through vision and language and sound. In CVPR, 2022.
- Scaling vision transformers. In CVPR, 2022a.
- LiT: Zero-shot transfer with locked-image text tuning. In CVPR, 2022b.
- Training a large video model on a single machine in a day. arXiv preprint arXiv:2309.16669, 2023.
- Learning video representations from large language models. In CVPR, 2023.
- Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022.