InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Published 13 Jul 2023 in cs.CV | (2307.06942v2)

Abstract: This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Our core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset with LLMs (LLM), thereby showcasing its efficacy in learning video-language representation at scale. Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance. Beyond basic video understanding tasks like recognition and retrieval, our dataset and model have broad applications. They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. These proposed resources provide a tool for researchers and practitioners interested in multimodal video understanding and generation.

Abstract PDF Upgrade to Chat

Authors (16)

First 10 authors:

Citations (162)

View on Semantic Scholar

Summary

The paper introduces LAVIC, a large-scale video-text dataset that aggregates over 7 million videos and 234 million annotated clips to enhance multimodal model training.
It leverages an innovative LLM-powered annotation methodology to ensure high-quality video-text alignment, surpassing limitations of previous datasets.
By developing the ViCLIP model, the research demonstrates superior zero-shot performance in action recognition and highlights potential for advanced video retrieval and dialogue systems.

LAVIC: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

The paper presents LAVIC, a comprehensive video-centric multimodal dataset specifically designed to foster the development of robust video-text representation models. As the demand for integrated video and natural language processing models has intensified, so has the need for large-scale, high-quality datasets that enable this integration. LAVIC addresses this gap by amalgamating over 7 million videos, encapsulating around 234 million video clips, each richly annotated with textual descriptions generated primarily via LLMs.

Key Contributions

Dataset Composition and Scale: LAVIC sets itself apart by its vast scale and detailed textual descriptions, encompassing 4.1 billion words spread across various contexts and content types. Previous datasets fell short either in scale, such as HowTo100M or WebVid10M, or in the quality of video-text alignment, an issue LAVIC actively addresses.
Innovative Annotation Methodology: The dataset leverages a multi-scale approach harnessed by LLMs to automatically generate video descriptions, thereby ensuring high-quality video-text alignment at scale. This strategy is instrumental, particularly given the limitations of ASR-generated text commonly used in existing datasets.
Introduction of the ViCLIP Model: The research advances a novel video-text representation learning model, ViCLIP, grounded on the Vision Transformer (ViT-L). This model is trained using contrastive learning on the LAVIC dataset, showcasing its efficacy through superior performance in zero-shot action recognition and competitive video retrieval.
Practical Applications: Beyond standard tasks like video retrieval and recognition, LAVIC and ViCLIP's design is poised to excel in generating interleaved video-text datasets conducive for training video-centric dialogue systems, as well as advancing video-to-text and text-to-video generation research.

Numerical Outcomes and Performance

The ViCLIP model, when trained on LAVIC, achieves a notable zero-shot performance, underscoring 75.7%, 73.5%, and 66.4% top-1 accuracy in K400, K600, and K700 action recognition datasets, respectively. This illustrates the model's superior generalization capability over other Video CLIP variations, particularly significant in video understanding and retrieval tasks.

Implications and Future Directions

The implications of LAVIC extend beyond academic research into practical domains like human-computer interaction, autonomous driving, and intelligent surveillance, where the seamless integration of video understanding into real-world applications holds substantial potential. The dataset's design and use demonstrate pivotal advances in multimodal dialogue systems, pushing the boundaries of what AI can achieve in understanding and generating multimodal content.

Moreover, LAVIC's assembly and success hint at future trajectories in AI, where generating plausible multi-modal narratives could become a hallmark of sophisticated AI systems. The interplay between visual data and language in LAVIC sets a precedent for future datasets to harness, enabling more intuitive and contextually aware AI models.

In conclusion, LAVIC emerges as a significant resource for the AI research community, spotlighting the symbiosis between large-scale data and advanced learning models to drive the evolution of video-text comprehension and generation capabilities in AI.

Markdown Report Issue