Emergent Mind

Abstract

In light of recent advances in multimodal LLMs, there is increasing attention to scaling them from image-text data to more informative real-world videos. Compared to static images, video poses unique challenges for effective large-scale pre-training due to the modeling of its spatiotemporal dynamics. In this paper, we address such limitations in video-language pre-training with an efficient video decomposition that represents each video as keyframes and temporal motions. These are then adapted to an LLM using well-designed tokenizers that discretize visual and temporal information as a few tokens, thus enabling unified generative pre-training of videos, images, and text. At inference, the generated tokens from the LLM are carefully recovered to the original continuous pixel space to create various video content. Our proposed framework is both capable of comprehending and generating image and video content, as demonstrated by its competitive performance across 13 multimodal benchmarks in image and video understanding and generation. Our code and models will be available at https://video-lavit.github.io.

Video-LaVIT approach segments videos for unified generative pre-training with text, enhancing multimodal sequence creation.

Overview

  • Introduces Video-LaVIT, a new framework for video-language pre-training that integrates spatiotemporal dynamics into LLMs.

  • Presents efficient video representation by disentangling videos into keyframes and temporal motions with significant token savings.

  • Enables generative pre-training for videos using an autoregressive model that handles multimodal content including images, video, and text.

  • Shows superior performance on 13 multimodal benchmarks and demonstrates notable results in zero-shot video question answering and text-to-video generation.

Introduction

The emergence of multimodal LLMs has accelerated progress towards AI systems that can understand and generate content across text and visual domains. Existing LLMs have been highly effective in image representation; however, adapting them for video—a dynamic medium with significant temporal aspects—presents additional challenges. We propose Video-LaVIT (Language-VIsion Transformer), advancing video-language pre-training by efficiently integrating spatiotemporal dynamics into LLMs.

Architecture

Video-LaVIT introduces an efficient video representation that disentangles videos into keyframes and temporal motions, each tokenized to generate fewer tokens while preserving essential visual and motion content. This method leverages the inherent redundancy in video data, targeting the main visual semantics through keyframes and capturing the temporal motion evolution in a compact form. Specifically, keyframes correspond with a visual tokenizer, repurposing pre-trained image LLM knowledge, while the temporal motions are captured by a novel spatiotemporal encoder that quantizes motion vectors, culminating in substantial token savings (> 90% tokens saved for a 2.2s clip). Video-LaVIT comprises two main components: a tokenizer for video modality handling and a detokenizer to reconstruct the original video pixels efficiently.

Unified Generative Modeling

The framework delivers unified generative pre-training through an autoregressive model that extends beyond the image modality to videos. It is capable of ingesting synchronous keyframe-motion token sequences with images and text, followed by optimization under a single next-token prediction objective. The result is a system that internalizes the sequential relationships between video clips, thus enhancing the model's capacity to understand and generate long sequences of multimodal content.

Experimental Validation

Video-LaVIT demonstrates competitive and, in many cases, state-of-the-art performance across 13 multimodal benchmarks covering image and video understanding and generation tasks. Experiments on zero-shot video question answering show clear numerical advantages. For instance, on MSVD-QA, Video-LaVIT achieves an accuracy of 73.5%, outperforming previous methods. In the demanding task of zero-shot text-to-video generation, our model showcases impressive results, surpassing established baselines in terms of the Frechet video distance (169.51 on MSR-VTT) and other metrics reflective of video quality.

Qualitatively, comparing generated content to contemporaneous models reveals Video-LaVIT’s strengths in generating cohesive and contextually appropriate images and videos. Whether by text-to-image or image-to-video generation, the model displays an exceptional ability to grasp and visually depict intricate textual descriptions.

Conclusion

In conclusion, Video-LaVIT achieves significant advancements in video-language pre-training. By efficiently tokenizing video into keyframes and temporal motions, it enables LLMs to embrace the video realm effectively. The combination of quantitative performance metrics, qualitative demonstrations, and the innovations in model design solidify Video-LaVIT as a crucial step towards holistic, multimodal AI that can seamlessly traverse the domains of text, images, and videos. Further details on the architecture, along with insights on the ablation studies and limitations, can be found in the supplementary materials accompanying the paper.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.