Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization (2402.03161v3)

Published 5 Feb 2024 in cs.CV and cs.CL

Abstract: In light of recent advances in multimodal LLMs, there is increasing attention to scaling them from image-text data to more informative real-world videos. Compared to static images, video poses unique challenges for effective large-scale pre-training due to the modeling of its spatiotemporal dynamics. In this paper, we address such limitations in video-language pre-training with an efficient video decomposition that represents each video as keyframes and temporal motions. These are then adapted to an LLM using well-designed tokenizers that discretize visual and temporal information as a few tokens, thus enabling unified generative pre-training of videos, images, and text. At inference, the generated tokens from the LLM are carefully recovered to the original continuous pixel space to create various video content. Our proposed framework is both capable of comprehending and generating image and video content, as demonstrated by its competitive performance across 13 multimodal benchmarks in image and video understanding and generation. Our code and models are available at https://video-lavit.github.io.

Citations (25)

View on Semantic Scholar

Summary

The paper introduces a novel decoupling method that tokenizes keyframes and temporal motions, reducing token count by over 90% while preserving essential video semantics.
The paper leverages a unified autoregressive model that integrates text, images, and videos, enhancing multimodal generative capabilities across diverse tasks.
The paper demonstrates state-of-the-art performance on 13 benchmarks, achieving 73.5% accuracy in zero-shot video QA and superior video quality metrics.

Introduction

The emergence of multimodal LLMs has accelerated progress towards AI systems that can understand and generate content across text and visual domains. Existing LLMs have been highly effective in image representation; however, adapting them for video—a dynamic medium with significant temporal aspects—presents additional challenges. We propose Video-LaVIT (Language-VIsion Transformer), advancing video-language pre-training by efficiently integrating spatiotemporal dynamics into LLMs.

Architecture

Video-LaVIT introduces an efficient video representation that disentangles videos into keyframes and temporal motions, each tokenized to generate fewer tokens while preserving essential visual and motion content. This method leverages the inherent redundancy in video data, targeting the main visual semantics through keyframes and capturing the temporal motion evolution in a compact form. Specifically, keyframes correspond with a visual tokenizer, repurposing pre-trained image LLM knowledge, while the temporal motions are captured by a novel spatiotemporal encoder that quantizes motion vectors, culminating in substantial token savings (> 90% tokens saved for a 2.2s clip). Video-LaVIT comprises two main components: a tokenizer for video modality handling and a detokenizer to reconstruct the original video pixels efficiently.

Unified Generative Modeling

The framework delivers unified generative pre-training through an autoregressive model that extends beyond the image modality to videos. It is capable of ingesting synchronous keyframe-motion token sequences with images and text, followed by optimization under a single next-token prediction objective. The result is a system that internalizes the sequential relationships between video clips, thus enhancing the model's capacity to understand and generate long sequences of multimodal content.

Experimental Validation

Video-LaVIT demonstrates competitive and, in many cases, state-of-the-art performance across 13 multimodal benchmarks covering image and video understanding and generation tasks. Experiments on zero-shot video question answering show clear numerical advantages. For instance, on MSVD-QA, Video-LaVIT achieves an accuracy of 73.5%, outperforming previous methods. In the demanding task of zero-shot text-to-video generation, our model showcases impressive results, surpassing established baselines in terms of the Frechet video distance (169.51 on MSR-VTT) and other metrics reflective of video quality.

Qualitatively, comparing generated content to contemporaneous models reveals Video-LaVIT’s strengths in generating cohesive and contextually appropriate images and videos. Whether by text-to-image or image-to-video generation, the model displays an exceptional ability to grasp and visually depict intricate textual descriptions.

Conclusion

In conclusion, Video-LaVIT achieves significant advancements in video-language pre-training. By efficiently tokenizing video into keyframes and temporal motions, it enables LLMs to embrace the video field effectively. The combination of quantitative performance metrics, qualitative demonstrations, and the innovations in model design solidify Video-LaVIT as a crucial step towards holistic, multimodal AI that can seamlessly traverse the domains of text, images, and videos. Further details on the architecture, along with insights on the ablation studies and limitations, can be found in the supplementary materials accompanying the paper.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/_akhaliq/status/1754730562550407185

https://twitter.com/arankomatsuzaki/status/1754695168995828144

https://twitter.com/taziku_co/status/1756626056600334773

https://twitter.com/TomMaixner/status/1754863001750864168

https://twitter.com/knishimae0531/status/1755024648012710054

https://twitter.com/arxivsanitybot/status/1755048232571548044