LongVILA: Scaling Long-Context Visual Language Models for Long Videos (2408.10188v6)

Published 19 Aug 2024 in cs.CV and cs.CL

Abstract: Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-LLMs by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, i.e., long context extension and long video supervised fine-tuning. However, training on long video is computationally and memory intensive. We introduce the long-context Multi-Modal Sequence Parallelism (MM-SP) system that efficiently parallelizes long video training and inference, enabling 2M context length training on 256 GPUs without any gradient checkpointing. LongVILA efficiently extends the number of video frames of VILA from 8 to 2048, achieving 99.8% accuracy in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack. LongVILA-7B demonstrates strong accuracy on 9 popular video benchmarks, e.g. 65.1% VideoMME with subtitle. Besides, MM-SP is 2.1x - 5.7x faster than ring style sequence parallelism and 1.1x - 1.4x faster than Megatron with a hybrid context and tensor parallelism. Moreover, it seamlessly integrates with Hugging Face Transformers.

Citations (20)

View on Semantic Scholar

Summary

The paper introduces Multi-Modal Sequence Parallelism (MM-SP) to scale long-context visual language models, achieving speedups of 2.1 to 5.7 times over traditional methods.
It outlines a 5-stage pipeline that enhances model capability from 8 to 1,024 frames and improves video captioning scores significantly.
Empirical results, including a 99.5% accuracy in challenging tests, underscore the benefits of full-stack design for scalable, multimodal AI systems.

LongVILA: Scaling Long-Context Visual LLMs for Long Videos

The paper "LongVILA: Scaling Long-Context Visual LLMs for Long Videos" presents a comprehensive solution designed to address the challenges associated with training and inference of long-context visual LLMs (VLMs). The authors propose advancements at the system, model, and dataset levels, presenting both practical and theoretical contributions to the field of AI.

The cornerstone of this research is the introduction of Multi-Modal Sequence Parallelism (MM-SP), a system specifically developed to support long-context training and inference for VLMs. MM-SP demonstrates significant improvements in efficiency and speed. The system supports context lengths up to 2 million tokens and achieves substantial speedups compared to existing methods such as Ring-Style Sequence Parallelism and Megatron-LM. Through the use of 256 GPUs, MM-SP enables efficient training that surpasses conventional systems by a factor of 2.1 to 5.7 times in speed and supports much longer context lengths without sacrificing performance.

Training and Data Pipeline

The training pipeline for LongVILA consists of five distinct stages:

Multi-Modal Alignment: Initializing the multi-modal capabilities by training only the multi-modal projector while freezing other components.
Large-Scale Pre-Training: Utilizing high-quality datasets to conduct extensive pre-training. This involves relabeling datasets like COYO-25M to refine the data quality.
Context Extension for Long-Context LLMs: Increasing the context length capability of LLMs through continued pre-training, aiming for a context length of 262,144 using datasets such as SlimPajama.
Short Supervised Fine-Tuning: Enhancing instruction-following abilities using a combination of short and long video datasets.
Long Supervised Fine-Tuning: Tailoring the model specifically for long videos using a specially constructed dataset derived from long video content.

Datasets

A critical component supporting this research is the curation of large-scale visual language pre-training datasets and a dedicated dataset for long video instruction-following. This dataset includes 15,292 videos spanning diverse categories and involving segmented annotations to facilitate detailed video understanding.

Performance and Evaluation

The research presents empirical results showcasing the superior performance of LongVILA. Notably, the model extends the feasible number of frames from 8 to 1,024 and substantially improves long video captioning scores (from 2.00 to 3.26). The model achieves a remarkable 99.5% accuracy on a 1400-frames video in the "needle in a haystack" test with a 274k context length. Additionally, LongVILA-8B demonstrates consistent performance improvements on the VideoMME benchmark as the number of video frames increases.

System-Level Contributions

The system-level contributions of this research are significant. MM-SP incorporates a two-dimensional attention mechanism that optimizes training throughput by leveraging both intra-node All-to-All (A2A) and inter-node Point-to-Point (P2P) communication. This design effectively addresses the challenges posed by network heterogeneity and modality heterogeneity, achieving balanced load distribution and efficient computation.

Implications and Future Directions

This research has several important implications for the development of more advanced AI systems. Practically, the solution offers a scalable and efficient framework for training long-context VLMs, enhancing capabilities in video understanding and multimodal interaction. Theoretically, it underscores the importance of full-stack design in AI systems, elucidating how integrated solutions spanning hardware, algorithms, and data can unlock new potentials in AI research.

Looking forward, future work might explore further optimization of MM-SP, possibly through porting to more efficient languages like C++ or integrating with other advanced hardware configurations. The approach could be extended to other modalities and more complex multi-modal scenarios, potentially pushing the boundaries of what current AI systems can achieve in terms of context length and multi-modal integration.

In summary, the paper "LongVILA: Scaling Long-Context Visual LLMs for Long Videos" offers innovative contributions to the field of long-context visual LLMs through its detailed system design, comprehensive training pipeline, and large-scale data curation. This work paves the way for future advancements in multi-modal AI capable of understanding and processing large-scale, diverse datasets.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1825743787391422908

https://twitter.com/iScienceLuvr/status/1825736475776004594

https://twitter.com/_TobiasLee/status/1826167618337911257

https://twitter.com/ADarmouni/status/1826035070169231414

https://twitter.com/arXivGPT/status/1826321944376901920

https://twitter.com/javaeeeee1/status/1826040606063472918