Emergent Mind

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

(2408.10188)
Published Aug 19, 2024 in cs.CV and cs.CL

Abstract

Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-language models by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, i.e., long context extension and long supervised fine-tuning. However, training on long video is computationally and memory intensive. We introduce the long-context Multi-Modal Sequence Parallelism (MM-SP) system that efficiently parallelizes long video training and inference, enabling 2M context length training on 256 GPUs without any gradient checkpointing. LongVILA efficiently extends the number of video frames of VILA from 8 to 1024, improving the long video captioning score from 2.00 to 3.26 (out of 5), achieving 99.5% accuracy in 1400-frame (274k context length) video needle-in-a-haystack. LongVILA-8B demonstrates consistent accuracy improvements on long videos in the VideoMME benchmark as the number of frames increases. Besides, MM-SP is 2.1x - 5.7x faster than ring sequence parallelism and 1.1x - 1.4x faster than Megatron with context parallelism + tensor parallelism. Moreover, it seamlessly integrates with Hugging Face Transformers.

Training pipeline for LongVILA model extending VLMs for long video processing using VILA-1.5.

Overview

  • The paper presents LongVILA, a solution designed to address challenges in training and inference of long-context visual language models (VLMs) with advancements in system, model, and dataset levels.

  • A key innovation is the Multi-Modal Sequence Parallelism (MM-SP), which significantly improves efficiency and speed for long-context training and inference by supporting context lengths up to 2 million tokens and leveraging 256 GPUs.

  • LongVILA employs a five-stage training pipeline and uses curated large-scale datasets, achieving notable improvements in performance, including a 99.5% accuracy on 1400-frame videos and substantial enhancements in long video captioning scores.

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

The paper "LongVILA: Scaling Long-Context Visual Language Models for Long Videos" presents a comprehensive solution designed to address the challenges associated with training and inference of long-context visual language models (VLMs). The authors propose advancements at the system, model, and dataset levels, presenting both practical and theoretical contributions to the field of AI.

Multi-Modal Sequence Parallelism (MM-SP)

The cornerstone of this research is the introduction of Multi-Modal Sequence Parallelism (MM-SP), a system specifically developed to support long-context training and inference for VLMs. MM-SP demonstrates significant improvements in efficiency and speed. The system supports context lengths up to 2 million tokens and achieves substantial speedups compared to existing methods such as Ring-Style Sequence Parallelism and Megatron-LM. Through the use of 256 GPUs, MM-SP enables efficient training that surpasses conventional systems by a factor of 2.1 to 5.7 times in speed and supports much longer context lengths without sacrificing performance.

Training and Data Pipeline

The training pipeline for LongVILA consists of five distinct stages:

  1. Multi-Modal Alignment: Initializing the multi-modal capabilities by training only the multi-modal projector while freezing other components.
  2. Large-Scale Pre-Training: Utilizing high-quality datasets to conduct extensive pre-training. This involves relabeling datasets like COYO-25M to refine the data quality.
  3. Context Extension for Long-Context Language Models (LLMs): Increasing the context length capability of LLMs through continued pre-training, aiming for a context length of 262,144 using datasets such as SlimPajama.
  4. Short Supervised Fine-Tuning: Enhancing instruction-following abilities using a combination of short and long video datasets.
  5. Long Supervised Fine-Tuning: Tailoring the model specifically for long videos using a specially constructed dataset derived from long video content.

Datasets

A critical component supporting this research is the curation of large-scale visual language pre-training datasets and a dedicated dataset for long video instruction-following. This dataset includes 15,292 videos spanning diverse categories and involving segmented annotations to facilitate detailed video understanding.

Performance and Evaluation

The research presents empirical results showcasing the superior performance of LongVILA. Notably, the model extends the feasible number of frames from 8 to 1,024 and substantially improves long video captioning scores (from 2.00 to 3.26). The model achieves a remarkable 99.5% accuracy on a 1400-frames video in the "needle in a haystack" test with a 274k context length. Additionally, LongVILA-8B demonstrates consistent performance improvements on the VideoMME benchmark as the number of video frames increases.

System-Level Contributions

The system-level contributions of this research are significant. MM-SP incorporates a two-dimensional attention mechanism that optimizes training throughput by leveraging both intra-node All-to-All (A2A) and inter-node Point-to-Point (P2P) communication. This design effectively addresses the challenges posed by network heterogeneity and modality heterogeneity, achieving balanced load distribution and efficient computation.

Implications and Future Directions

This research has several important implications for the development of more advanced AI systems. Practically, the solution offers a scalable and efficient framework for training long-context VLMs, enhancing capabilities in video understanding and multimodal interaction. Theoretically, it underscores the importance of full-stack design in AI systems, elucidating how integrated solutions spanning hardware, algorithms, and data can unlock new potentials in AI research.

Looking forward, future work might explore further optimization of MM-SP, possibly through porting to more efficient languages like C++ or integrating with other advanced hardware configurations. The approach could be extended to other modalities and more complex multi-modal scenarios, potentially pushing the boundaries of what current AI systems can achieve in terms of context length and multi-modal integration.

In summary, the paper "LongVILA: Scaling Long-Context Visual Language Models for Long Videos" offers innovative contributions to the field of long-context visual language models through its detailed system design, comprehensive training pipeline, and large-scale data curation. This work paves the way for future advancements in multi-modal AI capable of understanding and processing large-scale, diverse datasets.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.