Emergent Mind

Long Context Transfer from Language to Vision

(2406.16852)
Published Jun 24, 2024 in cs.CV

Abstract

Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. Many works address this by reducing the number of visual tokens using visual resamplers. Alternatively, in this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training. We call this phenomenon long context transfer and carefully ablate its properties. To effectively measure LMMs' ability to generalize to long contexts in the vision modality, we develop V-NIAH (Visual Needle-In-A-Haystack), a purely synthetic long vision benchmark inspired by the language model's NIAH test. Our proposed Long Video Assistant (LongVA) can process 2000 frames or over 200K visual tokens without additional complexities. With its extended context length, LongVA achieves state-of-the-art performance on Video-MME among 7B-scale models by densely sampling more input frames. Our work is open-sourced at https://github.com/EvolvingLMMs-Lab/LongVA.

Comparing traditional visual resamplers with LongVA's language model-focused approach for processing long-context visual data.

Overview

  • The paper introduces a novel approach called LongVA for processing extensive visual data sequences by transferring long context capabilities from language models to large multimodal models (LMMs).

  • LongVA utilizes long context training, advanced attention mechanisms, and a unified encoding scheme, UniRes, to handle massive visual token sets without the need for video-specific training.

  • The proposed method demonstrates superior performance in benchmarks like Visual Needle-In-A-Haystack (V-NIAH) and Video-MME, significantly outperforming existing models in both video and image tasks.

Long Context Transfer from Language to Vision: A Comprehensive Analysis

The paper "Long Context Transfer from Language to Vision" introduces a novel approach to processing extensive visual data sequences by transferring the context length capabilities from extended language models to large multimodal models (LMMs). This strategy addresses a significant challenge in the domain: the current inability of LMMs to effectively process extremely long video sequences.

Abstract and Introduction

The central premise of the paper is the recognition that while LMMs have excelled in tasks involving single images or short video clips, they falter when it comes to understanding long video sequences due to the overwhelming number of visual tokens generated. Previous solutions primarily focused on reducing these tokens via visual resamplers. Instead, this paper proposes the concept of "long context transfer," wherein the context length of the language model backbone is extrapolated to handle massive visual token sets without necessitating video-specific training. This novel approach, named the Long Video Assistant (LongVA), leverages LLMs extended through long context training and aligns them with visual tokens using a structured encoding scheme called UniRes.

Methodology

Long Context Training

A cornerstone in this research is the application of long context training to Qwen2-7B-Instruct, wherein the model undergoes continued pretraining with a context length of 224K. Enhanced by frameworks such as FlashAttention-2, Ring Attention, and memory-efficient optimization strategies, the training efficiently spans 1,000 steps across 8 A100 GPUs, distinguishing itself in computational feasibility and efficacy. The long context-enabled model exhibited near-perfect results on the Needle-in-a-Haystack (NIAH) benchmark, significantly surpassing its shorter context counterparts.

Vision and Language Alignment

A core innovation facilitating the long context transfer in vision is UniRes, a unified encoding scheme transforming video frames into extended image grids, allowing the language model to process visual inputs as extended text contexts. Specifically, images and video frames are encoded as 336x336-pixel grids, employed through CLIP-ViT-L-336px, and projected to the model's dimensional requirements via 2-layer MLPs. The unique aspect of UniRes lies in its pooling mechanism, maintaining a consistent representation between image and video inputs, a decisive factor for enhanced performance in long context visual tasks.

Benchmarking and Results

V-NIAH

The paper introduces the Visual Needle-In-A-Haystack (V-NIAH) benchmark, which synthetically assesses LMMs' capabilities to locate and interpret specific frames embedded in hours-long video sequences. LongVA demonstrated a robust ability to process up to 3000 frames, leveraging the extended context from its language model backbone. Benchmarking indicated that LMMs without long context transfer faced significant performance degradation beyond their predefined context length, underscoring the novelty and effectiveness of LongVA's approach.

Video-MME Performance

LongVA's zero-shot performance on Video-MME, a comprehensive video question-answering benchmark, validates the practical applicability of long context transfer. Notably, LongVA outperformed other models, including larger-scale LLMs, achieving state-of-the-art results among 7B-scale models. The performance gains were consistent across varying subsets of video lengths, highlighting the model's ability to leverage extended frame inputs effectively.

Image Benchmark Insights

Despite being optimized for long video contexts, LongVA demonstrated competitive results on multiple image benchmarks. UniRes encoding managed to outperform existing models on high-resolution image datasets, particularly InfoVQA, illustrating its robustness and transferability.

Implications and Future Directions

This research presents a significant stride in the domain of multimodal AI models, particularly in processing and understanding long video sequences. The methodology not only demonstrates practical feasibility but also lays the groundwork for future explorations in the alignment of extended language models with diverse modalities. Given the effectiveness of training long context LMMs on text and subsequently adapting them to visual data, this technique could potentially extend to other complex multimodal tasks.

Conclusion

In summary, the proposed long context transfer from language to vision in LongVA exemplifies a significant advancement in overcoming the limitations of contemporary LMMs. This paper effectively bridges the gap between long context language processing and comprehensive video understanding, suggesting a promising trajectory for future research and applications in AI.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.