Long Context Transfer from Language to Vision (2406.16852v2)

Published 24 Jun 2024 in cs.CV

Abstract: Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. Many works address this by reducing the number of visual tokens using visual resamplers. Alternatively, in this paper, we approach this problem from the perspective of the LLM. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training. We call this phenomenon long context transfer and carefully ablate its properties. To effectively measure LMMs' ability to generalize to long contexts in the vision modality, we develop V-NIAH (Visual Needle-In-A-Haystack), a purely synthetic long vision benchmark inspired by the LLM's NIAH test. Our proposed Long Video Assistant (LongVA) can process 2000 frames or over 200K visual tokens without additional complexities. With its extended context length, LongVA achieves state-of-the-art performance on Video-MME among 7B-scale models by densely sampling more input frames. Our work is open-sourced at https://github.com/EvolvingLMMs-Lab/LongVA.

Citations (49)

View on Semantic Scholar

Summary

The paper introduces LongVA, which transfers extended context capabilities from language models to efficiently process long video sequences.
It employs UniRes encoding and long context training of Qwen2-7B-Instruct with a 224K context length to align visual tokens with language inputs.
The approach achieves state-of-the-art results on benchmarks like V-NIAH and Video-MME, demonstrating superior performance on long video tasks.

Long Context Transfer from Language to Vision: A Comprehensive Analysis

The paper "Long Context Transfer from Language to Vision" introduces a novel approach to processing extensive visual data sequences by transferring the context length capabilities from extended LLMs to large multimodal models (LMMs). This strategy addresses a significant challenge in the domain: the current inability of LMMs to effectively process extremely long video sequences.

Abstract and Introduction

The central premise of the paper is the recognition that while LMMs have excelled in tasks involving single images or short video clips, they falter when it comes to understanding long video sequences due to the overwhelming number of visual tokens generated. Previous solutions primarily focused on reducing these tokens via visual resamplers. Instead, this paper proposes the concept of "long context transfer," wherein the context length of the LLM backbone is extrapolated to handle massive visual token sets without necessitating video-specific training. This novel approach, named the Long Video Assistant (LongVA), leverages LLMs extended through long context training and aligns them with visual tokens using a structured encoding scheme called UniRes.

Methodology

Long Context Training

A cornerstone in this research is the application of long context training to Qwen2-7B-Instruct, wherein the model undergoes continued pretraining with a context length of 224K. Enhanced by frameworks such as FlashAttention-2, Ring Attention, and memory-efficient optimization strategies, the training efficiently spans 1,000 steps across 8 A100 GPUs, distinguishing itself in computational feasibility and efficacy. The long context-enabled model exhibited near-perfect results on the Needle-in-a-Haystack (NIAH) benchmark, significantly surpassing its shorter context counterparts.

Vision and Language Alignment

A core innovation facilitating the long context transfer in vision is UniRes, a unified encoding scheme transforming video frames into extended image grids, allowing the LLM to process visual inputs as extended text contexts. Specifically, images and video frames are encoded as 336x336-pixel grids, employed through CLIP-ViT-L-336px, and projected to the model's dimensional requirements via 2-layer MLPs. The unique aspect of UniRes lies in its pooling mechanism, maintaining a consistent representation between image and video inputs, a decisive factor for enhanced performance in long context visual tasks.

Benchmarking and Results

V-NIAH

The paper introduces the Visual Needle-In-A-Haystack (V-NIAH) benchmark, which synthetically assesses LMMs' capabilities to locate and interpret specific frames embedded in hours-long video sequences. LongVA demonstrated a robust ability to process up to 3000 frames, leveraging the extended context from its LLM backbone. Benchmarking indicated that LMMs without long context transfer faced significant performance degradation beyond their predefined context length, underscoring the novelty and effectiveness of LongVA's approach.

Video-MME Performance

LongVA's zero-shot performance on Video-MME, a comprehensive video question-answering benchmark, validates the practical applicability of long context transfer. Notably, LongVA outperformed other models, including larger-scale LLMs, achieving state-of-the-art results among 7B-scale models. The performance gains were consistent across varying subsets of video lengths, highlighting the model's ability to leverage extended frame inputs effectively.

Image Benchmark Insights

Despite being optimized for long video contexts, LongVA demonstrated competitive results on multiple image benchmarks. UniRes encoding managed to outperform existing models on high-resolution image datasets, particularly InfoVQA, illustrating its robustness and transferability.

Implications and Future Directions

This research presents a significant stride in the domain of multimodal AI models, particularly in processing and understanding long video sequences. The methodology not only demonstrates practical feasibility but also lays the groundwork for future explorations in the alignment of extended LLMs with diverse modalities. Given the effectiveness of training long context LMMs on text and subsequently adapting them to visual data, this technique could potentially extend to other complex multimodal tasks.

Conclusion

In summary, the proposed long context transfer from language to vision in LongVA exemplifies a significant advancement in overcoming the limitations of contemporary LMMs. This paper effectively bridges the gap between long context language processing and comprehensive video understanding, suggesting a promising trajectory for future research and applications in AI.

PDF Markdown

Related Papers

GitHub

GitHub - EvolvingLMMs-Lab/LongVA: Long Context Transfer from Language to Vision (326 stars)

Tweets

https://twitter.com/arankomatsuzaki/status/1805435832989364611

https://twitter.com/liuziwei7/status/1805616945313489107

https://twitter.com/PY_Z001/status/1805555641219792933

https://twitter.com/fly51fly/status/1807179757026681063

https://twitter.com/PY_Z001/status/1826087396980699408

https://twitter.com/gm8xx8/status/1805432892505325909