World Model on Million-Length Video And Language With Blockwise RingAttention (2402.08268v3)

Published 13 Feb 2024 in cs.LG

Abstract: Current LLMs fall short in understanding aspects of the world not easily described in words, and struggle with complex, long-form tasks. Video sequences offer valuable temporal information absent in language and static images, making them attractive for joint modeling with language. Such models could develop a understanding of both human textual knowledge and the physical world, enabling broader AI capabilities for assisting humans. However, learning from millions of tokens of video and language sequences poses challenges due to memory constraints, computational complexity, and limited datasets. To address these challenges, we curate a large dataset of diverse videos and books, utilize the Blockwise RingAttention technique to scalably train on long sequences, and gradually increase context size from 4K to 1M tokens. This paper makes the following contributions: (a) Largest context size neural network: We train one of the largest context size transformers on long video and language sequences, setting new benchmarks in difficult retrieval tasks and long video understanding. (b) Solutions for overcoming vision-language training challenges, including using masked sequence packing for mixing different sequence lengths, loss weighting to balance language and vision, and model-generated QA dataset for long sequence chat. (c) A highly-optimized implementation with RingAttention, Blockwise Transformers, masked sequence packing, and other key features for training on millions-length multimodal sequences. (d) Fully open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens. This work paves the way for training on massive datasets of long video and language to develop understanding of both human knowledge and the multimodal world, and broader capabilities.

Citations (38)

View on Semantic Scholar

Summary

The paper introduces a scalable training strategy that uses RingAttention to extend context sizes from 4K to 1M tokens.
It demonstrates efficient integration of long video sequences with textual data for improved fact retrieval and multimodal learning.
Empirical results highlight enhanced performance across tasks while maintaining stability in short-context scenarios.

Large World Model on Million-Length Video and Language with RingAttention

Introduction

The research presented explores the development and implementation of a model capable of jointly processing long video sequences and textual data, referred to as the Large World Model (LWM). This model is distinctive for its ability to handle context sizes up to 1 million tokens, thereby setting a new precedent in the field for modeling long-sequence data. The paper outlines the utilization of the RingAttention technique, a novel approach aimed at efficiently training on vast sequences without the compromises typically associated with such ambitious scale.

Extending Context and Training Approach

The foundation of LWM's capability to manage extensive sequences lies in its innovative training methodology and the strategic extension of context sizes. Predominantly, this includes:

Scalable Training and Progressive Context Extension: The implementation of RingAttention facilitates scalable training over long documents, essential for handling the memory and computational challenges inherent in processing sequences of up to 1 million tokens. The model adopts a progressive training strategy, starting with shorter sequences and gradually extending to the target length, optimizing computational efficiency.
Positional Encoding and Training Steps: Modifications to the positional encoding mechanism and a detailed framework for adaptive training across various context sizes form the core of LWM’s strategy. These steps exhibit a methodical increase in context sizes from 4K to 1M tokens, each phase building upon the last, ensuring stability and performance consistency.

Solving Vision-Language Training Challenges

A critical segment of the paper discusses overcoming the hurdles associated with vision-language training. This includes techniques such as masked sequence packing for efficient training across diverse sequence lengths and optimizing loss weighting to balance the contributions of language and vision components. A notable innovation is the generation of a model-generated QA dataset for enhancing chat capabilities over long sequences, showcasing the model's practical versatility.

Empirical Evaluation and Results

The LWM demonstrates impressive numerical results and competencies in various tasks, particularly in:

Long Video Understanding and Fact Retrieval: The model shows promising results in understanding long videos and retrieving facts from extensive contexts, significantly outperforming existing approaches in scale and efficiency.
Generalization Across Tasks: Evaluation across a range of tasks demonstrates that extending context size does not compromise the model's performance on short-context tasks, underlying its adaptability.

Future Directions and Implications

The research paves the way for future advancements in AI, highlighting potential areas for further exploration such as improved video tokenization, expansion into additional modalities like audio, and the enrichment of video datasets. The practical implications of this work are vast, offering insights into developing more sophisticated AI systems capable of understanding and interacting within the complex multimodal world.

Conclusion

This paper represents a significant stride towards understanding multimodal world interactions through AI, establishing a new benchmark for processing extensive video and language sequences. The introduction of the RingAttention technique and a comprehensive training framework enables the Large World Model to effectively handle previously unattainable context sizes, showcasing the potential of AI to comprehend and reason over the vast and intricate tapestry of human knowledge and activity.

PDF Markdown

Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1757596665295368534

https://twitter.com/_akhaliq/status/1757602676399869960

https://twitter.com/qtnx_/status/1805218939724530143

https://twitter.com/haoliuhl/status/1757828394463735840

https://twitter.com/somewheresy/status/1799273410155233709

https://twitter.com/narges_razavian/status/1758456855393714390