Emergent Mind

Abstract

Current language models fall short in understanding aspects of the world not easily described in words, and struggle with complex, long-form tasks. Video sequences offer valuable temporal information absent in language and static images, making them attractive for joint modeling with language. Such models could develop a understanding of both human textual knowledge and the physical world, enabling broader AI capabilities for assisting humans. However, learning from millions of tokens of video and language sequences poses challenges due to memory constraints, computational complexity, and limited datasets. To address these challenges, we curate a large dataset of diverse videos and books, utilize the Blockwise RingAttention technique to scalably train on long sequences, and gradually increase context size from 4K to 1M tokens. This paper makes the following contributions: (a) Largest context size neural network: We train one of the largest context size transformers on long video and language sequences, setting new benchmarks in difficult retrieval tasks and long video understanding. (b) Solutions for overcoming vision-language training challenges, including using masked sequence packing for mixing different sequence lengths, loss weighting to balance language and vision, and model-generated QA dataset for long sequence chat. (c) A highly-optimized implementation with RingAttention, Blockwise Transformers, masked sequence packing, and other key features for training on millions-length multimodal sequences. (d) Fully open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens. This work paves the way for training on massive datasets of long video and language to develop understanding of both human knowledge and the multimodal world, and broader capabilities.

Multimodal training of LWMfn, expanding context size and incorporating diverse visual/video content, plus interactive capabilities.

Overview

  • The Large World Model (LWM) introduces a method to process long video sequences and textual data together, managing up to 1 million tokens.

  • Utilizes the RingAttention technique for scalable training over long sequences, avoiding traditional compromises in handling extensive data.

  • Addresses vision-language training challenges, including efficient sequence training and generating a model-driven QA dataset for enhanced chat capabilities.

  • Demonstrates notable results in long video understanding and fact retrieval, and shows the model's adaptability across various task contexts.

Large World Model on Million-Length Video and Language with RingAttention

Introduction

The research presented explore the development and implementation of a model capable of jointly processing long video sequences and textual data, referred to as the Large World Model (LWM). This model is distinctive for its ability to handle context sizes up to 1 million tokens, thereby setting a new precedent in the field for modeling long-sequence data. The paper outlines the utilization of the RingAttention technique, a novel approach aimed at efficiently training on vast sequences without the compromises typically associated with such ambitious scale.

Extending Context and Training Approach

The foundation of LWM's capability to manage extensive sequences lies in its innovative training methodology and the strategic extension of context sizes. Predominantly, this includes:

  • Scalable Training and Progressive Context Extension: The implementation of RingAttention facilitates scalable training over long documents, essential for handling the memory and computational challenges inherent in processing sequences of up to 1 million tokens. The model adopts a progressive training strategy, starting with shorter sequences and gradually extending to the target length, optimizing computational efficiency.
  • Positional Encoding and Training Steps: Modifications to the positional encoding mechanism and a detailed framework for adaptive training across various context sizes form the core of LWM’s strategy. These steps exhibit a methodical increase in context sizes from 4K to 1M tokens, each phase building upon the last, ensuring stability and performance consistency.

Solving Vision-Language Training Challenges

A critical segment of the paper discusses overcoming the hurdles associated with vision-language training. This includes techniques such as masked sequence packing for efficient training across diverse sequence lengths and optimizing loss weighting to balance the contributions of language and vision components. A notable innovation is the generation of a model-generated QA dataset for enhancing chat capabilities over long sequences, showcasing the model's practical versatility.

Empirical Evaluation and Results

The LWM demonstrates impressive numerical results and competencies in various tasks, particularly in:

  • Long Video Understanding and Fact Retrieval: The model shows promising results in understanding long videos and retrieving facts from extensive contexts, significantly outperforming existing approaches in scale and efficiency.
  • Generalization Across Tasks: Evaluation across a range of tasks demonstrates that extending context size does not compromise the model's performance on short-context tasks, underlying its adaptability.

Future Directions and Implications

The research paves the way for future advancements in AI, highlighting potential areas for further exploration such as improved video tokenization, expansion into additional modalities like audio, and the enrichment of video datasets. The practical implications of this work are vast, offering insights into developing more sophisticated AI systems capable of understanding and interacting within the complex multimodal world.

Conclusion

This paper represents a significant stride towards understanding multimodal world interactions through AI, establishing a new benchmark for processing extensive video and language sequences. The introduction of the RingAttention technique and a comprehensive training framework enables the Large World Model to effectively handle previously unattainable context sizes, showcasing the potential of AI to comprehend and reason over the vast and intricate tapestry of human knowledge and activity.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube