Emergent Mind

Data Engineering for Scaling Language Models to 128K Context

(2402.10171)
Published Feb 15, 2024 in cs.CL and cs.AI

Abstract

We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular \textit{the ability to utilize information at arbitrary input locations}, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the \textit{quantity} and \textit{quality} of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize \textit{domain balance} and \textit{length upsampling}. Concretely, we find that naively upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance, and that a balanced domain mixture is important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.

Model's retrieval performance peaks at 5B tokens, suggesting inherent retrieval capability enhanced by pretraining.

Overview

  • The paper discusses advancements in language models, enabling them to understand up to 128K tokens of context through data engineering.

  • It emphasizes the importance of not only the quantity but also the quality of data for effectively scaling language models.

  • Adopting a balanced approach to domain diversity and data length in training improves model performance in extended context tasks.

  • These advancements narrow the performance gap with proprietary models like GPT-4 128K and enhance applications in AI research and application.

Enhancing Language Models for Extended Context Understanding through Data Engineering Strategies

Introduction to Extended Context Capacity in Language Models

Language models have steadily evolved, displaying remarkable capabilities in generating coherent and contextually relevant text. Recent advancements have pushed the boundaries further by expanding the context window of these models to an impressive 128K tokens. Such an expansion enables the models to delve into applications that were previously infeasible, including multi-document comprehension, in-depth code analysis, and comprehensive dialog systems. Central to this progression is not just the advancement in model architecture, but significantly, the meticulous engineering of the data that feeds these models.

Data Engineering: The Core of Scaling Context

The capacity of language models to parse and leverage information from vastly extended contexts, specifically how they manage data at such scale, is fundamental to their success. The challenge is not merely the extension of the model's capacity to absorb longer contexts but ensuring the model can effectively utilize this expanded horizon. This hinges on the optimal selection, allocation, and engineering of training data—a process crucial yet complex, given the models' already massive scale.

Quantitative and Qualitative Data Considerations

For scaling language models to parse and understand extended context lengths effectively, both the quantity and quality of data become pivotal. From a quantitative perspective, research indicates that a range of 500 million to 5 billion tokens suffices for these models to harness long contexts effectively. Qualitatively, the balance of domains within the training data and a methodical approach to upsampling data lengths emerge as critical factors. Notably, naive upsampling of longer texts from specific domains, a common practice, results in subpar model performance. Instead, maintaining a balanced domain mixture while upsampling long sequences within each domain is recommended. This approach helps in preserving the integrity of domain diversity, which is imperative for the model's general applicability.

Experimental Insights and Achievements

The advocated data engineering strategy significantly narrows the performance gap between open-source models and state-of-the-art, proprietary models like GPT-4 128K. By meticulously adjusting the input data — specifically tailoring the continuity and richness of the context data — researchers have managed to not only retain but improve the model's performance in extended context tasks. This ranged from complex sentence retrieval tests, dubbed the Needle-in-a-Haystack test, to real-world applications such as BookQA, demonstrating the model's remarkable accuracy and versatility.

The Future of Extended Context in AI

The implications of these findings are far-reaching, especially concerning the theoretical and practical applications of AI. By extending the language models' understanding to contexts well beyond the traditional limits, new vistas in AI research and application are unveiled. This includes enhanced multi-document comprehension and more profound insights across vast datasets, potentially revolutionizing how information is processed, understood, and generated by AI.

Conclusion

The journey to scale language models to comprehend extended contexts up to 128K tokens has underscored the significance of data engineering. Through a sophisticated blend of quantitative adequacy and qualitative balance, the venture has not only bridged the gap to the leading frontier models but also set a new precedent for future explorations in AI. As the field continues to progress, the focus on refining data engineering techniques will remain at the forefront, paving the way for even more capable and versatile language models.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.