Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum (2405.13226v2)

Published 21 May 2024 in cs.CL and cs.LG

Abstract: LLMs are commonly trained on datasets consisting of fixed-length token sequences. These datasets are created by randomly concatenating documents of various lengths and then chunking them into sequences of a predetermined target length (concat-and-chunk). Recent attention implementations mask cross-document attention, reducing the effective length of a chunk of tokens. Additionally, training on long sequences becomes computationally prohibitive due to the quadratic cost of attention. In this study, we introduce dataset decomposition, a novel variable sequence length training technique, to tackle these challenges. We decompose a dataset into a union of buckets, each containing sequences of the same size extracted from a unique document. During training, we use variable sequence length and batch-size, sampling simultaneously from all buckets with a curriculum. In contrast to the concat-and-chunk baseline, which incurs a fixed attention cost at every step of training, our proposed method incurs a computational cost proportional to the actual document lengths at each step, resulting in significant savings in training time. We train an 8k context-length 1B model at the same cost as a 2k context-length model trained with the baseline approach. Experiments on a web-scale corpus demonstrate that our approach significantly enhances performance on standard language evaluations and long-context benchmarks, reaching target accuracy with up to 6x faster training compared to the baseline. Our method not only enables efficient pretraining on long sequences but also scales effectively with dataset size. Lastly, we shed light on a critical yet less studied aspect of training LLMs: the distribution and curriculum of sequence lengths, which results in a non-negligible difference in performance.

References (63)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces dataset decomposition with variable sequence length training to reduce cross-document attention and computational overhead.
The method enables training an 8k context-length 1B model at the cost of a 2k model, achieving target accuracy three times faster.
The approach optimizes sequence length distribution and curriculum design, offering scalable improvements for large-scale LLM training.

Dataset Decomposition: Enhancing LLM Training through Variable Sequence Length Curriculum

This paper introduces a significant improvement in the efficiency and effectiveness of training LLMs by proposing a novel technique called Dataset Decomposition (DD). The motivation for this research stems from the established but suboptimal practice of preparing fixed-length token sequences for LLM training. The conventional approach, termed "concat-and-chunk," involves random concatenation of documents followed by chunking into specific sequence lengths. This can inadvertently lead to cross-document attention and increased computational costs owing to the quadratic complexity of attention mechanisms.

The paper's central contribution is the introduction of Dataset Decomposition, combined with Variable Sequence Length (VSL) training. Dataset Decomposition involves reorganizing a dataset into a collection of buckets, each containing sequences of a fixed length—these sequences are derived from unique documents, thereby eliminating unnecessary cross-document attention. The method leverages this decomposition to conduct training using variable sequence lengths and batch sizes, selected through a curriculum.

A key highlight is the empirical demonstration that the DD approach allows training an 8k context-length 1B model at the same cost as a 2k context-length model using the baseline method. Moreover, the proposed approach achieves target accuracy approximately three times faster than the baseline when evaluated on standard language tasks and long-context benchmarks. This acceleration in reaching accuracy targets underscores both data and training efficiency, suggesting potential reductions in computational resource consumption that are beneficial for scaling LLMs.

The paper also addresses the often-overlooked aspect of sequence length distribution. By utilizing sequence length as prior knowledge, the authors demonstrate that optimizing sequence mixtures and curricula leads to varying performance impacts on different natural language and long-context tasks.

The results convey robust performance improvements, particularly in enhancing accuracy and training speed on a large-scale corpus with over 137 billion tokens. The application of the proposed DD and VSL strategies across multiple model sizes reaffirms its scalability and effectiveness.

One of the paper's distinctive analytical aspects is the examination of sequence length bias. The investigation reveals that the alignment between pretraining sequence lengths and the evaluation tasks' requirements plays a crucial role in optimizing performance. This insight invites further exploration into refining data mixtures tailored to target tasks, underscoring an approach that balances efficiency against complexity.

While Dataset Decomposition marks a substantial advancement in LLM training, the paper acknowledges that the technique's benefits are predominantly noteworthy in scenarios involving training with extended sequence lengths. Therefore, the direct computational savings from DD may not be as pronounced where sequence lengths do not present a significant computational overhead.

In conclusion, the paper outlines a methodologically sound and practically significant approach to overcoming limitations in traditional LLM training methodologies. By effectively reducing unnecessary computational burdens and enhancing training speed, Dataset Decomposition provides a pathway for more efficient resource utilization in LLM development. Researchers in the field may further explore this innovative approach's implications on varied language tasks, expanding the scope of LLM applications. With the groundwork laid by this paper, future advancements could delve into broader applications of curriculum-based training and extend these principles to other machine learning modalities.

PDF Markdown

Tweets

https://twitter.com/Grad62304977/status/1891061228992106706

https://twitter.com/OncelTuzel/status/1794021823371468877

https://twitter.com/HPouransari/status/1794133663149404436

https://twitter.com/mmarshall/status/1796541441449324838

https://twitter.com/Grad62304977/status/1891072641169023254

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum (2405.13226v2)

Summary

Dataset Decomposition: Enhancing LLM Training through Variable Sequence Length Curriculum

Related Papers

Tweets