Fewer Truncations Improve Language Modeling (2404.10830v2)

Published 16 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: In LLM training, input documents are typically concatenated together and then split into sequences of equal length to avoid padding tokens. Despite its efficiency, the concatenation approach compromises data integrity -- it inevitably breaks many documents into incomplete pieces, leading to excessive truncations that hinder the model from learning to compose logically coherent and factually consistent content that is grounded on the complete context. To address the issue, we propose Best-fit Packing, a scalable and efficient method that packs documents into training sequences through length-aware combinatorial optimization. Our method completely eliminates unnecessary truncations while retaining the same training efficiency as concatenation. Empirical results from both text and code pre-training show that our method achieves superior performance (e.g., relatively +4.7% on reading comprehension; +16.8% in context following; and +9.2% on program synthesis), and reduces closed-domain hallucination effectively by up to 58.3%.

References (70)

Citations (8)

View on Semantic Scholar

Summary

The paper introduces Best-fit Packing, reframing sequence packing as a combinatorial optimization problem to mitigate excessive truncations during training.
It demonstrates up to 16.8% gains in context following tasks and a 58.3% reduction in hallucination, enhancing model accuracy.
The method scales efficiently with a 60% runtime improvement at billion-document scales while preserving data coherence.

Fewer Truncations Improve Language Modeling: Introducing Best-fit Packing

Introduction to Best-fit Packing and Truncation Issues

The prevalent training approach for LLMs involves concatenating input documents followed by sequence splitting. This conventional method, although efficient, truncates documents excessively, fragmenting content integral to maintaining coherence and factual consistency. To counteract this, the new approach termed "Best-fit Packing" is proposed. It reframes sequence packing as a combinatorial optimization problem, emphasizing efficient and scalable handling, while substantially reducing unnecessary truncations. The results indicate superior performance and reduced hallucination across various pre-training scenarios.

Best-fit Packing: A Methodological Advancement

Best-fit Packing begins with segmenting documents longer than the model's maximum sequence length into shorter chunks. These chunks are then optimally packed into sequences, ensuring maximum context preservation without further segmentations. The process leverages a bin-packing problem strategy, specifically employing an optimized Best-Fit Decreasing algorithm, which is both scalable and preserves training efficiency comparable to the concatenation method. Remarkably, this method exhibits a 60% runtime improvement at the billion-document scale while achieving compactness levels on par with traditional techniques.

Empirical Validation and Performance Metrics

The empirical validation involved pre-training models on textual as well as code datasets, evaluating them across a spectrum of tasks including reading comprehension, natural language inference, context following, and program synthesis. Key findings are:

Performance Improvement: Relative improvements of up to +16.8% in context following tasks and +15.0% in program synthesis, validating that fewer truncations correlate with better model performance.
Reduction in Hallucination: Effective reduction in closed-domain hallucination by up to 58.3%, crucial for tasks like program synthesis where factual accuracy is paramount.
Scalability and Efficiency: Demonstrated scalability to billions of documents while maintaining compactness and computational efficiency similar to the concatenation approach.

Theoretical Insights and Analytical Validation

The paper also explores a simplified analytical model to demonstrate the adverse effects of truncation on model accuracy. This stochastic model analytically substantiates the empirical observations that truncated training leads to inferior learning outcomes, even when data availability is not a constraint.

Future Directions in LLM Training

Best-fit Packing potentially sets a precedent for future LLM training methodologies that prioritize data integrity without compromising efficiency. It opens avenues for exploring additional data packing strategies and their integration into standard LLM training pipelines. Additionally, this approach could enhance not only base model pre-training but also task-specific fine-tuning phases.

Conclusion: Towards More Coherent and Less Hallucinatory LLMs

In summary, Best-fit Packing addresses a critical flaw in the traditional LLM training regimen by mitigating excessive document truncation, thus enhancing logical coherence and factual consistency across model outputs. This method not only supports existing findings regarding the importance of comprehensive context in model training but also pioneers an efficient, scalable solution to a previously overlooked but significant problem.