Emergent Mind

Fewer Truncations Improve Language Modeling

(2404.10830)
Published Apr 16, 2024 in cs.CL , cs.AI , and cs.LG

Abstract

In large language model training, input documents are typically concatenated together and then split into sequences of equal length to avoid padding tokens. Despite its efficiency, the concatenation approach compromises data integrity -- it inevitably breaks many documents into incomplete pieces, leading to excessive truncations that hinder the model from learning to compose logically coherent and factually consistent content that is grounded on the complete context. To address the issue, we propose Best-fit Packing, a scalable and efficient method that packs documents into training sequences through length-aware combinatorial optimization. Our method completely eliminates unnecessary truncations while retaining the same training efficiency as concatenation. Empirical results from both text and code pre-training show that our method achieves superior performance (e.g., relatively +4.7% on reading comprehension; +16.8% in context following; and +9.2% on program synthesis), and reduces closed-domain hallucination effectively by up to 58.3%.

Comparison of Best-fit Packing and baseline concatenation techniques in document tokenization, showing efficiency in sequence grouping.

Overview

  • Fewer Truncations Improve Language Modeling introduces a new approach called Best-fit Packing to mitigate issues caused by excessive truncation in LLMs, enhancing coherence and factual consistency in model outputs.

  • Best-fit Packing reframes sequence packing as a combinatorial optimization problem, using a Best-Fit Decreasing algorithm to segment and pack document chunks efficiently, preserving context and improving runtime efficiency by 60% at the billion-document scale.

  • Empirical validations demonstrate notable performance improvements and significant reductions in hallucination across tasks like reading comprehension and program synthesis, ensuring better model accuracy and effectiveness.

  • The research provides both empirical evidence and theoretical insights into the adverse effects of truncations on LLMs and suggests future directions for optimizing LLM training methodologies.

Fewer Truncations Improve Language Modeling: Introducing Best-fit Packing

Introduction to Best-fit Packing and Truncation Issues

The prevalent training approach for LLMs involves concatenating input documents followed by sequence splitting. This conventional method, although efficient, truncates documents excessively, fragmenting content integral to maintaining coherence and factual consistency. To counteract this, the new approach termed "Best-fit Packing" is proposed. It reframes sequence packing as a combinatorial optimization problem, emphasizing efficient and scalable handling, while substantially reducing unnecessary truncations. The results indicate superior performance and reduced hallucination across various pre-training scenarios.

Best-fit Packing: A Methodological Advancement

Best-fit Packing begins with segmenting documents longer than the model's maximum sequence length into shorter chunks. These chunks are then optimally packed into sequences, ensuring maximum context preservation without further segmentations. The process leverages a bin-packing problem strategy, specifically employing an optimized Best-Fit Decreasing algorithm, which is both scalable and preserves training efficiency comparable to the concatenation method. Remarkably, this method exhibits a 60% runtime improvement at the billion-document scale while achieving compactness levels on par with traditional techniques.

Empirical Validation and Performance Metrics

The empirical validation involved pre-training models on textual as well as code datasets, evaluating them across a spectrum of tasks including reading comprehension, natural language inference, context following, and program synthesis. Key findings are:

  • Performance Improvement: Relative improvements of up to +16.8% in context following tasks and +15.0% in program synthesis, validating that fewer truncations correlate with better model performance.
  • Reduction in Hallucination: Effective reduction in closed-domain hallucination by up to 58.3%, crucial for tasks like program synthesis where factual accuracy is paramount.
  • Scalability and Efficiency: Demonstrated scalability to billions of documents while maintaining compactness and computational efficiency similar to the concatenation approach.

Theoretical Insights and Analytical Validation

The paper also explore a simplified analytical model to demonstrate the adverse effects of truncation on model accuracy. This stochastic model analytically substantiates the empirical observations that truncated training leads to inferior learning outcomes, even when data availability is not a constraint.

Future Directions in LLM Training

Best-fit Packing potentially sets a precedent for future LLM training methodologies that prioritize data integrity without compromising efficiency. It opens avenues for exploring additional data packing strategies and their integration into standard LLM training pipelines. Additionally, this approach could enhance not only base model pre-training but also task-specific fine-tuning phases.

Conclusion: Towards More Coherent and Less Hallucinatory LLMs

In summary, Best-fit Packing addresses a critical flaw in the traditional LLM training regimen by mitigating excessive document truncation, thus enhancing logical coherence and factual consistency across model outputs. This method not only supports existing findings regarding the importance of comprehensive context in model training but also pioneers an efficient, scalable solution to a previously overlooked but significant problem.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

HackerNews