Emergent Mind

Abstract

Data curation is commonly considered a "secret-sauce" for LLM training, with higher quality data usually leading to better LLM performance. Given the scale of internet-scraped corpora, data pruning has become a larger and larger focus. Specifically, many have shown that de-duplicating data, or sub-selecting higher quality data, can lead to efficiency or performance improvements. Generally, three types of methods are used to filter internet-scale corpora: embedding-based, heuristic-based, and classifier-based. In this work, we contrast the former two in the domain of finetuning LLMs for code generation. We find that embedding-based methods are often confounded by length, and that a simple heuristic--pruning long files--outperforms other methods in compute-limited regimes. Our method can yield up to a 2x efficiency benefit in training (while matching performance) or a 3.5% absolute performance improvement on HumanEval (while matching compute). However, we find that perplexity on held-out long files can increase, begging the question of whether optimizing data mixtures for common coding benchmarks (HumanEval, MBPP) actually best serves downstream use cases. Overall, we hope our work builds useful intuitions about code data (specifically, the low quality of extremely long code files) provides a compelling heuristic-based method for data pruning, and brings to light questions in how we evaluate code generation models.

Validation perplexity evolution across length bins, highlighting overfitting in aggressively pruned settings.

Overview

  • The paper introduces a heuristic-based data pruning method that removes the longest files from training datasets, significantly improving training efficiency and model performance for code generation tasks in LLMs.

  • Authors conducted a detailed analysis of the Python subset of The Stack dataset, finding that extremely long files often contain low-quality content, and demonstrated that their pruning method enhances performance, particularly in compute-limited scenarios.

  • Comparative evaluations show that the heuristic method outperforms embedding-based methods like SCIP in constrained computational settings while maintaining comparable performance in high-compute scenarios.

Brevity is the Soul of Wit: Pruning Long Files for Code Generation

The paper "Brevity is the Soul of Wit: Pruning Long Files for Code Generation" by Aaditya K. Singh et al. explore optimizing data pruning methodologies specifically for LLMs fine-tuned for code generation tasks. This study provides a comparative analysis of two predominant approaches to data pruning: embedding-based and heuristic-based methods. It introduces a novel, heuristic-based approach which strategically prunes long files, demonstrating notable improvements over existing methods, particularly in computationally constrained settings.

Key Contributions

The primary contributions of the paper can be summarized as follows:

  1. Identification of Long Files as Low-Quality Data: The authors conducted an in-depth analysis of the Python subset of The Stack dataset. Their findings illustrate that extremely long files are typically of low quality, often consisting of repetitive or irrelevant content such as large data arrays or poor-quality code, commonly referred to as "spaghetti code."
  2. Heuristic-based Pruning Method: The paper introduces a simple, yet effective, heuristic for data pruning—removing the longest files from the training dataset. This approach is shown to yield a 2x efficiency improvement in training or a 3.5% absolute performance improvement on the HumanEval benchmark compared to baseline methods.
  3. Evaluation of Pruning Methods: The study extensively evaluates the proposed heuristic in comparison with embedding-based methods like SCIP. Results indicate that while embedding-based methods struggle in compute-limited regimes, the heuristic-based method consistently maintains or improves performance on standard benchmarks such as HumanEval and MBPP.
  4. Impact on Downstream Benchmarks: The authors highlight that the improvements associated with pruning long files are particularly pronounced in compute-limited regimes, where training efficiency is critical. However, they also caution that this method can lead to increased perplexity on longer, held-out files, suggesting a tradeoff in optimizing for commonly used benchmarks.

Methodology

The authors' methodology involves fine-tuning Llama2 7B models on various pruned subsets of The Stack's Python data. They conduct bootstrapped experiments on random 50% subsets, applying different pruning strategies and evaluating their impact on downstream performance metrics. This rigorous experimental setup not only assesses the heuristic's efficacy but also quantifies the inherent noise in performance due to dataset variations.

Results

Training Efficiency

By pruning 50% of tokens derived from the longest files, the authors achieve performance parity with the baseline that uses the full dataset, effectively doubling the training efficiency. This is significant for low-resource or academic settings where computational resources are limited.

Performance Improvement

When considering performance at a fixed computational budget (8k steps), the heuristic method demonstrates a 3.5% absolute improvement on HumanEval and a modest 1.5% on MBPP. These gains affirm the utility of the heuristic in yielding better-performing models under constrained computational scenarios.

Contrast with Embedding-Based Methods

Embedding-based methods, specifically SCIP, show suboptimal performance in compute-limited conditions but perform comparably in high-compute scenarios. Interestingly, even in these larger compute regimes, the heuristic-based pruning method matches the performance of SCIP, underscoring its robustness.

Implications and Future Directions

The paper's findings have several practical and theoretical implications:

  • Practical Implications: The heuristic-based pruning of long files provides a straightforward and computationally efficient method for improving model training in code generation tasks. It encourages practitioners to consider simpler heuristics in their data curation pipelines, potentially reducing the complexity and computational overhead of more sophisticated embedding-based methods.
  • Theoretical Implications: The correlation between document length and data quality invites further exploration into the characteristics of high-quality training data. The paper also raises questions about the balance between optimizing for common benchmarks and serving broader downstream use cases, particularly those involving long-context code.

Conclusion

Overall, this paper contributes valuable insights into data pruning for LLMs in the domain of code generation. By introducing a heuristic-based method that targets the pruning of long files, the authors provide a compelling approach that enhances training efficiency and performance in compute-limited settings. Future research may further investigate the application of this heuristic to other domains and explore ways to mitigate its drawbacks in high-compute scenarios. The discourse initiated by this work on data quality and evaluation methods is set to influence ongoing efforts in the optimization of LLM training pipelines.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.