Emergent Mind

Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data

(2312.02418)
Published Dec 5, 2023 in cs.CL , cs.AI , and cs.LG

Abstract

Code datasets, often collected from diverse and uncontrolled sources such as GitHub, potentially suffer from quality issues, thereby affecting the performance and training efficiency of LLMs optimized for code generation. Previous studies demonstrated the benefit of using embedding spaces for data pruning, but they mainly focused on duplicate removal or increasing variety, and in other modalities, such as images. Our work focuses on using embeddings to identify and remove "low-quality" code data. First, we explore features of "low-quality" code in embedding space, through the use of synthetic corruptions. Armed with this knowledge, we devise novel pruning metrics that operate in embedding space to identify and remove low-quality entries in the Stack dataset. We demonstrate the benefits of this synthetic corruption informed pruning (SCIP) approach on the well-established HumanEval and MBPP benchmarks, outperforming existing embedding-based methods. Importantly, we achieve up to a 3% performance improvement over no pruning, thereby showing the promise of insights from synthetic corruptions for data pruning.

Overview

  • The paper addresses improving training data quality for LLMs optimized for code generation.

  • Introduces Synthetic Corruption Informed Pruning (SCIP) to remove low-quality code snippets from datasets.

  • SCIP works by introducing controlled errors into code and using embedding spaces to distinguish data quality.

  • Pruned datasets result in better performance and efficiency on code generation benchmarks like HumanEval and MBPP.

  • The method may be transferable to other AI domains, highlighting the importance of data curation for AI models.

The research presented in this paper tackles the challenge of improving the quality of training data for LLMs, specifically those optimized for code generation. LLMs' capacity to generate code has garnered significant attention due to its potential to revolutionize software development. However, the performance and efficiency of LLMs heavily depend on the quality of their training data. Datasets compiled from public sources, such as GitHub, often contain inconsistencies, errors, or low-quality snippets of code which can negatively impact the training of these models.

To address this, the study introduces a novel approach known as Synthetic Corruption Informed Pruning (SCIP), which enhances the dataset quality by removing low-quality code. This is achieved by first introducing synthetic corruptions or controlled errors into the code to generate an embedding space that reflects characteristics of lower-quality data. Synthetic corruption intentionally introduces syntax errors, like removing closed brackets, or content errors, like altering array indices, to craft a clear distinction between high and low-quality code snippets. The affected code tends to group into smaller clusters or further from cluster centroids in an embedding space created by a pre-trained model, known as StarEncoder.

The SCIP method operates by examining the size of clusters and the distance of data points to cluster centroids within the embedding space, targeting data that resemble the synthetically corrupted code in terms of their spatial properties. The study shows that by pruning code snippets that fall into smaller clusters or are farther away from cluster centroids, the resulting cleaned datasets produce enhanced performance of LLMs on widely recognized code generation benchmarks, namely HumanEval and MBPP.

Results from this pruning strategy show that it not only improves the performance on benchmark evaluations but also achieves better training efficiency, with models requiring fewer training steps to reach baseline performance levels. The method has proven to surpass existing embedding-based pruning methods in both performance and training efficiency.

The implications of this research extend beyond code datasets. It illustrates the importance of rigorously examining and curating training data for AI models. The idea of using synthetically corrupted data as a signal for pruning could be applicable to a broader range of datasets, including those used for natural language processing tasks. This work opens the door for future studies to develop improved data pruning techniques that utilize synthetic corruption insights for various types of AI models, potentially leading to more accurate, reliable, and effective AI applications across different domains.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.