Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data (2312.02418v1)

Published 5 Dec 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Code datasets, often collected from diverse and uncontrolled sources such as GitHub, potentially suffer from quality issues, thereby affecting the performance and training efficiency of LLMs optimized for code generation. Previous studies demonstrated the benefit of using embedding spaces for data pruning, but they mainly focused on duplicate removal or increasing variety, and in other modalities, such as images. Our work focuses on using embeddings to identify and remove "low-quality" code data. First, we explore features of "low-quality" code in embedding space, through the use of synthetic corruptions. Armed with this knowledge, we devise novel pruning metrics that operate in embedding space to identify and remove low-quality entries in the Stack dataset. We demonstrate the benefits of this synthetic corruption informed pruning (SCIP) approach on the well-established HumanEval and MBPP benchmarks, outperforming existing embedding-based methods. Importantly, we achieve up to a 3% performance improvement over no pruning, thereby showing the promise of insights from synthetic corruptions for data pruning.

References (18)

Citations (8)

View on Semantic Scholar

Summary

The paper introduces SCIP, a technique that leverages synthetic corruptions to identify and remove low-quality code, boosting LLM performance on benchmarks.
It employs controlled errors to create an embedding space where low-quality code clusters are effectively pruned based on size and centroid distance.
The improved data quality leads to enhanced training efficiency and performance, offering a scalable strategy for refining datasets in AI-driven code generation.

The research presented in this paper tackles the challenge of improving the quality of training data for LLMs, specifically those optimized for code generation. LLMs' capacity to generate code has garnered significant attention due to its potential to revolutionize software development. However, the performance and efficiency of LLMs heavily depend on the quality of their training data. Datasets compiled from public sources, such as GitHub, often contain inconsistencies, errors, or low-quality snippets of code which can negatively impact the training of these models.

To address this, the paper introduces a novel approach known as Synthetic Corruption Informed Pruning (SCIP), which enhances the dataset quality by removing low-quality code. This is achieved by first introducing synthetic corruptions or controlled errors into the code to generate an embedding space that reflects characteristics of lower-quality data. Synthetic corruption intentionally introduces syntax errors, like removing closed brackets, or content errors, like altering array indices, to craft a clear distinction between high and low-quality code snippets. The affected code tends to group into smaller clusters or further from cluster centroids in an embedding space created by a pre-trained model, known as StarEncoder.

The SCIP method operates by examining the size of clusters and the distance of data points to cluster centroids within the embedding space, targeting data that resemble the synthetically corrupted code in terms of their spatial properties. The paper shows that by pruning code snippets that fall into smaller clusters or are farther away from cluster centroids, the resulting cleaned datasets produce enhanced performance of LLMs on widely recognized code generation benchmarks, namely HumanEval and MBPP.

Results from this pruning strategy show that it not only improves the performance on benchmark evaluations but also achieves better training efficiency, with models requiring fewer training steps to reach baseline performance levels. The method has proven to surpass existing embedding-based pruning methods in both performance and training efficiency.

The implications of this research extend beyond code datasets. It illustrates the importance of rigorously examining and curating training data for AI models. The idea of using synthetically corrupted data as a signal for pruning could be applicable to a broader range of datasets, including those used for natural language processing tasks. This work opens the door for future studies to develop improved data pruning techniques that utilize synthetic corruption insights for various types of AI models, potentially leading to more accurate, reliable, and effective AI applications across different domains.

Related Papers

Tweets

https://twitter.com/22146921/status/1736149063484018804