RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

Published 5 Jun 2023 in cs.CL, cs.AI, and cs.SE | (2306.03091v2)

Abstract: LLMs have greatly advanced code auto-completion systems, with a potential for substantial productivity enhancements for developers. However, current benchmarks mainly focus on single-file tasks, leaving an assessment gap for more complex, real-world, multi-file programming scenarios. To fill this gap, we introduce RepoBench, a new benchmark specifically designed for evaluating repository-level code auto-completion systems. RepoBench supports both Python and Java and consists of three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline). Each task respectively measures the system's ability to retrieve the most relevant code snippets from other files as cross-file context, predict the next line of code with cross-file and in-file context, and handle complex tasks that require a combination of both retrieval and next-line prediction. RepoBench aims to facilitate a more complete comparison of performance and encouraging continuous improvement in auto-completion systems. RepoBench is publicly available at https://github.com/Leolty/repobench.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (99)

View on Semantic Scholar

Summary

The paper introduces RepoBench, a novel benchmark designed to evaluate auto-completion in multi-file repository contexts.
It defines three evaluation tasks—retrieval (RepoBench-R), code completion (RepoBench-C), and end-to-end pipeline (RepoBench-P)—to assess practical code generation challenges.
Experimental results highlight that handling extended multi-file contexts and optimal snippet arrangement significantly enhance completion performance.

An Overview of RepoBench: Advancing Repository-Level Code Auto-Completion Evaluation

In recent years, LLMs such as Codex and StarCoder have markedly improved the domain of code auto-completion, promising significant productivity gains for developers. These models, however, are predominantly evaluated on single-file tasks, which do not accurately reflect the complexities of real-world programming involving multi-file projects. Addressing this gap, the paper introduces RepoBench, a benchmark designed specifically for repository-level code auto-completion systems, which acknowledges the critical need for multi-file context in code generation tasks.

The centerpiece of RepoBench is its tripartite evaluation suite comprising three interconnected tasks—RepoBench-R for code retrieval, RepoBench-C for code completion, and RepoBench-P for the end-to-end completion pipeline—each addressing unique challenges of repository-level systems. These tasks offer a comprehensive framework for assessing models' abilities to manage extensive code contexts, an essential competency for practical application across real-world environments.

Contribution of RepoBench

RepoBench introduces several key innovations with implications for both practical development and future research in the field. It supports evaluations in Python and Java, with the tasks designed to reflect typical software development scenarios:

RepoBench-R (Retrieval): This task evaluates the efficiency of retrieving relevant code snippets from other files within a repository, emphasizing the need for models to understand multi-file dependencies. Performance metrics such as Accuracy@k highlight the model's ability to navigate and prioritize relevant snippets across extensive codebases.
RepoBench-C (Code Completion): Focusing on the prediction of the next line of code, this task provides different settings (2k and 8k) to cater to models with varying context length capabilities. RepoBench-C results illuminate the performance diversity of existing LLMs when conditioned on in-file and cross-file contexts, establishing a baseline for future advancements.
RepoBench-P (Pipeline): Simulating a full-code auto-completion pipeline, this task integrates retrieval and completion, assessing the pipeline's robustness in handling complex, multi-step code generation scenarios. It underscores the importance of effective retrieval methods in augmenting code completion accuracy, with findings suggesting the value of strategic snippet placement in inputs.

Insights from Experiments

The experimental results provide numerous insights into the strengths and limitations of current auto-completion systems:

Retrieval Efficacy: Among retrieval methods, UniXcoder demonstrated superior performance, suggesting semantic retrieval's advantage over lexical methods. The results also highlighted the performance gap between Python and Java tasks, attributed to inherent language complexities which future benchmarks may account for.
Completion Performance: A distinct performance discrepancy was noted with StarCoder and Codex across various input lengths, possibly due to distributional variances in training data length. This calls for refined model training strategies to improve length generalization.
Pipeline Realization: The inclusion of extended cross-file contexts markedly benefitted completion performance, affirming the utility of comprehensive retrieval techniques. However, strategic snippet arrangement remains a critical consideration, as demonstrated by the differential ordering results.

Implications for the Future

RepoBench is a crucial step towards realistic and effective code auto-completion evaluation. By reflecting real-world programming dilemmas, it offers not only a metric for current models but a framework guiding future model developments and optimizations. It encourages the research community to prioritize extensibility and adaptability in model design, thus enhancing practical applicability in professional software development scenarios.

Continued development in repository-level benchmarks like RepoBench is essential for advancing the efficacy of AI-driven code completion tools. By embracing complexities inherent in extensive code repositories, future iterations of LLMs can expect enhanced performance, supporting developers across diverse programming ecosystems.

Markdown Report Issue