Emergent Mind

Repoformer: Selective Retrieval for Repository-Level Code Completion

(2403.10059)
Published Mar 15, 2024 in cs.SE and cs.CL

Abstract

Recent advances in retrieval-augmented generation (RAG) have initiated a new era in repository-level code completion. However, the invariable use of retrieval in existing methods exposes issues in both efficiency and robustness, with a large proportion of the retrieved contexts proving unhelpful or harmful to code language models (code LMs). To tackle the challenges, this paper proposes a selective RAG framework where retrieval is avoided when unnecessary. To power this framework, we design a self-supervised learning approach that enables a code LM to accurately self-evaluate whether retrieval can improve its output quality and robustly leverage the potentially noisy retrieved contexts. Using this LM as both the selective retrieval policy and the generation model, our framework consistently outperforms the state-of-the-art prompting with an invariable retrieval approach on diverse benchmarks including RepoEval, CrossCodeEval, and a new benchmark. Meanwhile, our selective retrieval strategy results in strong efficiency improvements by as much as 70% inference speedup without harming the performance. We demonstrate that our framework effectively accommodates different generation models, retrievers, and programming languages. These advancements position our framework as an important step towards more accurate and efficient repository-level code completion.

Selective RAG framework improves accuracy, reduces latency by smart retrieval and streamlined processes.

Overview

  • Repoformer introduces a selective retrieval framework to optimize efficiency and robustness in repository-level code completion, moving beyond the traditional constant use of retrieval.

  • Empirical evidence suggests that traditional retrieval methods often introduce inefficiencies, with improvements in performance noted in only a small fraction of cases.

  • Repoformer's methodology enhances model performance by intelligently determining when retrieval is beneficial, achieving up to 70% inference speedups without sacrificing performance.

  • The model is rigorously evaluated, consistently outperforming state-of-the-art methods, and demonstrating flexibility across various languages, retrievers, and generative models.

Repoformer: Advancing Efficiency in Repository-Level Code Completion with Selective Retrieval

Introduction to Selective Retrieval in RAG

In the realm of code completion, particularly at the repository level, the integration of retrieval-augmented generation (RAG) techniques has been instrumental. These methods leverage contextually relevant code snippets or documentation from the same repository to enhance the predictive accuracy of code language models (code LMs). Traditional RAG-based approaches invariably utilize retrieval, assuming it always contributes positively to the completion task. This paper introduces a paradigm shift by questioning and subsequently disproving this assumption. The proposed framework, centered around Repoformer, employs selective retrieval, invoking it only when deemed beneficial, thus optimizing both robustness and efficiency in repository-level code completion.

The Dilemma of Invariable Retrieval

Empirical evidence suggests a substantial portion of retrievals in existing methods does not enhance, and can even degrade, the performance of code LMs. Analysis across diverse repository-level code completion tasks reveals that retrievals improve code LM performance in only 20\% or less of the instances. Notably, a significant number of retrievals introduce inefficiencies or irrelevant information determental to the task at hand. These findings underline the inefficacy of the 'invariable retrieval' strategy, demanding a more discerning approach to leveraging retrieved contexts.

Repoformer: A Solution to the Invariable Retrieval Issue

Repoformer epitomizes a novel approach to intelligently circumvent unnecessary retrievals. By self-evaluating the potential improvement retrieval might bring to a specific completion task, Repoformer allows for a more sophisticated, need-based engagement with retrieval mechanisms. This self-selective methodology not only enhances model performance across various benchmarks but also exhibits a marked improvement in efficiency, achieving up to 70\% inference speedups without compromising performance quality.

Three core principles underpin Repoformer:

  • Performance-oriented self-evaluation: determining the need for retrieval based not only on whether the model already possesses the requisite knowledge for code completion but also on the relevance and utility of additional context that retrieval might offer.
  • Robustness to retrieved contexts: an enhanced ability to leverage meaningful context when available and disregard it when not, minimizing potential performance degradation from unhelpful retrievals.
  • Generalizability: the proficiency to operate across different languages, retrievers, and generative models, ensuring its scalable application and facilitation as a plug-and-play solution for existing code LMs.

Empirical Validation and Analysis

Repoformer's effectiveness is rigorously evaluated through comprehensive benchmarks, including RepoEval, CrossCodeEval, and a newly introduced large-scale benchmark. The model consistently outperforms state-of-the-art methods, demonstrating superior accuracy and efficiency. Additional analyses reveal Repoformer's calibrated decision-making process in selective retrieval, its enhanced robustness to retrieved context, and its flexibility in accommodating various threshold settings for optimal performance-latency trade-offs. Moreover, Repoformer's capacity to augment existing code LMs with selective RAG capabilities underscores its utility in broadening the horizons of efficient repository-level code completion.

Concluding Remarks

The contributions of Repoformer extend beyond the immediate enhancements in repository-level code completion. By challenging the conventional wisdom of invariable retrieval, it lays the groundwork for more discerning, efficiency-oriented approaches to augmenting code LMs. The advancements presented hold promise for refining programming environments, fostering more sustainable coding practices, and facilitating continual improvement in automated code completion technologies.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.