RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion (2403.06095v4)

Published 10 Mar 2024 in cs.SE and cs.AI

Abstract: Code LLMs (CodeLLMs) have demonstrated impressive proficiency in code completion tasks. However, they often fall short of fully understanding the extensive context of a project repository, such as the intricacies of relevant files and class hierarchies, which can result in less precise completions. To overcome these limitations, we present \tool, a multifaceted framework designed to address the complex challenges associated with repository-level code completion. Central to RepoHYPER is the {\em Repo-level Semantic Graph} (RSG), a novel semantic graph structure that encapsulates the vast context of code repositories. Furthermore, RepoHyper leverages Expand and Refine retrieval method, including a graph expansion and a link prediction algorithm applied to the RSG, enabling the effective retrieval and prioritization of relevant code snippets. Our evaluations show that \tool markedly outperforms existing techniques in repository-level code completion, showcasing enhanced accuracy across various datasets when compared to several strong baselines. Our implementation of RepoHYPER can be found at https://github.com/FSoft-AI4Code/RepoHyper.

References (35)

Citations (1)

View on Semantic Scholar

Summary

The paper presents RepoHyper's search-expand-refine methodology that boosts context retrieval by leveraging repository-level semantic graphs.
It implements a two-stage process: an initial semantic similarity search followed by graph-based link prediction to refine context ranking.
Evaluation on RepoBench shows significant improvements in context retrieval and code completion metrics, with notable gains in Exact Match and CodeBLEU scores.

RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion

Research in code completion has increasingly focused on leveraging the context within entire code repositories rather than isolated files. The paper "RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion" introduces a compelling methodology that addresses the complexities of repository-level code completion by introducing a novel framework, RepoHyper.

Introduction to RepoHyper

RepoHyper proposes a novel approach for enhancing context retrieval in large codebases by utilizing a Repo-level Semantic Graph (RSG). This graph encapsulates the semantic relations and dependencies within a repository, aiming to surpass traditional similarity-based search techniques, which often fall short due to their inadequate handling of non-locally relevant information.

Key Components:

Repo-level Semantic Graph (RSG): A structured representation that captures a repository's global context, including nodes for functions, classes, and script segments, alongside edges denoting import, invocation, ownership, encapsulation, and inheritance relationships.
Expand and Refine Strategy: This involves two stages - Search-then-Expand, which broadens the retrieval scope beyond initial similar contexts, and Link Prediction to re-rank and refine the contexts, aimed at feeding LLM-based code completion pipelines more relevant information.
Figure 1: Illustration of graph-based semantic search versus similarity-based search.

Methodology

Repo-level Semantic Graph Construction

The paper introduces RSG as a pivotal structure that highlights not only the direct code-level semantics but also broadens the contextual retrieval to include related classes, functions, and their intricate dependencies, thereby enhancing repository-level understanding.

Search-then-Expand Strategy

The strategy begins with a semantic similarity search using kNN to find initial context "anchors" within the graph. It then employs Exhausted Search and Pattern Search methods to expand these anchors, capturing more contexts without overwhelming the model with noise.

The retrieval process integrates a graph neural network based link prediction step that ranks the expanded contexts by their relevance to the query. This step is crucial for filtering out less relevant information, thus allowing the model to focus effectively on the most informative parts of the codebase.

Evaluation

The paper conducts extensive evaluation using the RepoBench benchmark, showcasing RepoHyper's effectiveness over existing state-of-the-art methods.

In Context Retrieval (CR) tasks, RepoHyper outperformed similarity-based methods significantly, highlighting an improvement of around 26% to 72% using different encoders.
In End-to-End Code Completion (EECC), leveraging LLMs like GPT-3.5-Turbo and DeepSeek-Coder, RepoHyper achieved notable performance, with improvements of +4.1 in Exact Match (EM) and +5.6 in CodeBLEU scores, reflecting its robustness in handling extensive context.

Implementation and Results

The implementation details emphasize the use of tools like Tree-sitter for parsing and PyCG for call graph generation to construct accurate RSGs. The exhaustive evaluations indicate that the Pattern Search method is particularly adept at filtering irrelevant nodes, resulting in efficient graph traversal and node selection.

Figure 2: Overall Architecture of RepoHyper. Here we use $K=1$ .

Figure 3: Sample ID 1430 in repository secdev/scapy.

Conclusion

RepoHyper introduces a significant stride towards improving the precision of repository-level code completion tasks. By encapsulating a comprehensive view of the code's semantic structure in a graph form and refining context retrieval through calculated expansion and link prediction, it opens new pathways for enhancing LLM-based code generators. Future work could explore adaptive expansion methods and cross-language adaptations, further bolstering the framework's applicability in diverse programming environments.