Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 42 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 187 tok/s Pro
GPT OSS 120B 431 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion (2403.06095v4)

Published 10 Mar 2024 in cs.SE and cs.AI

Abstract: Code LLMs (CodeLLMs) have demonstrated impressive proficiency in code completion tasks. However, they often fall short of fully understanding the extensive context of a project repository, such as the intricacies of relevant files and class hierarchies, which can result in less precise completions. To overcome these limitations, we present \tool, a multifaceted framework designed to address the complex challenges associated with repository-level code completion. Central to RepoHYPER is the {\em Repo-level Semantic Graph} (RSG), a novel semantic graph structure that encapsulates the vast context of code repositories. Furthermore, RepoHyper leverages Expand and Refine retrieval method, including a graph expansion and a link prediction algorithm applied to the RSG, enabling the effective retrieval and prioritization of relevant code snippets. Our evaluations show that \tool markedly outperforms existing techniques in repository-level code completion, showcasing enhanced accuracy across various datasets when compared to several strong baselines. Our implementation of RepoHYPER can be found at https://github.com/FSoft-AI4Code/RepoHyper.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Guiding language models of code with global context using monitors. arXiv preprint arXiv:2306.10763.
  2. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  3. Codeplan: Repository-level coding using llms and planning. arXiv preprint arXiv:2309.12499.
  4. Codetf: One-stop transformer library for state-of-the-art code llm. arXiv preprint arXiv:2306.00029.
  5. Evaluating large language models trained on code.
  6. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
  7. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  8. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
  9. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
  10. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. arXiv preprint arXiv:2310.11248.
  11. Cocomic: Code completion by jointly modeling in-file and cross-file context. arXiv preprint arXiv:2212.10007.
  12. Unixcoder: Unified cross-modal pre-training for code representation. arXiv preprint arXiv:2203.03850.
  13. Deepseek-coder: When the large language model meets programming – the rise of code intelligence.
  14. Inductive representation learning on large graphs. In NIPS.
  15. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938.
  16. Starcoder: may the source be with you!
  17. Context-aware code generation framework for code repositories: Local, global, and third-party library awareness. arXiv preprint arXiv:2312.05772.
  18. Repobench: Benchmarking repository-level code auto-completion systems.
  19. Repocoder: Repository-level code completion through cross-file context retrieval. arXiv preprint arXiv:2303.12570.
  20. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
  21. OpenDialKG: Explainable conversational reasoning with attention-based walks over knowledge graphs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 845–854, Florence, Italy. Association for Computational Linguistics.
  22. Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint arXiv:2305.02309.
  23. Codegen: An open large language model for code with multi-turn program synthesis.
  24. Carbon emissions and large neural network training.
  25. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409.
  26. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297.
  27. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  28. Repofusion: Training code models to understand your repository.
  29. Repository-level prompt generation for large language models of code. In ICML 2022 Workshop on Knowledge Retrieval and Language Models.
  30. Repository-level prompt generation for large language models of code. In International Conference on Machine Learning, pages 31693–31715. PMLR.
  31. CodeT5+: Open code large language models for code understanding and generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1069–1088, Singapore. Association for Computational Linguistics.
  32. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859.
  33. Rlpg: A reinforcement learning based code completion system with graph-based context representation. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021, page 1119–1129, New York, NY, USA. Association for Computing Machinery.
  34. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. arXiv preprint arXiv:2401.07339.
  35. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x.
Citations (1)

Summary

  • The paper presents RepoHyper's search-expand-refine methodology that boosts context retrieval by leveraging repository-level semantic graphs.
  • It implements a two-stage process: an initial semantic similarity search followed by graph-based link prediction to refine context ranking.
  • Evaluation on RepoBench shows significant improvements in context retrieval and code completion metrics, with notable gains in Exact Match and CodeBLEU scores.

RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion

Research in code completion has increasingly focused on leveraging the context within entire code repositories rather than isolated files. The paper "RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion" introduces a compelling methodology that addresses the complexities of repository-level code completion by introducing a novel framework, RepoHyper.

Introduction to RepoHyper

RepoHyper proposes a novel approach for enhancing context retrieval in large codebases by utilizing a Repo-level Semantic Graph (RSG). This graph encapsulates the semantic relations and dependencies within a repository, aiming to surpass traditional similarity-based search techniques, which often fall short due to their inadequate handling of non-locally relevant information.

Key Components:

  1. Repo-level Semantic Graph (RSG): A structured representation that captures a repository's global context, including nodes for functions, classes, and script segments, alongside edges denoting import, invocation, ownership, encapsulation, and inheritance relationships.
  2. Expand and Refine Strategy: This involves two stages - Search-then-Expand, which broadens the retrieval scope beyond initial similar contexts, and Link Prediction to re-rank and refine the contexts, aimed at feeding LLM-based code completion pipelines more relevant information. Figure 1

    Figure 1: Illustration of graph-based semantic search versus similarity-based search.

Methodology

Repo-level Semantic Graph Construction

The paper introduces RSG as a pivotal structure that highlights not only the direct code-level semantics but also broadens the contextual retrieval to include related classes, functions, and their intricate dependencies, thereby enhancing repository-level understanding.

Search-then-Expand Strategy

The strategy begins with a semantic similarity search using kNN to find initial context "anchors" within the graph. It then employs Exhausted Search and Pattern Search methods to expand these anchors, capturing more contexts without overwhelming the model with noise.

The retrieval process integrates a graph neural network based link prediction step that ranks the expanded contexts by their relevance to the query. This step is crucial for filtering out less relevant information, thus allowing the model to focus effectively on the most informative parts of the codebase.

Evaluation

The paper conducts extensive evaluation using the RepoBench benchmark, showcasing RepoHyper's effectiveness over existing state-of-the-art methods.

  • In Context Retrieval (CR) tasks, RepoHyper outperformed similarity-based methods significantly, highlighting an improvement of around 26% to 72% using different encoders.
  • In End-to-End Code Completion (EECC), leveraging LLMs like GPT-3.5-Turbo and DeepSeek-Coder, RepoHyper achieved notable performance, with improvements of +4.1 in Exact Match (EM) and +5.6 in CodeBLEU scores, reflecting its robustness in handling extensive context.

Implementation and Results

The implementation details emphasize the use of tools like Tree-sitter for parsing and PyCG for call graph generation to construct accurate RSGs. The exhaustive evaluations indicate that the Pattern Search method is particularly adept at filtering irrelevant nodes, resulting in efficient graph traversal and node selection. Figure 2

Figure 2: Overall Architecture of RepoHyper. Here we use K=1K=1.

Figure 3

Figure 3: Sample ID 1430 in repository secdev/scapy.

Conclusion

RepoHyper introduces a significant stride towards improving the precision of repository-level code completion tasks. By encapsulating a comprehensive view of the code's semantic structure in a graph form and refining context retrieval through calculated expansion and link prediction, it opens new pathways for enhancing LLM-based code generators. Future work could explore adaptive expansion methods and cross-language adaptations, further bolstering the framework's applicability in diverse programming environments.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 1 like.

Upgrade to Pro to view all of the tweets about this paper: