RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation (2303.12570v3)
Abstract: The task of repository-level code completion is to continue writing the unfinished code based on a broader context of the repository. While for automated code completion tools, it is difficult to utilize the useful information scattered in different files. We propose RepoCoder, a simple, generic, and effective framework to address the challenge. It streamlines the repository-level code completion process by incorporating a similarity-based retriever and a pre-trained code LLM in an iterative retrieval-generation pipeline. RepoCoder makes effective utilization of repository-level information for code completion and has the ability to generate code at various levels of granularity. Moreover, we propose a new benchmark RepoEval, which consists of the latest and high-quality real-world repositories covering line, API invocation, and function body completion scenarios. Experimental results indicate that RepoCoder significantly improves the In-File completion baseline by over 10% in all settings and consistently outperforms the vanilla retrieval-augmented code completion approach. Furthermore, we validate the effectiveness of RepoCoder through comprehensive analysis, providing valuable insights for future research. Our source code and benchmark are publicly available: https://github.com/microsoft/CodeT/tree/main/RepoCoder
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Long-range modeling of source code files with ewash: Extended window access by syntax hierarchy. arXiv preprint arXiv:2109.08780.
- Cocomic: Code completion by jointly modeling in-file and cross-file context. arXiv preprint arXiv:2212.10007.
- Unixcoder: Unified cross-modal pre-training for code representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7212–7225.
- Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
- Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks the best choice for modeling source code? In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pages 763–773.
- When code completion fails: A case study on real-world completions. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 960–970. IEEE.
- Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299.
- Paul Jaccard. 1912. The distribution of the flora in the alpine zone. 1. New phytologist, 11(2):37–50.
- Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710. Soviet Union.
- Standing on the shoulders of giant frozen language models. arXiv preprint arXiv:2204.10019.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Generation-augmented query expansion for code retrieval. arXiv preprint arXiv:2212.10692.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
- Learning to recommend method names with global context. In Proceedings of the 44th International Conference on Software Engineering, pages 1294–1306.
- Reacc: A retrieval-augmented code completion framework. arXiv preprint arXiv:2203.07722.
- Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664.
- Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
- Generation-augmented retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553.
- Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474.
- OpenAI. 2023. Gpt-4 technical report.
- In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083.
- Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN conference on programming language design and implementation, pages 419–428.
- Retrieval augmented code generation and summarization. arXiv e-prints, pages arXiv–2108.
- Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652.
- Repository-level prompt generation for large language models of code. arXiv preprint arXiv:2206.12839.
- Intellicode compose: Code generation using transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1433–1443.
- Fast and memory-efficient neural code completion. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), pages 329–340. IEEE.
- Pythia: Ai-assisted code completion system. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2727–2735.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
- On the localness of software. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 269–280.
- Query2doc: Query expansion with large language models. arXiv preprint arXiv:2303.07678.
- When language model meets private library. arXiv preprint arXiv:2210.17236.
- Generate-and-retrieve: Use your predictions to improve retrieval for semantic parsing. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4946–4951.
- Retgen: A joint framework for retrieval and grounded text generation modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11739–11747.
- Doccoder: Generating code by retrieving and reading docs. arXiv preprint arXiv:2207.05987.
- How does code style inconsistency affect pull request integration? an exploratory study on 117 github projects. Empirical Software Engineering, 24:3871–3903.