Contextual API Completion for Unseen Repositories Using LLMs (2405.04600v3)
Abstract: LLMs have made substantial progress in addressing diverse code-related tasks. However, their adoption is hindered by inconsistencies in generating output due to the lack of real-world, domain-specific information, such as for intra-repository API calls for unseen software projects. We introduce a novel technique to mitigate hallucinations by leveraging global and local contextual information within a code repository for API completion tasks. Our approach is tailored to refine code completion tasks, with a focus on optimizing local API completions. We examine relevant import statements during API completion to derive insights into local APIs, drawing from their method signatures. For API token completion, we analyze the inline variables and correlate them with the appropriate imported modules, thereby allowing our approach to rank the most contextually relevant suggestions from the available local APIs. Further, for conversational API completion, we gather APIs that are most relevant to the developer query with a retrieval-based search across the project. We employ our tool, LANCE, within the framework of our proposed benchmark, APIEval, encompassing two different programming languages. Our evaluation yields an average accuracy of 82.6% for API token completion and 76.9% for conversational API completion tasks. On average, LANCE surpasses Copilot by 143% and 142% for API token completion and conversational API completion, respectively. The implications of our findings are substantial for developers, suggesting that our lightweight context analysis can be applied to multilingual environments without language-specific training or fine-tuning, allowing for efficient implementation with minimal examples and effort.
- OpenAI, “Gpt-4 technical report,” ArXiv, vol. abs/2303.08774, 2023.
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models,” arXiv:2307.09288, 2023.
- P. Denny, V. Kumar, and N. Giacaman, “Conversing with copilot: Exploring prompt engineering for solving cs1 problems using natural language,” in ACM Technical Symposium on Computer Science Education V. 1. ACM, 2023, p. 1136–1142.
- S. I. Ross, F. Martinez, S. Houde, M. Muller, and J. D. Weisz, “The programmer’s assistant: Conversational interaction with a large language model for software development,” in International Conference on Intelligent User Interfaces. ACM, 2023, p. 491–514.
- Z. Xie, Y. Chen, C. Zhi, S. Deng, and J. Yin, “Chatunitest: a chatgpt-based automated unit test generation tool,” arXiv:2305.04764, 2023.
- H. Tian, W. Lu, T. O. Li, X. Tang, S.-C. Cheung, J. Klein, and T. F. Bissyandé, “Is chatgpt the ultimate programming assistant–how far is it?” arXiv:2304.11938, 2023.
- C. S. Xia and L. Zhang, “Keep the conversation going: Fixing 162 out of 337 bugs for 0.42 each using chatgpt,” arXiv:2304.00385, 2023.
- S. Lertbanjongam, B. Chinthanet, T. Ishio, R. G. Kula, P. Leelaprute, B. Manaskasemsak, A. Rungsawang, and K. Matsumoto, “An empirical evaluation of competitive programming ai: A case study of alphacode,” arXiv:2208.08603, 2022.
- H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “An empirical cybersecurity evaluation of github copilot’s code contributions,” arXiv, pp. arXiv–2108, 2021.
- N. Nguyen and S. Nadi, “An empirical evaluation of GitHub Copilot’s code suggestions,” in ACM International Conference on Mining Software Repositories, 2022, pp. 1–5.
- P. Vaithilingam, T. Zhang, and E. L. Glassman, “Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models,” in Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems. ACM, 2022.
- A. M. Dakhel, V. Majdinasab, A. Nikanjam, F. Khomh, M. C. Desmarais, Z. Ming et al., “Github copilot ai pair programmer: Asset or liability?” arXiv:2206.15331, 2022.
- S. Imai, “Is github copilot a substitute for human pair-programming? an empirical study,” in 44th International Conference on Software Engineering (ICSE-Companion), 2022, pp. 319–321.
- A. Ziegler, E. Kalliamvakou, X. A. Li, A. Rice, D. Rifkin, S. Simister, G. Sittampalam, and E. Aftandilian, “Measuring github copilot’s impact on productivity,” Communications of the ACM, vol. 67, no. 3, pp. 54–63, 2024.
- Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,” ACM Computing Surveys, 2022.
- T. Breaux and A. Antón, “Analyzing regulatory rules for privacy and security requirements,” IEEE Transactions on Software Engineering, vol. 34, no. 1, pp. 5–20, 2008.
- N. Alhirabi, O. Rana, and C. Perera, “Security and privacy requirements for the internet of things: A survey,” ACM Transactions on Internet of Things, vol. 2, no. 1, pp. 1–37, 2021.
- P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” arXiv:2107.13586, 2021.
- F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, “A systematic evaluation of large language models of code,” in International Symposium on Machine Programming. ACM, 2022, p. 1–10.
- “Codex model,” https://beta.openai.com/docs/models/codex-series-private-beta, 2022.
- R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source be with you!” arXiv:2305.06161, 2023.
- OpenAI, “Chat completions,” https://platform.openai.com/docs/guides/chat, 2023, accessed: December 6, 2023.
- R. Vallée-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, and V. Sundaresan, “Soot: A java bytecode optimization framework,” in CASCON First Decade High Impact Papers. IBM, 2010, p. 214–224.
- M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv:2107.03374, 2021.
- J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large language models,” arXiv:2108.07732, 2021.
- F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M. Yee, Y. Zi, C. Anderson, M. Q. Feldman, A. Guha, M. Greenberg, and A. Jangda, “Multipl-e: A scalable and polyglot approach to benchmarking neural code generation,” IEEE Transactions on Software Engineering, vol. 49, no. 07, pp. 3675–3691, 2023.
- Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, L. Shen, Z. Wang, A. Wang, Y. Li, T. Su, Z. Yang, and J. Tang, “Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x,” in SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, 2023, p. 5673–5684.
- Y. Peng, S. Li, W. Gu, Y. Li, W. Wang, C. Gao, and M. R. Lyu, “Revisiting, benchmarking and exploring api recommendation: How far are we?” Trans. Softw. Eng., vol. 49, no. 4, p. 1876–1897, 2023.
- A. Ziegler, E. Kalliamvakou, X. A. Li, A. Rice, D. Rifkin, S. Simister, G. Sittampalam, and E. Aftandilian, “Productivity assessment of neural code completion,” in International Symposium on Machine Programming. ACM, 2022, p. 21–29.
- E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the dangers of stochastic parrots: Can language models be too big?” in ACM Conference on Fairness, Accountability, and Transparency. ACM, 2021, p. 610–623.
- T. Liu, Y. Zhang, C. Brockett, Y. Mao, Z. Sui, W. Chen, and B. Dolan, “A token-level reference-free hallucination detection benchmark for free-form text generation,” in Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). ACL, 2022, pp. 6723–6737.
- P. Yin and G. Neubig, “A syntactic neural model for general-purpose code generation,” in 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), R. Barzilay and M.-Y. Kan, Eds. ACL, 2017, pp. 440–450.
- T. H. M. Le, H. Chen, and M. A. Babar, “Deep learning for source code modeling and generation: Models, applications, and challenges,” ACM Comput. Surv., vol. 53, no. 3, 2020.
- P. YM, V. Ganesan, D. K. Arumugam, M. Gupta, N. Shadagopan, T. Dixit, S. Segal, P. Kumar, M. Jain, and S. Rajamani, “Pwr: Exploring the role of representations in conversational programming,” arXiv:2309.09495, 2023.
- A. M. Mcnutt, C. Wang, R. A. Deline, and S. M. Drucker, “On the design of ai-powered code assistants for notebooks,” in CHI Conference on Human Factors in Computing Systems. ACM, 2023.
- “Tree-sitter,” https://tree-sitter.github.io/tree-sitter/, 2023.
- A. Neelakantan, T. Xu, R. Puri, A. Radford, J. M. Han, J. Tworek, Q. Yuan, N. Tezak, J. W. Kim, C. Hallacy et al., “Text and code embeddings by contrastive pre-training,” arXiv preprint arXiv:2201.10005, 2022.
- J. A. H. López, B. Chen, T. Sharma, and D. Varró, “On inter-dataset code duplication and data leakage in large language models,” arXiv:2401.07930, 2024.
- C. Deng, Y. Zhao, X. Tang, M. Gerstein, and A. Cohan, “Benchmark probing: Investigating data leakage in large language models,” in NeurIPS 2023 Workshop on Backdoors in Deep Learning-The Good, the Bad, and the Ugly, 2023.
- A. Al-Kaswan, M. Izadi, and A. V. Deursen, “Traces of memorisation in large language models for code,” in International Conference on Software Engineering. IEEE Computer Society, 2024, pp. 862–862.
- Z. Yang, Z. Zhao, C. Wang, J. Shi, D. Kim, D. Han, and D. Lo, “Unveiling memorization in code models,” in International Conference on Software Engineering. IEEE Computer Society, 2024, pp. 856–856.
- X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y. Chen, J. Feng, C. Sha, X. Peng, and Y. Lou, “Evaluating large language models in class-level code generation,” in International Conference on Software Engineering. IEEE Computer Society, 2024, pp. 865–865.
- Q. Ren, C. Gao, J. Shao, J. Yan, X. Tan, W. Lam, and L. Ma, “Exploring safety generalization challenges of large language models via code,” arXiv:2403.07865, 2024.
- T. Ahmed and P. Devanbu, “Few-shot training llms for project-specific code-summarization,” in International Conference on Automated Software Engineering. ACM, 2023.
- “Tiobe language popularity index,” https://www.tiobe.com/tiobe-index/, 2024, accessed: 2024-03-15.
- B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv:2308.12950, 2023.
- D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, W.-t. Yih, L. Zettlemoyer, and M. Lewis, “Incoder: A generative model for code infilling and synthesis,” arXiv:2204.05999, 2022.
- Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” in Empirical Methods in Natural Language Processing. ACL, 2021, pp. 8696–8708.
- E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “A conversational paradigm for program synthesis,” arXiv:2203.13474, 2022.
- Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “CodeBERT: A pre-trained model for programming and natural languages,” in Findings of the Association for Computational Linguistics: EMNLP 2020. ACL, 2020, pp. 1536–1547.
- D. Guo, S. Lu, N. Duan, Y. Wang, M. Zhou, and J. Yin, “UniXcoder: Unified cross-modal pre-training for code representation,” in 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). ACL, 2022, pp. 7212–7225.
- D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu, M. Tufano, S. K. Deng, C. B. Clement, D. Drain, N. Sundaresan, J. Yin, D. Jiang, and M. Zhou, “Graphcodebert: Pre-training code representations with data flow,” in International Conference on Learning Representations. OpenReview.net, 2021.
- S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang et al., “Codexglue: A machine learning benchmark dataset for code understanding and generation,” arXiv:2102.04664, 2021.
- P. Jain, A. Jain, T. Zhang, P. Abbeel, J. Gonzalez, and I. Stoica, “Contrastive code representation learning,” in Empirical Methods in Natural Language Processing. ACL, 2021, pp. 5954–5971.
- L. Phan, H. Tran, D. Le, H. Nguyen, J. Annibal, A. Peltekian, and Y. Ye, “CoTexT: Multi-task learning with code-text transformer,” in Workshop on Natural Language Processing for Programming, 2021, pp. 40–47.
- W. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified pre-training for program understanding and generation,” in Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, 2021, pp. 2655–2668.
- Z. Yuan, Y. Lou, M. Liu, S. Ding, K. Wang, Y. Chen, and X. Peng, “No more manual tests? evaluating and improving chatgpt for unit test generation,” arXiv:2305.04207, 2023.
- Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung et al., “A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity,” arXiv:2302.04023, 2023.
- K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, and N. Carlini, “Deduplicating training data makes language models better,” in 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). ACL, 2022, pp. 8424–8445.
- B. Dhingra, M. Faruqui, A. Parikh, M.-W. Chang, D. Das, and W. Cohen, “Handling divergent reference texts when evaluating table-to-text generation,” in Annual Meeting of the Association for Computational Linguistics. ACL, 2019, pp. 4884–4895.
- N. Dziri, A. Madotto, O. Zaïane, and A. J. Bose, “Neural path hunter: Reducing hallucination in dialogue systems via path grounding,” in Empirical Methods in Natural Language Processing. ACL, 2021, pp. 2197–2214.
- W. Dai, Z. Liu, Z. Ji, D. Su, and P. Fung, “Plausible may not be faithful: Probing object hallucination in vision-language pre-training,” in 17th Conference of the European Chapter of the Association for Computational Linguistics. ACL, 2023, pp. 2136–2148.
- P. Manakul, A. Liusie, and M. Gales, “SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models,” in Empirical Methods in Natural Language Processing. ACL, 2023, pp. 9004–9017.
- OpenAI, “Gpt-4 technical report,” 2023.
- D. Shrivastava, H. Larochelle, and D. Tarlow, “Repository-level prompt generation for large language models of code,” in International Conference on Machine Learning. PMLR, 2023, pp. 31 693–31 715.
- Y. Ding, Z. Wang, W. U. Ahmad, M. K. Ramanathan, R. Nallapati, P. Bhatia, D. Roth, and B. Xiang, “Cocomic: Code completion by jointly modeling in-file and cross-file context,” arXiv:2212.10007, 2022.
- H. N. Phan, H. N. Phan, T. N. Nguyen, and N. D. Bui, “Repohyper: Better context retrieval is all you need for repository-level code completion,” arXiv preprint arXiv:2403.06095, 2024.
- R. P. L. Buse and W. Weimer, “Synthesizing api usage examples,” in International Conference on Software Engineering. IEEE Press, 2012, p. 782–792.
- T. Gvero and V. Kuncak, “Synthesizing java expressions from free-form queries,” SIGPLAN Not., vol. 50, no. 10, p. 416–432, 2015.
- Q. Huang, X. Xia, Z. Xing, D. Lo, and X. Wang, “Api method recommendation without worrying about the task-api knowledge gap,” in International Conference on Automated Software Engineering. ACM, 2018, p. 293–304.
- M. Liu, X. Peng, A. Marcus, Z. Xing, W. Xie, S. Xing, and Y. Liu, “Generating query-specific class api summaries,” in Foundations of Software Engineering. ACM, 2019, p. 120–130.
- C. McMillan, M. Grechanik, D. Poshyvanyk, Q. Xie, and C. Fu, “Portfolio: finding relevant functions and their usage,” in International Conference on Software Engineering, 2011, pp. 111–120.
- M. Raghothaman, Y. Wei, and Y. Hamadi, “Swim: synthesizing what i mean: code search and idiomatic snippet synthesis,” in International Conference on Software Engineering. ACM, 2016, p. 357–367.
- M. M. Rahman and C. Roy, “Nlp2api: Query reformulation for code search using crowdsourced knowledge and extra-large data analytics,” in International Conference on Software Maintenance and Evolution. IEEE, 2018, pp. 714–714.
- M. M. Rahman, C. K. Roy, and D. Lo, “Rack: Automatic api recommendation using crowdsourced knowledge,” in International Conference on Software Analysis, Evolution, and Reengineering (SANER), vol. 1. IEEE, 2016, pp. 349–359.
- X. Gu, H. Zhang, D. Zhang, and S. Kim, “Deep api learning,” in Foundations of Software Engineering. ACM, 2016, p. 631–642.
- M. A. Hadi, I. N. B. Yusuf, F. Thung, K. G. Luong, J. Lingxiao, F. H. Fard, and D. Lo, “On the effectiveness of pretrained models for api learning,” in International Conference on Program Comprehension. ACM, 2022, p. 309–320.
- J. Martin and J. L. C. Guo, “Deep api learning revisited,” in International Conference on Program Comprehension. ACM, 2022, p. 321–330.
- M. Wei, N. S. Harzevili, Y. Huang, J. Wang, and S. Wang, “Clear: contrastive learning for api recommendation,” in International Conference on Software Engineering, 2022, pp. 376–387.
- M. Liu, T. Yang, Y. Lou, X. Du, Y. Wang, and X. Peng, “Codegen4libs: A two-stage approach for library-oriented code generation,” in International Conference on Automated Software Engineering. IEEE, 2023, pp. 434–445.
- S. Zhou, U. Alon, F. F. Xu, Z. Jiang, and G. Neubig, “Docprompting: Generating code by retrieving the docs,” in International Conference on Learning Representations, 2023.
- D. Zan, B. Chen, Z. Lin, B. Guan, W. Yongji, and J.-G. Lou, “When language model meets private library,” in Findings of the Association for Computational Linguistics: EMNLP 2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. ACL, 2022, pp. 277–288.
- H. Pei, J. Zhao, L. Lausen, S. Zha, and G. Karypis, “Better context makes better code language models: A case study on function call argument completion,” in AAAI Conference on Artificial Intelligence, vol. 37, no. 4, 2023, pp. 5230–5238.
- T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and D. Radev, “Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task,” in Empirical Methods in Natural Language Processing. ACL, 2018, pp. 3911–3921.
- Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettlemoyer, W.-t. Yih, D. Fried, S. Wang, and T. Yu, “Ds-1000: A natural and reliable benchmark for data science code generation,” in International Conference on Machine Learning. PMLR, 2023, pp. 18 319–18 345.
- H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y. Ma, G. Liang, Y. Li, Q. Wang, and T. Xie, “Codereval: A benchmark of pragmatic code generation with generative pre-trained models,” in International Conference on Software Engineering, 2024, pp. 1–12.
- Noor Nashid (10 papers)
- Taha Shabani (3 papers)
- Parsa Alian (7 papers)
- Ali Mesbah (45 papers)