Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions (2312.12450v6)
Abstract: A significant amount of research is focused on developing and evaluating LLMs for a variety of code synthesis tasks. These include synthesizing code from natural language, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is provided a block of code and an instruction to modify the code. The editing instruction may ask for a feature to be added or removed, describe a bug and ask for a fix, or ask for a different kind of solution. We introduce a carefully crafted benchmark of code editing tasks and use it to evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is better than the best open model at code editing tasks. We also introduce a new, carefully curated, permissively licensed training dataset of code editing tasks coupled with natural language instructions. Using this training dataset, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities, closing the gap between open and closed models. All code, data, and models are available at https://github.com/nuprl/CanItEdit.
- DeepSeek AI. 2023. DeepSeek Coder: Let the Code Write Itself. https://github.com/deepseek-ai/DeepSeek-Coder.
- Program Synthesis with Large Language Models. arXiv preprint arXiv:2108.07732 (2021).
- StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code. arXiv:2306.04556 [cs.LG]
- Ned Batchelder and Contributors to Coverage.py. [n. d.]. Coverage.py: The code coverage tool for Python. https://github.com/nedbat/coveragepy
- Efficient Training of Language Models to Fill in the Middle. arXiv:2207.14255 [cs.CL]
- Andrei Z. Broder. 2000. Identifying and Filtering Near-Duplicate Documents. In Combinatorial Pattern Matching, Raffaele Giancarlo and David Sankoff (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1–10.
- Federico Cassano. 2023. A Pipeline for Fine-Tuning HuggingFace Models.
- MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation. IEEE Transactions on Software Engineering (TSE) 49, 7 (2023), 3675–3691.
- Sahil Chaudhary. 2023. Code Alpaca: An Instruction-following LLaMA model for code generation. https://github.com/sahil280114/codealpaca.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
- Teaching Large Language Models to Self-Debug. arXiv:2304.05128 [cs.CL]
- GitHub Copilot. 2023. Github Copilot Your AI pair programmer. https://github.com/features/copilot
- Cursor. 2023. Cursor: The AI-first Code Editor. https://cursor.sh/features Accessed: 2023-12-03.
- Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691 [cs.LG]
- Automated Repair of Programs from Large Language Models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 1469–1481. https://doi.org/10.1109/ICSE48619.2023.00128
- InCoder: A Generative Model for Code Infilling and Synthesis. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=hQwb-lbM6EL
- Grace: Language Models Meet Code Edits. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (, San Francisco, CA, USA,) (ESEC/FSE 2023). Association for Computing Machinery, New York, NY, USA, 1483–1495. https://doi.org/10.1145/3611643.3616253
- InstructCoder: Empowering Language Models for Code Editing. arXiv:2310.20329 [cs.CL]
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770 [cs.CL]
- InferFix: End-to-End Program Repair with LLMs. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (, San Francisco, CA, USA,) (ESEC/FSE 2023). Association for Computing Machinery, New York, NY, USA, 1646–1656. https://doi.org/10.1145/3611643.3613892
- Repair is Nearly Generation: Multilingual Program Repair with LLMs. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence (AAAI’23/IAAI’23/EAAI’23). AAAI Press, Article 573, 10 pages. https://doi.org/10.1609/aaai.v37i4.25642
- Efficient Memory Management for Large Language Model Serving with PagedAttention. In ACM SIGOPS Symposium on Operating Systems Principles (SOSP).
- OpenAssistant Conversations – Democratizing Large Language Model Alignment. arXiv:2304.07327 [cs.CL]
- Deduplicating Training Data Makes Language Models Better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 8424–8445. https://doi.org/10.18653/v1/2022.acl-long.577
- Finding Similar Items (2 ed.). Cambridge University Press, 68–122. https://doi.org/10.1017/CBO9781139924801.004
- CodeEditor: Learning to Edit Source Code with Pre-Trained Models. ACM Trans. Softw. Eng. Methodol. 32, 6, Article 143 (sep 2023), 22 pages. https://doi.org/10.1145/3597207
- StarCoder: May the Source Be with You! https://doi.org/10.48550/arXiv.2305.06161 arXiv:2305.06161 [cs]
- Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. arXiv preprint arXiv:2305.01210 (2023). https://doi.org/10.48550/arXiv.2305.01210
- WizardCoder: Empowering Code Large Language Models with Evol-Instruct. https://doi.org/10.48550/arXiv.2306.08568 arXiv:2306.08568 [cs]
- Coffee: Boost Your Code LLMs by Fixing Bugs with Feedback. arXiv:2311.07215 [cs.CL]
- OctoPack: Instruction Tuning Code Large Language Models. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following. https://openreview.net/forum?id=CjrPqvvUXL
- Is Self-Repair a Silver Bullet for Code Generation? arXiv:2306.09896 [cs.CL]
- OpenAI. 2023a. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
- OpenAI. 2023b. Introducing ChatGPT Enterprise. https://openai.com/blog/introducing-chatgpt-enterprise Accessed: 2023-12-03.
- OpenAI. 2023c. Terms of Service. https://openai.com/policies/terms-of-use Accessed: August 17, 2023.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 27730–27744. https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf
- Generating High-Precision Feedback for Programming Syntax Errors using Large Language Models. arXiv:2302.04662 [cs.PL]
- ZeRO: Memory Optimizations toward Training Trillion Parameter Models. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
- Code Llama: Open Foundation Models for Code. arXiv:2308.12950 [cs.CL]
- Reflexion: language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=vAElhFcKW6
- Self-Instruct: Aligning Language Model with Self Generated Instructions. In Annual Meeting of the Association of Computation Linguistics (ACL).
- Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (, San Francisco, CA, USA,) (ESEC/FSE 2023). Association for Computing Machinery, New York, NY, USA, 172–184. https://doi.org/10.1145/3611643.3616271
- Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Qun Liu and David Schlangen (Eds.). Association for Computational Linguistics, 38–45. https://doi.org/10.18653/v1/2020.emnlp-demos.6
- Ming-Ho Yee and Arjun Guha. 2023. Do Machine Learning Models Produce TypeScript Types that Type Check? . https://doi.org/10.48550/arXiv.2302.12163
- Self-Edit: Fault-Aware Code Editor for Code Generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 769–787. https://doi.org/10.18653/v1/2023.acl-long.45
- LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset. arXiv:2309.11998 [cs.CL]