OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement (2402.14658v2)
Abstract: The introduction of LLMs has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address this, we introduce OpenCodeInterpreter, a family of open-source code systems designed for generating, executing, and iteratively refining code. Supported by Code-Feedback, a dataset featuring 68K multi-turn interactions, OpenCodeInterpreter integrates execution and human feedback for dynamic code refinement. Our comprehensive evaluation of OpenCodeInterpreter across key benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus reveals its exceptional performance. Notably, OpenCodeInterpreter-33B achieves an accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, closely rivaling GPT-4's 84.2 (76.2) and further elevates to 91.6 (84.6) with synthesized human feedback from GPT-4. OpenCodeInterpreter brings the gap between open-source code generation models and proprietary systems like GPT-4 Code Interpreter.
- Program synthesis with large language models. ArXiv preprint, abs/2108.07732.
- Qwen technical report. ArXiv preprint, abs/2309.16609.
- GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136, virtual+Dublin. Association for Computational Linguistics.
- Sahil Chaudhary. 2023. Code Alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca. Accessed: 2024-02-13.
- Evaluating large language models trained on code. ArXiv preprint, abs/2107.03374.
- Teaching large language models to self-debug. ArXiv preprint, abs/2304.05128.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
- Pangu-coder: Program synthesis with function-level language modeling. ArXiv preprint, abs/2207.11280.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Codefuse-13b: A pretrained multi-lingual code large language model. ArXiv preprint, abs/2310.06266.
- Stepcoder: Improve code generation with reinforcement learning from compiler feedback. ArXiv preprint, abs/2402.01391.
- Incoder: A generative model for code infilling and synthesis. ArXiv preprint, abs/2204.05999.
- GitHub. 2023. Github copilot. https://github.com/features/copilot. Accessed: 2024-02-14.
- Gemini: a family of highly capable multimodal models. ArXiv preprint, abs/2312.11805.
- Deepseek-coder: When the large language model meets programming–the rise of code intelligence. ArXiv preprint, abs/2401.14196.
- Mistral 7b. ArXiv preprint, abs/2310.06825.
- Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328.
- Starcoder: may the source be with you! ArXiv preprint, abs/2305.06161.
- Taco: Topics in algorithmic code generation dataset. ArXiv preprint, abs/2312.14852.
- Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
- Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems.
- Wizardcoder: Empowering code large language models with evol-instruct. ArXiv preprint, abs/2306.08568.
- Memory-assisted prompt editing to improve GPT-3 after deployment. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2833–2861, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Self-refine: Iterative refinement with self-feedback. ArXiv preprint, abs/2303.17651.
- Codegen: An open large language model for code with multi-turn program synthesis. ArXiv preprint, abs/2203.13474.
- OpenAI. 2022. ChatGPT: Optimizing Language Models for Dialogue. https://openai.com/blog/chatgpt/. Accessed on 14 Feb. 2024.
- OpenAI. 2023. Gpt-4 technical report.
- Phind. 2023. Phind/phind-codellama-34b-v2.
- Scaling language models: Methods, analysis & insights from training gopher. ArXiv preprint, abs/2112.11446.
- Code llama: Open foundation models for code. ArXiv preprint, abs/2308.12950.
- Self-critiquing models for assisting human evaluators. ArXiv preprint, abs/2206.05802.
- Training language models with natural language feedback. ArXiv preprint, abs/2204.14146.
- In-context pretraining: Language modeling beyond document boundaries. ArXiv preprint, abs/2310.10638.
- Execution-based code generation using deep reinforcement learning. ArXiv preprint, abs/2301.13816.
- speechless. 2023. speechless-codellama-34b-v2.0.
- Interscript: A dataset for interactive learning of scripts through error feedback. ArXiv preprint, abs/2112.07867.
- Galactica: A large language model for science. ArXiv preprint, abs/2211.09085.
- Lamda: Language models for dialog applications. ArXiv preprint, abs/2201.08239.
- Debugbench: Evaluating debugging capability of large language models. ArXiv preprint, abs/2401.04621.
- Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971.
- Openchat: Advancing open-source language models with mixed-quality data. ArXiv preprint, abs/2309.11235.
- Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13484–13508.
- Codet5+: Open code large language models for code understanding and generation. ArXiv preprint, abs/2305.07922.
- Magicoder: Source code is all you need. ArXiv preprint, abs/2312.02120.
- Xwin-LM. 2023. Xwin-lm.
- Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation. ArXiv preprint, abs/2312.14187.
- Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. ArXiv preprint, abs/2303.17568.
- Tianyu Zheng (28 papers)
- Ge Zhang (170 papers)
- Tianhao Shen (15 papers)
- Xueling Liu (5 papers)
- Bill Yuchen Lin (72 papers)
- Jie Fu (229 papers)
- Wenhu Chen (134 papers)
- Xiang Yue (72 papers)