The introduction of LLMs has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address this, we introduce OpenCodeInterpreter, a family of open-source code systems designed for generating, executing, and iteratively refining code. Supported by Code-Feedback, a dataset featuring 68K multi-turn interactions, OpenCodeInterpreter integrates execution and human feedback for dynamic code refinement. Our comprehensive evaluation of OpenCodeInterpreter across key benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus reveals its exceptional performance. Notably, OpenCodeInterpreter-33B achieves an accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, closely rivaling GPT-4's 84.2 (76.2) and further elevates to 91.6 (84.6) with synthesized human feedback from GPT-4. OpenCodeInterpreter brings the gap between open-source code generation models and proprietary systems like GPT-4 Code Interpreter.
OpenCodeInterpreter is an open-source code generation system that integrates code generation with execution and iterative refinement, significantly advancing the capabilities of open-source models to match proprietary systems like GPT-4 Code Interpreter.
The system leverages a richly constructed dataset called Code-Feedback, comprising 68,000 multi-turn interactions that include execution feedback and human feedback, to train effective and adaptable LLMs for coding tasks.
Experimental evaluations show that OpenCodeInterpreter achieves competitive results with proprietary models, particularly in multi-turn code generation tasks, underscoring the importance of iterative feedback and execution in improving code accuracy and reliability.
The paper introduces OpenCodeInterpreter, an open-source code generation system designed to integrate code generation with execution and iterative refinement. A significant challenge within code generation has been the disparity between proprietary systems, such as the GPT-4 Code Interpreter, and open-source models, which generally lack the same level of execution capabilities and dynamic refinement through feedback. OpenCodeInterpreter targets this gap by building on a richly constructed dataset called Code-Feedback, consisting of 68,000 multi-turn interactions which include elements of both execution feedback and human feedback.
The design of OpenCodeInterpreter hinges on three core components:
The Code-Feedback dataset is critical to OpenCodeInterpreter's success. It comprises a mix of queries sourced from open-source datasets and coding challenges from LeetCode, processed to ensure a diverse and challenging collection of tasks. Noteworthy elements include:
These methods ensure that Code-Feedback not only covers a wide array of coding challenges but also fosters robust engagement with execution and human feedback.
The paper presents a thorough evaluation of OpenCodeInterpreter against established benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus, demonstrating its performance across model scales (7B, 13B, 34B, and 70B parameters). Remarkably, OpenCodeInterpreter-33B achieves an average accuracy of 83.2 (76.4 for the plus versions) on HumanEval and MBPP, closely matching GPT-4's 84.2 (76.2). With synthesized human feedback from GPT-4, this performance further elevates to 91.6 (84.6), showcasing the efficacy of iterative refinement and feedback integration.
Single-Turn Code Generation: OpenCodeInterpreter significantly outperforms other open-source models, achieving results on par with or surpassing proprietary models. This is evident across different scales and configurations, highlighting its robustness.
Multi-Turn Code Generation: When evaluated on multi-turn tasks involving execution feedback and synthetic human feedback, OpenCodeInterpreter demonstrates superior refinement capabilities. It consistently performs better than leading models in incorporating iterative feedback to correct and enhance code functionality.
The research denotes several key implications:
Future progress in AI-driven code generation could see:
OpenCodeInterpreter marks a significant advancement in integrating code generation with execution and refinement. By effectively leveraging multi-turn interactions through execution and human feedback, OpenCodeInterpreter narrows the gap between open-source and proprietary systems. This approach not only elevates performance standards but also sets a new precedent for future research and development in the field of automated code generation.