Emergent Mind

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

(2402.14658)
Published Feb 22, 2024 in cs.SE , cs.AI , and cs.CL

Abstract

The introduction of LLMs has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address this, we introduce OpenCodeInterpreter, a family of open-source code systems designed for generating, executing, and iteratively refining code. Supported by Code-Feedback, a dataset featuring 68K multi-turn interactions, OpenCodeInterpreter integrates execution and human feedback for dynamic code refinement. Our comprehensive evaluation of OpenCodeInterpreter across key benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus reveals its exceptional performance. Notably, OpenCodeInterpreter-33B achieves an accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, closely rivaling GPT-4's 84.2 (76.2) and further elevates to 91.6 (84.6) with synthesized human feedback from GPT-4. OpenCodeInterpreter brings the gap between open-source code generation models and proprietary systems like GPT-4 Code Interpreter.

OpenCodeInterpreter's pass@1 accuracy on HumanEval, comparable to GPT-4 with feedback.

Overview

  • OpenCodeInterpreter is an open-source code generation system that integrates code generation with execution and iterative refinement, significantly advancing the capabilities of open-source models to match proprietary systems like GPT-4 Code Interpreter.

  • The system leverages a richly constructed dataset called Code-Feedback, comprising 68,000 multi-turn interactions that include execution feedback and human feedback, to train effective and adaptable LLMs for coding tasks.

  • Experimental evaluations show that OpenCodeInterpreter achieves competitive results with proprietary models, particularly in multi-turn code generation tasks, underscoring the importance of iterative feedback and execution in improving code accuracy and reliability.

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

Introduction

The paper introduces OpenCodeInterpreter, an open-source code generation system designed to integrate code generation with execution and iterative refinement. A significant challenge within code generation has been the disparity between proprietary systems, such as the GPT-4 Code Interpreter, and open-source models, which generally lack the same level of execution capabilities and dynamic refinement through feedback. OpenCodeInterpreter targets this gap by building on a richly constructed dataset called Code-Feedback, consisting of 68,000 multi-turn interactions which include elements of both execution feedback and human feedback.

Methodology

The design of OpenCodeInterpreter hinges on three core components:

  1. Code Generation: Utilizing pre-trained LLMs on extensive code-centric datasets.
  2. Execution: Directly running generated code to gather diagnostic feedback.
  3. Iterative Refinement: Incorporating multi-turn interactions to refine code iteratively based on feedback.

Dataset Construction: Code-Feedback

The Code-Feedback dataset is critical to OpenCodeInterpreter's success. It comprises a mix of queries sourced from open-source datasets and coding challenges from LeetCode, processed to ensure a diverse and challenging collection of tasks. Noteworthy elements include:

  • Multi-Turn Dialogues: Structured interactions that simulate real-world coding scenarios, focusing on iterative feedback.
  • Diverse Data Collection Methods: Encompassing single-turn packing, interaction simulation, code correction, LeetCode similar problems, and LeetCode follow-up.

These methods ensure that Code-Feedback not only covers a wide array of coding challenges but also fosters robust engagement with execution and human feedback.

Experimental Evaluation

The paper presents a thorough evaluation of OpenCodeInterpreter against established benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus, demonstrating its performance across model scales (7B, 13B, 34B, and 70B parameters). Remarkably, OpenCodeInterpreter-33B achieves an average accuracy of 83.2 (76.4 for the plus versions) on HumanEval and MBPP, closely matching GPT-4's 84.2 (76.2). With synthesized human feedback from GPT-4, this performance further elevates to 91.6 (84.6), showcasing the efficacy of iterative refinement and feedback integration.

Results

Single-Turn Code Generation: OpenCodeInterpreter significantly outperforms other open-source models, achieving results on par with or surpassing proprietary models. This is evident across different scales and configurations, highlighting its robustness.

Multi-Turn Code Generation: When evaluated on multi-turn tasks involving execution feedback and synthetic human feedback, OpenCodeInterpreter demonstrates superior refinement capabilities. It consistently performs better than leading models in incorporating iterative feedback to correct and enhance code functionality.

Discussion

The research denotes several key implications:

  • Integrated Execution and Feedback: The dynamic incorporation of execution results and human feedback into the code generation process bridges a notable gap between open-source and proprietary models.
  • Model Adaptability: Iterative refinement allows for better handling of complex and ambiguous user intents, improving overall code accuracy and reliability.
  • Scalability of Training Data: The construction and utilization of Code-Feedback emphasize the importance of diverse, multi-turn interactions to train more effective code generation models.

Future Developments

Future progress in AI-driven code generation could see:

  • Enhanced Feedback Mechanisms: Incorporating more sophisticated diagnostic tools and nuanced human feedback to further boost model performance.
  • Cross-Domain Applications: Expanding code generation capabilities into other domains such as data science or low-level programming, leveraging the adaptability of models like OpenCodeInterpreter.
  • Community Contributions: Engaging the open-source community to expand and refine datasets, continually improving the robustness and applicability of models.

Conclusion

OpenCodeInterpreter marks a significant advancement in integrating code generation with execution and refinement. By effectively leveraging multi-turn interactions through execution and human feedback, OpenCodeInterpreter narrows the gap between open-source and proprietary systems. This approach not only elevates performance standards but also sets a new precedent for future research and development in the field of automated code generation.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube