Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 158 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 36 tok/s Pro

GPT-5 High 35 tok/s Pro

GPT-4o 112 tok/s Pro

Kimi K2 177 tok/s Pro

GPT OSS 120B 452 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement (2402.14658v3)

Published 22 Feb 2024 in cs.SE, cs.AI, and cs.CL

Abstract: The introduction of LLMs has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address this, we introduce OpenCodeInterpreter, a family of open-source code systems designed for generating, executing, and iteratively refining code. Supported by Code-Feedback, a dataset featuring 68K multi-turn interactions, OpenCodeInterpreter integrates execution and human feedback for dynamic code refinement. Our comprehensive evaluation of OpenCodeInterpreter across key benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus reveals its exceptional performance. Notably, OpenCodeInterpreter-33B achieves an accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, closely rivaling GPT-4's 84.2 (76.2) and further elevates to 91.6 (84.6) with synthesized human feedback from GPT-4. OpenCodeInterpreter brings the gap between open-source code generation models and proprietary systems like GPT-4 Code Interpreter.

References (47)

Citations (67)

View on Semantic Scholar

Summary

The paper presents an open-source system that integrates code generation with live execution and iterative refinement through multi-turn feedback.
It leverages a diverse Code-Feedback dataset of 68,000 interactions to improve performance across benchmarks like HumanEval and MBPP.
Experimental results show the model achieving performance on par with GPT-4, highlighting its potential for handling complex coding tasks.

Introduction

The paper introduces OpenCodeInterpreter, an open-source code generation system designed to integrate code generation with execution and iterative refinement. A significant challenge within code generation has been the disparity between proprietary systems, such as the GPT-4 Code Interpreter, and open-source models, which generally lack the same level of execution capabilities and dynamic refinement through feedback. OpenCodeInterpreter targets this gap by building on a richly constructed dataset called Code-Feedback, consisting of 68,000 multi-turn interactions which include elements of both execution feedback and human feedback.

Methodology

The design of OpenCodeInterpreter hinges on three core components:

Code Generation: Utilizing pre-trained LLMs on extensive code-centric datasets.
Execution: Directly running generated code to gather diagnostic feedback.
Iterative Refinement: Incorporating multi-turn interactions to refine code iteratively based on feedback.

Dataset Construction: Code-Feedback

The Code-Feedback dataset is critical to OpenCodeInterpreter's success. It comprises a mix of queries sourced from open-source datasets and coding challenges from LeetCode, processed to ensure a diverse and challenging collection of tasks. Noteworthy elements include:

Multi-Turn Dialogues: Structured interactions that simulate real-world coding scenarios, focusing on iterative feedback.
Diverse Data Collection Methods: Encompassing single-turn packing, interaction simulation, code correction, LeetCode similar problems, and LeetCode follow-up.

These methods ensure that Code-Feedback not only covers a wide array of coding challenges but also fosters robust engagement with execution and human feedback.

Experimental Evaluation

The paper presents a thorough evaluation of OpenCodeInterpreter against established benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus, demonstrating its performance across model scales (7B, 13B, 34B, and 70B parameters). Remarkably, OpenCodeInterpreter-33B achieves an average accuracy of 83.2 (76.4 for the plus versions) on HumanEval and MBPP, closely matching GPT-4's 84.2 (76.2). With synthesized human feedback from GPT-4, this performance further elevates to 91.6 (84.6), showcasing the efficacy of iterative refinement and feedback integration.

Results

Single-Turn Code Generation: OpenCodeInterpreter significantly outperforms other open-source models, achieving results on par with or surpassing proprietary models. This is evident across different scales and configurations, highlighting its robustness.

Multi-Turn Code Generation: When evaluated on multi-turn tasks involving execution feedback and synthetic human feedback, OpenCodeInterpreter demonstrates superior refinement capabilities. It consistently performs better than leading models in incorporating iterative feedback to correct and enhance code functionality.

Discussion

The research denotes several key implications:

Integrated Execution and Feedback: The dynamic incorporation of execution results and human feedback into the code generation process bridges a notable gap between open-source and proprietary models.
Model Adaptability: Iterative refinement allows for better handling of complex and ambiguous user intents, improving overall code accuracy and reliability.
Scalability of Training Data: The construction and utilization of Code-Feedback emphasize the importance of diverse, multi-turn interactions to train more effective code generation models.

Future Developments

Future progress in AI-driven code generation could see:

Enhanced Feedback Mechanisms: Incorporating more sophisticated diagnostic tools and nuanced human feedback to further boost model performance.
Cross-Domain Applications: Expanding code generation capabilities into other domains such as data science or low-level programming, leveraging the adaptability of models like OpenCodeInterpreter.
Community Contributions: Engaging the open-source community to expand and refine datasets, continually improving the robustness and applicability of models.

Conclusion

OpenCodeInterpreter marks a significant advancement in integrating code generation with execution and refinement. By effectively leveraging multi-turn interactions through execution and human feedback, OpenCodeInterpreter narrows the gap between open-source and proprietary systems. This approach not only elevates performance standards but also sets a new precedent for future research and development in the field of automated code generation.