AutoCoder: Enhancing Code Large Language Model with \textsc{AIEV-Instruct} (2405.14906v1)

Published 23 May 2024 in cs.SE and cs.AI

Abstract: We introduce AutoCoder, the first LLM to surpass GPT-4 Turbo (April 2024) and GPT-4o in pass@1 on the Human Eval benchmark test ($\mathbf{90.9\%}$ vs. $\mathbf{90.2\%}$). In addition, AutoCoder offers a more versatile code interpreter compared to GPT-4 Turbo and GPT-4o. It's code interpreter can install external packages instead of limiting to built-in packages. AutoCoder's training data is a multi-turn dialogue dataset created by a system combining agent interaction and external code execution verification, a method we term \textbf{\textsc{AIEV-Instruct}} (Instruction Tuning with Agent-Interaction and Execution-Verified). Compared to previous large-scale code dataset generation methods, \textsc{AIEV-Instruct} reduces dependence on proprietary large models and provides execution-validated code dataset. The code and the demo video is available in \url{https://github.com/bin123apple/AutoCoder}.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces AutoCoder, which uses the innovative AIEV-Instruct framework to generate validated code instruction datasets via agent dialogues, reducing reliance on proprietary models.
Its dual-stage training process, featuring a teaching stage and a self-learning stage, achieves outstanding results including a 90.9% Pass@1 on HumanEval.
Experimental results confirm AutoCoder's robust performance across diverse datasets and programming languages, offering a cost-effective and versatile solution for code generation.

AutoCoder: Enhancing Code LLM with AIEV-Instruct

The paper "AutoCoder: Enhancing Code LLM with AIEV-Instruct" introduces AutoCoder, a novel LLM for code generation, trained using an innovative dataset annotation method named AIEV-Instruct. This paper aims to address two prominent challenges in the existing methodologies for code generation with LLMs: the reliance on expensive proprietary models for annotation and the transmission of incorrect knowledge during the distillation process from teacher models to student models.

Motivation

Code generation is increasingly becoming a critical tool in modern software development, as it can significantly increase productivity, reduce errors, and support the development of complex systems. Various closed-source and open-source models have shown promise in this domain, but there remain key challenges in achieving high accuracy without excessive costs and dependencies on closed-source models.

Methodology: AIEV-Instruct

AIEV-Instruct, which stands for Agent-Interaction and Execution-Verified Instruction, is the core methodological innovation presented in this paper. It revolutionizes the process of creating high-quality code instruction datasets through simulated dialogues between two agents: a questioner and a programmer.

The AIEV-Instruct process is separated into two primary stages:

Teaching Stage: During this stage, a proprietary model is used as the "teacher" to generate code snippets and corresponding unit tests from open-source fragments. The dialogue between the agents ensures diverse and multi-turn interaction, capturing a comprehensive set of programming scenarios. The proprietary model generates an initial code and unit test, which are then verified by execution. Errors are fed back into the system to iteratively improve the generated code until it passes all unit tests.
Self-Learning Stage: Once AutoCoder surpasses proprietary models in accuracy on the test sets, it transitions to the Self-Learning Stage, where AutoCoder itself assumes both the questioner and programmer roles. This self-sustaining loop significantly reduces the dependence on expensive proprietary models and enables autonomous learning and dataset generation.

AutoCoder Training and Evaluation

Using the AIEV-Instruct dataset, AutoCoder was trained in two configurations: AutoCoder (33B parameters) and AutoCoder-S (6.7B parameters). Training was conducted using the Deepseek-Coder as the base model. AutoCoder introduces a new feature in its code interpreter environment, which allows it to execute bash commands for external package installations, addressing a significant limitation in existing models like GPT-4 and GPT-4 Turbo.

Experimental Results

The experimental results presented in the paper are notable for several reasons:

HumanEval Performance: AutoCoder achieved a Pass@1 score of 90.9%, surpassing the performance of GPT-4 Turbo (90.2%) and all current state-of-the-art models.
Benchmarks Across Diverse Datasets: AutoCoder was evaluated on multiple datasets including HumanEval+, MBPP, MBPP+, MultiPL-E, and DS-1000. Consistently, it performed admirably, ranking among the top across these benchmarks:
- HumanEval+: 78% Pass@1, second only to GPT-4 Turbo and CodeQwen1.5-Chat
- MBPP and MBPP+: 82.5% and 70.6% Pass@1 respectively
- Multilingual Performance: AutoCoder showed robust capabilities across several programming languages, achieving 68.9% Pass@1 in both C++ and JavaScript.
- Data Science Code Generation (DS-1000): AutoCoder's performance was second only to GPT-4 Turbo with an overall Pass@1 exceeding 45%.

Implications and Future Directions

The implications of this research extend both practically and theoretically. Practically, AutoCoder offers a more versatile and autonomous approach to high-quality code generation, reducing costs and dependence on closed-source models. Theoretically, the AIEV-Instruct method presents a robust framework for iterative learning and validation in dataset generation, potentially applicable to other domains beyond code generation.

The paper's results suggest several avenues for future work:

Extension to Other Programming Languages: Future work could explore the application of AIEV-Instruct in generating datasets for languages beyond those tested.
Integration with IDEs: Further integration of AutoCoder within Integrated Development Environments (IDEs) to provide real-time assistance and corrections to developers.
Expansion of the Dialogue-Based Annotation: Expanding the annotation methodology to more complex, multi-step programming tasks.

In summary, this paper makes significant contributions to the field of code generation with LLMs by presenting AutoCoder and the AIEV-Instruct methodology. It demonstrates improved performance and versatility over existing models, suggesting a path towards more autonomous and cost-effective code generation solutions.

PDF Markdown

Related Papers

GitHub

GitHub - bin123apple/AutoCoder: We introduced a new model designed for the Code generation task. Its test accuracy on the HumanEval base dataset surpasses that of GPT-4 Turbo (April 2024) and GPT-4o. (643 stars)

Tweets

https://twitter.com/_akhaliq/status/1794913439963402683

https://twitter.com/pythontrending/status/1796513746472370626

https://twitter.com/GitHubGPT/status/1796980725604118555

https://twitter.com/susumuota/status/1796331976686702632

https://twitter.com/chalamalasetti/status/1797082567981633819

https://twitter.com/victor_explore/status/1796395223935001040