Magicoder: Empowering Code Generation with OSS-Instruct (2312.02120v2)

Published 4 Dec 2023 in cs.CL, cs.AI, and cs.SE

Abstract: We introduce Magicoder, a series of fully open-source (code, weights, and data) LLMs for code that significantly closes the gap with top code models while having no more than 7B parameters. Magicoder models are trained on 75K synthetic instruction data using OSS-Instruct, a novel approach to enlightening LLMs with open-source code snippets to generate diverse instruction data for code. Our main motivation is to mitigate the inherent bias of the synthetic data generated by LLMs through the wealth of open-source references for the production of more realistic and controllable data. The orthogonality of OSS-Instruct and other data generation methods like Evol-Instruct further enables us to build an enhanced MagicoderS. Both Magicoder and MagicoderS substantially outperform state-of-the-art code models with similar or even larger sizes on a wide range of coding benchmarks. Notably, MagicoderS-CL-7B based on CodeLlama even surpasses the prominent ChatGPT on HumanEval+ (66.5 vs. 65.9 in pass@1 ). Overall, OSS-Instruct opens a new direction for crafting diverse synthetic instruction data for code using abundant open-source references.

Citations (43)

View on Semantic Scholar

Summary

The paper presents the innovative OSS-INSTRUCT method that utilizes open-source code to generate diverse, low-bias instruction data for fine-tuning 7B LLMs.
The paper demonstrates significant performance gains, with the Magicoder series outperforming ChatGPT and other benchmarks on multilingual and Python coding tasks.
The paper emphasizes open collaboration by releasing model weights, training data, and source code, paving the way for future advancements in LLM-based code generation.

Overview

Magicoder presents a series of open-source LLMs specifically designed for code generation. These models, which hold a maximum of 7 billion parameters, show significant performance improvements over other leading code models across coding benchmarks. The training of these models utilizes an original training method, OSS-INSTRUCT, which draws on open-source code snippets to generate diverse and realistic instruction data for code.

Methodology

OSS-INSTRUCT serves as a key innovation that enhances LLMs by tapping into the rich diversity of open-source code to create synthetic instruction data. The process avoids the bias inherent in synthetic data created by LLMs and benefits from utilizing a varied open-source code to produce unique and controllable data. Synthesized data is created by having a powerful LLM generate coding problems inspired by random segments of source code. This novel data then serves as the foundation for training the Magicoder series, with 75k synthetic data points being used to fine-tune models like CODE LLAMA-PYTHON-7B resulting in Magicoder-CL. Furthermore, combining OSS-INSTRUCT with other data generation methods like Evol-Instruct results in even more robust variants such as MagicoderS.

Performance and Comparisons

Extensive performance evaluations were carried out across a variety of coding tasks, such as Python text-to-code and multilingual code completion. The results revealed that Magicoder models significantly boosted the performance of baseline LLM models. Notably, the MagicoderS-CL variant even exceeded the performance of ChatGPT on rigorous coding challenges, showcasing the potential of generating robust code with a comparatively smaller 7 billion parameter model. Additionally, recent experimentation with the even more powerful DeepSeek-Coder series demonstrated the effectiveness and adaptability of the OSS-INSTRUCT approach across different base models.

Conclusions and Open-Source Contributions

The paper concludes by underscoring the innovative nature of the OSS-INSTRUCT method, which advances the field by producing higher-quality, lower-bias instruction data for LLMs. The success of Magicoder series offers a beacon for future work in LLMs for code generation. The authors have generously made the model weights, training data, and source code available at a specified GitHub repository to encourage collaborative efforts and future advancements in the domain.