Emergent Mind

Magicoder: Source Code Is All You Need

(2312.02120)
Published Dec 4, 2023 in cs.CL , cs.AI , and cs.SE

Abstract

We introduce Magicoder, a series of fully open-source (code, weights, and data) LLMs for code that significantly closes the gap with top code models while having no more than 7B parameters. Magicoder models are trained on 75K synthetic instruction data using OSS-Instruct, a novel approach to enlightening LLMs with open-source code snippets to generate high-quality instruction data for code. Our main motivation is to mitigate the inherent bias of the synthetic data generated by LLMs by empowering them with a wealth of open-source references for the production of more diverse, realistic, and controllable data. The orthogonality of OSS-Instruct and other data generation methods like Evol-Instruct further enables us to build an enhanced MagicoderS. Both Magicoder and MagicoderS substantially outperform state-of-the-art code models with similar or even larger sizes on a wide range of coding benchmarks, including Python text-to-code generation, multilingual coding, and data-science program completion. Notably, MagicoderS-CL-7B based on CodeLlama even surpasses the prominent ChatGPT on HumanEval+ (66.5 vs. 65.9 in pass@1). Overall, OSS-Instruct opens a new direction for low-bias and high-quality instruction tuning using abundant open-source references.

Examples of OSS-Instruct generating problems and solutions from code snippets without full details for brevity.

Overview

  • Magicoder introduces specialized LLMs for efficient code generation with up to 7 billion parameters.

  • Introduces OSS-INSTRUCT training method utilizing open-source code to generate diverse, realistic instruction data for code models.

  • Combining OSS-INSTRUCT with other methods like Evol-Instruct produces robust versions, such as MagicoderS with improved performance.

  • Outperforms leading models, including ChatGPT, in coding tasks with a smaller parameter model.

  • Magicoder’s training data, model weights, and source code are made open-source for community collaboration.

Overview

Magicoder presents a series of open-source LLMs specifically designed for code generation. These models, which hold a maximum of 7 billion parameters, show significant performance improvements over other leading code models across coding benchmarks. The training of these models utilizes an original training method, OSS-INSTRUCT, which draws on open-source code snippets to generate diverse and realistic instruction data for code.

Methodology

OSS-INSTRUCT serves as a key innovation that enhances LLMs by tapping into the rich diversity of open-source code to create synthetic instruction data. The process avoids the bias inherent in synthetic data created by LLMs and benefits from utilizing a varied open-source code to produce unique and controllable data. Synthesized data is created by having a powerful LLM generate coding problems inspired by random segments of source code. This novel data then serves as the foundation for training the Magicoder series, with 75k synthetic data points being used to fine-tune models like CODE LLAMA-PYTHON-7B resulting in Magicoder-CL. Furthermore, combining OSS-INSTRUCT with other data generation methods like Evol-Instruct results in even more robust variants such as MagicoderS.

Performance and Comparisons

Extensive performance evaluations were carried out across a variety of coding tasks, such as Python text-to-code and multilingual code completion. The results revealed that Magicoder models significantly boosted the performance of baseline LLM models. Notably, the MagicoderS-CL variant even exceeded the performance of ChatGPT on rigorous coding challenges, showcasing the potential of generating robust code with a comparatively smaller 7 billion parameter model. Additionally, recent experimentation with the even more powerful DeepSeek-Coder series demonstrated the effectiveness and adaptability of the OSS-INSTRUCT approach across different base models.

Conclusions and Open-Source Contributions

The paper concludes by underscoring the innovative nature of the OSS-INSTRUCT method, which advances the field by producing higher-quality, lower-bias instruction data for LLMs. The success of Magicoder series offers a beacon for future work in LLMs for code generation. The authors have generously made the model weights, training data, and source code available at a specified GitHub repository to encourage collaborative efforts and future advancements in the domain.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
HackerNews
Magicoder: Source Code Is All You Need (182 points, 67 comments)