Code LLMs (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction fine-tuning. In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. Through comprehensive experiments on four prominent code generation benchmarks, namely HumanEval, HumanEval+, MBPP, and DS-1000, we unveil the exceptional capabilities of our model. It surpasses all other open-source Code LLMs by a substantial margin. Moreover, our model even outperforms the largest closed LLMs, Anthropic's Claude and Google's Bard, on HumanEval and HumanEval+. Our code, model weights, and data are public at https://github.com/nlpxucan/WizardLM
The paper introduces WizardCoder, which builds on the previous Code LLM, StarCoder, by integrating instruction fine-tuning.
WizardCoder is based on the Evol-Instruct method from WizardLM, which evolves instruction data for enhanced model training.
The paper shows that WizardCoder surpasses both open-source and some closed-source LLMs in code generation benchmarks.
The research confirms the importance of instruction fine-tuning, especially with the added challenges of code-related tasks.
The researchers recognize the ethical implications of Code LLMs and stress the need for responsible research and use.
The landscape of Code LLMs (Code LLMs) has dramatically evolved with the introduction of various pre-trained models demonstrating proficiency in coding tasks. Open-source options like StarCoder have received significant acclaim. Yet, most of these models have largely been trained on code data alone, without the benefits of instruction fine-tuning. Building on the recent developments in general domain fine-tuning and the Evol-Instruct method, introduced by WizardLM, this paper presents WizardCoder, an enhancement to StarCoder that integrates complex instruction fine-tuning specific to coding tasks.
In contextualizing WizardCoder, this research builds upon two primary foundations: open-source Code LLMs pre-trained on extensive code datasets and the methodology of instruction fine-tuning that has been largely explored in NLP tasks. Previous models, such as InstructGPT by OpenAI, have attempted to demonstrate the value of human-annotator provided instructions. Recent contributions like Alpaca and Vicuna further explored the potential of instruction fine-tuning, albeit in the general domain. WizardLM's Evol-Instruct method distinguished itself by evolving existing instruction data, signaling the potential for application in the code domain leading to the inception of WizardCoder.
WizardCoder employs an adapted Evol-Instruct method designed to evolve code instructions within the Code Alpaca dataset. This enables fine-tuning of StarCoder with an evolved set of code instruction-following training data. The researchers introduced evolutionary instructions that include code debugging and time-space complexity constraints, which are unique to the programming domain. The methodology ensures evolutionary prompts that augment the difficulty of the programming tasks. One observes that the empirical success of WizardCoder on several benchmarks is attributed to this nuanced approach of instruction fine-tuning.
A rigorous experimentation framework was established utilizing multiple code generation benchmarks. WizardCoder outshines all open-source Code LLMs in these benchmarks, including its precursor, StarCoder. Notably, on prominent benchmarks such as HumanEval, it surpasses even the top closed-source LLMs, which is a remarkable feat for an open-source model of its size. The paper provides detailed comparative analysis, placing WizardCoder in the upper echelons of Code LLM performance. Furthermore, the ablation study confirms the efficacy of the number of data evolution rounds carried out, providing insights into fine-tuning methodologies.
The paper concludes with WizardCoder positioned as a state-of-the-art model that advances the field of code generation through instruction fine-tuning. It successfully applies the Evol-Instruct method, previously proven in the general domain, to the specific challenges of coding tasks. Looking ahead, the researchers point out the potential enhancements to WizardCoder and the need for continual improvement to meet and exceed the benchmarks set by models like GPT-4. Reflecting on the broader impact, the authors acknowledge the ethical considerations paralleling those of other LLMs and emphasize the necessity of research towards responsible use and deployment.
Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca
Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca
GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
Microsoft. Azure openai service models. https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/models
Llm humaneval benchmarks. https://github.com/my-other-github-account/llm-humaneval-benchmarks