Emergent Mind

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

(2306.08568)
Published Jun 14, 2023 in cs.CL and cs.AI

Abstract

Code LLMs (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction fine-tuning. In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. Through comprehensive experiments on four prominent code generation benchmarks, namely HumanEval, HumanEval+, MBPP, and DS-1000, we unveil the exceptional capabilities of our model. It surpasses all other open-source Code LLMs by a substantial margin. Moreover, our model even outperforms the largest closed LLMs, Anthropic's Claude and Google's Bard, on HumanEval and HumanEval+. Our code, model weights, and data are public at https://github.com/nlpxucan/WizardLM

Overview

  • The paper introduces WizardCoder, which builds on the previous Code LLM, StarCoder, by integrating instruction fine-tuning.

  • WizardCoder is based on the Evol-Instruct method from WizardLM, which evolves instruction data for enhanced model training.

  • The paper shows that WizardCoder surpasses both open-source and some closed-source LLMs in code generation benchmarks.

  • The research confirms the importance of instruction fine-tuning, especially with the added challenges of code-related tasks.

  • The researchers recognize the ethical implications of Code LLMs and stress the need for responsible research and use.

Introduction

The landscape of Code LLMs (Code LLMs) has dramatically evolved with the introduction of various pre-trained models demonstrating proficiency in coding tasks. Open-source options like StarCoder have received significant acclaim. Yet, most of these models have largely been trained on code data alone, without the benefits of instruction fine-tuning. Building on the recent developments in general domain fine-tuning and the Evol-Instruct method, introduced by WizardLM, this paper presents WizardCoder, an enhancement to StarCoder that integrates complex instruction fine-tuning specific to coding tasks.

Related Work

In contextualizing WizardCoder, this research builds upon two primary foundations: open-source Code LLMs pre-trained on extensive code datasets and the methodology of instruction fine-tuning that has been largely explored in NLP tasks. Previous models, such as InstructGPT by OpenAI, have attempted to demonstrate the value of human-annotator provided instructions. Recent contributions like Alpaca and Vicuna further explored the potential of instruction fine-tuning, albeit in the general domain. WizardLM's Evol-Instruct method distinguished itself by evolving existing instruction data, signaling the potential for application in the code domain leading to the inception of WizardCoder.

Approach

WizardCoder employs an adapted Evol-Instruct method designed to evolve code instructions within the Code Alpaca dataset. This enables fine-tuning of StarCoder with an evolved set of code instruction-following training data. The researchers introduced evolutionary instructions that include code debugging and time-space complexity constraints, which are unique to the programming domain. The methodology ensures evolutionary prompts that augment the difficulty of the programming tasks. One observes that the empirical success of WizardCoder on several benchmarks is attributed to this nuanced approach of instruction fine-tuning.

Experimentation and Results

A rigorous experimentation framework was established utilizing multiple code generation benchmarks. WizardCoder outshines all open-source Code LLMs in these benchmarks, including its precursor, StarCoder. Notably, on prominent benchmarks such as HumanEval, it surpasses even the top closed-source LLMs, which is a remarkable feat for an open-source model of its size. The paper provides detailed comparative analysis, placing WizardCoder in the upper echelons of Code LLM performance. Furthermore, the ablation study confirms the efficacy of the number of data evolution rounds carried out, providing insights into fine-tuning methodologies.

Conclusion and Implications

The paper concludes with WizardCoder positioned as a state-of-the-art model that advances the field of code generation through instruction fine-tuning. It successfully applies the Evol-Instruct method, previously proven in the general domain, to the specific challenges of coding tasks. Looking ahead, the researchers point out the potential enhancements to WizardCoder and the need for continual improvement to meet and exceed the benchmarks set by models like GPT-4. Reflecting on the broader impact, the authors acknowledge the ethical considerations paralleling those of other LLMs and emphasize the necessity of research towards responsible use and deployment.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual
  2. GPT-4 Technical Report
  3. PaLM: Scaling Language Modeling with Pathways
  4. PaLM 2 Technical Report
  5. Training Compute-Optimal Large Language Models
  6. Scaling Language Models: Methods, Analysis & Insights from Training Gopher
  7. GLM-130B: An Open Bilingual Pre-trained Model
  8. LLaMA: Open and Efficient Foundation Language Models
  9. OPT: Open Pre-trained Transformer Language Models
  10. Training language models to follow instructions with human feedback. In NeurIPS
  11. StarCoder: may the source be with you!
  12. Competition-Level Code Generation with AlphaCode
  13. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations
  14. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X
  15. InCoder: A Generative Model for Code Infilling and Synthesis
  16. Evaluating Large Language Models Trained on Code
  17. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 8696–8708. Association for Computational Linguistics
  18. CodeT5+: Open Code Large Language Models for Code Understanding and Generation
  19. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67
  20. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net
  21. Scaling Instruction-Finetuned Language Models
  22. Ext5: Towards extreme multi-task scaling for transfer learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net
  23. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net
  24. Zeroprompt: Scaling prompt-based pretraining to 1, 000 tasks improves zero-shot generalization. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 4235–4252. Association for Computational Linguistics
  25. Unifiedqa: Crossing format boundaries with a single QA system. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 1896–1907. Association for Computational Linguistics
  26. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

  27. Self-Instruct: Aligning Language Models with Self-Generated Instructions
  28. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023
  29. WizardLM: Empowering Large Language Models to Follow Complex Instructions
  30. Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca

  31. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
  32. Program Synthesis with Large Language Models
  33. DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation
  34. GPT-NeoX-20B: An Open-Source Autoregressive Language Model
  35. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.

  36. UL2: Unifying Language Learning Paradigms
  37. Microsoft. Azure openai service models. https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/models

  38. Llm humaneval benchmarks. https://github.com/my-other-github-account/llm-humaneval-benchmarks

  39. LaMDA: Language Models for Dialog Applications

Show All 39