WizardCoder: Empowering Code Large Language Models with Evol-Instruct (2306.08568v1)
Abstract: Code LLMs (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction fine-tuning. In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. Through comprehensive experiments on four prominent code generation benchmarks, namely HumanEval, HumanEval+, MBPP, and DS-1000, we unveil the exceptional capabilities of our model. It surpasses all other open-source Code LLMs by a substantial margin. Moreover, our model even outperforms the largest closed LLMs, Anthropic's Claude and Google's Bard, on HumanEval and HumanEval+. Our code, model weights, and data are public at https://github.com/nlpxucan/WizardLM
- Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
- Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311, 2022.
- Palm 2 technical report. CoRR, abs/2305.10403, 2023.
- Training compute-optimal large language models. CoRR, abs/2203.15556, 2022.
- Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446, 2021.
- GLM-130B: an open bilingual pre-trained model. CoRR, abs/2210.02414, 2022.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
- OPT: open pre-trained transformer language models. CoRR, abs/2205.01068, 2022.
- Training language models to follow instructions with human feedback. In NeurIPS, 2022.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
- Competition-level code generation with alphacode. CoRR, abs/2203.07814, 2022.
- Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023.
- Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. CoRR, abs/2303.17568, 2023.
- Incoder: A generative model for code infilling and synthesis. CoRR, abs/2204.05999, 2022.
- Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021.
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 8696–8708. Association for Computational Linguistics, 2021.
- Codet5+: Open code large language models for code understanding and generation. CoRR, abs/2305.07922, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020.
- Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
- Scaling instruction-finetuned language models. CoRR, abs/2210.11416, 2022.
- Ext5: Towards extreme multi-task scaling for transfer learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
- Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
- Zeroprompt: Scaling prompt-based pretraining to 1, 000 tasks improves zero-shot generalization. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 4235–4252. Association for Computational Linguistics, 2022.
- Unifiedqa: Crossing format boundaries with a single QA system. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 1896–1907. Association for Computational Linguistics, 2020.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
- Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca, 2023.
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. CoRR, abs/2305.01210, 2023.
- Program synthesis with large language models. CoRR, abs/2108.07732, 2021.
- DS-1000: A natural and reliable benchmark for data science code generation. CoRR, abs/2211.11501, 2022.
- Gpt-neox-20b: An open-source autoregressive language model. CoRR, abs/2204.06745, 2022.
- GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
- Unifying language learning paradigms. CoRR, abs/2205.05131, 2022.
- Microsoft. Azure openai service models. https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/models, 2023.
- Llm humaneval benchmarks. https://github.com/my-other-github-account/llm-humaneval-benchmarks, 2023.
- Lamda: Language models for dialog applications. CoRR, abs/2201.08239, 2022.