WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning (2312.14187v5)
Abstract: Recent work demonstrates that, after instruction tuning, Code LLMs (Code LLMs) can obtain impressive capabilities to address a wide range of code-related tasks. However, current instruction tuning methods for Code LLMs mainly focus on the traditional code generation task, resulting in poor performance in complex multi-task scenarios. In this paper, we concentrate on multiple code-related tasks and present WaveCoder, a series of Code LLMs trained with Widespread And Versatile Enhanced instruction data. To enable the models to tackle complex code-related tasks, we propose a method to stably generate diverse, high-quality instruction data from open source code dataset in multi-task scenarios and obtain CodeSeaXDataset, a dataset comprising 19,915 instruction instances across 4 code-related tasks, which is aimed at improving the generalization ability of Code LLM. Our experiments demonstrate that WaveCoder models significantly outperform other open-source models in terms of the generalization ability across different code-related tasks. Moreover, WaveCoder-Ultra-6.7B presents the state-of-the-art generalization abilities on a wide range of code-related tasks.
- OpenAI. Gpt-4 technical report, 2023.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
- Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022.
- Ext5: Towards extreme multi-task scaling for transfer learning. In International Conference on Learning Representations, 2022.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
- Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
- A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Instructiongpt-4: A 200-instruction paradigm for fine-tuning minigpt-4. arXiv preprint arXiv:2308.12067, 2023.
- Learning to break the loop: Analyzing and mitigating repetitions for neural text generation. Advances in Neural Information Processing Systems, 35:3082–3095, 2022.
- Understanding in-context learning from repetitions. arXiv preprint arXiv:2310.00297, 2023.
- Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, 2022.
- Anonymous. Octopack: Instruction tuning code large language models. In Submitted to The Twelfth International Conference on Learning Representations, 2023. under review.
- DeepSeek. Deepseek coder: Let the code write itself. https://github.com/deepseek-ai/DeepSeek-Coder, 2023.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
- Instruction tuned models are quick learners. arXiv preprint arXiv:2306.05539, 2023.
- Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018.
- Maybe only 0.5% data is needed: A preliminary exploration of low training data instruction tuning. arXiv preprint arXiv:2305.09246, 2023.
- Pangu-coder2: Boosting large language models for code with ranking feedback. arXiv preprint arXiv:2307.14936, 2023.
- Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
- Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696–8708, 2021.
- Codet5+: Open code large language models for code understanding and generation. arXiv preprint, 2023.