COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning (2403.18058v2)
Abstract: Remarkable progress on English instruction tuning has facilitated the efficacy and reliability of LLMs. However, there remains a noticeable gap in instruction tuning for Chinese, where the complex linguistic features pose significant challenges. Existing datasets, generally distilled from English-centric LLMs, are not well-aligned with Chinese users' interaction patterns. To bridge this gap, we introduce COIG-CQIA, a new Chinese instruction tuning dataset derived from various real-world resources and undergoing rigorous human verification. We conduct extensive experiments on COIG-CQIA, and compare them with strong baseline models and datasets. The experimental results show that models trained on COIG-CQIA achieve highly competitive performance in diverse benchmarks. Additionally, our findings offer several insights for designing effective Chinese instruction-tuning datasets and data-mixing strategies. Our dataset are available at https://huggingface.co/datasets/m-a-p/COIG-CQIA.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Alpagasus: Training a better alpaca with fewer data.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
- Scaling instruction-finetuned language models.
- CLUEbenchmark. 2022. pclue: Large-scale prompt-based dataset for multi-task and zero-shot learning in chinese.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm.
- How close is chatgpt to human experts? comparison corpus, evaluation, and detection.
- Han He and Jinho D Choi. 2021. The stem cell hypothesis: Dilemma behind multi-task learning with transformer encoders. arXiv preprint arXiv:2109.06939.
- Unnatural instructions: Tuning language models with (almost) no human labor.
- Camels in a changing climate: Enhancing lm adaptation with tulu 2.
- Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases.
- Critical behavior of the fluctuation heat capacity near the glass transition of metallic glasses.
- Self-alignment with instruction backtranslation.
- Muffin: Curating multi-faceted instructions for improving instruction following. In The Twelfth International Conference on Learning Representations.
- Cross-task generalization via natural language crowdsourcing instructions.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
- Multitask prompted training enables zero-shot task generalization.
- Dynamics of instruction tuning: Each ability of large language models has its own growth pace.
- Moss: Training conversational language models from synthetic data.
- InternLM Team. 2023. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Self-instruct: Aligning language models with self-generated instructions.
- Wizardlm: Empowering large language models to follow complex instructions.
- Baize: An open-source chat model with parameter-efficient tuning on self-chat data.
- Jianxin Yang. 2023. Firefly(流萤): 中文对话式大语言模型. https://github.com/yangjianxin1/Firefly.
- Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652.
- Chinese open instruction generalist: A preliminary release.
- Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792.
- Lima: Less is more for alignment.