COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning (2403.18058v2)
Abstract: Remarkable progress on English instruction tuning has facilitated the efficacy and reliability of LLMs. However, there remains a noticeable gap in instruction tuning for Chinese, where the complex linguistic features pose significant challenges. Existing datasets, generally distilled from English-centric LLMs, are not well-aligned with Chinese users' interaction patterns. To bridge this gap, we introduce COIG-CQIA, a new Chinese instruction tuning dataset derived from various real-world resources and undergoing rigorous human verification. We conduct extensive experiments on COIG-CQIA, and compare them with strong baseline models and datasets. The experimental results show that models trained on COIG-CQIA achieve highly competitive performance in diverse benchmarks. Additionally, our findings offer several insights for designing effective Chinese instruction-tuning datasets and data-mixing strategies. Our dataset are available at https://huggingface.co/datasets/m-a-p/COIG-CQIA.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Alpagasus: Training a better alpaca with fewer data.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
- Scaling instruction-finetuned language models.
- CLUEbenchmark. 2022. pclue: Large-scale prompt-based dataset for multi-task and zero-shot learning in chinese.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm.
- How close is chatgpt to human experts? comparison corpus, evaluation, and detection.
- Han He and Jinho D Choi. 2021. The stem cell hypothesis: Dilemma behind multi-task learning with transformer encoders. arXiv preprint arXiv:2109.06939.
- Unnatural instructions: Tuning language models with (almost) no human labor.
- Camels in a changing climate: Enhancing lm adaptation with tulu 2.
- Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases.
- Critical behavior of the fluctuation heat capacity near the glass transition of metallic glasses.
- Self-alignment with instruction backtranslation.
- Muffin: Curating multi-faceted instructions for improving instruction following. In The Twelfth International Conference on Learning Representations.
- Cross-task generalization via natural language crowdsourcing instructions.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
- Multitask prompted training enables zero-shot task generalization.
- Dynamics of instruction tuning: Each ability of large language models has its own growth pace.
- Moss: Training conversational language models from synthetic data.
- InternLM Team. 2023. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Self-instruct: Aligning language models with self-generated instructions.
- Wizardlm: Empowering large language models to follow complex instructions.
- Baize: An open-source chat model with parameter-efficient tuning on self-chat data.
- Jianxin Yang. 2023. Firefly(流萤): 中文对话式大语言模型. https://github.com/yangjianxin1/Firefly.
- Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652.
- Chinese open instruction generalist: A preliminary release.
- Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792.
- Lima: Less is more for alignment.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.