Kun: Answer Polishment for Chinese Self-Alignment with Instruction Back-Translation (2401.06477v4)
Abstract: In this paper, we introduce Kun, a novel approach for creating high-quality instruction-tuning datasets for LLMs without relying on manual annotations. Adapting a self-training algorithm based on instruction back-translation and answer polishment, Kun leverages unlabelled data from diverse sources such as Wudao, Wanjuan, and SkyPile to generate a substantial dataset of over a million Chinese instructional data points. This approach significantly deviates from traditional methods by using a self-curation process to refine and select the most effective instruction-output pairs. Our experiments with the 6B-parameter Yi model across various benchmarks demonstrate Kun's robustness and scalability. Our method's core contributions lie in its algorithmic advancement, which enhances data retention and clarity, and its innovative data generation approach that substantially reduces the reliance on costly and time-consuming manual annotations. This methodology presents a scalable and efficient solution for improving the instruction-following capabilities of LLMs, with significant implications for their application across diverse fields. The code and dataset can be found at https://github.com/Zheng0428/COIG-Kun
- Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631.
- BAAI. 2023a. Coig-pc.
- BAAI. 2023b. Coig-pc-lite.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335.
- Wanyun Cui and Qianle Wang. 2023. Ada-instruct: Adapting instruction generators for complex reasoning. arXiv preprint arXiv:2310.04484.
- Musilingo: Bridging music and text with pre-trained language models for music captioning and query response. arXiv preprint arXiv:2309.08730.
- Alpacafarm: A simulation framework for methods that learn from human feedback.
- Wanjuan: A comprehensive multimodal dataset for advancing english and chinese large models.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. CoRR, abs/2305.08322.
- Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases.
- Tigerscore: Towards building explainable metric for all text generation tasks. arXiv preprint arXiv:2310.00752.
- Mertech: Instrument playing technique detection using self-supervised pretrained model with multi-task finetuning. arXiv preprint arXiv:2310.09853.
- CMMLU: measuring massive multitask language understanding in chinese. CoRR, abs/2306.09212.
- Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
- Cross-task generalization via natural language crowdsourcing instructions. In ACL.
- OL-CC. 2023. Openlabel-chinese conversations dataset (ol-cc).
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
- Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
- Moss: Training conversational language models from synthetic data.
- Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975.
- Self-instruct: Aligning language model with self generated instructions.
- Super-naturalinstructions:generalization via declarative instructions on 1600+ tasks. In EMNLP.
- Interactive natural language processing. arXiv preprint arXiv:2305.13246.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Skywork: A more open bilingual foundation model.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- Superclue: A comprehensive chinese large language model benchmark. arXiv preprint arXiv:2307.15020.
- WuDaoCorpora Text.
- Jianxin Yang. 2023. Firefly: Chinese conversational large language model. https://github.com/yangjianxin1/Firefly.
- Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.
- Chinese open instruction generalist: A preliminary release.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.