What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning (2312.15685v2)
Abstract: Instruction tuning is a standard technique employed to align LLMs to end tasks and user preferences after the initial pretraining phase. Recent research indicates the critical role of data engineering in instruction tuning -- when appropriately selected, only limited data is necessary to achieve superior performance. However, we still lack a principled understanding of what makes good instruction tuning data for alignment, and how we should select data automatically and effectively. In this work, we delve deeply into automatic data selection strategies for alignment. We start with controlled studies to measure data across three dimensions: complexity, quality, and diversity, along which we examine existing methods and introduce novel techniques for enhanced data measurement. Subsequently, we propose a simple strategy to select data samples based on the measurement. We present deita (short for Data-Efficient Instruction Tuning for Alignment), a series of models fine-tuned from LLaMA and Mistral models using data samples automatically selected with our proposed approach. Empirically, deita performs better or on par with the state-of-the-art open-source alignment models with only 6K SFT training data samples -- over 10x less than the data used in the baselines. When further trained with direct preference optimization (DPO), deita-Mistral-7B + DPO trained with 6K SFT and 10K DPO samples achieve 7.55 MT-Bench and 90.06% AlpacaEval scores. We anticipate this work to provide tools on automatic data selection, facilitating data-efficient alignment. We release our models as well as the selected datasets for future researches to effectively align models more efficiently.
- Semdedup: Data-efficient learning at web-scale through semantic deduplication. In ICLR 2023 Workshop on Multimodal Representation Learning: Perks and Pitfalls, 2023. URL https://openreview.net/forum?id=u96ZBg_Shna.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
- Tart: A plug-and-play transformer module for task-agnostic reasoning. ArXiv preprint, abs/2306.07536, 2023. URL https://arxiv.org/abs/2306.07536.
- Instruction mining: High-quality instruction data selection for large language models. ArXiv preprint, abs/2307.06290, 2023. URL https://arxiv.org/abs/2307.06290.
- Alpagasus: Training a better alpaca with fewer data. ArXiv preprint, abs/2307.08701, 2023. URL https://arxiv.org/abs/2307.08701.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023.
- Ultrafeedback: Boosting language models with high-quality feedback, 2023.
- Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. ArXiv preprint, abs/2307.08691, 2023. URL https://arxiv.org/abs/2307.08691.
- Enhancing chat language models by scaling high-quality instructional conversations. ArXiv preprint, abs/2305.14233, 2023. URL https://arxiv.org/abs/2305.14233.
- Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
- Mistral 7B, 2023.
- Openassistant conversations–democratizing large language model alignment. ArXiv preprint, abs/2304.07327, 2023. URL https://arxiv.org/abs/2304.07327.
- From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. ArXiv preprint, abs/2308.12032, 2023a. URL https://arxiv.org/abs/2308.12032.
- Self-alignment with instruction backtranslation. ArXiv preprint, abs/2308.06259, 2023b. URL https://arxiv.org/abs/2308.06259.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023c.
- TruthfulQA: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
- The flan collection: Designing data and methods for effective instruction tuning. ArXiv preprint, abs/2301.13688, 2023. URL https://arxiv.org/abs/2301.13688.
- # instag: Instruction tagging for analyzing supervised fine-tuning of large language models. arXiv e-prints, pp. arXiv–2308, 2023.
- MTEB: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 2014–2037, Dubrovnik, Croatia, 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.eacl-main.148.
- OpenAI. Chatgpt: Optimizing language models for dialogue. OpenAI Blog, 2022. URL https://openai.com/blog/chatgpt/.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, 2022.
- Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems, 2023.
- ZeRO-Offload: Democratizing Billion-Scale model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pp. 551–564, 2021.
- Principle-driven self-alignment of language models from scratch with minimal human supervision. ArXiv preprint, abs/2305.03047, 2023. URL https://arxiv.org/abs/2305.03047.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023a. URL https://arxiv.org/abs/2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023b. URL https://arxiv.org/abs/2307.09288.
- Zephyr: Direct distillation of lm alignment, 2023.
- Text embeddings by weakly-supervised contrastive pre-training. ArXiv preprint, abs/2212.03533, 2022. URL https://arxiv.org/abs/2212.03533.
- Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13484–13508, Toronto, Canada, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754.
- Wizardlm: Empowering large language models to follow complex instructions. ArXiv preprint, abs/2304.12244, 2023. URL https://arxiv.org/abs/2304.12244.
- HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472.
- A preliminary study of the intrinsic relationship between complexity and alignment. ArXiv preprint, abs/2308.05696, 2023. URL https://arxiv.org/abs/2308.05696.
- Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv preprint, abs/2306.05685, 2023. URL https://arxiv.org/abs/2306.05685.
- LIMA: Less is more for alignment. In Advances in Neural Information Processing Systems, 2023.