Emergent Mind

Abstract

Instruction tuning is a standard technique employed to align LLMs to end tasks and user preferences after the initial pretraining phase. Recent research indicates the critical role of data engineering in instruction tuning -- when appropriately selected, only limited data is necessary to achieve superior performance. However, we still lack a principled understanding of what makes good instruction tuning data for alignment, and how we should select data automatically and effectively. In this work, we delve deeply into automatic data selection strategies for alignment. We start with controlled studies to measure data across three dimensions: complexity, quality, and diversity, along which we examine existing methods and introduce novel techniques for enhanced data measurement. Subsequently, we propose a simple strategy to select data samples based on the measurement. We present deita (short for Data-Efficient Instruction Tuning for Alignment), a series of models fine-tuned from LLaMA and Mistral models using data samples automatically selected with our proposed approach. Empirically, deita performs better or on par with the state-of-the-art open-source alignment models with only 6K SFT training data samples -- over 10x less than the data used in the baselines. When further trained with direct preference optimization (DPO), deita-Mistral-7B + DPO trained with 6K SFT and 10K DPO samples achieve 7.55 MT-Bench and 90.06% AlpacaEval scores. We anticipate this work to provide tools on automatic data selection, facilitating data-efficient alignment. We release our models as well as the selected datasets for future researches to effectively align models more efficiently.

Data selection approach measuring complexity, quality, diversity, using evolution-based method for sample collection.

Overview

  • The paper explores data selection techniques for fine-tuning LLMs and introduces new methods to measure data quality for effective instruction tuning.

  • Controlled experiments are conducted to assess the impact of data complexity and quality on the performance of LLMs, and new metrics are established using these results.

  • New methods such as 'Evol-Complexity' and 'Evol-Quality' are presented, aiming to evolve instruction samples for better complexity measurement and enhance response quality iteratively.

  • 'Repr Filter' is proposed for maintaining diversity in datasets, proving to enhance model alignment by including diverse samples in the training set.

  • The Data-Efficient Instruction Tuning (DEITA) approach achieves high alignment performance with fewer samples, demonstrating the importance of sophisticated data selection.

Data Selection for Model Alignment

Introduction

The process of improving LLMs often involves instruction tuning, where the model is fine-tuned using specific datasets post-pretraining. A pivotal facet of instruction tuning efficacy is the selection of appropriate data. Although data engineering's significance is acknowledged, a systematic method for identifying optimal instruction tuning data remains undefined. This paper explores data selection techniques, conducting controlled experiments and developing new metrics for appraising data with the aim of enhancing instruction tuning performance.

Measuring Data for Alignment

In the context of data selection, the research measures data qualities through controlled experiments, addressing complexity, quality, and diversity. Data scoring involves evaluating samples across these dimensions, allowing for data selection strategies to be formed based on these scores. Benchmarks are established using multiple datasets, revealing insights into the impact of complexity and quality variance on model performance.

Complexity and Data Selection

Complexity is often associated with better instruction tuning outcomes. Methods such as "Evol-Complexity" were introduced, leveraging LLMs like ChatGPT for scoring complexity by evolving a set of instruction samples and addressing fine-grained complexity differences. The results showed the robustness of this method in diverse datasets, with superior performance in both high and lower-quality settings.

Quality Assessment

Quality in data is crucial, especially when available data pools exhibit considerable variance in sample quality. "Evol-Quality" is developed, prompting ChatGPT to enhance response quality iteratively. This method, like Evol-Complexity, emphasizes the importance of nuanced scoring, showing a consistent improvement in alignment performance, particularly in datasets with a high incidence of low-quality examples.

Diversity of Data

Acknowledging that a proficient LLM should handle diverse requests, a selection strategy is formulated to ensure dataset diversity while maintaining complexity and quality. An iterative embedding-based strategy, "Repr Filter," is proposed, which prioritizes the addition of diverse samples to the training set. This method triumphs over other strategies by contributing to superior model alignment.

Data-Efficient Instruction Tuning (DEITA)

DEITA encompasses models fine-tuned with data carefully selected for complexity, quality, and diversity. The proposed score-first, diversity-aware selection strategy significantly reduces the number of samples required for effective alignment. DEITA models, based on preexisting LLMs, reach or surpass the alignment performance of models trained with much larger datasets, underlining the efficacy of the sophisticated data selection and reduced computational expenditure.

Experimentation and Findings

Estensive experimentation with models based on different LLM architectures demonstrates that DEITA models outperform other instruction tuning models aligned solely with supervision. Even when compared with models undergoing reinforcement learning with human feedback, DEITA shows commendable performance, particularly when direct preference optimization is applied post-supervision.

Conclusion

This work establishes clear methodologies for ascertaining what constitutes "good data" for model alignment through instruction tuning. The creation of DEITA and its associated models is a step towards more data-efficient alignment. These models and selected datasets have been released to further research efforts in model efficiency and productivity.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Semdedup: Data-efficient learning at web-scale through semantic deduplication. In ICLR 2023 Workshop on Multimodal Representation Learning: Perks and Pitfalls, 2023. https://openreview.net/forum?id=u96ZBg_Shna.

  2. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

  3. TART: A plug-and-play Transformer module for task-agnostic reasoning
  4. Instruction Mining: When Data Mining Meets Large Language Model Finetuning
  5. AlpaGasus: Training A Better Alpaca with Fewer Data
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023)

  7. Think you have solved question answering? try arc, the ai2 reasoning challenge
  8. Free dolly: Introducing the world’s first truly open instruction-tuned llm
  9. Ultrafeedback: Boosting language models with high-quality feedback
  10. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
  11. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
  12. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. https://openreview.net/forum?id=d7KBjmI3GmQ.

  13. Mistral 7B
  14. OpenAssistant Conversations -- Democratizing Large Language Model Alignment
  15. From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning
  16. Self-Alignment with Instruction Backtranslation
  17. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023c.

  18. TruthfulQA: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. https://aclanthology.org/2022.acl-long.229.

  19. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
  20. # instag: Instruction tagging for analyzing supervised fine-tuning of large language models. arXiv e-prints, pp.  arXiv–2308
  21. MTEB: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  2014–2037, Dubrovnik, Croatia, 2023. Association for Computational Linguistics. https://aclanthology.org/2023.eacl-main.148.

  22. OpenAI. Chatgpt: Optimizing language models for dialogue. OpenAI Blog, 2022. https://openai.com/blog/chatgpt/.

  23. OpenAI. Gpt-4 technical report
  24. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems
  25. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems
  26. ZeRO-Offload: Democratizing Billion-Scale model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pp.  551–564
  27. Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
  28. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

  29. LLaMA: Open and Efficient Foundation Language Models
  30. Llama 2: Open Foundation and Fine-Tuned Chat Models
  31. Zephyr: Direct distillation of lm alignment
  32. Text Embeddings by Weakly-Supervised Contrastive Pre-training
  33. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  13484–13508, Toronto, Canada, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. https://aclanthology.org/2023.acl-long.754.

  34. WizardLM: Empowering Large Language Models to Follow Complex Instructions
  35. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. https://aclanthology.org/P19-1472.

  36. A Preliminary Study of the Intrinsic Relationship between Complexity and Alignment
  37. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  38. LIMA: Less is more for alignment. In Advances in Neural Information Processing Systems

Show All 38