Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning (2402.10110v2)
Abstract: Instruction tuning is critical to LLMs for achieving better instruction following and task adaptation capabilities but its success heavily relies on the training data quality. Many recent methods focus on improving the data quality but often overlook the compatibility of the data with the student model being finetuned. This paper introduces Selective Reflection-Tuning, a novel paradigm that synergizes a teacher LLM's reflection and introspection for improving existing data quality with the data selection capability of the student LLM, to automatically refine existing instruction-tuning data. This teacher-student collaboration produces high-quality and student-compatible instruction-response pairs, resulting in sample-efficient instruction tuning and LLMs of superior performance. Selective Reflection-Tuning is a data augmentation and synthesis that generally improves LLM finetuning and self-improvement without collecting brand-new data. We apply our method to Alpaca and WizardLM data and achieve much stronger and top-tier 7B and 13B LLMs.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
- Alpagasus: Training a better alpaca with fewer data.
- Claude2-alpaca: Instruction tuning datasets distilled from claude. https://github.com/Lichang-Chen/claude2-alpaca.
- Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Scaling instruction-finetuned language models. ArXiv, abs/2210.11416.
- Think you have solved question answering? try arc, the ai2 reasoning challenge.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm.
- Flashattention: Fast and memory-efficient exact attention with io-awareness.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
- GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, Dublin, Ireland. Association for Computational Linguistics.
- Alpacafarm: A simulation framework for methods that learn from human feedback.
- A framework for few-shot language model evaluation.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- Large language models can self-improve. arXiv preprint arXiv:2210.11610.
- UNIFIEDQA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1896–1907, Online. Association for Computational Linguistics.
- Diederik P. Kingma and Jimmy Ba. 2017. Adam: A method for stochastic optimization.
- Look at the first sentence: Position bias in question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1109–1121, Online. Association for Computational Linguistics.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
- Generative judge for evaluating alignment.
- Reflection-tuning: Recycling data for better instruction-tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
- Superfiltering: Weak-to-strong data filtering for fast instruction-tuning. ArXiv, abs/2402.00530.
- From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. ArXiv, abs/2308.12032.
- Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
- What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning.
- G-eval: Nlg evaluation using gpt-4 with better human alignment.
- The flan collection: Designing data and methods for effective instruction tuning. ArXiv, abs/2301.13688.
- Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773.
- Orca 2: Teaching small language models how to reason.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
- Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Xwin-LM Team. 2023. Xwin-lm.
- Llama 2: Open foundation and fine-tuned chat models.
- Zephyr: Direct distillation of lm alignment.
- Koala: An index for quantifying overlaps with pre-training corpora.
- Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235.
- Large language models are not fair evaluators.
- Shepherd: A critic for language model generation.
- Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
- Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Chain-of-thought prompting elicits reasoning in large language models.
- Lamini-lm: A diverse herd of distilled models from large-scale instructions.
- Wizardlm: Empowering large language models to follow complex instructions.
- Rethinking the instruction quality: Lift is what you need.
- Tree of thoughts: Deliberate problem solving with large language models.
- CrossFit: A few-shot learning challenge for cross-task generalization in NLP. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7163–7189, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Selfee: Iterative self-revising llm empowered by self-feedback generation. Blog post.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
- Instruction tuning for large language models: A survey.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Lima: Less is more for alignment.