Self-Refine Instruction-Tuning for Aligning Reasoning in Language Models (2405.00402v1)
Abstract: The alignments of reasoning abilities between smaller and larger LLMs are largely conducted via Supervised Fine-Tuning (SFT) using demonstrations generated from robust LLMs. Although these approaches deliver more performant models, they do not show sufficiently strong generalization ability as the training only relies on the provided demonstrations. In this paper, we propose the Self-refine Instruction-tuning method that elicits Smaller LLMs to self-refine their abilities. Our approach is based on a two-stage process, where reasoning abilities are first transferred between LLMs and Small LLMs (SLMs) via Instruction-tuning on demonstrations provided by LLMs, and then the instructed models Self-refine their abilities through preference optimization strategies. In particular, the second phase operates refinement heuristics based on the Direct Preference Optimization algorithm, where the SLMs are elicited to deliver a series of reasoning paths by automatically sampling the generated responses and providing rewards using ground truths from the LLMs. Results obtained on commonsense and math reasoning tasks show that this approach significantly outperforms Instruction-tuning in both in-domain and out-domain scenarios, aligning the reasoning abilities of Smaller and Larger LLMs.
- A general theoretical paradigm to understand learning from human preferences.
- Piqa: Reasoning about physical commonsense in natural language.
- Language models are few-shot learners.
- Sparks of artificial general intelligence: Early experiments with gpt-4.
- Training verifiers to solve math word problems. ArXiv, abs/2110.14168.
- Qlora: Efficient finetuning of quantized llms.
- Complexity-based prompting for multi-step reasoning.
- Pal: Program-aided language models.
- Vedant Gaur and Nikunj Saunshi. 2023. Reasoning in large language models through symbolic math word problems. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5889–5903, Toronto, Canada. Association for Computational Linguistics.
- Measuring massive multitask language understanding.
- Measuring mathematical problem solving with the math dataset.
- Large language models are reasoning teachers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14852–14882, Toronto, Canada. Association for Computational Linguistics.
- Mistral 7b.
- Mixtral of experts.
- Symbolic chain-of-thought distillation: Small models can also “think” step-by-step. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2665–2679, Toronto, Canada. Association for Computational Linguistics.
- Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5315–5333, Toronto, Canada. Association for Computational Linguistics.
- Evaluating the logical reasoning ability of chatgpt and gpt-4.
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.
- Reft: Reasoning with reinforced fine-tuning.
- Teaching small language models to reason. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1773–1781, Toronto, Canada. Association for Computational Linguistics.
- Can a suit of armor conduct electricity? a new dataset for open book question answering.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback.
- Refiner: Reasoning feedback on intermediate representations.
- Direct preference optimization: Your language model is secretly a reward model.
- Leonardo Ranaldi and Andre Freitas. 2024. Aligning large and small language models via chain-of-thought reasoning. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1812–1827, St. Julian’s, Malta. Association for Computational Linguistics.
- Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752, Lisbon, Portugal. Association for Computational Linguistics.
- Social IQa: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473, Hong Kong, China. Association for Computational Linguistics.
- Proximal policy optimization algorithms.
- Distilling reasoning capabilities into smaller language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7059–7073, Toronto, Canada. Association for Computational Linguistics.
- CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama 2: Open foundation and fine-tuned chat models.
- Solving math word problems with process- and outcome-based feedback.
- Making large language models better reasoners with alignment.
- Self-consistency improves chain of thought reasoning in language models.
- Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
- Democratizing reasoning ability: Tailored learning from large language model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1948–1966, Singapore. Association for Computational Linguistics.
- Emergent abilities of large language models.
- Chain-of-thought prompting elicits reasoning in large language models.
- Wizardlm: Empowering large language models to follow complex instructions.
- Mammoth: Building math generalist models through hybrid instruction tuning.
- Star: Bootstrapping reasoning with reasoning.
- Interpretable math word problem solution generation via step-by-step planning.
- Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification.
- Leonardo Ranaldi (18 papers)
- Andrè Freitas (3 papers)