Improving Language Model Reasoning with Self-motivated Learning (2404.07017v3)
Abstract: Large-scale high-quality training data is important for improving the performance of models. After trained with data that has rationales (reasoning steps), models gain reasoning capability. However, the dataset with high-quality rationales is relatively scarce due to the high annotation cost. To address this issue, we propose \textit{Self-motivated Learning} framework. The framework motivates the model itself to automatically generate rationales on existing datasets. Based on the inherent rank from correctness across multiple rationales, the model learns to generate better rationales, leading to higher reasoning capability. Specifically, we train a reward model with the rank to evaluate the quality of rationales, and improve the performance of reasoning through reinforcement learning. Experiment results of Llama2 7B on multiple reasoning datasets show that our method significantly improves the reasoning ability of models, even outperforming text-davinci-002 in some datasets.
- Akari Asai and Hannaneh Hajishirzi. 2020. Logic-guided data augmentation and regularization for consistent question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5642–5650, Online. Association for Computational Linguistics.
- Training a helpful and harmless assistant with reinforcement learning from human feedback.
- Abductive commonsense reasoning.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Zero-shot transfer learning with synthesized data for multi-domain dialogue state tracking. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 122–132, Online. Association for Computational Linguistics.
- HybridQA: A dataset of multi-hop question answering over tabular and textual data. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1026–1036, Online. Association for Computational Linguistics.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Safe rlhf: Safe reinforcement learning from human feedback.
- ReasonBERT: Pre-trained to reason with distant supervision. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6112–6127, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378, Minneapolis, Minnesota. Association for Computational Linguistics.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.
- Injecting numerical reasoning skills into language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 946–958, Online. Association for Computational Linguistics.
- Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies.
- Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
- Learning instructions with unlabeled data for zero-shot cross-task generalization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1617–1634, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Large language models are reasoning teachers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14852–14882, Toronto, Canada. Association for Computational Linguistics.
- Learning to solve arithmetic word problems with verb categorization.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Qasc: A dataset for question answering via sentence composition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8082–8090.
- The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning.
- Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc.
- Parsing algebraic word problems into equations.
- Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597.
- Can language models learn from explanations in context? arXiv preprint arXiv:2204.02329.
- Symbolic chain-of-thought distillation: Small models can also “think” step-by-step. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2665–2679, Toronto, Canada. Association for Computational Linguistics.
- Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5315–5333, Toronto, Canada. Association for Computational Linguistics.
- Let’s verify step by step.
- Visual instruction tuning.
- Logiqa: A challenge dataset for machine reading comprehension with logical reasoning.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521.
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.
- A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975–984.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In Conference on Empirical Methods in Natural Language Processing.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
- Are nlp models really able to solve simple math word problems?
- Are nlp models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094.
- Reasoning like program executors. arXiv preprint arXiv:2201.11473.
- Improving language understanding by generative pre-training. -.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67.
- Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752.
- Subhro Roy and Dan Roth. 2016. Solving general arithmetic word problems.
- Generate & rank: A multi-task framework for math word problems. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2269–2279.
- Clutrr: A diagnostic benchmark for inductive reasoning from text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4506–4515.
- Defining and characterizing reward hacking.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158.
- Llama 2: Open foundation and fine-tuned chat models.
- Solving math word problems with process- and outcome-based feedback.
- SCOTT: Self-consistent chain-of-thought distillation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5546–5558, Toronto, Canada. Association for Computational Linguistics.
- Logic-driven context extension and data augmentation for logical reasoning of text. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1619–1629, Dublin, Ireland. Association for Computational Linguistics.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- WorldTree v2: A corpus of science-domain structured explanations and inference patterns supporting multi-hop inference. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5456–5473, Marseille, France. European Language Resources Association.
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
- In-context instruction learning. arXiv preprint arXiv:2302.14691.
- Turning tables: Generating examples from semi-structured tables for endowing language models with reasoning skills. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6016–6031, Dublin, Ireland. Association for Computational Linguistics.
- Reclor: A reading comprehension dataset requiring logical reasoning.
- Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465.
- SWAG: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 93–104, Brussels, Belgium. Association for Computational Linguistics.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance.