Papers
Topics
Authors
Recent
2000 character limit reached

Teaching Broad Reasoning Skills for Multi-Step QA by Generating Hard Contexts (2205.12496v2)

Published 25 May 2022 in cs.CL and cs.AI

Abstract: Question-answering datasets require a broad set of reasoning skills. We show how to use question decompositions to teach LLMs these broad reasoning skills in a robust fashion. Specifically, we use widely available QDMR representations to programmatically create hard-to-cheat synthetic contexts for real questions in six multi-step reasoning datasets. These contexts are carefully designed to avoid reasoning shortcuts prevalent in real contexts that prevent models from learning the right skills. This results in a pretraining dataset, named TeaBReaC, containing 525K multi-step questions (with associated formal programs) covering about 900 reasoning patterns. We show that pretraining standard LMs on TeaBReaC before fine-tuning them on target datasets improves their performance by up to 13 F1 points across 4 multi-step QA datasets, with up to 21 point gain on more complex questions. The resulting models also demonstrate higher robustness, with a 5-8 F1 point improvement on two contrast sets. Furthermore, TeaBReaC pretraining substantially improves model performance and robustness even when starting with numerate LMs pretrained using recent methods (e.g., PReasM, POET). Our work thus shows how to effectively use decomposition-guided contexts to robustly teach multi-step reasoning.

Citations (10)

Summary

  • The paper demonstrates that using synthetic contexts via TeaBReaC significantly boosts language model performance, yielding up to 13 F1 point improvements on multi-step QA tasks.
  • It leverages QDMR annotations to systematically generate and balance synthetic contexts, effectively minimizing reasoning shortcuts.
  • Experimental results indicate that pretraining on TeaBReaC enhances robustness across in-distribution and complex question evaluations for various language models.

"Teaching Broad Reasoning Skills for Multi-Step QA by Generating Hard Contexts"

Introduction

The paper "Teaching Broad Reasoning Skills for Multi-Step QA by Generating Hard Contexts" (2205.12496) addresses the challenge of enhancing LLMs' capability in multi-step question answering (QA) tasks. Multi-step QA requires models to integrate various reasoning skills such as reading comprehension, numerical reasoning, and information synthesis. Existing datasets are limited in that they often allow models to exploit reasoning shortcuts, preventing them from developing the required reasoning skills. The paper proposes a method for generating synthetic contexts, referred to as TeaBReaC, to improve the robustness and performance of LLMs on multi-step reasoning tasks. Figure 1

Figure 1: TeaBReaC images/teabreac.png Dataset Construction: We use widely available question decomposition annotations (QDMRs) for real questions from a broad range of datasets to carefully construct synthetic contexts such that answering the resulting images/teabreac.png question requires proper multi-step reasoning. These questions are further re-balanced to help teach a broad set of reasoning skills.

Methodology

The TeaBReaC dataset is constructed by leveraging Question Decomposition Meaning Representation (QDMR) annotations to systematically generate synthetic contexts for questions sourced from six diverse datasets. The construction process controls for reasoning shortcuts by balancing the distribution of reasoning patterns and ensuring that questions reflect a wide range of reasoning skills. Figure 2

Figure 2: A simplified example of a QA instance in TeaBReaC, with a (simplified) real question from the DROP dataset and the synthetic context we construct for it using the question's 3-step decomposition. Statements in red, yellow, and green form the synthetic context. The instance satisfies desirable properties P1, P2, and P3, and thus helps robustly teach multi-step reasoning skills.

The process involves converting QDMRs into precise typed programs that define the steps needed to answer a question. This conversion enables the creation of synthetic contexts that models can execute to obtain answers while avoiding irrelevant reasoning paths. The synthetic contexts use grounded entities such that reasoning steps cannot be bypassed.

Experimental Setup

The effectiveness of TeaBReaC is assessed by pretraining LLMs on the dataset before fine-tuning them on target QA datasets. The research evaluates various LLMs, both plain (e.g., T5-Large) and numerate ones (e.g., NT5), to ascertain the performance and robustness gains from TeaBReaC pretraining. Figure 3

Figure 3: F1 scores for plain and numerate LMs with and without images/teabreac.png \xspace TeaBReaC pretraining on DROP across varying numbers of steps, as determined by our programs. TeaBReaC pretraining helps more on more complex questions. The effect is more prominent on plain LMs like T5-L than on numerate LMs like PReasM-L.

Results and Discussion

TeaBReaC pretraining significantly boosts the performance of LMs on multiple QA datasets, achieving up to 13 F1 point improvements (with up to 21 points on complex questions). These performance gains are observed across in-distribution evaluations and robustness tests, indicating that models pretrained on TeaBReaC learn more reliable multi-step reasoning skills. Figure 4

Figure 4: TeaBReaC Ablations: All aspects of TeaBReaC pretraining data contribute to the overall performance: (i) primitive QA instances (ii) multi-step QA instances (iii) balancing of QDMR distribution.

The dataset notably enhances performance on complex questions requiring multiple reasoning steps, which are underrepresented in natural datasets. The ablation paper reveals that every component of the TeaBReaC design, including the construction of QA instances and the QDMR balancing, is crucial for achieving these improvements.

Conclusion

This research demonstrates that synthetic context generation is a powerful method for imparting robust multi-step reasoning skills to LLMs. By addressing the limitations of existing datasets and providing varied reasoning challenges, TeaBReaC represents a significant step forward in preparing models to handle complex QA tasks. The dataset's broad applicability offers numerous avenues for further research into teaching reasoning patterns and measuring reasoning robustness in LLMs.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.