Emergent Mind

Abstract

We equip a smaller Language Model to generalise to answering challenging compositional questions that have not been seen in training. To do so we propose a combination of multitask supervised pretraining on up to 93 tasks designed to instill diverse reasoning abilities, and a dense retrieval system that aims to retrieve a set of evidential paragraph fragments. Recent progress in question-answering has been achieved either through prompting methods against very large pretrained Language Models in zero or few-shot fashion, or by fine-tuning smaller models, sometimes in conjunction with information retrieval. We focus on the less explored question of the extent to which zero-shot generalisation can be enabled in smaller models with retrieval against a corpus within which sufficient information to answer a particular question may not exist. We establish strong baselines in this setting for diverse evaluation datasets (StrategyQA, CommonsenseQA, IIRC, DROP, Musique and ARC-DA), and show that performance can be significantly improved by adding retrieval-augmented training datasets which are designed to expose our models to a variety of heuristic reasoning strategies such as weighing partial evidence or ignoring an irrelevant context.

Diagram shows Iterator and QA Model processing initial query to score and select top sentences.

Overview

  • Investigates the ability of smaller language models enhanced by multitask pretraining and dense retrieval systems to generalize to complex, unseen compositional questions.

  • Advances the methodology of using a broader range of tasks in multitask pretraining to encourage versatile reasoning strategies, moving beyond the capabilities of large pretrained language models alone.

  • Demonstrates through comparative analyses that models using retrieval-augmented training datasets significantly outperform baselines, especially in multi-hop question answering contexts.

  • Highlights the potential for future research in refining multitask pretraining and retrieval systems to improve the context relevance and evidence scoring for efficient and accurate AI problem-solving.

Enhancing Smaller Language Models with Multitask Pretraining and Dense Retrieval for Compositional Question Answering

Introduction

Recent advances have shown the effectiveness of large pretrained LLMs in question answering tasks, capable of understanding compositional questions never seen during training. However, the applicability of these models can be limited by practical considerations such as latency, cost, and energy efficiency. This paper contributes to the field by exploring the extent to which smaller language models, enhanced by multitask pretraining and dense retrieval systems, can generalize to answering complex, unseen questions. Specifically, it investigates a model pretrained on 93 diverse reasoning tasks and augmented with a dense retrieval system, focusing on compositional questions where answers may not be directly inferable from a given corpus.

System Components and Related Work

The study builds upon and extends previous methodologies in retrieval-augmented question answering. It significantly advances the multitask pretraining approach, involving a broader range of tasks designed to instill versatile reasoning strategies. Unlike sole reliance on LLMs' parameters for knowledge encoding, this work utilizes a query transformation strategy, transforming question-answering into a problem of reading comprehension by retrieving relevant information from an external corpus. Moreover, it moves beyond two-hop retrieval limits, enabling the capture of more complex reasoning paths. Comparative analyses show the iteratively enhanced retrieval, reranking, and scoring system exhibits promising alignment with human reasoning patterns, especially in multi-hop question answering contexts.

Experiments and Results

The research evaluates its hypothesis using six diverse evaluation datasets tailored to test textual and numerical reasoning abilities. Notably, the models trained with retrieval-augmented training datasets (RATD) significantly outperformed the baselines, demonstrating the capacity of smaller models to generalize from observed compositional reasoning to unseen problems effectively. However, the study also uncovers challenges in the models' numerical literacy and in handling unanswerable questions, especially in contexts where retrieval systems may introduce plausible but misleading information.

In detailed experiments comparing baseline models without RATD datasets to those augmented with them, findings consistently show an improvement in performance across various datasets when models are equipped with heuristic reasoning strategies acquired through RATD. For instance, on the StrategyQA dataset, augmented models demonstrated superior generalization abilities, even approaching the performance levels of much larger language models in certain contexts.

Discussion and Future Directions

The paper's findings underline the potential of combining multitask pretraining with sophisticated retrieval mechanisms to enhance smaller models' performance on complex question-answering tasks. This approach not only contributes to the development of more accessible and versatile AI tools but also offers insights into the mechanics of knowledge application and reasoning in AI systems. The extension of multitask pretraining to incorporate a more extensive array of reasoning strategies and the refinement of retrieval systems for better context relevance and evidence scoring are outlined as promising areas for future research. Additionally, the study hints at exploring the balance between encoded knowledge in model parameters and dynamically retrieved information for efficient and accurate problem-solving in AI.

Conclusion

This paper makes significant strides in advancing the capabilities of smaller language models for question answering, bridging the gap between the advantages of dense retrieval systems and the generalizability afforded by multitask pretraining. Its contributions not only enhance the understanding of how AI can mimic complex human reasoning patterns but also set a new benchmark for future work in making AI both more efficient and effective across a broader range of applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.