- The paper demonstrates that combining multitask pretraining with dense retrieval enhances small language model generalization on unseen compositional questions.
- It introduces a query transformation strategy that reframes complex question answering as a reading comprehension task beyond conventional two-hop retrieval.
- Experimental results, including on StrategyQA, reveal that retrieval-augmented training significantly narrows the performance gap with larger models.
Enhancing Smaller LLMs with Multitask Pretraining and Dense Retrieval for Compositional Question Answering
Introduction
Recent advances have shown the effectiveness of large pretrained LLMs in question answering tasks, capable of understanding compositional questions never seen during training. However, the applicability of these models can be limited by practical considerations such as latency, cost, and energy efficiency. This paper contributes to the field by exploring the extent to which smaller LLMs, enhanced by multitask pretraining and dense retrieval systems, can generalize to answering complex, unseen questions. Specifically, it investigates a model pretrained on 93 diverse reasoning tasks and augmented with a dense retrieval system, focusing on compositional questions where answers may not be directly inferable from a given corpus.
The paper builds upon and extends previous methodologies in retrieval-augmented question answering. It significantly advances the multitask pretraining approach, involving a broader range of tasks designed to instill versatile reasoning strategies. Unlike sole reliance on LLMs' parameters for knowledge encoding, this work utilizes a query transformation strategy, transforming question-answering into a problem of reading comprehension by retrieving relevant information from an external corpus. Moreover, it moves beyond two-hop retrieval limits, enabling the capture of more complex reasoning paths. Comparative analyses show the iteratively enhanced retrieval, reranking, and scoring system exhibits promising alignment with human reasoning patterns, especially in multi-hop question answering contexts.
Experiments and Results
The research evaluates its hypothesis using six diverse evaluation datasets tailored to test textual and numerical reasoning abilities. Notably, the models trained with retrieval-augmented training datasets (RATD) significantly outperformed the baselines, demonstrating the capacity of smaller models to generalize from observed compositional reasoning to unseen problems effectively. However, the paper also uncovers challenges in the models' numerical literacy and in handling unanswerable questions, especially in contexts where retrieval systems may introduce plausible but misleading information.
In detailed experiments comparing baseline models without RATD datasets to those augmented with them, findings consistently show an improvement in performance across various datasets when models are equipped with heuristic reasoning strategies acquired through RATD. For instance, on the StrategyQA dataset, augmented models demonstrated superior generalization abilities, even approaching the performance levels of much larger LLMs in certain contexts.
Discussion and Future Directions
The paper's findings underline the potential of combining multitask pretraining with sophisticated retrieval mechanisms to enhance smaller models' performance on complex question-answering tasks. This approach not only contributes to the development of more accessible and versatile AI tools but also offers insights into the mechanics of knowledge application and reasoning in AI systems. The extension of multitask pretraining to incorporate a more extensive array of reasoning strategies and the refinement of retrieval systems for better context relevance and evidence scoring are outlined as promising areas for future research. Additionally, the paper hints at exploring the balance between encoded knowledge in model parameters and dynamically retrieved information for efficient and accurate problem-solving in AI.
Conclusion
This paper makes significant strides in advancing the capabilities of smaller LLMs for question answering, bridging the gap between the advantages of dense retrieval systems and the generalizability afforded by multitask pretraining. Its contributions not only enhance the understanding of how AI can mimic complex human reasoning patterns but also set a new benchmark for future work in making AI both more efficient and effective across a broader range of applications.