InPars: Data Augmentation for Information Retrieval using Large Language Models

Published 10 Feb 2022 in cs.CL | (2202.05144v1)

Abstract: The information retrieval community has recently witnessed a revolution due to large pretrained transformer models. Another key ingredient for this revolution was the MS MARCO dataset, whose scale and diversity has enabled zero-shot transfer learning to various tasks. However, not all IR tasks and domains can benefit from one single dataset equally. Extensive research in various NLP tasks has shown that using domain-specific training data, as opposed to a general-purpose one, improves the performance of neural models. In this work, we harness the few-shot capabilities of large pretrained LLMs as synthetic data generators for IR tasks. We show that models finetuned solely on our unsupervised dataset outperform strong baselines such as BM25 as well as recently proposed self-supervised dense retrieval methods. Furthermore, retrievers finetuned on both supervised and our synthetic data achieve better zero-shot transfer than models finetuned only on supervised data. Code, models, and data are available at https://github.com/zetaalphavector/inpars .

Abstract PDF Upgrade to Chat

Authors (4)

Citations (57)

View on Semantic Scholar

Summary

The paper introduces InPars, a novel method that uses few-shot LLMs to generate synthetic training data for information retrieval.
It demonstrates that retrievers finetuned on InPars-generated data consistently outperform traditional BM25 and supervised approaches in zero-shot settings.
The study validates the approach across diverse datasets, underscoring its cost-effectiveness, scalability, and improved domain-specific performance.

InPars: Leveraging LLMs for Data Augmentation in Information Retrieval

The paper entitled "InPars: Data Augmentation for Information Retrieval using LLMs" addresses the critical challenge of generating domain-specific data for information retrieval (IR) tasks, leveraging the few-shot capabilities of large pretrained LLMs. The research presents an innovative method called InPars, which uses these models as synthetic data generators, demonstrating significant improvements in IR metrics over traditional methods like BM25 and contemporary self-supervised dense retrieval techniques.

Overview and Methodology

The recent advancements in IR largely stem from the availability of large-scale datasets like MS MARCO and the use of pretrained transformer models. However, these general-purpose datasets may not optimize performance uniformly across diverse IR domains. Addressing this, InPars uses LLMs to generate synthetic training data in an unsupervised fashion, effectively surpassing existing strong baselines when finetuned on this data.

The InPars method leverages models such as GPT-3, FLAN, Gopher, and T0++, using a few-shot approach to create labeled datasets. Interestingly, this approach combines unsupervised and supervised learning paradigms, resulting in superior zero-shot transfer learning. A distinctive aspect of this work is the demonstration that models finetuned on InPars-generated data perform better in zero-shot settings compared to those trained on supervised data alone. This highlights the versatility and robustness of the method across various datasets.

The authors propose using LLMs to generate pairs of questions and relevant documents from a collection of unlabeled documents, filtering top question-document pairs based on a probability criterion. This process forms the basis for further finetuning retrievers. The paper showcases the efficiency of this approach, where retrievers finetuned only on InPars synthetic data achieved better results than similar retrievers relying on existing supervised datasets.

Experimental Analysis

The researchers conducted extensive experiments using multiple datasets, including MS MARCO, TREC-DL, Robust04, Natural Questions (NQ), and TREC-COVID. Results from these experiments demonstrate the potency of the InPars method. In key measures like Mean Reciprocal Rank (MRR) and normalized Discounted Cumulative Gain (nDCG), models tuned using InPars outperformed both traditional baselines and advanced self-supervised methods. Notably, the retrievers saw substantial gains in domains less aligned with MS MARCO, underscoring the advantage of generating domain-specific synthetic data.

The study also investigated the effects of different prompt designs and the choice of LLM model sizes for generating synthetic data, finding that larger models like GPT-3 Curie increased performance, albeit marginally with increasing model size. Furthermore, the collaborative filtering step, selecting the top high-probability question pairs, significantly enhanced retrieval effectiveness.

Implications and Future Directions

The findings have notable implications for practical applications in IR, particularly in scenarios lacking extensive labeled data. By enabling robust performance from fewer supervised examples, InPars offers a cost-effective pathway to adapt retrieval models to new domains efficiently. The method's scalability, highlighted by its synthetic data generation capacity from large corpora, promises to facilitate wider IR tasks with reduced manual annotation efforts.

Looking forward, several avenues remain open for exploration. Enhancements could include integrating dense retrievers with InPars-augmented training, utilizing negative question examples more strategically, expanding the synthetic dataset size, and refining pair selection methods. These developments might further streamline the adaptability of IR systems using large LLMs.

In conclusion, this paper offers a significant contribution to IR, presenting a methodology that efficiently uses LLMs to generate synthetic data, thus improving model transfer capabilities to diverse and under-resourced domains.

Markdown Report Issue