Distilling Knowledge from Reader to Retriever for Question Answering (2012.04584v2)

Published 8 Dec 2020 in cs.CL and cs.LG

Abstract: The task of information retrieval is an important component of many natural language processing systems, such as open domain question answering. While traditional methods were based on hand-crafted features, continuous representations based on neural networks recently obtained competitive results. A challenge of using such methods is to obtain supervised data to train the retriever model, corresponding to pairs of query and support documents. In this paper, we propose a technique to learn retriever models for downstream tasks, inspired by knowledge distillation, and which does not require annotated pairs of query and documents. Our approach leverages attention scores of a reader model, used to solve the task based on retrieved documents, to obtain synthetic labels for the retriever. We evaluate our method on question answering, obtaining state-of-the-art results.

Citations (235)

View on Semantic Scholar

Summary

The paper introduces a method to distill attention scores from a reader model as synthetic labels for training retriever models without direct supervision.
It employs a BERT-based bi-encoder refined iteratively to significantly enhance retrieval accuracy on benchmarks like NaturalQuestions and TriviaQA.
The approach reduces reliance on manual query-document annotations, enabling broader applicability across diverse, real-world QA tasks.

Distillation of Knowledge for Improved Information Retrieval in Question Answering

The paper by Izacard and Grave presents a novel approach to enhance information retrieval in open-domain question answering systems. This paper addresses the challenge of training retrieving models without direct supervision for query-document pairs. Instead, it leverages the concept of knowledge distillation, where insights from a "reader" model, tasked with resolving the question-answering problem, are transferred to a "retriever" model.

Methodological Framework

The core innovation presented in the paper is the distillation of attention scores from a sequence-to-sequence reader model. These scores serve as synthetic labels for the retriever model, enabling its training in the absence of traditional annotated data. By focusing on cross-attention mechanisms, which highlight the relevance of specific document segments to the task at hand, the model transitions the necessity of having explicit query-document pair annotations to relying on signal inferenced through the question-answering process itself.

Implementation and Results

The retriever within this framework utilizes dense representations derived from a BERT-based bi-encoder model. Unlike prior work, the retrieval function here is refined iteratively using synthetic labels formed by attention scores. This iterative method has been shown to enhance retrieval accuracy significantly, as evidenced by state-of-the-art results on esteemed benchmarks such as NaturalQuestions and TriviaQA.

Empirical evaluations underscore the utility of the approach. When initialized with BM25 retrieved documents or by using DPR (Dense Passage Retrieval), the iterative training shows improved retrieval accuracy. Notably, the Fusion-in-Decoder reader achieves competitive end-to-end performance, emphasizing the merit in distilling finer-grained interpretable patterns from complex reader models to more straightforward retrievers.

Critical Insights and Implications

The implications of this work are multifaceted:

Supervision Independence: This method alleviates the labor-intensive process of generating query-document annotations, thus democratizing access to high-performing retrieval systems.
Model Flexibility: By dissociating training from strong supervision dependencies, the approach accommodates a diverse range of downstream tasks, further attested in extensions to the NarrativeQA dataset, which involves non-standard, extended-length answers.
Retrieval Accuracy: The observed improvements in retrieval accuracy illustrate the model’s capacity to generalize from distilled attention scores, enhancing its applicability in real-world scenarios with variable content and context complexities.

Future Prospects

Looking forward, the paper suggests several avenues for further exploration. The potential refinement of pre-training strategies could allow for even more significant gains in retrieval accuracy. Additionally, expanding the scope of attention aggregation methodologies might disclose further layers of relevance comprehension, refining the mapping from reader-derived insights to retriever outputs.

In conclusion, Izacard and Grave's paper offers a substantial contribution to the ongoing evolution of question answering systems by introducing an innovative method for model training that prioritizes performance while minimizing data annotation overhead. This balance holds promise for wider applicability and scalability in various information retrieval scenarios.

PDF Markdown

Related Papers

GitHub

GitHub - facebookresearch/FiD: Fusion-in-Decoder (574 stars)

YouTube

Show All Videos