Open-Domain Question Answering Goes Conversational via Question Rewriting

Published 10 Oct 2020 in cs.IR and cs.CL | (2010.04898v3)

Abstract: We introduce a new dataset for Question Rewriting in Conversational Context (QReCC), which contains 14K conversations with 80K question-answer pairs. The task in QReCC is to find answers to conversational questions within a collection of 10M web pages (split into 54M passages). Answers to questions in the same conversation may be distributed across several web pages. QReCC provides annotations that allow us to train and evaluate individual subtasks of question rewriting, passage retrieval and reading comprehension required for the end-to-end conversational question answering (QA) task. We report the effectiveness of a strong baseline approach that combines the state-of-the-art model for question rewriting, and competitive models for open-domain QA. Our results set the first baseline for the QReCC dataset with F1 of 19.10, compared to the human upper bound of 75.45, indicating the difficulty of the setup and a large room for improvement.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (147)

View on Semantic Scholar

Summary

The paper introduces QReCC as a dataset with 14,000 conversations and 80,000 QA pairs, emphasizing question rewriting to handle context-dependent queries.
It outlines a three-phase approach—question rewriting, passage retrieval, and reading comprehension—validated by advanced Transformer++ models.
The evaluation reveals a baseline F1 score of 19.10 versus a human upper bound of 75.45, underscoring significant challenges and future research opportunities.

Open-Domain Question Answering Goes Conversational via Question Rewriting: An Overview

The paper "Open-Domain Question Answering Goes Conversational via Question Rewriting" presents a significant advancement in the domain of open-domain conversational question answering (QA) by introducing the QReCC dataset. QReCC, standing for Question Rewriting in Conversational Context, comprises 14,000 conversations and 80,000 question-answer pairs geared towards addressing conversational questions within a corpus of 10 million web pages. Here, the authors aim to tackle the inherent complexity in conversational QA where answers may span multiple documents, a challenge previously neglected by widely used datasets like QuAC and CoQA.

Methodology and Dataset Design

The QReCC dataset offers annotations that enable the segmentation of the QA task into three interdependent subtasks: question rewriting (QR), passage retrieval, and reading comprehension. This structured breakdown allows researchers to tailor techniques specifically to conversational phenomena such as ellipsis and coreference that often complicate conversational QA systems. Indeed, QR plays a pivotal role here, aiding in transforming context-dependent queries into self-contained questions that existing retrieval and comprehension models can efficiently process.

The dataset collection unfolds in dual phases: dialogue collection, leveraging professional annotators to produce high-quality conversational data, and document collection, involving retrieval and segmentation of relevant web pages from the Wayback Machine and Common Crawl. Both phases are meticulously designed to mimic realistic information-seeking behaviour in interactive settings, thus enhancing the practicality of the dataset in real-world applications.

Baseline Approach and QR Models Evaluation

Transitioning into empirical evaluation, the paper establishes a formidable baseline for QReCC by integrating a sophisticated QR model with the BERTserini open-domain QA architecture. They explore various QR models, including PointerGenerator, GECOR, and novel Transformer-based architectures, with Transformer++ emerging as the superior model according to metrics like ROUGE-1 R and Recall@10.

Interestingly, the adoption of retrieval-based metrics like Recall@10 demonstrates improved correlation with human judgements over traditional metrics like BLEU, underlining the importance of retrieval effectiveness in conversational query reformulation.

Implications and Future Directions

The evaluation reveals the end-to-end system's effectiveness, achieving a baseline F1 score of 19.10, substantially below the human upper bound of 75.45, and underscoring considerable scope for improvement. This gap signifies the complexity of achieving holistic, conversationally-aware QA models and indicates potential directions for future research to explore more sophisticated, possibly abstractive methods rather than purely extractive techniques.

Theoretically, the QReCC dataset offers a comprehensive benchmark for developing and evaluating systems that can gracefully navigate the intricacies of multi-turn dialogue. Practically, it empowers the AI community to model and simulate real-user interactions, thereby advancing the frontier of interaction-based AI systems.

In conclusion, by presenting QReCC and its foundational baselines, the authors provide not only robust resources but also illuminate a path towards a more nuanced understanding and advancement of conversational QA, aligning it closely with the practical demands of interactive information retrieval scenarios. Future endeavors could see further improvements in conversational context integration, enhancing the precision and relevance of QA systems in open-domain settings.

Markdown Report Issue