HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering (1809.09600v1)

Published 25 Sep 2018 in cs.CL

Abstract: Existing question answering (QA) datasets fail to train QA systems to perform complex reasoning and provide explanations for answers. We introduce HotpotQA, a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems' ability to extract relevant facts and perform necessary comparison. We show that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.

Citations (2,104)

View on Semantic Scholar

Summary

The paper presents a dataset that requires multi-hop reasoning over diverse Wikipedia content, compelling models to integrate evidence from multiple documents.
It employs a meticulous data collection pipeline that curates question-answer pairs and supporting facts to enhance explainability in QA systems.
Baseline evaluations show notable performance drops in full Wiki settings, highlighting opportunities for advancing retrieval and reasoning techniques.

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

HotpotQA is introduced as a novel dataset aimed at enhancing the capabilities of question answering (QA) systems, particularly focusing on multi-hop reasoning and explainability. The dataset, which comprises over 113,000 question-answer pairs derived from Wikipedia articles, stands out for its unique properties that address several limitations of existing QA datasets.

Key Features of the Dataset

Multi-hop Reasoning: Unlike many existing datasets where questions can be answered from a single paragraph, HotpotQA necessitates multi-hop reasoning. This means that systems need to integrate information from multiple documents to derive an answer.
Diverse Questions: The questions in HotpotQA are not limited to specific knowledge schemas. They are designed to be broad and cover a wide range of topics, thereby avoiding biases inherent in knowledge-base-specific question datasets.
Explainable Predictions: HotpotQA provides sentence-level supporting facts that are essential for deriving the answer. This helps in training QA systems that can not only provide the correct answers but also explain the reasoning behind them.
Factoid Comparison: Unique to HotpotQA is the inclusion of comparison questions, which require systems to compare and reason about different entities. This adds a layer of complexity, testing a system's ability to handle more intricate forms of reasoning.

Data Collection Strategy

To generate high-quality multi-hop questions, the authors used a meticulously designed data collection pipeline that leverages the structure of Wikipedia. They built a hyperlink graph from Wikipedia articles and curated candidate paragraph pairs to ensure meaningful multi-hop reasoning. Additionally, the dataset includes comparison questions by sampling pairs of related entities, thus enriching the kinds of reasoning required.

Benchmark Settings

The dataset is evaluated under two primary settings:

Distractor Setting: In this setting, each question is accompanied by eight distractor paragraphs alongside the two gold paragraphs. This setup challenges the model to identify the relevant supporting facts amidst irrelevant information.
Full Wiki Setting: Here, models are tested on their ability to retrieve the relevant information from the entirety of Wikipedia, which significantly heightens the difficulty due to the massive search space.

Model Architecture and Evaluation

The baseline model for HotpotQA combines character-level models, self-attention, and bi-attention layers, aligning with current state-of-the-art trends in QA systems. The objective is set up as a multi-task learning problem where the model learns both to answer questions and to identify supporting facts simultaneously. This strong supervision over supporting facts is beneficial for both answer accuracy and explainability.

Results

The baseline results on HotpotQA demonstrate the challenge posed by multi-hop reasoning and the necessity for explainable predictions. While the model achieved reasonable performance in the distractor setting (F1 of 58.28 for answers and 66.66 for supporting facts), there was a considerable drop in the full wiki setting (F1 of 34.36 for answers and 40.98 for supporting facts). This indicates substantial room for improvement, especially in large-context retrieval scenarios.

Implications and Future Work

HotpotQA is a significant addition to the QA dataset landscape, emphasizing multi-hop reasoning and explainability.

Practical Implications:

It encourages the development of more sophisticated QA models that can handle complex reasoning processes and provide transparent answers.
The dataset's structure also incentivizes advancements in natural language understanding and information retrieval techniques.

Theoretical Implications:

HotpotQA tests the boundaries of current architecture capabilities, pushing the research community toward innovative solutions for multi-hop reasoning and explainable AI.

Future Developments:

Enhancements in retrieval algorithms to better manage full-document context.
Integration of more advanced LLMs capable of deeper reasoning and better handling of factoid comparisons.
Improvement of models' ability to explain their reasoning process by effectively utilizing the strong supervision data.

HotpotQA's design ensures it will be a valuable resource for future advancements in the field, challenging researchers to push the envelope of what QA systems can achieve.

PDF Markdown

Related Papers

YouTube

Show All Videos