Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 52 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 454 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

MuSiQue: Multihop Questions via Single-hop Question Composition (2108.00573v3)

Published 2 Aug 2021 in cs.CL and cs.AI

Abstract: Multihop reasoning remains an elusive goal as existing multihop benchmarks are known to be largely solvable via shortcuts. Can we create a question answering (QA) dataset that, by construction, \emph{requires} proper multihop reasoning? To this end, we introduce a bottom-up approach that systematically selects composable pairs of single-hop questions that are connected, i.e., where one reasoning step critically relies on information from another. This bottom-up methodology lets us explore a vast space of questions and add stringent filters as well as other mechanisms targeting connected reasoning. It provides fine-grained control over the construction process and the properties of the resulting $k$-hop questions. We use this methodology to create MuSiQue-Ans, a new multihop QA dataset with 25K 2-4 hop questions. Relative to existing datasets, MuSiQue-Ans is more difficult overall (3x increase in human-machine gap), and harder to cheat via disconnected reasoning (e.g., a single-hop model has a 30 point drop in F1). We further add unanswerable contrast questions to produce a more stringent dataset, MuSiQue-Full. We hope our datasets will help the NLP community develop models that perform genuine multihop reasoning.

Citations (166)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a novel bottom-up approach that composes single-hop questions into robust multihop queries, mitigating reasoning shortcuts.
  • It details a systematic pipeline pairing single-hop questions via shared entities and enforcing a condition that prevents shortcut exploitation.
  • Results reveal a wider gap between human and machine performance, underscoring the dataset’s value for evaluating true multihop reasoning.

Overview of MuSiQue: Multihop Questions via Single-hop Question Composition

The paper "MuSiQue: Multihop Questions via Single-hop Question Composition" introduces a novel approach to question answering (QA) focusing on genuine multihop reasoning. The authors present a dataset, MuSiQue, designed to compel models to perform complex reasoning by integrating single-hop questions in a structured manner.

Problem Statement

Current multihop QA benchmarks face criticism for being overly susceptible to shortcuts, allowing models to bypass true multihop reasoning. MuSiQue addresses this by systematically constructing questions that enforce interconnected reasoning steps, ensuring models cannot exploit reasoning shortcuts for high scores.

Methodology

The authors propose a bottom-up strategy to create multihop questions through the composition of single-hop questions. This involves:

  1. Composable Pair Identification: Single-hop questions are paired by identifying shared entities, ensuring the questions are interlinked, forming a DAG (Directed Acyclic Graph).
  2. Ensuring Connected Reasoning: A filtering process ensures that each compositional link between questions cannot be bypassed, adhering to a condition termed the MuSiQue condition.
  3. Dataset Construction Pipeline:
    • Filtering of single-hop questions based on various criteria.
    • Composition of these into multihop questions with 2-4 hops.
    • Reduction of train-test leakage to prevent models from simply memorizing answers.
    • Addition of distractor contexts to ensure challenging model assessments.
    • Human-aided refinement and validation via a crowdsourced approach.

Results

MuSiQue presents notable improvements over existing datasets in several aspects:

  • Increased Difficulty: Models exhibit a larger gap between human performance and machine performance.
  • Reduced Cheatability: The dataset is significantly more robust against shortcut exploitation, as evidenced by lower scores from partial-input models and higher DiRe scores.
  • Challenge Dataset: The inclusion of a contrasting set of unanswerable questions, MuSiQue-Full, further tests the robustness of model reasoning capabilities.

Implications and Future Directions

The MuSiQue dataset promises significant contributions to the advancement of reliable multihop reasoning models. By negating reasoning shortcuts and focusing on connected reasoning, MuSiQue sets a new standard for evaluating multihop QA systems. Given the success of this approach, future exploration may consider extending similar methodologies to other areas, such as open-domain QA or multimodal datasets, potentially enhancing the ability of AI systems to engage in more complex reasoning tasks.

This paper invites further investigation into decomposition-based models and could foster the development of AI systems capable of tackling real-world, multifaceted problems through rigorous reasoning processes.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com