Explainable Conversational Question Answering over Heterogeneous Sources via Iterative Graph Neural Networks (2305.01548v2)

Published 2 May 2023 in cs.IR

Abstract: In conversational question answering, users express their information needs through a series of utterances with incomplete context. Typical ConvQA methods rely on a single source (a knowledge base (KB), or a text corpus, or a set of tables), thus being unable to benefit from increased answer coverage and redundancy of multiple sources. Our method EXPLAIGNN overcomes these limitations by integrating information from a mixture of sources with user-comprehensible explanations for answers. It constructs a heterogeneous graph from entities and evidence snippets retrieved from a KB, a text corpus, web tables, and infoboxes. This large graph is then iteratively reduced via graph neural networks that incorporate question-level attention, until the best answers and their explanations are distilled. Experiments show that EXPLAIGNN improves performance over state-of-the-art baselines. A user study demonstrates that derived answers are understandable by end users.

Citations (9)

View on Semantic Scholar

Summary

The paper advances explainable conversational question answering by constructing a heterogeneous evidence graph and applying iterative graph neural network reduction.
It employs structured representation parsing and SR-attention mechanisms to accurately fuse knowledge from text, tables, and knowledge bases.
Empirical results demonstrate significant gains in metrics like P@1, MRR, and Hit@5 while enhancing user trust through comprehensive explainability studies.

Explainable Conversational QA over Heterogeneous Sources via Iterative Graph Neural Networks

The paper "Explainable Conversational Question Answering over Heterogeneous Sources via Iterative Graph Neural Networks" (Explaignn) (2305.01548) addresses Conversational Question Answering (ConvQA) in settings where answers may be distributed over disparate information sources, including knowledge bases (KBs), text corpora, web tables, and infoboxes. The focus of the work is twofold: (i) to advance answer accuracy and coverage in ConvQA by integrating evidence from these heterogeneous sources, and (ii) to provide user-comprehensible, model-aware explanations for the predicted answers. This is operationalized through an iterative graph neural network (GNN) architecture applied to a jointly constructed, heterogeneous graph of entities and evidences.

Motivation and Task Formulation

ConvQA systems must interpret multi-turn dialogs in which follow-up questions often omit overt context, relying on prior conversation turns. The shortcomings of prior work are clear: most systems operate over a single information modality (KB, text, or table), lack mechanisms to integrate the strengths of multiple sources, and—when leveraging powerful sequence-to-sequence neural models—tend to produce opaque, non-explainable outputs and ignore the structured relationships among pieces of evidence.

By contrast, Explaignn models the ConvQA process as the construction and progressive reduction of a heterogeneous graph. Nodes in this graph represent both entities and retrieved evidences (verbalized KB facts, sentences, table records, infobox entries); edges encode entity mentions within evidences. This enables the model to leverage inter-evidence connections, crucial for sifting relevant information from noise, and provides a natural substrate for explanation.

Explaignn Architecture

The system pipeline comprises three principal stages:

Structured Representation Parsing: A sequence-to-sequence (BART-based) model, fine-tuned for conversational context, rewrites (potentially incomplete) user questions into intent-explicit structured representations (SR), specifying context entity, question entity, relation, and expected answer type. A mechanism is introduced to minimize hallucinations by promoting SRs whose vocabulary strictly aligns with the governed conversational context, with exceptions for explicit answer type hints.
Evidence Retrieval: Candidate evidences are drawn from all available modalities, anchored to KB entities. Clocq is used for KB retrieval and entity disambiguation, textual sources are mapped via Wikipedia entity linkage, and web tables/infobox records are parsed as appropriate. Importantly, evidence retrieval is focused on entities occurring specifically in dedicated slots (context or question entity) of the SR, filtering noise from relation/type slot artifacts.
Graph Construction and Iterative GNN Reduction:
- A heterogeneous entity-evidence bipartite graph is constructed, where nodes are initialized using cross-encodings (evidence text or entity label + SR) via a shared, fine-tuned LLM (DistilRoBERTa).
- Entity type information is appended to entity encodings, providing critical type-awareness for downstream disambiguation and ranking.
- Message passing is performed using a question-aware attention mechanism ("SR-attention") that weights neighborhood updates according to relevance to the SR, avoiding the pitfalls of indiscriminate information propagation.
- The core innovation is iterative application of GNNs: after each inference pass, the most relevant evidences (per evidence relevance score) and their associated entities are retained to define a progressively smaller, denser subgraph for subsequent inference. This sequence is repeated until a compact graph—comprised of a minimal, explanatory set of evidences—remains for answer selection.

The model is trained under a multi-task learning paradigm, jointly optimizing answer prediction (entity classification) and evidence relevance scoring. This dual objective ensures not only accurate answer selection but also the model's capacity to highlight its own logical chain of evidence.

Empirical Results

Extensive experiments are conducted on ConvMix, a benchmark specifically designed for ConvQA over heterogeneous sources. Comparative baselines include strong systems such as Convinse (Fusion-in-Decoder model), question completion and rewriting pipelines, and sequence-to-sequence answerers.

Main quantitative results demonstrate:

Explaignn surpasses all baselines by a clear margin on all tested metrics under both gold answer and predicted answer settings, with P@1 improving from 0.343 (best Convinse variant) to 0.406, MRR from 0.378 to 0.471, and Hit@5 from 0.431 to 0.561 (Table 1).
Using predicted answers in conversation history, Explaignn remains robust, with a P@1 of 0.339 compared to 0.279 for the strongest baseline.
Explaignn benefits notably from integrating all evidence sources; removing modalities leads to substantial declines in answer coverage and accuracy.

Ablation analysis confirms that SR-attention, cross-encoding with the SR, and entity type features are each essential—removing SR-attention, for example, reduces P@1 from 0.442 to 0.062.

The iterative GNN procedure achieves efficient inference with compact graphs (five evidences suffice for explanation and answer in most cases), enabling performance-explainability trade-offs without substantial cost to accuracy or efficiency.

Zero-shot transfer: When evaluated out-of-the-box (no fine-tuning) on ConvQuestions, Explaignn achieves state-of-the-art MRR, and accuracy is further improved when heterogeneous sources are included, attesting to the generality of the approach.

Explainability and User Study

A key contribution is the explicit evaluation of end-user explainability through an extensive crowd-sourced user paper. For test cases, Turkers are shown the predicted answer with the supporting evidences output by Explaignn and are asked to confirm the answer's correctness and their certainty. The results are compelling:

Users accurately decide answer correctness in 76.1% of cases and are certain in 79.8% of assessments.
When users are certain, accuracy increases to 79.2%.
Explanations are sufficiently informative that both certainty and correctness are frequently attainable, demonstrating the practical viability of the explanation mechanism.

Qualitative analyses reveal that incorrect answers are usually traceable to missing evidence in the constructed graph, not to failures of the reduction or explanation procedure per se.

Theoretical and Practical Implications

Explaignn provides an effective template for explainable, multi-hop ConvQA over heterogeneous sources. By embedding model explainability directly into the iterative inference architecture, the system makes the provenance of answers accessible to end-users—a property not achievable with black-box sequence-to-sequence generation approaches. The iterative reduction mechanism offers a natural alignment with human reasoning, where answers are distilled from interactions among a small, highly relevant set of facts and relations, transparent to the system operator or end-user.

Key practical implications include:

Deployability in High-Stakes Applications: The traceable, evidence-backed answer predictions are particularly valuable for domains (e.g., biomedical, financial QA) where trust and transparency are imperative.
Robustness and Error Calibration: Iterative graph reduction improves robustness to noisy or incomplete retrieval, and the system can signal uncertainty to users when supporting evidence is insufficient.
Extensibility to Other Multi-source Reasoning Tasks: The architecture's modularity allows for extension to recommendation, fact-checking, and more ambitious knowledge integration tasks.

Speculation on Future Developments

Several avenues of research naturally follow:

Extension of the iterative, explainable GNN paradigm to incorporate reasoning over visual or temporal information sources.
Improved retrieval and entity linking modules to address the remaining cases where answers are absent from the evidence graph.
Enrichment of explanation interfaces with interactive capabilities for user-driven error correction or evidence exploration.
Integration with LLMs as retrieval-augmented generation components, leveraging the controllable explainability framework of Explaignn.

In summary, this work provides a novel and practically grounded framework for explainable ConvQA, establishing a benchmark for future systems that must balance accuracy, efficiency, and end-user trust.