Evaluating Open-Domain Question Answering in the Era of Large Language Models (2305.06984v3)

Published 11 May 2023 in cs.CL

Abstract: Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is increasingly the case as we shift from extractive to generative models. The recent success of LLMs for QA aggravates lexical matching failures since candidate answers become longer, thereby making matching with the gold answers even more challenging. Without accurate evaluation, the true progress in open-domain QA remains unknown. In this paper, we conduct a thorough analysis of various open-domain QA models, including LLMs, by manually evaluating their answers on a subset of NQ-open, a popular benchmark. Our assessments reveal that while the true performance of all models is significantly underestimated, the performance of the InstructGPT (zero-shot) LLM increases by nearly +60%, making it on par with existing top models, and the InstructGPT (few-shot) model actually achieves a new state-of-the-art on NQ-open. We also find that more than 50% of lexical matching failures are attributed to semantically equivalent answers. We further demonstrate that regex matching ranks QA models consistent with human judgments, although still suffering from unnecessary strictness. Finally, we demonstrate that automated evaluation models are a reasonable surrogate for lexical matching in some circumstances, but not for long-form answers generated by LLMs. The automated models struggle in detecting hallucinations in LLM answers and are thus unable to evaluate LLMs. At this time, there appears to be no substitute for human evaluation.

References (45)

Citations (65)

View on Semantic Scholar

Summary

The paper demonstrates that standard lexical matching significantly underestimates QA performance, with human evaluation revealing nearly 60% improvement for LLM outputs.
It finds that semantic similarity methods and regex matching partially bridge the gap but still fail to capture model nuances compared to human judgment.
The study emphasizes the need for robust, human-centered evaluation strategies to accurately assess long-form answers generated by large language models.

This paper investigates the efficacy of lexical matching as an evaluation metric for open-domain question answering (QA) systems, particularly in the context of LLMs. The authors argue that lexical matching, which is the standard evaluation method, fails to accurately assess model performance because it requires an exact match between the predicted answer and the gold answer. This is problematic as the set of gold answers is often incomplete, and LLMs frequently generate plausible, yet non-identical answers. The authors conduct a manual evaluation of several open-domain QA models, including LLMs, on a subset of the {open} benchmark dataset and compare the results with lexical matching, a semantic similarity model (BEM), and a zero-shot evaluation method using InstructGPT.

The paper's primary contributions and findings are as follows:

Limitations of Lexical Matching: Lexical matching significantly underestimates the true performance of open-domain QA models. The authors observe a large performance gap between lexical matching and human evaluation, with the performance of InstructGPT (zero-shot) increasing by nearly +60% when evaluated by humans.
Semantic Equivalence: The majority of lexical matching failures are due to semantic equivalence, where the model's answer is semantically similar to a correct answer but not lexically identical. This includes synonymous answers, elaborations, and tokenization mismatches.
Human Evaluation: Human evaluation is essential for accurately assessing open-domain QA models, particularly LLMs, due to their ability to generate long-form, plausible but sometimes incorrect answers.
Automated Evaluation Models: Semantic similarity models like BEM show some improvement over lexical matching, particularly in cases where answers are semantically equivalent but not lexically identical. However, BEM still underestimates the performance of models.
LLM Evaluation: The authors explored using LLMs to evaluate QA models via a zero-shot prompting method (InstructGPT-eval). The results are promising, showing good agreement with human evaluation, but are prone to misjudging hallucinated long answers generated by LLMs. GPT4-eval is also tested showing similar error patterns to InstructGPT-eval, with marginal improvements.
Regex Matching: Regular expression matching, which is used to evaluate models on the CuratedTREC dataset, is more robust than exact match, but still suffers from unnecessary strictness.
CuratedTREC 2002 Analysis: The authors also performed experiments on the CuratedTREC 2002 dataset. The results indicate that regex matching, BEM, and InstructGPT-eval produce results that are mostly consistent with human judgements, although they still underestimate the true model performance. Also, human evaluation is necessary for the performance of LLMs to surpass that of the best traditional statistical NLP systems of that time.

The models used in the paper were divided into retriever-reader models (DPR, FiD, ANCE, Contriever, RocketQAv2, FiD-KD, GAR, and R2-D2), end-to-end models (EMDR² and EviGen), and closed-book models (InstructGPT zero-shot and few-shot). The evaluation datasets included a subset of {open} (301 questions randomly sampled from the 3,610 test questions) and the CuratedTREC 2002 dataset.

The evaluation strategies consisted of:

Lexical Matching: Exact match (EM) and F1 score.
Supervised Evaluation via Semantic Similarity: Using BEM to classify whether candidate answers are semantically equivalent to the gold answers.
Zero-shot Evaluation via Prompting: Using InstructGPT and GPT-4 to evaluate answers by prompting the LLMs to determine if a candidate answer is correct given the question and gold answer.
Human Evaluation: Two human annotators independently judge the correctness of the generated answers, with a third annotator resolving disagreements.

The paper also provides a detailed linguistic analysis of the discrepancies between lexical matching and human judgment, categorizing the failure modes of lexical matching into semantic equivalence, symbolic equivalence, intrinsic ambiguity in questions, granularity discrepancies, list-style questions, and incorrect gold answers.

The paper concludes that while automated evaluation methods, such as BEM and LLM-based evaluation, can serve as a reasonable surrogate for lexical matching in some circumstances, they still fall short of the accuracy of human evaluation, particularly for long-form answers generated by LLMs. The authors emphasize the need for more robust evaluation techniques for open-domain QA, especially with the increasing prominence of LLMs.

PDF Markdown

GitHub

GitHub - ehsk/OpenQA-eval: ACL 2023: Evaluating Open-Domain Question Answering in the Era of Large Language Models (35 stars)

Evaluating Open-Domain Question Answering in the Era of Large Language Models (2305.06984v3)

Summary

Related Papers

GitHub