Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ExpertQA: Expert-Curated Questions and Attributed Answers (2309.07852v2)

Published 14 Sep 2023 in cs.CL and cs.AI

Abstract: As LLMs are adopted by a more sophisticated and diverse set of users, the importance of guaranteeing that they provide factually correct information supported by verifiable sources is critical across fields of study. This is especially the case for high-stakes fields, such as medicine and law, where the risk of propagating false information is high and can lead to undesirable societal consequences. Previous work studying attribution and factuality has not focused on analyzing these characteristics of LLM outputs in domain-specific scenarios. In this work, we conduct human evaluation of responses from a few representative systems along various axes of attribution and factuality, by bringing domain experts in the loop. Specifically, we collect expert-curated questions from 484 participants across 32 fields of study, and then ask the same experts to evaluate generated responses to their own questions. In addition, we ask experts to improve upon responses from LLMs. The output of our analysis is ExpertQA, a high-quality long-form QA dataset with 2177 questions spanning 32 fields, along with verified answers and attributions for claims in the answers.

Citations (37)

Summary

  • The paper introduces a novel ExpertQA dataset comprising 2,177 expert-curated, domain-specific questions with rigorously validated answers.
  • It implements a comprehensive evaluation framework measuring factuality, informativeness, and evidence reliability across various language model systems.
  • The study reveals that retrieve-and-read systems deliver more complete attributions than vanilla models, highlighting the need for improved automatic factuality estimation.

An Expert-Curated Approach to Evaluating LLM Factuality: Insights from ExpertQA

The paper "ExpertQA: Expert-Curated Questions and Attributed Answers" conducts a thorough analysis of the factuality and attribution capabilities of LLMs (LMs) within domain-specific contexts. It introduces a novel dataset, ExpertQA, which reflects the questions and information needs originating from 32 varied fields, encompassing domains such as medicine, law, and engineering. This work recognizes the deficiency of previous studies in addressing the nuanced requirements of domain-specific users and aims to fill this gap by involving experts directly in the evaluation process.

Key Contributions

The paper makes several noteworthy contributions:

  1. Expert-Curated Dataset: A key aspect of this paper is the creation of the ExpertQA dataset, comprising 2,177 questions across multiple domains, generated and vetted by domain experts. This dataset is meticulously crafted to avoid vague queries and emphasizes real-world professional needs. Notably, each answer is critically evaluated for factual accuracy and its supporting evidence, allowing for a reliable benchmark to assess LMs.
  2. Comprehensive Evaluation Metrics: The analysis breaks down the assessment of LM responses into several attributes, such as factuality, informativeness, and cite-worthiness of claims, as well as the reliability of source evidence. This multifaceted evaluation framework provides a detailed understanding of how LMs perform across different axes relevant to domain experts.
  3. Detailed System Evaluation: By evaluating a variety of systems, including vanilla LMs, retrieve-and-read models, and post-hoc retrieval systems, the paper illustrates distinct strengths and weaknesses. It finds that retrieve-and-read systems often provide more complete attributions and highlights how different retrieval sources significantly impact the quality of retrieval-augmented responses.
  4. Analysis of Automatic Estimators: Additionally, the paper investigates how existing automatic methods for attributing sources and evaluating factuality correlate with expert judgments. Findings indicate that while these methods display high precision, there is considerable room for improvement in recall, emphasizing the necessity for further refinement.

Implications and Future Directions

The implications of this paper are multifaceted. Practically, it emphasizes the importance of trustworthy and supportable information in high-stakes domains where the cost of misinformation can be substantial. Theoretically, the dataset and evaluation framework proposed can serve as crucial tools for the research community to improve LMs' capabilities in generating factually correct and well-attributed information.

Future work may explore the development of enhanced algorithms for factuality assessment and attribution, potentially integrating more robust retrieval mechanisms and improved NLI models. Additional investigation into how domain-specific context can be better incorporated into retrieval sources would further minimize the discrepancy in source reliability highlighted in the current work.

Furthermore, extending the framework to capture a broader range of expert viewpoints could lead to even richer insights into the capabilities and limitations of LMs. The authors also address the need for automating attribution and factuality assessments effectively, suggesting potential advancements in metric development and dataset augmentation.

In summary, the ExpertQA paper provides a valuable contribution to understanding and improving the interaction between domain experts and LLMs. It sets a foundational stage for more precise and reliable AI tools tailored to the needs of professional and academic users spanning diverse fields. By prioritizing factual veracity and authoritative sources, this work charts a course for responsible AI integration in specialized domains.