Emergent Mind

Abstract

Single document news summarization has seen substantial progress on faithfulness in recent years, driven by research on the evaluation of factual consistency, or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model's size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we conducted an analysis of hallucination types with a curated error taxonomy. We find that there are diverse errors and error distributions in model-generated summaries and that non-LLM based metrics can capture all error types better than LLM-based evaluators.

TofuEval evaluates 1.5K summaries for completeness, relevance, and accuracy using expert linguistic annotators.

Overview

  • Introduces TofuEval, a dataset for assessing the factual consistency of LLM-generated summaries in dialogue summarization.

  • Finds significant factual inaccuracies (hallucinations) in summaries by LLMs of all sizes, challenging the assumption that larger models are more accurate.

  • TofuEval differentiates from other benchmarks by focusing on dialogue summarization and providing expert-annotated factual consistency labels with explanations.

  • Develops an error taxonomy for dialogue summarization, identifying areas where non-LLM metrics outperform LLM evaluators, and outlines future directions for enhancing model and metric accuracy.

Evaluation of LLMs on Topic-Focused Dialogue Summarization: A Study on Hallucinations

Introduction to TofuEval

The body of research surrounding LLMs and their applications in various text summarization tasks has been burgeoning, especially within the domain of news summarization. However, the exploration into dialogue summarization, a significant but less traversed area, remains limited in breadth. This study introduces TofuEval, a novel benchmark dataset designed to assess the factual consistency of LLM-generated summaries focusing on dialogue summarization. TofuEval leverages summaries generated by LLMs of different sizes and conducts thorough analyses of these summaries' factual adherence, providing sentence-level binary human annotations for factual consistency along with explanations for any identified factual inconsistencies.

Hallucinations in LLM-generated Summaries

The study's findings highlight a pervasive issue of hallucinations—factually incorrect inferences made by LLMs—within the domain of dialogue summarization. The analysis uncovers that LLMs, despite their sizes, are prone to introducing a significant volume of factual inaccuracies in their summaries. This contradicts the common assumption that larger models would inherently produce more accurate and factually consistent summaries. The analysis further elucidates that LLMs, including advanced versions like GPT-4, when tasked as binary factual consistency evaluators, demonstrate inadequate performance, underscored by the superiority of specialized state-of-the-art factuality evaluation metrics over LLM evaluators in terms of both accuracy and computational efficiency.

Comparative Analysis with Existing Benchmarks

Contrary to other benchmarks which predominantly focus on news summarization, TofuEval is dedicated to evaluating dialogue summarization across various dialogic contexts, encompassing interviews and meetings. This focus stems from the potential applicability of dialogue summarization in real-world scenarios such as streamlining customer service interactions and enhancing efficiency in meetings. The paper effectively situates TofuEval within the landscape of existing benchmarks, highlighting its unique contributions including the provision of expert-annotated factual consistency labels with written explanations, thereby offering a comprehensive framework for assessing summary factuality in dialogue contexts.

Error Taxonomy and Analysis

An innovative aspect of the study is the development and application of a detailed error taxonomy tailored for the dialogue summarization domain. This taxonomy facilitates a nuanced analysis of the types and distributions of errors across model-generated summaries, revealing diverse patterns of factual inaccuracies. Through this taxonomy, the study identifies specific areas where non-LLM based metrics excel in capturing error types more effectively than their LLM-based counterparts, thus providing valuable insights for future improvements in model evaluators and summarization techniques.

Implications and Future Directions

The implications of this research are twofold. Practically, it underscores the urgent need for enhancing the factual consistency of LLM-generated summaries in dialogue summarization, urging the development of models and evaluation metrics that can better contend with the complexities inherent in dialogic texts. Theoretically, it contributes to our understanding of the limitations and capabilities of LLMs across varying dimensions of text summarization, challenging the notion of a one-size-fits-all model proficiency. Looking ahead, the study behooves the research community to delve deeper into the exploration of specialized models and metrics tailored for dialogue summarization, potentially leading to advancements in AI's applicability across more nuanced and context-rich text summarization tasks.

Conclusion

TofuEval stands as a critical endeavor in the exploration of LLM performance on dialogue summarization, highlighting the prevalence of hallucinations in model-generated summaries and the current inadequacy of LLMs as reliable evaluators of factual consistency. By providing a robust benchmark and a detailed analysis of the errors present in summaries, this study lays foundational groundwork for future advancements in the field, aiming towards the development of more accurate, efficient, and contextually aware summarization and evaluation models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.