Emergent Mind

Summarization is (Almost) Dead

(2309.09558)
Published Sep 18, 2023 in cs.CL

Abstract

How well can LLMs generate summaries? We develop new datasets and conduct human evaluation experiments to evaluate the zero-shot generation capability of LLMs across five distinct summarization tasks. Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models. Specifically, LLM-generated summaries exhibit better factual consistency and fewer instances of extrinsic hallucinations. Due to the satisfactory performance of LLMs in summarization tasks (even surpassing the benchmark of reference summaries), we believe that most conventional works in the field of text summarization are no longer necessary in the era of LLMs. However, we recognize that there are still some directions worth exploring, such as the creation of novel datasets with higher quality and more reliable evaluation methods.

High human preference for LLM-generated summaries over other systems across five tasks.

Overview

  • The paper evaluates the summarization performance of LLMs like GPT-3, GPT-3.5, and GPT-4, comparing them against human and model-generated summaries across different scenarios.

  • It details the creation of specialized datasets for human evaluation in summarization tasks, including news, dialogue, code, and cross-lingual text summarization.

  • Findings indicate LLM-generated summaries are preferred for their fluency, coherence, and factual consistency, revealing limitations in human-generated summaries.

  • The study suggests focusing on high-quality datasets, application-oriented approaches, and advanced evaluation metrics for future text summarization research.

Evaluating the Performance of LLMs in Summarization Tasks

Introduction to Summarization Capabilities of LLMs

The advent of LLMs like GPT-3, GPT-3.5, and GPT-4 has notably shifted the focus towards their remarkable zero-shot generation capabilities in various tasks, including text summarization. This paper undertakes a comprehensive analysis to evaluate the performance of LLMs against human-written summaries and those generated by models fine-tuned for specific summarization tasks. Utilizing newly developed datasets for human evaluation, the study presents a series of experiments aimed at comparing these summaries across five distinct summarization scenarios: single-news, multi-news, dialogue, source codes, and cross-lingual text summarization.

Detailed Overview of Experimental Framework

Datasets and Models

The creation of specialized datasets aimed to ensure the LLMs had not been exposed to the data during their training phase. Each dataset comprised 50 samples across various tasks, emulating methods used in established datasets like CNN/DailyMail for news and borrowing methodologies for dialogue and code summarization. Notably, for cross-lingual summarization, a unique approach involving translation and post-editation was employed to bolster the dataset's robustness.

Experimental Process

A rigorous human evaluation process was adopted, involving graduate students and domain experts where necessary, such as for code summarization. Each evaluator was tasked with pairwise comparisons of summaries, ensuring a broad and comprehensive assessment across different summarization models, including GPT-3, GPT-3.5, GPT-4, BART, T5, Pegasus, MT5, and Codet5.

Insightful Findings from the Human Evaluation

LLM-generated summaries were consistently preferred over those produced by humans and fine-tuned models. This preference was attributed to the higher fluency, coherence, and sometimes better factual consistency found in LLM summaries. Notably, in tasks where human-written summaries showed weaker factual consistency, LLMs demonstrated superior performance, underlining the potential limitations of human summation in certain contexts.

Furthermore, the study introduced a novel classification of errors into intrinsic and extrinsic hallucinations, with a significant finding that extrinsic hallucinations largely contributed to the factual inconsistencies observed in human-written summaries.

Implications and Future Directions

Given the compelling performance of LLMs in generating coherent, fluent, and factually consistent summaries, the study suggests a paradigm shift in the development and refinement of text summarization models. It underlines the need for:

  • High-Quality Reference Datasets: Future work should focus on constructing high-quality datasets with expert-annotated reference summaries to further challenge and evaluate LLMs' summarization capabilities.
  • Application-Oriented Approaches: There's a ripe opportunity to explore LLMs in application-specific summarization tasks, potentially offering more personalized and contextually relevant summaries.
  • Advanced Evaluation Metrics: Moving beyond traditional metrics like ROUGE, there's an imperative need for more nuanced and practical evaluation methodologies that align better with the capabilities of advanced LLMs.

Conclusion

The study's findings underscore the impressive summarization capabilities of LLMs, raising critical questions about the continued development of traditional summarization models. Despite the success, the paper does not discount the importance of ongoing research, especially in creating superior datasets, exploring novel application-oriented summarization tasks, and developing more relevant evaluation metrics. The future of text summarization appears to be on the cusp of a significant transformation, driven by the advancements in LLM technologies.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.