Emergent Mind

Abstract

We evaluate recent Large language Models (LLMs) on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the authors themselves. Through quantitative and qualitative analysis grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We find that all three models make faithfulness mistakes in over 50% of summaries and struggle to interpret difficult subtext. However, at their best, the models can provide thoughtful thematic analysis of stories. We additionally demonstrate that LLM judgments of summary quality do not match the feedback from the writers.

Overview

  • The paper analyzes LLMs like GPT-4, Claude-2.1, and LLama-2-70B, exploring their ability to summarize complex, unpublished short stories involving nuanced subtext and non-linear timelines.

  • It evaluates models using a novel approach by incorporating experienced writers' feedback and focuses on metrics such as coherence, faithfulness, coverage, and thematic understanding.

  • Findings indicate that while LLMs can perform thematic analysis, they struggle with faithfulness and accurately interpreting complex narrative subtexts, with GPT-4 performing the best among them.

  • The study suggests future research should focus on improving LLMs' narrative comprehension and advocate for more human-centered evaluation methods in narrative summarization tasks.

Evaluating LLMs on the Subtle Task of Short Story Summarization: A Study with Unseen Data and Expert Writers

Introduction

Short story summarization presents a unique challenge for LLMs due to the inherent complexity of narrative structures, which can include nuanced subtext, non-linear timelines, and a mix of abstract and concrete details. Recognizing this, the study "Reading Subtext: Evaluating LLMs on Short Story Summarization with Writers" seeks to understand how well current LLMs—specifically GPT-4, Claude-2.1, and LLama-2-70B—perform in summarizing short stories that are complex and have not been previously shared online, ensuring these texts are unseen by the models prior to evaluation.

Methodology

The authors' approach revolves around collaborating directly with experienced writers to use unpublished short stories as test cases, thereby ensuring the stories are not in the models' training data. This approach not only maintains the integrity and originality of the data but also leverages expert human judgments for evaluation. The study involves quantitative and qualitative assessments, examining models' performances across coherence, faithfulness, coverage, and analysis—a novel inclusion highlighting the importance of thematic understanding in summarization tasks. Additionally, the paper innovates in its methodological framework by replacing conventional LLM judgments with skilled human evaluations to assess summary quality, providing a robust critique of current automatic evaluation methodologies.

Key Findings

The findings reveal a mixed performance by the evaluated LLMs. While all models exhibit a tendency to make over 50% faithfulness errors and struggle with interpreting complex subtext, they also show the capacity for insightful thematic analysis at their best. Specifically, GPT-4 emerges as the most capable, followed closely by Claude-2.1, with LLama-2-70B lagging in its ability to summarize effectively, particularly for longer stories. The evaluation highlights the significant disparity between LLM-generated judgments of summary quality and those provided by the writers, underscoring the inadequacy of LLMs in replacing human expertise in nuanced tasks like narrative summarization.

Implications and Future Directions

This study underlines several critical areas for future research, especially in improving LLMs’ understanding of narrative structures and subtext. The pronounced difficulty in summarizing stories with complex narratives, unreliable narrators, or detailed subplots suggests a need for models that can better grasp the subtleties of human storytelling. Furthermore, the mismatch between LLM and human evaluations of summaries prompts a reevaluation of current summary quality metrics, advocating for more human-centered approaches in assessing narrative understanding.

Moreover, the research methodology adopted here, specifically the direct engagement with creative communities and the use of unpublished stories, offers a valuable template for future studies aiming to challenge LLMs with genuinely unseen data. Such collaborations not only enrich the dataset diversity but also ensure a more contextually informed evaluation of LLM performance, a critical step toward models that can genuinely understand and generate human-like narrative content.

Conclusion

In conclusion, "Reading Subtext: Evaluating LLMs on Short Story Summarization with Writers" provides an insightful exploration into the current capabilities and limitations of LLMs in the complex task of narrative summarization. By leveraging expert human judgments and ensuring the use of unseen, nuanced narrative texts, the study presents an instructive foray into understanding the depth of narrative comprehension achievable by current LLM technologies. As the field progresses, bridging the identified gaps and continuing to refine models’ narrative understanding will be crucial for advancements in AI-generated narrative content.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.