Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers (2403.01061v3)

Published 2 Mar 2024 in cs.CL

Abstract: We evaluate recent LLMs on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the authors themselves. Through quantitative and qualitative analysis grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We find that all three models make faithfulness mistakes in over 50% of summaries and struggle with specificity and interpretation of difficult subtext. We additionally demonstrate that LLM ratings and other automatic metrics for summary quality do not correlate well with the quality ratings from the writers.

References (47)

Citations (8)

View on Semantic Scholar

Summary

The paper demonstrates that LLMs struggle with narrative subtext, with over 50% faithfulness errors observed in summary outputs.
The study employs a novel methodology using unpublished short stories and direct expert writer evaluations to ensure data originality and contextual accuracy.
Findings reveal GPT-4 outperforms peers, highlighting significant gaps in current models' capacity for nuanced thematic analysis and overall summarization.

Evaluating LLMs on the Subtle Task of Short Story Summarization: A Study with Unseen Data and Expert Writers

Introduction

Short story summarization presents a unique challenge for LLMs due to the inherent complexity of narrative structures, which can include nuanced subtext, non-linear timelines, and a mix of abstract and concrete details. Recognizing this, the paper "Reading Subtext: Evaluating LLMs on Short Story Summarization with Writers" seeks to understand how well current LLMs—specifically GPT-4, Claude-2.1, and LLama-2-70B—perform in summarizing short stories that are complex and have not been previously shared online, ensuring these texts are unseen by the models prior to evaluation.

Methodology

The authors' approach revolves around collaborating directly with experienced writers to use unpublished short stories as test cases, thereby ensuring the stories are not in the models' training data. This approach not only maintains the integrity and originality of the data but also leverages expert human judgments for evaluation. The paper involves quantitative and qualitative assessments, examining models' performances across coherence, faithfulness, coverage, and analysis—a novel inclusion highlighting the importance of thematic understanding in summarization tasks. Additionally, the paper innovates in its methodological framework by replacing conventional LLM judgments with skilled human evaluations to assess summary quality, providing a robust critique of current automatic evaluation methodologies.

Key Findings

The findings reveal a mixed performance by the evaluated LLMs. While all models exhibit a tendency to make over 50% faithfulness errors and struggle with interpreting complex subtext, they also show the capacity for insightful thematic analysis at their best. Specifically, GPT-4 emerges as the most capable, followed closely by Claude-2.1, with LLama-2-70B lagging in its ability to summarize effectively, particularly for longer stories. The evaluation highlights the significant disparity between LLM-generated judgments of summary quality and those provided by the writers, underscoring the inadequacy of LLMs in replacing human expertise in nuanced tasks like narrative summarization.

Implications and Future Directions

This paper underlines several critical areas for future research, especially in improving LLMs’ understanding of narrative structures and subtext. The pronounced difficulty in summarizing stories with complex narratives, unreliable narrators, or detailed subplots suggests a need for models that can better grasp the subtleties of human storytelling. Furthermore, the mismatch between LLM and human evaluations of summaries prompts a reevaluation of current summary quality metrics, advocating for more human-centered approaches in assessing narrative understanding.

Moreover, the research methodology adopted here, specifically the direct engagement with creative communities and the use of unpublished stories, offers a valuable template for future studies aiming to challenge LLMs with genuinely unseen data. Such collaborations not only enrich the dataset diversity but also ensure a more contextually informed evaluation of LLM performance, a critical step toward models that can genuinely understand and generate human-like narrative content.

Conclusion

In conclusion, "Reading Subtext: Evaluating LLMs on Short Story Summarization with Writers" provides an insightful exploration into the current capabilities and limitations of LLMs in the complex task of narrative summarization. By leveraging expert human judgments and ensuring the use of unseen, nuanced narrative texts, the paper presents an instructive foray into understanding the depth of narrative comprehension achievable by current LLM technologies. As the field progresses, bridging the identified gaps and continuing to refine models’ narrative understanding will be crucial for advancements in AI-generated narrative content.