A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing (2310.08433v1)

Published 12 Oct 2023 in cs.CL and cs.CY

Abstract: We evaluate a range of recent LLMs on English creative writing, a challenging and complex task that requires imagination, coherence, and style. We use a difficult, open-ended scenario chosen to avoid training data reuse: an epic narration of a single combat between Ignatius J. Reilly, the protagonist of the Pulitzer Prize-winning novel A Confederacy of Dunces (1980), and a pterodactyl, a prehistoric flying reptile. We ask several LLMs and humans to write such a story and conduct a human evalution involving various criteria such as fluency, coherence, originality, humor, and style. Our results show that some state-of-the-art commercial LLMs match or slightly outperform our writers in most dimensions; whereas open-source LLMs lag behind. Humans retain an edge in creativity, while humor shows a binary divide between LLMs that can handle it comparably to humans and those that fail at it. We discuss the implications and limitations of our study and suggest directions for future research.

Citations (43)

View on Semantic Scholar

Summary

The paper presents a comprehensive evaluation of LLMs on creative writing, comparing commercial and open-source models with human performance.
It employs rigorous human assessments measuring fluency, coherence, originality, humor, and style to benchmark model outputs.
Findings reveal that while commercial LLMs nearly match human proficiency in technical writing, human writers outperform in creativity and humor.

The paper "A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing" presents an in-depth evaluation of several recent LLMs with a focus on creative writing tasks. The authors specifically selected a complex and imaginative scenario to avoid potential training data contamination: an epic narration involving a single combat between Ignatius J. Reilly, the central character from the novel A Confederacy of Dunces, and a pterodactyl.

The paper employed a range of both commercial and open-source LLMs, alongside human writers, to produce creative stories based on the scenario. The evaluation encompassed multiple dimensions of writing quality, including:

Fluency
Coherence
Originality
Humor
Style

The human evaluation methodology was rigorous and aimed to assess the comparative performance of LLMs and human writers across these criteria. The key findings from the paper are as follows:

Commercial vs. Open-Source LLMs: State-of-the-art commercial LLMs demonstrated performance that either matched or slightly exceeded that of human writers in most technical aspects, such as fluency and coherence. In contrast, open-source LLMs significantly lagged behind their commercial counterparts and human writers in these areas.
Creativity: Humans maintained a definitive edge in creativity. While LLMs could generate coherent and fluent text, the depth of imagination and uniqueness exhibited by human writers were still more pronounced.
Humor: The results showed a binary distribution in the handling of humor. Some advanced LLMs managed to incorporate humor comparably to human writers, while others were notably ineffective at this task. This variability indicates that the capability to generate humor remains a challenge for many models.
Implications and Limitations: The paper highlights both the advances and the current limitations of LLMs in creative writing. While certain models are approaching human performance in technical writing aspects, the nuanced and subjective elements of creative expression, such as genuine creativity and humor, still pose challenges.

The authors also discuss potential future research directions, such as enhancing the creative faculties of LLMs and developing more sophisticated evaluation metrics that can better capture the essence of creative writing traits. These insights constitute an important contribution to understanding the capabilities and limitations of LLMs in creative fields.

PDF Markdown

Related Papers

Tweets

https://twitter.com/1003494708671385600/status/1742187012956819836