- The paper presents a comprehensive evaluation of LLMs on creative writing, comparing commercial and open-source models with human performance.
- It employs rigorous human assessments measuring fluency, coherence, originality, humor, and style to benchmark model outputs.
- Findings reveal that while commercial LLMs nearly match human proficiency in technical writing, human writers outperform in creativity and humor.
The paper "A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing" presents an in-depth evaluation of several recent LLMs with a focus on creative writing tasks. The authors specifically selected a complex and imaginative scenario to avoid potential training data contamination: an epic narration involving a single combat between Ignatius J. Reilly, the central character from the novel A Confederacy of Dunces, and a pterodactyl.
The paper employed a range of both commercial and open-source LLMs, alongside human writers, to produce creative stories based on the scenario. The evaluation encompassed multiple dimensions of writing quality, including:
- Fluency
- Coherence
- Originality
- Humor
- Style
The human evaluation methodology was rigorous and aimed to assess the comparative performance of LLMs and human writers across these criteria. The key findings from the paper are as follows:
- Commercial vs. Open-Source LLMs: State-of-the-art commercial LLMs demonstrated performance that either matched or slightly exceeded that of human writers in most technical aspects, such as fluency and coherence. In contrast, open-source LLMs significantly lagged behind their commercial counterparts and human writers in these areas.
- Creativity: Humans maintained a definitive edge in creativity. While LLMs could generate coherent and fluent text, the depth of imagination and uniqueness exhibited by human writers were still more pronounced.
- Humor: The results showed a binary distribution in the handling of humor. Some advanced LLMs managed to incorporate humor comparably to human writers, while others were notably ineffective at this task. This variability indicates that the capability to generate humor remains a challenge for many models.
- Implications and Limitations: The paper highlights both the advances and the current limitations of LLMs in creative writing. While certain models are approaching human performance in technical writing aspects, the nuanced and subjective elements of creative expression, such as genuine creativity and humor, still pose challenges.
The authors also discuss potential future research directions, such as enhancing the creative faculties of LLMs and developing more sophisticated evaluation metrics that can better capture the essence of creative writing traits. These insights constitute an important contribution to understanding the capabilities and limitations of LLMs in creative fields.