Emergent Mind

Abstract

LLMs can now generate and recognize text in a wide range of styles and genres, including highly specialized, creative genres like poetry. But what do LLMs really know about poetry? What can they know about poetry? We develop a task to evaluate how well LLMs recognize a specific aspect of poetry, poetic form, for more than 20 forms and formal elements in the English language. Poetic form captures many different poetic features, including rhyme scheme, meter, and word or line repetition. We use this task to reflect on LLMs' current poetic capabilities, as well as the challenges and pitfalls of creating NLP benchmarks for poetry and for other creative tasks. In particular, we use this task to audit and reflect on the poems included in popular pretraining datasets. Our findings have implications for NLP researchers interested in model evaluation, digital humanities and cultural analytics scholars, and cultural heritage professionals.

Evaluating LLMs' ability to identify over 20 English poetic forms and elements.

Overview

  • The paper investigates the ability of various LLMs to recognize and classify different poetic forms, analyzing their performance on over 4,197 tagged poems from reputable sources.

  • The study found that LLMs like GPT-4 and GPT-4o excel in identifying common fixed forms such as sonnets and haikus but struggle with more complex or uncommon forms like sestinas and pantoums.

  • The research highlights the potential biases introduced by pretraining data, stressing the need for better benchmarks and interdisciplinary collaboration to enhance the capabilities and applications of LLMs in digital humanities.

Evaluation of Poetic Forms by LLMs

The paper "Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets" by Melanie Walsh, Anna Preus, and Maria Antoniak provides a rigorous examination of how well contemporary LLMs can identify various poetic forms. The authors introduce a benchmark designed to evaluate LLMs' capabilities in recognizing more than 20 fixed and unfixed poetic forms in the English language and analyze the implications for NLP, digital humanities, and cultural heritage.

Summary of the Study

The study focuses on assessing LLMs' abilities to categorize poems by form, a task requiring understanding complex features such as rhyme schemes, meter, and repetition. The poetic forms considered include common ones like sonnets and haikus, as well as more intricate forms like sestinas and pantoums.

Methodology

The researchers used a diverse set of poems sourced from reputable institutions such as the Poetry Foundation and the Academy of American Poets. Additionally, they manually digitized a selection of poetry books. The resulting dataset comprises over 4,197 poems tagged by human experts.

They evaluated multiple LLMs, including GPT-3.5 Turbo, GPT-4, GPT-4o, Claude 3 Sonnet, Llama3, and Mixtral 8x22B, using different zero-shot prompt types (e.g., poem text only, title and author only, first line only). The models' performance was measured against human expert annotations.

Findings

The results indicate that the LLMs generally perform well on common fixed forms like sonnets and haikus, achieving high F1 scores (near or above 0.9 for most models) when provided with the text of the poem. However, performance declines on more complex or less common forms like sestinas and pantoums.

  • Fixed Forms: GPT-4 and GPT-4o showed particular strength in detecting complex forms with repetition precision, such as sestinas (F1=0.87; 0.73) and pantoums (F1=0.81; 0.82).
  • Unfixed Forms: The models struggled with forms based on topic (e.g., elegies, ars poetica) and visual features (e.g., concrete poetry, prose poems).

Analysis of Pretraining Data

The authors also investigated the presence of memorized poems in popular pretraining datasets (e.g., Dolma). They found substantial memorization of poetry content in GPT-4's output, revealing the potential biases introduced by pretraining data. In particular, Common Crawl and C4 datasets contained significant percentages of the used poems, which poses challenges for creating unbiased benchmarks.

Implications and Future Directions

NLP and Model Evaluation

This study underscores the need for nuanced benchmarks that account for the complexities of creative genres like poetry. The observed performance differences between poetic forms highlight the varying capabilities of modern LLMs and their reliance on the structure and frequency of training data.

Digital Humanities and Cultural Analytics

For digital humanities scholars, the research shows the potential and current limitations of using LLMs for literary analysis. Automated form detection could notably enhance the discoverability of poetic texts in digital archives, aiding research and education.

Cultural Heritage Collections

For libraries and cultural institutions, these findings suggest that integrating LLM-based tools could facilitate the cataloging of large poetry collections. However, careful attention to the limitations and biases of such models is crucial.

Conclusion

The paper offers a thorough examination of how well LLMs understand and categorize English poetic forms. While the results are promising, especially for commonly studied forms, they also highlight significant gaps in the models' capabilities for less frequent and more complex forms. Future research should explore multi-label classifications and include a broader range of poetic traditions and languages to build more comprehensive evaluation frameworks. This research bridges the fields of NLP and digital humanities, opening avenues for enhanced literary analysis and text categorization with advanced AI tools.

In addition to their technical contributions, the authors call for more interdisciplinary collaboration between computer scientists and literary scholars to develop nuanced evaluation tools that respect the diversity and complexity inherent in poetic forms.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.