One Thousand and One Pairs: A "novel" challenge for long-context language models (2406.16264v3)

Published 24 Jun 2024 in cs.CL and cs.AI

Abstract: Synthetic long-context LLM benchmarks (e.g., "needle-in-the-haystack") test only surface-level retrieval capabilities, but how well can long-context LLMs retrieve, synthesize, and reason over information across book-length inputs? We address this question by creating NoCha, a dataset of 1,001 minimally different pairs of true and false claims about 67 recently-published English fictional books, written by human readers of those books. In contrast to existing long-context benchmarks, our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify. Our experiments show that while human readers easily perform this task, it is enormously challenging for all ten long-context LLMs that we evaluate: no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks), while GPT-4o achieves the highest accuracy at 55.8%. Further analysis reveals that (1) on average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning; (2) model-generated explanations for their decisions are often inaccurate even for correctly-labeled claims; and (3) models perform substantially worse on speculative fiction books that contain extensive world-building. The methodology proposed in NoCha allows for the evolution of the benchmark dataset and the easy analysis of future models.

Citations (16)

View on Semantic Scholar

Summary

The paper shows that long-context language models achieve only 55.8% accuracy on global reasoning tasks compared to 96.9% by human readers.
The paper presents a dataset of 1,001 minimally-different true/false claim pairs sourced from 67 recent novels that require synthesis over full texts.
The paper reveals that models perform well on simple retrieval but struggle with narrative synthesis, particularly with speculative fiction that involves extensive world-building.

Overview of "One Thousand and One Pairs: A ``novel'' challenge for long-context LLMs"

Introduction

The advent of LLMs capable of processing long contexts has dramatically altered the landscape of NLP. However, existing benchmarks, such as "needle-in-the-haystack" (NIAH), primarily assess surface-level retrieval from texts rather than the challenging operations of synthesis and reasoning over large narrative structures. The paper "One Thousand and One Pairs: A 'novel' challenge for long-context LLMs" addresses this gap by presenting a meticulously designed dataset, \emph{A Novel Challenge}, aimed at evaluating LLMs' ability to verify claims over book-length inputs.

Dataset Construction

The dataset comprises 1,001 minimally different pairs of true and false claims about 67 recently published English fictional books. The unique design of these pairs ensures global reasoning over entire books is necessary for accurate verification. This setup contrasts with current benchmarks, which often rely on synthetic tasks and can be prone to superficial contextual understanding.

Data were collected from annotators who created true/false pairs based on their reading experiences of recent novels. This method helps mitigate data contamination issues inherent in existing datasets and ensures the claims are grounded in real-world contexts rather than synthetic ones. Each false claim in a pair is minimally different from its true counterpart, isolating a single narrative element. This design helps ensure quality control by making it easier to validate false claims against their true pairs.

Evaluation Methodology

The paper evaluates ten LLMs, encompassing both open-weight and closed-source models, using the \emph{A Novel Challenge} dataset. The results indicate significant challenges: no open-weight model performed above random chance levels, and the best-performing model achieved only 55.8% accuracy. Analysis revealed that claim pairs requiring synthesis and reasoning over entire book contents significantly dropped model performance compared to simpler sentence-level retrieval tasks.

Key Findings

Global Reasoning Difficulty: The main task in the \emph{A Novel Challenge} dataset requires models to perform global reasoning, a task at which all evaluated models struggled. For instance, although human readers achieved an accuracy of 96.9%, the best model could only achieve 55.8%.
Contradiction in Simple Retrieval Tasks: While models performed well on simple sentence-level retrieval tasks, as indicated by their performance in NIAH and tasks like Ruler, they faltered when required to integrate and reason over large documents.
Speculative Fiction Challenges: Notably, models performed substantially worse on books containing speculative fiction that involves extensive world-building. This genre seems to introduce additional layers of complexity that currently challenge the state-of-the-art LLMs.
Explanation Accuracy: Model-generated explanations for decisions were often inaccurate, revealing flawed or incomplete reasoning even when the model arrived at the correct verdict.

Practical and Theoretical Implications

The practical implications of these findings are vast. For one, it challenges the suitability of existing long-context benchmarks, emphasizing the need for more complex, realistic datasets like \emph{A Novel Challenge}. These datasets better mimic real-world applications where users expect models not only to retrieve information but to synthesize and reason across extensive contexts.

From a theoretical perspective, this paper opens avenues for improving the architectural and training methodologies of LLMs to handle extensive contexts efficiently. The gap between human performance and current model capabilities suggests significant room for progress in better understanding and integrating long narratives.

Future Developments

Future research may focus on several key areas:

Enhancing model architectures to ensure better reasoning across extended narratives, potentially by incorporating memory mechanisms or improved attention strategies capable of handling multiple information threads.
Adapting training regimes that focus on real-world dataset complexity rather than overly simplified synthetic tasks.
Expanding benchmarks to other domains beyond English fiction, thus avoiding biases and making these models more universally applicable.

Conclusion

The paper "One Thousand and One Pairs: A 'novel' challenge for long-context LLMs" provides a valuable critique of current evaluation methods and sets a new standard for long-context LLM benchmarks. By highlighting the challenges LLMs face in synthesizing and reasoning over large texts, it underscores critical areas for future improvement in model design and training strategies. This contribution is pivotal for advancing the development of LLMs that can more accurately understand and process extensive narrative content.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1806716486494699717

https://twitter.com/GaryMarcus/status/1806172164414132292

https://twitter.com/rohanpaul_ai/status/1810685321647919268

https://twitter.com/AIMachineDream/status/1837330334788297069

https://twitter.com/koenfucius/status/1806240052906803705

https://twitter.com/mar_kar_/status/1810800879655866842

YouTube

Show All Videos