Emergent Mind

One Thousand and One Pairs: A "novel" challenge for long-context language models

(2406.16264)
Published Jun 24, 2024 in cs.CL and cs.AI

Abstract

Synthetic long-context LLM benchmarks (e.g., "needle-in-the-haystack") test only surface-level retrieval capabilities, but how well can long-context LLMs retrieve, synthesize, and reason over information across book-length inputs? We address this question by creating NoCha, a dataset of 1,001 minimally different pairs of true and false claims about 67 recently-published English fictional books, written by human readers of those books. In contrast to existing long-context benchmarks, our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify. Our experiments show that while human readers easily perform this task, it is enormously challenging for all ten long-context LLMs that we evaluate: no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks), while GPT-4o achieves the highest accuracy at 55.8%. Further analysis reveals that (1) on average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning; (2) model-generated explanations for their decisions are often inaccurate even for correctly-labeled claims; and (3) models perform substantially worse on speculative fiction books that contain extensive world-building. The methodology proposed in NoCha allows for the evolution of the benchmark dataset and the easy analysis of future models.

Data collection and evaluation pipeline for claim verification using books published between 2023 and 2024.

Overview

  • The paper introduces a dataset called 'A Novel Challenge' aimed at evaluating LLMs on their ability to verify claims over book-length inputs, unlike existing benchmarks focusing on surface-level retrieval.

  • The dataset consists of 1,001 pairs of minimally different true and false claims regarding 67 English fictional books, requiring global reasoning over entire texts for accurate verification.

  • Evaluation of ten LLMs showed significant challenges in synthesis and reasoning, with no model performing above random chance on the task, thus highlighting the need for more complex and realistic benchmarks.

Overview of "One Thousand and One Pairs: A ``novel'' challenge for long-context language models"

Introduction

The advent of LLMs capable of processing long contexts has dramatically altered the landscape of NLP. However, existing benchmarks, such as "needle-in-the-haystack" (NIAH), primarily assess surface-level retrieval from texts rather than the challenging operations of synthesis and reasoning over large narrative structures. The paper "One Thousand and One Pairs: A 'novel' challenge for long-context language models" addresses this gap by presenting a meticulously designed dataset, \emph{A Novel Challenge}, aimed at evaluating LLMs' ability to verify claims over book-length inputs.

Dataset Construction

The dataset comprises 1,001 minimally different pairs of true and false claims about 67 recently published English fictional books. The unique design of these pairs ensures global reasoning over entire books is necessary for accurate verification. This setup contrasts with current benchmarks, which often rely on synthetic tasks and can be prone to superficial contextual understanding.

Data were collected from annotators who created true/false pairs based on their reading experiences of recent novels. This method helps mitigate data contamination issues inherent in existing datasets and ensures the claims are grounded in real-world contexts rather than synthetic ones. Each false claim in a pair is minimally different from its true counterpart, isolating a single narrative element. This design helps ensure quality control by making it easier to validate false claims against their true pairs.

Evaluation Methodology

The paper evaluates ten LLMs, encompassing both open-weight and closed-source models, using the \emph{A Novel Challenge} dataset. The results indicate significant challenges: no open-weight model performed above random chance levels, and the best-performing model achieved only 55.8% accuracy. Analysis revealed that claim pairs requiring synthesis and reasoning over entire book contents significantly dropped model performance compared to simpler sentence-level retrieval tasks.

Key Findings

  1. Global Reasoning Difficulty: The main task in the \emph{A Novel Challenge} dataset requires models to perform global reasoning, a task at which all evaluated models struggled. For instance, although human readers achieved an accuracy of 96.9%, the best model could only achieve 55.8%.
  2. Contradiction in Simple Retrieval Tasks: While models performed well on simple sentence-level retrieval tasks, as indicated by their performance in NIAH and tasks like Ruler, they faltered when required to integrate and reason over large documents.
  3. Speculative Fiction Challenges: Notably, models performed substantially worse on books containing speculative fiction that involves extensive world-building. This genre seems to introduce additional layers of complexity that currently challenge the state-of-the-art LLMs.
  4. Explanation Accuracy: Model-generated explanations for decisions were often inaccurate, revealing flawed or incomplete reasoning even when the model arrived at the correct verdict.

Practical and Theoretical Implications

The practical implications of these findings are vast. For one, it challenges the suitability of existing long-context benchmarks, emphasizing the need for more complex, realistic datasets like \emph{A Novel Challenge}. These datasets better mimic real-world applications where users expect models not only to retrieve information but to synthesize and reason across extensive contexts.

From a theoretical perspective, this study opens avenues for improving the architectural and training methodologies of LLMs to handle extensive contexts efficiently. The gap between human performance and current model capabilities suggests significant room for progress in better understanding and integrating long narratives.

Future Developments

Future research may focus on several key areas:

  • Enhancing model architectures to ensure better reasoning across extended narratives, potentially by incorporating memory mechanisms or improved attention strategies capable of handling multiple information threads.
  • Adapting training regimes that focus on real-world dataset complexity rather than overly simplified synthetic tasks.
  • Expanding benchmarks to other domains beyond English fiction, thus avoiding biases and making these models more universally applicable.

Conclusion

The paper "One Thousand and One Pairs: A 'novel' challenge for long-context language models" provides a valuable critique of current evaluation methods and sets a new standard for long-context LLM benchmarks. By highlighting the challenges LLMs face in synthesizing and reasoning over large texts, it underscores critical areas for future improvement in model design and training strategies. This contribution is pivotal for advancing the development of LLMs that can more accurately understand and process extensive narrative content.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube