LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Published 27 Jun 2024 in cs.CL, cs.AI, and cs.LG | (2406.19314v2)

Abstract: Test set contamination, wherein test data from a benchmark ends up in a newer model's training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be resistant to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-limited versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 405B in size. LiveBench is difficult, with top models achieving below 70% accuracy. We release all questions, code, and model answers. Questions are added and updated on a monthly basis, and we release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models.

Abstract PDF Upgrade to Chat

Authors (18)

First 10 authors:

Citations (25)

View on Semantic Scholar

Summary

The paper demonstrates that LiveBench eliminates test contamination by using regularly updated questions and objective, automated scoring based on ground-truth evaluations.
The methodology evaluates LLMs across six categories—math, coding, reasoning, language, instruction following, and data analysis—with top models achieving below 65% accuracy.
The benchmark shifts evaluation from human or LLM-based judgements to an unbiased framework, promoting more reliable and meaningful assessments of model performance.

An Academic Overview of LiveBench: A Contamination-Free LLM Benchmark

The paper under review introduces "LiveBench," a novel benchmark designed to address the limitations associated with traditional evaluation frameworks for LLMs. These frameworks are often compromised due to test set contamination and biases introduced via crowdsourcing evaluations, posing significant challenges to accurate, fair assessments. LiveBench, as articulated by White et al., proposes a comprehensive solution by regularly updating its questions using recent information and relying on objective ground-truth values for automatic scoring, thereby eliminating the dependencies and biases of human or LLM judges.

Key Characteristics of LiveBench

The authors establish three unique facets of LiveBench:

Regularly Updated Questions: Questions derive from contemporary sources, such as recent math competitions and academic datasets, allowing LiveBench to remain contemporaneous and relevant.
Objective Scoring: Without involving LLM or human judges, LiveBench scores answers based on established ground-truth values, enhancing the integrity and reliability of the evaluation.
Diverse Task Range: The benchmark encompasses six broad categories—math, coding, reasoning, language, instruction following, and data analysis—enhancing its capacity to evaluate various LLM capabilities comprehensively.

Evaluation and Insights

The authors propose a robust evaluation framework by testing a wide array of models, including both open and closed-source variants ranging in size up to 110B parameters. Results showcase LiveBench’s complexity, as even the highest performing models achieve an accuracy below 65%. This challenges the LLMs across the breadth of updated questions while remaining resistant to contamination.

A significant insight revealed by this paper is the inadequacy of LLMs as judges for difficult tasks. The authors point out that GPT-4-Turbo's pass/fail judgement errors stand as high as 46% for complex reasoning tasks, thereby necessitating a transition towards automated, ground-truth answers.

Implications and Future Prospects

Practical Implications: For practitioners and developers of LLMs, LiveBench offers a contamination-proof, unbiased framework for evaluating model enhancements and innovations. This leads to more truthful and meaningful assessments of model capabilities in application settings, fostering the development of models better suited for dynamic, real-world challenges.

Theoretical Implications: From a research perspective, LiveBench prompts reconsideration of existing evaluation metrics and methodologies. This scrutiny can lead to broader acceptance of automated, contamination-free benchmarks, further advancing the theoretical understanding of LLM capabilities and limitations.

Future Developments: Looking forward, the adaptability of LiveBench is a notable benefit. Its structure supports continuous evolution, with new questions and challenges added regularly. As models improve, LiveBench can adapt to maintain its challenging nature, fostering ongoing, meaningful differentiation between models.

Conclusion

LiveBench represents a concerted effort to address the chronic problem of test data contamination and biased evaluations in LLM benchmarking. It sets a new standard in the evaluation landscape by emphasizing updated, diverse question sets and objective, automated scoring mechanisms. This approach promises enhanced reliability in benchmarking outcomes, contributing significantly to both the applied and theoretical realms of LLM research and development. Future engagements and expansions of LiveBench will likely bolster its utility and efficacy, reinforcing its role as a pivotal benchmark in LLM assessments.

Markdown Report Issue