Emergent Mind

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

(2407.18370)
Published Jul 25, 2024 in cs.LG and cs.CL

Abstract

We present a principled approach to provide LLM-based evaluation with a rigorous guarantee of human agreement. We first propose that a reliable evaluation method should not uncritically rely on model preferences for pairwise evaluation, but rather assess the confidence of judge models and selectively decide when to trust its judgement. We then show that under this selective evaluation framework, human agreement can be provably guaranteed -- such that the model evaluation aligns with that of humans to a user-specified agreement level. As part of our framework, we also introduce Simulated Annotators, a novel confidence estimation method that significantly improves judge calibration and thus enables high coverage of evaluated instances. Finally, we propose Cascaded Selective Evaluation, where we use cheaper models as initial judges and escalate to stronger models only when necessary -- again, while still providing a provable guarantee of human agreement. Experimental results show that Cascaded Selective Evaluation guarantees strong alignment with humans, far beyond what LLM judges could achieve without selective evaluation. For example, on a subset of Chatbot Arena where GPT-4 almost never achieves 80% human agreement, our method, even while employing substantially cost-effective models such as Mistral-7B, guarantees over 80% human agreement with almost 80% test coverage.

Cascaded system using cost-effective model first, escalating to stronger model based on confidence.

Overview

  • The paper introduces Cascaded Selective Evaluation, a framework to enhance the reliability of evaluations performed by LLMs, ensuring alignment with human judgment through a confidence measure.

  • The methodology leverages Simulated Annotators for better confidence estimation and employs a hierarchical approach, escalating from cheaper to costlier models based on confidence thresholds.

  • Experimental results demonstrate high human agreement across various datasets and practical conditions, and the framework retains robustness even under distribution shifts.

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

Introduction

The paper presents a methodology to enhance the reliability of evaluation performed by LLMs through a framework that offers a rigorous guarantee of human agreement. The central premise is that LLMs, while potentially scalable substitutes for manual annotation, suffer from systematic biases and over-confidence issues that hinder their reliability. The proposed framework, termed Cascaded Selective Evaluation, aims to not only assess the model's judgment but also appraise the confidence in such judgments, selectively determining when to trust the model. This selective process intends to ensure that the alignment with human preferences meets a user-specified agreement level.

Methodology

The proposed framework employs a selective evaluation mechanism. Instead of uncritically relying on model outputs, it introduces a confidence measure to evaluate whether an LLM's prediction aligns with human judgment:

  1. Selective Evaluation: This involves the model making a prediction and an associated confidence estimate. Only predictions meeting a threshold confidence value, λ, are trusted. This threshold calibration follows a fixed sequence testing procedure on a calibration set, ensuring that the disagreement with human judgments stays below a prescribed risk level, α.
  2. Simulated Annotators: The framework introduces an innovative confidence estimation method named Simulated Annotators. This method generates multiple simulated annotations using in-context learning, subsequently measuring confidence as the consensus among these simulated annotations. This approach dramatically improves calibration, reducing over-confidence and enhancing the predictability of LLM judges, even for cheaper models like Mistral-7B.
  3. Cascaded Selective Evaluation: A hierarchical approach where the evaluation starts with a cheaper model. Only if this model's confidence is insufficient does the system escalate to stronger and costlier models. Each model in the cascade is calibrated to maintain the specified human agreement guarantee.

Experimental Results

The framework was tested across multiple datasets, including summarization tasks (TL;DR) and real-world user interactions (Chatbot Arena and Auto-J). Key findings include:

  • TL;DR Dataset: Cascaded Selective Evaluation effectively guarantees high human agreement across different levels of risk tolerance (α). Notably, it ensures over 80% human agreement with less than 43.5% reliance on GPT-4, largely depending on more cost-effective models.
  • Chatbot Arena: Under practical conditions involving varied LLM outputs, the system maintains its performance guaranteeing high human agreement levels with significant reduction in API costs.
  • Robustness Under Distribution Shift: The approach retains high guarantee success rates even when calibration and test data have different distributions, underscoring its practicality in real-world settings.

Implications

The implications of this research are twofold:

  1. Practical Implications: Cascaded Selective Evaluation offers a cost-effective method for deploying LLMs in large-scale evaluations. It demonstrates that weaker, cheaper models can contribute substantially to reliable evaluations when coupled with a robust confidence estimation.
  2. Theoretical Implications: The introduction of Simulated Annotators offers a new dimension in confidence estimation, leveraging synthetic agreement to better align model outputs with human judgment. This method could be adapted to other tasks requiring reliable confidence measures.

Future Directions

Potential future developments include:

  • Extension to Other Evaluation Metrics: Extending the current framework to incorporate other evaluation metrics beyond human agreement, such as factual correctness for knowledge-intensive tasks.
  • Hybrid Models: Exploring hybrid techniques that combine Simulated Annotators with other uncertainty quantification methods to further enhance reliability.
  • Domain Adaptation: Adapting the calibration procedures to various domains and further testing under different distribution shifts to generalize the framework's applicability.

Conclusion

This paper presents Cascaded Selective Evaluation as a principled approach for reliable LLM-based evaluation, combining the efficiency of weaker models with the robustness required for high-stakes evaluation tasks. With rigorous theoretical backing and promising experimental results, this framework enhances our ability to trust automated model evaluations, significantly reducing dependency on the most advanced and expensive LLMs. This development marks a vital step towards scalable, yet reliable, AI-driven assessments.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.