Papers
Topics
Authors
Recent
2000 character limit reached

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences (2404.12272v1)

Published 18 Apr 2024 in cs.HC and cs.AI

Abstract: Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, LLMs are increasingly being used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators simply inherit all the problems of the LLMs they evaluate, requiring further human validation. We present a mixed-initiative approach to ``validate the validators'' -- aligning LLM-generated evaluation functions (be it prompts or code) with human requirements. Our interface, EvalGen, provides automated assistance to users in generating evaluation criteria and implementing assertions. While generating candidate implementations (Python functions, LLM grader prompts), EvalGen asks humans to grade a subset of LLM outputs; this feedback is used to select implementations that better align with user grades. A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment. In particular, we identify a phenomenon we dub \emph{criteria drift}: users need criteria to grade outputs, but grading outputs helps users define criteria. What is more, some criteria appears \emph{dependent} on the specific LLM outputs observed (rather than independent criteria that can be defined \emph{a priori}), raising serious questions for approaches that assume the independence of evaluation from observation of model outputs. We present our interface and implementation details, a comparison of our algorithm with a baseline approach, and implications for the design of future LLM evaluation assistants.

Citations (30)

Summary

  • The paper introduces EvalGen, a framework that leverages LLMs to generate, synthesize, and iteratively adjust evaluation criteria for superior human alignment.
  • It demonstrates that fewer, well-aligned assertions can achieve robust evaluation in diverse contexts such as medical records and e-commerce descriptions.
  • User studies reveal that continuous human feedback integration improves evaluation accuracy while highlighting challenges like criteria drift.

Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

This paper introduces a unique approach to aligning LLM-generated evaluations with human preferences through mixed-initiative methods. The proposed system, EvalGen, aims to address the inherent challenges in effectively validating LLM outputs, especially when human feedback is costly and LLM evaluations themselves could be flawed. It presents a workflow to help users systematically tune LLM-assisted evaluations to reflect human evaluation criteria more closely.

Introduction to the Problem

Validating outputs from LLMs poses significant challenges due to their limitations in accuracy and consistency. Traditional methods heavily rely on human evaluation, but this approach is time-consuming and not scalable. The paper highlights the growing reliance on LLMs to evaluate other LLMs, a circular validation loop that necessitates solutions ensuring alignment with human evaluators. EvalGen is introduced as a solution—facilitating the creation, evaluation, and alignment of LLM-output assessment criteria.

The EvalGen Workflow

EvalGen is embedded within the ChainForge tool for prompt engineering, designed to aid users in developing evaluation criteria and corresponding assertions (either through code or via further LLM-generated prompts). Users engage with EvalGen through the following workflow:

  1. Criteria Generation: Using LLMs to generate potential evaluation criteria from user prompts. Users can accept these criteria or modify them to better suit their needs.
  2. Assertion Synthesis: EvalGen generates multiple implementations of each chosen criterion, allowing for both code-based and LLM-based assertions. The choice between code and LLM-based evaluation methods is available based on user preference.
  3. Grading Outputs: Users are asked to provide feedback on LLM outputs to capture their evaluation preferences through a simple rating system. This feedback actively influences which assertion implementations are selected for each criterion. Figure 1

Figure 1

Figure 1: Typical Evaluation Pipeline

  1. Alignment and Adjustment: EvalGen ranks the candidate assertions according to their alignment with the user's feedback. The alignment process prioritizes assertions that effectively distinguish between good and bad output relative to the user's grading.

Methodology and Evaluation

The assessment of EvalGen's efficacy was conducted using datasets tailored for two distinct LLM pipeline tasks: medical records and e-commerce product descriptions. By leveraging ground-truth annotations for output evaluation, the paper showcased EvalGen's capacity to reduce the number of assertions while maintaining (or even enhancing) alignment with human preferences, compared to existing approaches like SPADE.

Results of the System Evaluation

EvalGen demonstrated its effectiveness by providing better-aligned evaluations with fewer assertions, highlighting its strength in leveraging human-specified criteria. Notably, users could achieve greater coverage and alignment with their preferences, addressing a significant challenge in LLM-based evaluation pipelines. However, the paper highlights the challenge posed by criteria drift—where the definition of evaluation criteria evolves as users become more familiar with the outputs, necessitating iterative refinement.

User Study Insights

A user paper involving industry practitioners revealed EvalGen's advantages and areas for improvement:

  • Initial Criteria Setting: Users appreciated the LLM-suggested criteria, though they often needed adjustment based on their unique task requirements.
  • Iterative Alignment: EvalGen's interface facilitated ongoing adjustments to criteria as users interacted with more LLM-generated outputs.
  • Diverse Implementation Approaches: The system's flexibility in handling both code-based and LLM-based assertion implementations suited different evaluation contexts and user strengths. Figure 2

    Figure 2: The workflow of our EvalGen prototype, from initial criteria generation to final assertion evaluation alignment

    .

Challenges and Future Directions

Despite its effectiveness, the paper identified challenges with the alignment process, especially pertaining to criteria drift and the subjective nature of what constitutes "alignment." Future work should explore optimizations for handling evolving criteria definitions and more robust integration of human feedback cycles to continuously refine LLM evaluators. Figure 3

Figure 3: The Table View, showing inputs, LLM outputs, and evaluation results per criteria for the NER task

.

Conclusion

The paper's findings underpin the critical insights into the dynamic and iterative nature of aligning LLM-based evaluations with human judgments. EvalGen emerges as an early attempt at creating more intelligent synergies between human evaluators and LLM-based systems. As LLM applications continue to expand, developing dependable, human-aligned evaluation methodologies will be crucial for their broader adoption and trustworthiness.

This work lays the foundational principles for future research into mixed-initiative evaluation systems, offering a robust framework to examine human-in-the-loop learning systems and their application in AI.

Whiteboard

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 35 tweets with 996 likes about this paper.