Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences (2404.12272v1)

Published 18 Apr 2024 in cs.HC and cs.AI

Abstract: Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, LLMs are increasingly being used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators simply inherit all the problems of the LLMs they evaluate, requiring further human validation. We present a mixed-initiative approach to ``validate the validators'' -- aligning LLM-generated evaluation functions (be it prompts or code) with human requirements. Our interface, EvalGen, provides automated assistance to users in generating evaluation criteria and implementing assertions. While generating candidate implementations (Python functions, LLM grader prompts), EvalGen asks humans to grade a subset of LLM outputs; this feedback is used to select implementations that better align with user grades. A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment. In particular, we identify a phenomenon we dub \emph{criteria drift}: users need criteria to grade outputs, but grading outputs helps users define criteria. What is more, some criteria appears \emph{dependent} on the specific LLM outputs observed (rather than independent criteria that can be defined \emph{a priori}), raising serious questions for approaches that assume the independence of evaluation from observation of model outputs. We present our interface and implementation details, a comparison of our algorithm with a baseline approach, and implications for the design of future LLM evaluation assistants.

Citations (30)

View on Semantic Scholar

Summary

The paper introduces a mixed-initiative system that iteratively refines evaluation criteria using user feedback to align LLM evaluations with human judgment.
It demonstrates that leveraging GPT-4 for binary evaluation criteria generation achieves superior alignment with human preferences compared to systems like SPADE.
User studies reveal that direct control and iterative feedback significantly enhance evaluation accuracy while mitigating issues like criteria drift.

Aligning LLM-Assisted Evaluation with Human Preferences: The EvalGen Interface

Overview

The EvalGen system presents a mixed-initiative approach to address the challenge of aligning LLM-generated evaluation functions with human preferences during the evaluation of LLM outputs. Built into the open-source ChainForge interface, EvalGen automates the creation and adjustment of evaluation criteria and associated evaluation functions, engaging users actively in this process to ensure the generated evaluators align closely with user-defined expectations.

System Design and Implementation

EvalGen integrates into the existing ChainForge system, expanding its functionalities to support interactive and automated evaluation of LLM outputs. The fundamental components of EvalGen include:

Criteria Suggestion: Using GPT-4, EvalGen proposes binary evaluation criteria based on the context provided by user inputs and the associated prompt.
Candidate Assertion Synthesis and Execution: For each criterion, EvalGen employs GPT-4 to generate multiple candidate assertions, which are then asynchronously executed against LLM outputs. This process includes both code-based and LLM-based evaluations.
Grading Sampler: This module actively samples LLM outputs for user grading, collecting binary feedback to adjust dynamic estimates of candidate assertions’ effectiveness and alignment with user expectations.

Through this design, EvalGen facilitates a granular and user-engaged evaluation process, allowing for ongoing refinement of criteria and assertions based on real-time feedback.

Evaluation of the EvalGen System

Algorithm Performance

EvalGen’s capabilities were benchmarked against SPADE, an existing system for generating and selecting assertions based on set criteria. EvalGen demonstrated superior performance by achieving equal or better alignment with fewer assertions. This underscores the benefit of involving users directly in the criteria selection process, which enables more accurate tailoring of assertions to the nuances of the evaluation task.

User Study Insights

A qualitative paper with industry practitioners revealed strong support for EvalGen, highlighting its utility in easing the development of evaluation metrics. Key findings include:

Iterative Refinement: Users appreciated being able to iteratively refine evaluation criteria and assertions. This was crucial as grading more outputs often led to new insights and adjustments in evaluation standards, a phenomenon termed "criteria drift."
Integration of Human Judgment: The system’s mixed-initiative approach, which integrates human judgments directly into the evaluation loop, was particularly valued. Users felt involved in the process and believed this led to better alignment of automatic evaluations with their expectations.
Preference for Control: Users expressed a preference for maintaining control over the evaluation process, particularly when defining or editing criteria and selecting from generated assertions.

Implications and Future Directions

The findings from implementing and studying EvalGen have several implications for the design of future LLM evaluation assistants:

Support for Iterative Processes: Evaluation systems should support iterative interactions, allowing users to refine criteria and re-evaluate assertions as new information becomes available or as their understanding of the task evolves.
Diverse Evaluation Needs: There is a need to balance automated evaluation with opportunities for user control and customization. This includes allowing users to select between different types of evaluations (code-based vs. LLM-based) and to directly influence the generation of assertions.
Scalability and Operationalization: Future systems should consider how assertions can be operationalized and scaled within production environments, addressing the dynamic nature of LLM applications and the continuous evolution of models and outputs.

In conclusion, EvalGen demonstrates a successful application of a mixed-initiative approach to the challenge of aligning LLM-generated evaluations with human preferences, offering valuable insights for the ongoing development of advanced evaluation tools in the field of generative AI.

PDF Markdown

Related Papers

Tweets

https://twitter.com/sh_reya/status/1782425960366940333

https://twitter.com/sh_reya/status/1845140863523094734

https://twitter.com/HamelHusain/status/1792223793903276343

https://twitter.com/eugeneyan/status/1817661888400089419

https://twitter.com/rebeccali1203/status/1806776483530338451

https://twitter.com/hugobowne/status/1797644541685821650

YouTube

Show All Videos

HackerNews

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs (2 points, 0 comments)