Emergent Mind

Abstract

Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, LLMs are increasingly being used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators simply inherit all the problems of the LLMs they evaluate, requiring further human validation. We present a mixed-initiative approach to ``validate the validators'' -- aligning LLM-generated evaluation functions (be it prompts or code) with human requirements. Our interface, EvalGen, provides automated assistance to users in generating evaluation criteria and implementing assertions. While generating candidate implementations (Python functions, LLM grader prompts), EvalGen asks humans to grade a subset of LLM outputs; this feedback is used to select implementations that better align with user grades. A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment. In particular, we identify a phenomenon we dub \emph{criteria drift}: users need criteria to grade outputs, but grading outputs helps users define criteria. What is more, some criteria appears \emph{dependent} on the specific LLM outputs observed (rather than independent criteria that can be defined \emph{a priori}), raising serious questions for approaches that assume the independence of evaluation from observation of model outputs. We present our interface and implementation details, a comparison of our algorithm with a baseline approach, and implications for the design of future LLM evaluation assistants.

Workflow of the EvalGen prototype, showing data inputs, algorithm steps, and output generation processes.

Overview

  • The EvalGen Interface, part of the open-source ChainForge system, employs a mixed-initiative approach to align LLM-generated evaluation functions with human preferences in evaluating LLM outputs.

  • EvalGen uses GPT-4 to suggest evaluation criteria and generate candidate assertions, which are then tested against LLM outputs, with feedback mechanisms allowing for iterative refinement based on user grading.

  • Compared to existing systems like SPADE, EvalGen showed improved alignment with fewer assertions, indicating the benefit of engaging users directly in shaping the evaluation criteria and functions.

  • Key user study insights reveal the value of human integration in the evaluation process, preference for control over evaluations, and support for iterative refinement, guiding future design of LLM evaluation tools.

Aligning LLM-Assisted Evaluation with Human Preferences: The EvalGen Interface

Overview

The EvalGen system presents a mixed-initiative approach to address the challenge of aligning LLM-generated evaluation functions with human preferences during the evaluation of LLM outputs. Built into the open-source ChainForge interface, EvalGen automates the creation and adjustment of evaluation criteria and associated evaluation functions, engaging users actively in this process to ensure the generated evaluators align closely with user-defined expectations.

System Design and Implementation

EvalGen integrates into the existing ChainForge system, expanding its functionalities to support interactive and automated evaluation of LLM outputs. The fundamental components of EvalGen include:

  • Criteria Suggestion: Using GPT-4, EvalGen proposes binary evaluation criteria based on the context provided by user inputs and the associated prompt.
  • Candidate Assertion Synthesis and Execution: For each criterion, EvalGen employs GPT-4 to generate multiple candidate assertions, which are then asynchronously executed against LLM outputs. This process includes both code-based and LLM-based evaluations.
  • Grading Sampler: This module actively samples LLM outputs for user grading, collecting binary feedback to adjust dynamic estimates of candidate assertions’ effectiveness and alignment with user expectations.

Through this design, EvalGen facilitates a granular and user-engaged evaluation process, allowing for ongoing refinement of criteria and assertions based on real-time feedback.

Evaluation of the EvalGen System

Algorithm Performance

EvalGen’s capabilities were benchmarked against SPADE, an existing system for generating and selecting assertions based on set criteria. EvalGen demonstrated superior performance by achieving equal or better alignment with fewer assertions. This underscores the benefit of involving users directly in the criteria selection process, which enables more accurate tailoring of assertions to the nuances of the evaluation task.

User Study Insights

A qualitative study with industry practitioners revealed strong support for EvalGen, highlighting its utility in easing the development of evaluation metrics. Key findings include:

  1. Iterative Refinement: Users appreciated being able to iteratively refine evaluation criteria and assertions. This was crucial as grading more outputs often led to new insights and adjustments in evaluation standards, a phenomenon termed "criteria drift."
  2. Integration of Human Judgment: The system’s mixed-initiative approach, which integrates human judgments directly into the evaluation loop, was particularly valued. Users felt involved in the process and believed this led to better alignment of automatic evaluations with their expectations.
  3. Preference for Control: Users expressed a preference for maintaining control over the evaluation process, particularly when defining or editing criteria and selecting from generated assertions.

Implications and Future Directions

The findings from implementing and studying EvalGen have several implications for the design of future LLM evaluation assistants:

  • Support for Iterative Processes: Evaluation systems should support iterative interactions, allowing users to refine criteria and re-evaluate assertions as new information becomes available or as their understanding of the task evolves.
  • Diverse Evaluation Needs: There is a need to balance automated evaluation with opportunities for user control and customization. This includes allowing users to select between different types of evaluations (code-based vs. LLM-based) and to directly influence the generation of assertions.
  • Scalability and Operationalization: Future systems should consider how assertions can be operationalized and scaled within production environments, addressing the dynamic nature of LLM applications and the continuous evolution of models and outputs.

In conclusion, EvalGen demonstrates a successful application of a mixed-initiative approach to the challenge of aligning LLM-generated evaluations with human preferences, offering valuable insights for the ongoing development of advanced evaluation tools in the field of generative AI.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube