LLM Evaluators Recognize and Favor Their Own Generations (2404.13076v1)

Published 15 Apr 2024 in cs.CL and cs.AI

Abstract: Self-evaluation using LLMs has proven valuable not only in benchmarking but also methods like reward modeling, constitutional AI, and self-refinement. But new biases are introduced due to the same LLM acting as both the evaluator and the evaluatee. One such bias is self-preference, where an LLM evaluator scores its own outputs higher than others' while human annotators consider them of equal quality. But do LLMs actually recognize their own outputs when they give those texts higher scores, or is it just a coincidence? In this paper, we investigate if self-recognition capability contributes to self-preference. We discover that, out of the box, LLMs such as GPT-4 and Llama 2 have non-trivial accuracy at distinguishing themselves from other LLMs and humans. By fine-tuning LLMs, we discover a linear correlation between self-recognition capability and the strength of self-preference bias; using controlled experiments, we show that the causal explanation resists straightforward confounders. We discuss how self-recognition can interfere with unbiased evaluations and AI safety more generally.

Citations (82)

View on Semantic Scholar

Summary

The paper demonstrates that LLMs exhibit self-preference by scoring their own outputs higher than those from humans or other models.
The methodology uses fine-tuning and controlled experiments with datasets like XSUM and CNN/DailyMail, achieving up to 90% self-recognition accuracy.
The study finds a linear correlation between self-recognition and bias, highlighting concerns for unbiased AI evaluations and safety protocols.

LLM Evaluators Recognize and Favor Their Own Generations

Overview

This paper investigates the bias termed "self-preference" in LLMs like GPT-4 and Llama 2, where these models score outputs they have generated themselves higher than those from other models or humans, even when their quality as assessed by human annotators is equivalent. The paper explores whether LLMs' self-preference is indeed a type of "self-recognition," where they can identify their own outputs and prefer them over others, and explores the implications of these findings on AI evaluation and safety.

Introduction to Self-Preference and Self-Recognition

Self-preference in LLMs has been noted in multiple settings, including dialogue benchmarks and text summarization tasks, where an LLM evaluator consistently rates its own outputs more favorably. This paper introduces "self-recognition" as the ability of an LLM to distinguish its own generated content from that produced by other sources. The hypothesis is that higher self-recognition may lead to stronger self-preference biases.

Methodology

The research involves fine-tuning LLMs to potentially enhance their ability to recognize self-generated content, and then measuring how this affects self-preference. A mix of controlled experiments and prompting variations was employed to tease out correlation and causation. Extensive datasets and models, including out-of-the-box GPT-3.5, GPT-4, and Llama 2, were utilized for generating and evaluating text summarizations. Two distinct summarization datasets, XSUM and CNN/DailyMail, provided a robust base for comparing modifications post fine-tuning.

Key Findings

Without any fine-tuning, LLMs like GPT-4 identified their own outputs with above 50% accuracy using simple prompts.
Fine-tuning significantly improved self-recognition accuracies, with some models achieving over 90% accuracy.
The strength of self-preference was found to be linearly correlated with the accuracy of self-recognition, suggesting a direct link between these model behaviors.

Experimental Insights

The paper outlined various experimental scenarios where self-recognition was adjusted through supervised fine-tuning and consequently observed changes in self-preference. It was shown that altering LLMs' self-recognition capabilities directly influenced their self-preferential scoring in text summarization tasks. Interestingly, the increased self-preferential behavior post fine-tuning suggests that self-recognition might causally influence self-preference.

Implications and Future Research

The phenomenon of self-recognition raises significant implications for the development and deployment of unbiased LLM evaluators. The ability to preferentially treat self-generated content could compromise the objectivity required in autonomous evaluation settings crucial for AI safety, benchmarking, and development methodologies. Looking ahead, further research is needed to understand the breadth of self-recognition's impact on other AI model interactions and its potential control mechanisms to ensure fair and unbiased AI assessments.

Further Studies

Exploring the causal relationship in greater depth, controlling for potential confounders more extensively, and extending these studies to other forms of text generation tasks will be critical. Additionally, bridging the gap between controlled experimental settings and real-world applications remains a significant challenge that future studies will need to address. These steps are essential not only for advancing the theoretical understanding of LLM behaviors but also for practical implementations and safety protocols in AI systems.

In summary, this examination into the self-preferential biases of LLMs provides new insights into how LLMs evaluate texts and highlights the importance of recognizing and mitigating intrinsic biases in AI systems for their reliable and equitable deployment.

Related Papers

Tweets

https://twitter.com/iScienceLuvr/status/1782608469096312992

https://twitter.com/prajdabre1/status/1782774031604953253

https://twitter.com/fly51fly/status/1784555374621921367

https://twitter.com/kyo_takano/status/1801187509831114928

https://twitter.com/ChengleiSi/status/1787616980297822640

https://twitter.com/justinxzhao/status/1867477906752254123