Papers
Topics
Authors
Recent
Search
2000 character limit reached

Evaluating Large Language Model Biases in Persona-Steered Generation

Published 30 May 2024 in cs.CL | (2405.20253v1)

Abstract: The task of persona-steered text generation requires LLMs to generate text that reflects the distribution of views that an individual fitting a persona could have. People have multifaceted personas, but prior work on bias in LLM-generated opinions has only explored multiple-choice settings or one-dimensional personas. We define an incongruous persona as a persona with multiple traits where one trait makes its other traits less likely in human survey data, e.g. political liberals who support increased military spending. We find that LLMs are 9.7% less steerable towards incongruous personas than congruous ones, sometimes generating the stereotypical stance associated with its demographic rather than the target stance. Models that we evaluate that are fine-tuned with Reinforcement Learning from Human Feedback (RLHF) are more steerable, especially towards stances associated with political liberals and women, but present significantly less diverse views of personas. We also find variance in LLM steerability that cannot be predicted from multiple-choice opinion evaluation. Our results show the importance of evaluating models in open-ended text generation, as it can surface new LLM opinion biases. Moreover, such a setup can shed light on our ability to steer models toward a richer and more diverse range of viewpoints.

Citations (10)

Summary

  • The paper demonstrates LLMs exhibit a 9.7% lower steerability toward incongruous personas compared to stereotypical ones.
  • It uses Pew survey-based personas and fine-tuning methods like RLHF and DPO to assess model bias in political, racial, and gender stances.
  • Findings indicate that although fine-tuning improves steerability, it often restricts semantic diversity and reinforces societal biases.

Evaluating LLM Biases in Persona-Steered Generation

Introduction

The study investigates the ability of LLMs to generate text that reflects the views of individuals corresponding to specific personas, with a focus on incongruous personas. These are multifaceted personas wherein one trait decreases the likelihood of the other traits according to human survey data. The analysis highlights the LLMs' challenge in accurately steering towards such non-stereotypical personas, resulting in a tendency to default to stereotypical stances. Figure 1

Figure 1: The process by which we construct personas from human data to evaluate LLM steerability. We find that LLMs are less steerable towards incongruous personas.

Methods

Persona-Steered Generation Setting

The research delineates a persona-steered statement generation task, where models generate statements reflective of specific personas sourced from Pew survey data. These personas consist of stances on political, racial, or gender-related issues that define the viewpoint of a prototypical individual.

Model Selection and Evaluation

Models in the Llama 2 family (fine-tuned via Reinforcement Learning from Human Feedback (RLHF)) and Tulu 2 family (fine-tuned using Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO)) were used. Additionally, OpenAI's GPT-3.5-Turbo was evaluated for its steerability and diversity of generated opinions.

Steerability Evaluation

Steerability towards personas was assessed by GPT-4 evaluations which were validated against human crowdworker-driven labels, providing a high degree of correlation, indicating its suitability for evaluation tasks.

Results and Discussion

Steerability towards Incongruous Personas

LLMs exhibit a 9.7% lower steerability towards incongruous personas versus congruous personas, revealing an innate bias towards generating stereotypical content. RLHF-tuned models demonstrated the highest steerability yet lacked semantic diversity, reflecting narrow persona views. Figure 2

Figure 2: Mean steerability of Llama and Tulu models towards stances commonly associated with each demographic.

Fine-Tuning and Scale

RLHF and DPO fine-tuning methods enhance steerability, particularly favoring politically liberal and female-associated stances. The investigation uncovered that congruity in persona significantly affects steerability, with more complex configurations accentuating this challenge.

Metrics Comparison

Evaluations disclosed that models fine-tuned with RLHF showed reduced semantic diversity in generations, implying potential emphasis on robustness over richness in persona representation.

Implications and Future Work

Implications are profound: by perpetuating stereotypical stances, LLMs inadvertently bolster societal divides and polarization, underscoring a need for improved persona modeling that can capture nuanced multi-attribute identities.

Further investigations are warranted into more complex, interactive LLM simulations and robust fine-tuning methodologies to mitigate existing biases and enhance the diversity of LLM outputs.

Conclusion

The analysis underscores deficiencies in LLMs when steering towards multifaceted, incongruous personas, highlighting a preference towards generating stereotypical personae. While fine-tuning enhances steerability, it often limits diversity, suggesting an improvement trajectory focused on maximizing semantic richness and steering fidelity.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 111 likes about this paper.