Evaluating Large Language Model Biases in Persona-Steered Generation (2405.20253v1)

Published 30 May 2024 in cs.CL

Abstract: The task of persona-steered text generation requires LLMs to generate text that reflects the distribution of views that an individual fitting a persona could have. People have multifaceted personas, but prior work on bias in LLM-generated opinions has only explored multiple-choice settings or one-dimensional personas. We define an incongruous persona as a persona with multiple traits where one trait makes its other traits less likely in human survey data, e.g. political liberals who support increased military spending. We find that LLMs are 9.7% less steerable towards incongruous personas than congruous ones, sometimes generating the stereotypical stance associated with its demographic rather than the target stance. Models that we evaluate that are fine-tuned with Reinforcement Learning from Human Feedback (RLHF) are more steerable, especially towards stances associated with political liberals and women, but present significantly less diverse views of personas. We also find variance in LLM steerability that cannot be predicted from multiple-choice opinion evaluation. Our results show the importance of evaluating models in open-ended text generation, as it can surface new LLM opinion biases. Moreover, such a setup can shed light on our ability to steer models toward a richer and more diverse range of viewpoints.

Citations (10)

View on Semantic Scholar

Summary

The paper demonstrates that LLMs show a 9.7% lower steerability for incongruous personas, underscoring a bias towards stereotypical views.
The study reveals that RLHF fine-tuning increases steerability but reduces semantic diversity by up to 58.2%, highlighting a trade-off between consistency and representation.
Evaluation metrics including IND, EXAG, EDIV, and SDIV, along with GPT-4 as a proxy for human assessment, provide robust insights into persona-steered generation challenges.

Evaluating LLM Biases in Persona-Steered Generation

Overview

This paper addresses the challenges and biases inherent in persona-steered text generation using LLMs. The analysis focuses on how well LLMs can be steered to generate text reflecting the views of individuals with multifaceted personas. A "persona" in this context includes multiple traits, such as political views and demographics, which may not always align. The paper introduces the concepts of "incongruous personas" and "congruous personas", where an incongruous persona's traits make each other less likely based on human survey data.

Key Findings

Differences in Steerability:
- LLMs exhibit a 9.7% lower steerability towards incongruous personas compared to congruous ones.
- LLMs often default to stereotypical stances associated with a demographic when faced with an incongruous persona, even when explicitly directed to adopt the opposite stance.
- Models fine-tuned with Reinforcement Learning from Human Feedback (RLHF) are more steerable but present a significantly narrower and less diverse range of views about personas.
Impact of Model Size and Fine-Tuning:
- Evaluated LLMs fine-tuned with RLHF show increased steerability, particularly towards stances associated with certain demographics like political liberals and women.
- However, these models exhibit a decrease in semantic diversity by up to 58.2%, indicating a potential trade-off between steerability and the richness of viewpoints represented.
Evaluation Metrics:
- Multiple-choice survey responses from LLMs do not reliably predict steerability in open-ended persona-steered generation tasks. The predictability from multiple-choice to open-ended steerability stands at only 51.5%, marginally better than random chance.
- GPT-4 proves to be an effective proxy for human evaluations in determining persona steerability, with an F1 score of 96.3%.

Methodology

Persona-Steered Generation Task

The authors created a persona-steered statement generation task, using data from the Pew Research Center's American Trends Panel Survey to source various stances. These stances pertain to American trends in politics, race, and gender. Multifaceted personas were formed by combining demographics with stances, enabling an analysis of the steerability towards each persona type.

Evaluation Techniques

Multiple LLMs, including variants from the Llama 2 and Tulu 2 families, as well as OpenAI's GPT-3.5-Turbo, were evaluated on their ability to generate statements agreeing with a given persona's stance. The steerability score was defined as the percentage of statements correctly reflecting the persona’s views. Additional metrics included Individuation (IND), Exaggeration (EXAG), Entailment Diversity (EDIV), and Semantic Diversity (SDIV).

Discussion and Implications

The research unveils significant limitations in current LLMs' abilities to simulate multifaceted personas, especially those with incongruent traits. This has practical implications for applications requiring nuanced LLM simulations, such as virtual assistance, automated content creation, and social media bots. The inability of LLMs to accurately represent complex, non-stereotypical personas could lead to oversimplified and potentially harmful representations of diverse societal groups.

While RLHF fine-tuning improves steerability, it also narrows the scope of represented views, raising concerns about the loss of diversity and potential biases towards more common, less controversial stances. This phenomenon underscores the necessity for balanced fine-tuning methodologies that enhance steerability without sacrificing semantic richness.

Future Directions

Future research should aim to develop and test methods that better address the steerability of LLMs towards incongruous personas without compromising diversity. This includes:

Enhanced Training Models: Combining RLHF with other techniques to preserve the variety of viewpoints in model generations.
Complex Persona Integration: Developing models that can handle more granular and multifaceted personas.
Interactive Evaluation Frameworks: Expanding the evaluation to more interactive and dynamic settings to better simulate real-world applications.

Conclusion

This paper contributes to understanding the biases in LLMs' persona-steered generation, emphasizing a pronounced steerability gap between congruous and incongruous personas. The findings indicate a significant trade-off between model steerability and the diversity of generated content, guiding future efforts to create balanced, fair, and effective LLMs.

Related Papers

Tweets

https://twitter.com/uilydna/status/1796599479417868482

https://twitter.com/dan_fried/status/1822822870403219612

YouTube

Show All Videos