Emergent Mind

Are Large Language Models Consistent over Value-laden Questions?

(2407.02996)
Published Jul 3, 2024 in cs.CL and cs.AI

Abstract

LLMs appear to bias their survey answers toward certain values. Nonetheless, some argue that LLMs are too inconsistent to simulate particular values. Are they? To answer, we first define value consistency as the similarity of answers across (1) paraphrases of one question, (2) related questions under one topic, (3) multiple-choice and open-ended use-cases of one question, and (4) multilingual translations of a question to English, Chinese, German, and Japanese. We apply these measures to a few large ($>=34b$), open LLMs including llama-3, as well as gpt-4o, using eight thousand questions spanning more than 300 topics. Unlike prior work, we find that models are relatively consistent across paraphrases, use-cases, translations, and within a topic. Still, some inconsistencies remain. Models are more consistent on uncontroversial topics (e.g., in the U.S., "Thanksgiving") than on controversial ones ("euthanasia"). Base models are both more consistent compared to fine-tuned models and are uniform in their consistency across topics, while fine-tuned models are more inconsistent about some topics ("euthanasia") than others ("women's rights") like our human subjects (n=165).

Chat models' consistency varies by topic; more consistent on women's rights and income inequality.

Overview

  • The paper evaluates the consistency of LLMs when responding to value-laden questions, finding that they perform relatively consistently across different paraphrases, topics, translations, and use-cases.

  • The study highlights that LLMs show higher consistency for uncontroversial topics, while controversial topics like 'euthanasia' result in more variability. Fine-tuned models also tend to be less consistent than base models.

  • Future research directions include refining fine-tuning techniques, particularly for controversial issues, exploring fine-tuning for underrepresented languages and values, and improving the steerability of LLMs to specific value sets.

Are LLMs Consistent over Value-laden Questions?

The paper "Are LLMs Consistent over Value-laden Questions?" addresses the consistency of LLMs responding to questions that are influenced by values. Past literature has raised concerns around LLMs' biases and consistency; this paper evaluates the assumption that LLMs exhibit consistent value-laden responses.

Summary of Results

Consistency Across Measures:

  • Contrary to earlier studies citing inconsistency, the paper finds that LLMs are relatively consistent across paraphrases, topics, translations, and different use-cases.
  • Evaluations using the Jensen-Shannon divergence and a new d-dimensional divergence measure confirm that LLMs match or exceed the consistency of human subjects in responding to questions.

Influence of Controversial Topics:

  • Consistency is notably higher for uncontroversial topics compared to controversial ones, suggesting that the clarity and neutrality of a topic play a crucial role in model behavior.
  • Examples from the study show much lower inconsistency for topics like "women’s rights" compared to polarizing topics such as "euthanasia."

Base Models vs. Fine-tuned Models:

  • Fine-tuned models show more inconsistency on specific topics compared to their base model counterparts.
  • The authors report that these inconsistencies align with human subject results, where similar variability across topics is observed.

Use-case Consistency:

  • Models exhibit slightly less consistency in open-ended tasks compared to multiple-choice tasks. However, this effect is minimal, strengthening the idea that LLMs are robust across formats.

Multilingual Consistency:

  • Across different languages, models show relatively consistent behavior, further substantiating the robustness of LLMs in handling multilingual tasks. English prompts yielded the most consistent responses, especially for U.S.-based topics.

Implications and Future Directions

Theoretical Insights:

  • Given the notable consistency of LLMs, the research solidifies the notion that LLMs can be used in applications requiring reliable value-based simulations and assessments.
  • However, the variability displayed in specific contexts, particularly on polarizing issues, suggests that there is still a gap to bridge for perfect alignment with human values.

Model Steerability:

  • A surprising finding is that LLMs are not effectively steerable to predefined value-sets like Schwartz’s values. Despite the large parameterization and supposed flexibility of these models, they often fail to significantly alter responses when prompted with specific value indicators.

Practical Implications:

  • The consistency observed suggests that LLMs can be of utility in structured environments where value-laden decision-making is pivotal, such as social surveys or automated essay scoring.
  • However, the lack of absolute steerability underscores a cautionary note for deploying LLMs in real-world, value-sensitive applications without rigorous scrutiny.

Future Research Directions:

  • The authors highlight the need for a granular approach in fine-tuning to mitigate identified inconsistencies, particularly focusing on controversial topics.
  • The results also call for exploration into fine-tuning specific datasets that cater to underrepresented languages and values, ensuring broader applicability and greater alignment across diverse global contexts.
  • Further evaluation into the mechanisms of how alignment data influences model training may pave the way for more sophisticated tuning strategies to enhance model consistency and steerability.

Concluding Remarks

The investigation into LLMs' consistency in value-laden domains presents a nuanced understanding of their operational reliability. While base models exhibit higher and more uniform consistency, fine-tuned models and humans share similarly varying inconsistencies. Despite evident strengths, the steerability challenge emphasizes limitations in directing LLMs towards specific values effectively. This study invites further research to refine alignment methodologies and expand consistency evaluations, ensuring LLMs' safe and reliable deployment in value-sensitive tasks.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.