Are Large Language Models Consistent over Value-laden Questions? (2407.02996v2)

Published 3 Jul 2024 in cs.CL and cs.AI

Abstract: LLMs appear to bias their survey answers toward certain values. Nonetheless, some argue that LLMs are too inconsistent to simulate particular values. Are they? To answer, we first define value consistency as the similarity of answers across (1) paraphrases of one question, (2) related questions under one topic, (3) multiple-choice and open-ended use-cases of one question, and (4) multilingual translations of a question to English, Chinese, German, and Japanese. We apply these measures to small and large, open LLMs including llama-3, as well as gpt-4o, using 8,000 questions spanning more than 300 topics. Unlike prior work, we find that models are relatively consistent across paraphrases, use-cases, translations, and within a topic. Still, some inconsistencies remain. Models are more consistent on uncontroversial topics (e.g., in the U.S., "Thanksgiving") than on controversial ones ("euthanasia"). Base models are both more consistent compared to fine-tuned models and are uniform in their consistency across topics, while fine-tuned models are more inconsistent about some topics ("euthanasia") than others ("women's rights") like our human subjects (n=165).

Citations (6)

View on Semantic Scholar

Summary

The paper demonstrates that LLMs exhibit notable consistency in value-laden responses, as measured by Jensen-Shannon and d-dimensional divergence metrics.
It reveals higher consistency for uncontroversial topics and shows that fine-tuned models have more variability compared to base models.
The study underscores LLMs' potential for reliable value-based tasks while highlighting the need for improved steerability in controversial contexts.

Are LLMs Consistent over Value-laden Questions?

The paper "Are LLMs Consistent over Value-laden Questions?" addresses the consistency of LLMs responding to questions that are influenced by values. Past literature has raised concerns around LLMs' biases and consistency; this paper evaluates the assumption that LLMs exhibit consistent value-laden responses.

Summary of Results

Consistency Across Measures:

Contrary to earlier studies citing inconsistency, the paper finds that LLMs are relatively consistent across paraphrases, topics, translations, and different use-cases.
Evaluations using the Jensen-Shannon divergence and a new d-dimensional divergence measure confirm that LLMs match or exceed the consistency of human subjects in responding to questions.

Influence of Controversial Topics:

Consistency is notably higher for uncontroversial topics compared to controversial ones, suggesting that the clarity and neutrality of a topic play a crucial role in model behavior.
Examples from the paper show much lower inconsistency for topics like "women’s rights" compared to polarizing topics such as "euthanasia."

Base Models vs. Fine-tuned Models:

Fine-tuned models show more inconsistency on specific topics compared to their base model counterparts.
The authors report that these inconsistencies align with human subject results, where similar variability across topics is observed.

Use-case Consistency:

Models exhibit slightly less consistency in open-ended tasks compared to multiple-choice tasks. However, this effect is minimal, strengthening the idea that LLMs are robust across formats.

Multilingual Consistency:

Across different languages, models show relatively consistent behavior, further substantiating the robustness of LLMs in handling multilingual tasks. English prompts yielded the most consistent responses, especially for U.S.-based topics.

Implications and Future Directions

Theoretical Insights:

Given the notable consistency of LLMs, the research solidifies the notion that LLMs can be used in applications requiring reliable value-based simulations and assessments.
However, the variability displayed in specific contexts, particularly on polarizing issues, suggests that there is still a gap to bridge for perfect alignment with human values.

Model Steerability:

A surprising finding is that LLMs are not effectively steerable to predefined value-sets like Schwartz’s values. Despite the large parameterization and supposed flexibility of these models, they often fail to significantly alter responses when prompted with specific value indicators.

Practical Implications:

The consistency observed suggests that LLMs can be of utility in structured environments where value-laden decision-making is pivotal, such as social surveys or automated essay scoring.
However, the lack of absolute steerability underscores a cautionary note for deploying LLMs in real-world, value-sensitive applications without rigorous scrutiny.

Future Research Directions:

The authors highlight the need for a granular approach in fine-tuning to mitigate identified inconsistencies, particularly focusing on controversial topics.
The results also call for exploration into fine-tuning specific datasets that cater to underrepresented languages and values, ensuring broader applicability and greater alignment across diverse global contexts.
Further evaluation into the mechanisms of how alignment data influences model training may pave the way for more sophisticated tuning strategies to enhance model consistency and steerability.

Concluding Remarks

The investigation into LLMs' consistency in value-laden domains presents a nuanced understanding of their operational reliability. While base models exhibit higher and more uniform consistency, fine-tuned models and humans share similarly varying inconsistencies. Despite evident strengths, the steerability challenge emphasizes limitations in directing LLMs towards specific values effectively. This paper invites further research to refine alignment methodologies and expand consistency evaluations, ensuring LLMs' safe and reliable deployment in value-sensitive tasks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/emollick/status/1809606921642459327

https://twitter.com/jaredlcm/status/1809311719807910334

https://twitter.com/fly51fly/status/1809910286243254770

https://twitter.com/rodjnaquin/status/1809611740780298513

HackerNews

Are Large Language Models Consistent over Value-Laden Questions? (1 point, 0 comments)