Papers
Topics
Authors
Recent
2000 character limit reached

Moral Mimicry: Large Language Models Produce Moral Rationalizations Tailored to Political Identity (2209.12106v2)

Published 24 Sep 2022 in cs.CL

Abstract: LLMs have demonstrated impressive capabilities in generating fluent text, as well as tendencies to reproduce undesirable social biases. This study investigates whether LLMs reproduce the moral biases associated with political groups in the United States, an instance of a broader capability herein termed moral mimicry. This hypothesis is explored in the GPT-3/3.5 and OPT families of Transformer-based LLMs. Using tools from Moral Foundations Theory, it is shown that these LLMs are indeed moral mimics. When prompted with a liberal or conservative political identity, the models generate text reflecting corresponding moral biases. This study also explores the relationship between moral mimicry and model size, and similarity between human and LLM moral word use.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies.
  2. Belief-based Generation of Argumentative Claims. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 224–233, Online. Association for Computational Linguistics.
  3. Milad Alshomary and Henning Wachsmuth. 2021. Toward audience-aware argument generation. Patterns, 2(6):100253.
  4. Out of One, Many: Using Language Models to Simulate Human Samples. Political Analysis, pages 1–15.
  5. Probing pre-trained language models for cross-cultural differences in values. In Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP), pages 114–130, Dubrovnik, Croatia. Association for Computational Linguistics.
  6. Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5185–5198, Online. Association for Computational Linguistics.
  7. On the Opportunities and Risks of Foundation Models.
  8. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  9. The moral foundations hypothesis does not replicate well in Black samples. Journal of Personality and Social Psychology, 110(4):e23–e30.
  10. DeepSpeed. 2022. ZeRO-Inference: Democratizing massive model inference. https://www.deepspeed.ai/2022/09/09/zero-inference.html.
  11. David Dobolyi. 2016. Critiques | Moral Foundations Theory.
  12. The five-factor model of the moral foundations theory is stable across WEIRD and non-WEIRD cultures. Personality and Individual Differences, 151:109547.
  13. It’s a Match: Moralization and the Effects of Moral Foundations Congruence on Ethical and Unethical Leadership Perception. Journal of Business Ethics, 167(4):707–723.
  14. Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 698–718, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  15. Hubert Etienne. 2021. The dark side of the ‘Moral Machine’ and the fallacy of computational ethical decision-making for autonomous vehicles. Law, Innovation and Technology, 13(1):85–107.
  16. Matthew Feinberg and Robb Willer. 2015. From Gulf to Bridge: When Do Moral Arguments Facilitate Political Influence? Personality and Social Psychology Bulletin, 41(12):1665–1681.
  17. Social chemistry 101: Learning to reason about social and moral norms. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 653–670, Online. Association for Computational Linguistics.
  18. Does Moral Code have a Moral Code? Probing Delphi’s Moral Philosophy. In Proceedings of the 2nd Workshop on Trustworthy Natural Language Processing (TrustNLP 2022), pages 26–42, Seattle, U.S.A. Association for Computational Linguistics.
  19. Jeremy Frimer. 2019. Moral Foundations Dictionary 2.0.
  20. Jeremy A. Frimer. 2020. Do liberals and conservatives use different moral languages? Two replications and six extensions of Graham, Haidt, and Nosek’s (2009) moral text analysis. Journal of Research in Personality, 84:103906.
  21. Leo Gao. 2021. On the Sizes of OpenAI API Models. https://blog.eleuther.ai/gpt3-model-sizes/.
  22. Morality Between the Lines : Detecting Moral Sentiment In Text.
  23. Liberals and conservatives rely on different sets of moral foundations. Journal of Personality and Social Psychology, 96(5):1029–1046.
  24. Mapping the Moral Domain. Journal of personality and social psychology, 101(2):366–385.
  25. Jonathan Haidt. 2013. The Righteous Mind: Why Good People Are Divided by Politics and Religion. Vintage Books.
  26. Craig A. Harper and Darren Rhodes. 2021. Reanalysing the factor structure of the moral foundations questionnaire. The British Journal of Social Psychology, 60(4):1303–1329.
  27. Aligning AI with shared human values. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  28. Geert Hofstede. 2001. Culture’s Recent Consequences: Using Dimension Scores in Theory and Research. International Journal of Cross Cultural Management, 1(1):11–17.
  29. The extended Moral Foundations Dictionary (eMFD): Development and applications of a crowd-sourced approach to extracting moral intuitions from text. Behavior Research Methods, 53(1):232–246.
  30. OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization.
  31. CommunityLM: Probing partisan worldviews from language models. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6818–6826, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  32. Can Machines Learn Morality? The Delphi Experiment.
  33. When to make exceptions: Exploring language models as accounts of human moral judgment. In NeurIPS.
  34. Kristen Johnson and Dan Goldwasser. 2018. Classification of Moral Foundations in Microblog Political Discourse. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 720–730, Melbourne, Australia. Association for Computational Linguistics.
  35. Scaling Laws for Neural Language Models.
  36. Moral Frames Are Persuasive and Moralize Attitudes; Nonmoral Frames Are Persuasive and De-Moralize Attitudes. Psychological Science, 33(3):433–449.
  37. Challenging Moral Attitudes With Moral Messages. Psychological Science, 30(8):1136–1150.
  38. Do Bots Have Moral Judgement? The Difference Between Bots and Humans in Moral Rhetoric. In 2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pages 222–226.
  39. OpenAI. 2021. OpenAI API. https://openai.com/api/.
  40. OpenAI. 2022. Model Index for Researchers.
  41. Morality Classification in Natural Language Text. IEEE Transactions on Affective Computing, pages 1–1.
  42. Discovering Language Model Behaviors with Model-Written Evaluations.
  43. Morality Beyond the Lines: Detecting Moral Sentiment Using AI-Generated Synthetic Context. In Artificial Intelligence in HCI, Lecture Notes in Computer Science, pages 84–94, Cham. Springer International Publishing.
  44. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  45. Towards Few-Shot Identification of Morality Frames using In-Context Learning. In Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS), pages 183–196, Abu Dhabi, UAE. Association for Computational Linguistics.
  46. Christopher Suhler and Pat Churchland. 2011. Can Innate, Modular “Foundations” Explain Morality? Challenges for Haidt’s Moral Foundations Theory. Journal of cognitive neuroscience, 23:2103–16; discussion 2117.
  47. World Values Survey. 2022. WVS Database. https://www.worldvaluessurvey.org/wvs.jsp.
  48. On the Machine Learning of Ethical Judgments from Natural Language. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 769–779, Seattle, United States. Association for Computational Linguistics.
  49. Nitasha Tiku. 2022. The Google engineer who thinks the company’s AI has come to life. Washington Post.
  50. LLaMA: Open and Efficient Foundation Language Models.
  51. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  52. Taxonomy of Risks posed by Language Models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, pages 214–229, New York, NY, USA. Association for Computing Machinery.
  53. HuggingFace’s Transformers: State-of-the-art Natural Language Processing.
  54. An Investigation of Moral Foundations Theory in Turkey Using Different Measures. Current Psychology, 38(2):440–457.
  55. OPT: Open Pre-trained Transformer Language Models.
  56. The moral integrity corpus: A benchmark for ethical dialogue systems. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3755–3773, Dublin, Ireland. Association for Computational Linguistics.
Citations (43)

Summary

  • The paper shows that LLMs generate politically tailored moral rationalizations based on Moral Foundations Theory.
  • It employs controlled experiments using varied prompts and models (GPT-3/3.5, OPT) to measure foundation-specific effect sizes.
  • Findings indicate that model architecture and post-training methods significantly influence mimicry, with implications for AI ethics and political polarization.

Moral Mimicry in LLMs: Alignment of Moral Rationalizations with Political Identity

Introduction and Problem Formulation

This essay analyzes “Moral Mimicry: LLMs Produce Moral Rationalizations Tailored to Political Identity” (2209.12106), which investigates the extent to which state-of-the-art LLMs—specifically those in the GPT-3/3.5 and OPT families—reproduce and modulate moral rationalizations as a function of explicit political identity priming. The paper leverages Moral Foundations Theory (MFT), which operationalizes human morality along five axes (Care/Harm, Fairness/Cheating, Loyalty/Betrayal, Authority/Subversion, Sanctity/Degradation), to probe whether these models' outputs exhibit the characteristic foundational biases observed in U.S. liberal and conservative groups.

Given that LLMs can be highly influential in digital discourse, their ability to align rationalizations with sociopolitical identities has practical and ethical significance, including downstream risks around polarization and targeted influence.

Experimental Methodology

The core experimental loop constructs prompts from parameterizable templates comprising: (1) a scenario warranting moral judgment, (2) a political identity string (e.g., "as a liberal"/"as a conservative"), and (3) a stance (moral/immoral). Prompts are paired with scenarios drawn from the Moral Stories, ETHICS, and Social Chemistry datasets, covering both action-centered and situational narratives. Completions are generated from multiple model variants, holding decoding parameters fixed (temperature=0, max_tokens=64). Figure 1

Figure 1: Overview of the experimental pipeline, illustrating prompt assembly, generation, foundation content analysis, and effect size estimation.

The output text is then analyzed for relative foundation-specific lexical content using three versions of the Moral Foundations Dictionary (MFDv1, MFDv2, eMFD). For each completion, the proportion of words associated with each foundation serves as a proxy for moral rationalization content. The principal signal of interest is the effect size: the absolute difference in foundation activation, for a given scenario and stance, when the political identity string alternates between "liberal" and "conservative."

LLM vs. Human Foundation Use

A two-tiered evaluation is conducted to assess:

  • Criterion A: Whether LLMs increase references to a particular foundation when the prompt scenario is salient for that foundation.
  • Criterion B: Whether LLMs match or diverge from human consensus in foundation use, compared to human inter-annotator variability. Figure 2

    Figure 2: Left—LLMs increase use of foundation-appropriate language in ground-truth foundation scenarios; Right—Deviation from human consensus is generally larger for LLMs than for individual humans, measured with text-davinci-002 on the Social Chemistry Situations dataset.

LLMs consistently satisfy Criterion A, indicating lexical sensitivity. However, per Criterion B, deviations from human consensus are systematically larger than human-human differences, highlighting persistent representation misalignment.

Emergence of Moral Mimicry

The principal hypothesis interrogated is whether political identity priming induces foundation-specific shifts in model output, mirroring empirical differences between liberal and conservative populations. The paper finds robust evidence: effect sizes align with the MFT-predicted directions—liberal identity primes increase Care/Harm and Fairness/Cheating, while conservative primes augment Authority/Subversion, Loyalty/Betrayal, and Sanctity/Degradation. Figure 3

Figure 3: Effect sizes for liberal vs. conservative prompting across four models (OPT-30B, text-davinci-001/002/003), three dictionaries, and all moral foundations; directionality matches MFT predictions, with large effect sizes for some foundations.

Of the 60 foundation/model/dictionary combinations, only 11 yield effect sizes contrary to expectations, and these are of negligible magnitude. This constitutes strong evidence that present-day large LLMs are "moral mimics" in the defined sense.

Scaling, Architecture, and Training Effects

To map the scaling behavior of moral mimicry, effect sizes and a summary MFH-Score are computed as a function of parameter count for multiple GPT-3 and OPT models. The relationship is nontrivial: For OPT, MFH-Score correlates positively with size (r=0.69), but anomaly is observed in the 13B variant. In the GPT-3.5 models, which share parameter counts but differ in fine-tuning protocols, the largest effect sizes are observed, surpassing even those of the base 175B GPT-3 variant. Figure 4

Figure 4: Top—Effect size vs. model parameters for Care/Harm, Fairness/Cheating, etc.; Bottom—MFH-Score vs. model parameters, with marked correlation across models and families.

This suggests that model architecture and post-training procedures (e.g., RLHF, SFT) mediate the emergence and fidelity of moral mimicry, in addition to scale.

Prompt and Dataset Robustness

Effects are robust across both prompt templates and datasets, including the variously sourced Moral Stories, ETHICS, and Social Chemistry data. The directionality and magnitude of political identity effect sizes persist, with only minor deviations in less-salient foundations or atypical prompt/dictionary combinations.

Qualitative Output Evolution

The paper provides qualitative output slices (randomly sampled) across GPT-3 model scales, demonstrating that as model size increases, moral rationalizations become more contextually attuned, nuanced, and lexically competent. Figure 5

Figure 5: Illustrative completions for prompts under different model sizes, evidencing increasingly sophisticated and foundation-relevant rationalizations.

Implications and Theoretical Considerations

The results establish that LLMs function as conditional simulators of group-typical moral rationalizations, not only echoing lexical cues but capturing deeper alignment with documented human group differences. This has implications for generative systems embedded in digital discourse: adversarial or deliberate prompting may be used to surface group-congruent or polarizing moral rhetoric at scale.

For computational social science, these models offer a living, high-fidelity probe for attitudinal distributions—but underscore the need for careful conditioning and interpretive guardrails given their training-data aggregation effects.

The variances with human consensus reinforce a well-recognized risk: LLMs are not perfect moral agents, and dictionary-based evaluation, though tractable, is blunt—future work must turn to more sophisticated, possibly neural, foundation detectors and incorporate continual human-in-the-loop evaluation ([gartenMoralityLinesDetecting2016], [royFewShotIdentificationMorality2022]).

Limitations

  • Use of dictionary methods for foundation detection—limited nuance, possible bias.
  • Evaluation datasets are U.S.-centric; cross-cultural generalization is untested here, but directly motivated.
  • Lack of model interpretability regarding the mechanisms of moral mimicry—opacity leaves open questions concerning causal drivers and potential levers for intervention.
  • API-accessible models condition downstream reproducibility and limit fine-grained mechanistic analysis.

Conclusion

The paper provides strong empirical and quantitative evidence that LLMs tailored with political identity labels generate distinct, foundation-specific moral rationalizations consonant with canonical sociopolitical group biases. The magnitude of this "moral mimicry" scales with model size and is modulated by both architecture and training regimes, with prompt and data selection exerting secondary influence. These findings have immediate consequences for the deployment, governance, and interpretation of LLM-based systems in sociotechnical contexts.

Future AI developmental priorities should incorporate explicit mechanisms for managing subgroup alignment, transparency into conditional simulation capabilities, and systematic investigation of emergent behaviors in other axes of social cognition. In addition, establishing culturally aware and human-in-the-loop systems for moral content evaluation will be critical as LLMs become further entangled in public-facing applications.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.