Emergent Mind

Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies

(2208.10264)
Published Aug 18, 2022 in cs.CL , cs.AI , and cs.LG

Abstract

We introduce a new type of test, called a Turing Experiment (TE), for evaluating to what extent a given language model, such as GPT models, can simulate different aspects of human behavior. A TE can also reveal consistent distortions in a language model's simulation of a specific human behavior. Unlike the Turing Test, which involves simulating a single arbitrary individual, a TE requires simulating a representative sample of participants in human subject research. We carry out TEs that attempt to replicate well-established findings from prior studies. We design a methodology for simulating TEs and illustrate its use to compare how well different language models are able to reproduce classic economic, psycholinguistic, and social psychology experiments: Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of Crowds. In the first three TEs, the existing findings were replicated using recent models, while the last TE reveals a "hyper-accuracy distortion" present in some language models (including ChatGPT and GPT-4), which could affect downstream applications in education and the arts.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Out of one, many: Using language models to simulate human samples. Political Analysis, pp.  1–15, 2023. doi: 10.1017/pan.2023.2.
  2. Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems, 35:38176–38189
  3. Climbing towards nlu: On meaning, form, and understanding in the age of data. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp.  5185–5198
  4. Using cognitive psychology to understand gpt-3. Proceedings of the National Academy of Sciences, 120(6):e2218523120
  5. Language (technology) is power: A critical survey of “bias” in nlp. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5454–5476
  6. Suicide risk assessment and intervention in people with mental illness. BMJ, 351, 2015. doi: 10.1136/bmj.h4978. https://www.bmj.com/content/351/bmj.h4978.

  7. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in Neural Information Processing Systems (NeurIPS)
  8. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.

  9. Identifying and Manipulating the Personality Traits of Language Models
  10. PaLM: Scaling Language Modeling with Pathways
  11. Thematic roles assigned along the garden path linger. Cognitive psychology, 42(4):368–407
  12. On not being led up the garden path: the use of context by the psychological syntax processor, pp.  320–358. Cambridge University Press, United States, 1985. ISBN 9780521262033. Cambridge Books Online.
  13. Darling, K. Extending legal protection to social robots: The effects of anthropomorphism, empathy, and violent behavior towards robotic objects. In Robot law. Edward Elgar Publishing
  14. Language models show human-like content effects on reasoning tasks
  15. Chivalry and Solidarity in Ultimatum Games. Economic Inquiry, 39(2):171–188, April 2001. https://ideas.repec.org/a/oup/ecinqu/v39y2001i2p171-88.html.

  16. Galton, F. Vox populi. Nature, 75(7):450–451
  17. An experimental analysis of ultimatum bargaining. Journal of Economic Behavior & Organization, 3(4):367–388, 1982. ISSN 0167-2681. doi: https://doi.org/10.1016/0167-2681(82)90011-7. https://www.sciencedirect.com/science/article/pii/0167268182900117.

  18. Thinking Fast and Slow in Large Language Models
  19. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR)
  20. Chapter 2 - experimental economics and experimental game theory. In Glimcher, P. W. and Fehr, E. (eds.), Neuroeconomics (Second Edition), pp.  19–34. Academic Press, San Diego, second edition edition, 2014. ISBN 978-0-12-416008-8. doi: https://doi.org/10.1016/B978-0-12-416008-8.00002-4. https://www.sciencedirect.com/science/article/pii/B9780124160088000024.

  21. Evaluating and Inducing Personality in Pre-trained Language Models
  22. Can Machines Learn Morality? The Delphi Experiment
  23. Capturing Failures of Large Language Models via Human Cognitive Biases
  24. Estimating the Personality of White-Box Language Models
  25. Large Language Models are Zero-Shot Reasoners
  26. Korinek, A. Language models and cognitive automation for economic research. Technical report, National Bureau of Economic Research
  27. Evaluating Large Language Models in Theory of Mind Tasks
  28. Krawczyk, D. C. Chapter 12 - social cognition: Reasoning with others. In Krawczyk, D. C. (ed.), Reasoning, pp.  283–311. Academic Press, 2018. ISBN 978-0-12-809285-9. doi: https://doi.org/10.1016/B978-0-12-809285-9.00012-0. https://www.sciencedirect.com/science/article/pii/B9780128092859000120.

  29. Holistic Evaluation of Language Models
  30. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing
  31. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
  32. Tutorial on agent-based modelling and simulation. Journal of Simulation, 4(3):151–162
  33. Building a large annotated corpus of English: The Penn treebank. Computational Linguistics, 19(2):313–330, June 1993.
  34. White, man, and highly followed: Gender and race inequalities in twitter. In Proceedings of the International Conference on Web Intelligence, pp.  266–274
  35. Milgram, S. Behavioral study of obedience. The Journal of abnormal and social psychology, 67(4):371
  36. Social influence and the collective dynamics of opinion formation. PLOS ONE, 8(11):1–8, 11 2013. doi: 10.1371/journal.pone.0078433. https://doi.org/10.1371/journal.pone.0078433.

  37. GPT-4 Technical Report
  38. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744
  39. Page, S. The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies. Princeton University Press
  40. Social simulacra: Creating populated prototypes for social computing systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pp.  1–18
  41. Lingering misinterpretations in garden-path sentences: evidence from a paraphrasing task. Journal of experimental psychology. Learning, memory, and cognition, 35 1:280–5
  42. Language models are unsupervised multitask learners. 2019. https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf.

  43. Societal biases in language generation: Progress and challenges. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  4275–4293, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.330. https://aclanthology.org/2021.acl-long.330.

  44. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  4222–4235, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.346. https://aclanthology.org/2020.emnlp-main.346.

  45. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  46. Surowiecki, J. The Wisdom of Crowds. Doubleday
  47. Turing, A. M. Computing machinery and intelligence. Mind, LIX:433–460
  48. Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks
  49. Investigating gender bias in language models using causal mediation analysis. Advances in Neural Information Processing Systems, 33:12388–12401
  50. Can machines think? a report on turing test experiments at the royal society. Journal of experimental & Theoretical artificial Intelligence, 28(6):989–1007
  51. Finetuned Language Models Are Zero-Shot Learners
  52. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
  53. Wikipedia. Wikipedia:Systemic bias — Wikipedia, the free encyclopedia. http://en.wikipedia.org/w/index.php?title=Wikipedia%3ASystemic%20bias&oldid=1102157003, 2022. [Online; accessed 30-August-2022].

Show All 53