Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
9 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

When LLMs Play the Telephone Game: Cultural Attractors as Conceptual Tools to Evaluate LLMs in Multi-turn Settings (2407.04503v3)

Published 5 Jul 2024 in physics.soc-ph, cs.AI, and cs.MA

Abstract: As LLMs start interacting with each other and generating an increasing amount of text online, it becomes crucial to better understand how information is transformed as it passes from one LLM to the next. While significant research has examined individual LLM behaviors, existing studies have largely overlooked the collective behaviors and information distortions arising from iterated LLM interactions. Small biases, negligible at the single output level, risk being amplified in iterated interactions, potentially leading the content to evolve towards attractor states. In a series of telephone game experiments, we apply a transmission chain design borrowed from the human cultural evolution literature: LLM agents iteratively receive, produce, and transmit texts from the previous to the next agent in the chain. By tracking the evolution of text toxicity, positivity, difficulty, and length across transmission chains, we uncover the existence of biases and attractors, and study their dependence on the initial text, the instructions, LLM, and model size. For instance, we find that more open-ended instructions lead to stronger attraction effects compared to more constrained tasks. We also find that different text properties display different sensitivity to attraction effects, with toxicity leading to stronger attractors than length. These findings highlight the importance of accounting for multi-step transmission dynamics and represent a first step towards a more comprehensive understanding of LLM cultural dynamics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. A. Acerbi and J. M. Stubbersfield. Large language models show human-like content biases in transmission chain experiments. Proceedings of the National Academy of Sciences, 120(44):e2313790120, 2023.
  2. LitLLM: A Toolkit for Scientific Literature Review, Feb. 2024. arXiv:2402.01788 [cs].
  3. Analyzing the Impact of Data Selection and Fine-Tuning on Economic and Political Biases in LLMs, Apr. 2024. arXiv:2404.08699 [cs].
  4. C. Andersson and D. Read. Group size and cultural complexity. Nature, 511(7507):E1–E1, 2014. Publisher: Nature Publishing Group UK London.
  5. Which humans?, Sep 2023.
  6. R. Baldini. Revisiting the effect of population size on cumulative cultural evolution. Journal of Cognition and Culture, 15(3-4):320–336, 2015. Publisher: Brill.
  7. P. L. Bartlett. Remembering. Cambridge University Press., 1932.
  8. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA, 2021. Association for Computing Machinery.
  9. J. Bogert. In defense of the fog index. The Bulletin of the Association for Business Communication, 48(2):9–12, 1985.
  10. Machine culture. Nature Human Behaviour, 7(11):1855–1868, 2023.
  11. Generative AI at Work, Apr. 2023.
  12. O. O. Buruk. Academic Writing with GPT-3.5: Reflections on Practices, Efficacy and Transparency. In 26th International Academic Mindtrek Conference, pages 144–153, Oct. 2023. arXiv:2304.11079 [cs].
  13. A. Buskell. What are cultural attractors? Biology & Philosophy, 32(3):377–394, 2017.
  14. Tortured phrases: A dubious writing style emerging in science. Evidence of critical issues affecting established journals, July 2021. arXiv:2107.06751 [cs].
  15. Humans or LLMs as the Judge? A Study on Judgement Biases, Apr. 2024. arXiv:2402.10669 [cs].
  16. Have AI-Generated Texts from LLM Infiltrated the Realm of Scientific Writing? A Large-Scale Analysis of Preprint Platforms, Mar. 2024. Pages: 2024.03.25.586710 Section: New Results.
  17. Simulating opinion dynamics with networks of llm-based agents. arXiv preprint arXiv:2311.09618, 2023.
  18. Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning, June 2024. arXiv:2406.00392 [cs].
  19. Understanding hunter–gatherer cultural evolution needs network thinking. Trends in Ecology & Evolution, 37(8):632–636, 2022. Publisher: Elsevier.
  20. Emergent cooperation and strategy adaptation in multi-agent systems: An extended coevolutionary theory with llms. Electronics, 12(12):2722, 2023. Publisher: MDPI.
  21. Experimental evidence for the influence of group size on cultural complexity. Nature, 503(7476):389–391, 2013. Publisher: Nature Publishing Group UK London.
  22. M. Derex and R. Boyd. Partial connectivity increases cultural accumulation within groups. Proceedings of the National Academy of Sciences, 113(11):2982–2987, Mar. 2016.
  23. M. Derex and A. Mesoudi. Cumulative cultural evolution within evolving population structures. Trends in Cognitive Sciences, 24(8):654–667, 2020. Publisher: Elsevier.
  24. From human writing to artificial intelligence generated text: examining the prospects and potential threats of ChatGPT in academic writing. Biology of Sport, 40(2):615–622, Apr. 2023.
  25. Disclosure and Mitigation of Gender Bias in LLMs, Feb. 2024. arXiv:2402.11190 [cs].
  26. Cognitive Bias in High-Stakes Decision-Making with LLMs, Feb. 2024. arXiv:2403.00811 [cs].
  27. GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models, Aug. 2023. arXiv:2303.10130 [cs, econ, q-fin].
  28. Increasing population size can inhibit cumulative cultural evolution. Proceedings of the National Academy of Sciences, 116(14):6726–6731, Apr. 2019.
  29. J. Gleick and R. C. Hilborn. Chaos, Making a New Science. American Journal of Physics, 56(11):1053–1054, Nov. 1988.
  30. OpinionGPT: Modelling Explicit Biases in Instruction-Tuned LLMs, Sept. 2023. arXiv:2309.03876 [cs].
  31. L. Hanu and Unitary team. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020.
  32. Natural language processing: python and NLTK. Packt Publishing Ltd, 2016.
  33. Tracking the perspectives of interacting language models, June 2024. arXiv:2406.11938 [cs].
  34. War and Peace (WarAgent): Large Language Model-based Multi-Agent Simulation of World Wars, Nov. 2023. arXiv:2311.17227 [cs].
  35. C. Hutto and E. Gilbert. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media, volume 8, pages 216–225, 2014.
  36. Iterated learning: Intergenerational knowledge transmission reveals inductive biases. Psychonomic Bulletin & Review, 14(2):288–294, 2007. Place: US Publisher: Psychonomic Society.
  37. Can large language models replace humans in systematic reviews? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. Research Synthesis Methods, Mar. 2024.
  38. S. Kirby and M. Tamariz. Cumulative cultural evolution, population structure and the origin of combinatoriality in human language. Philosophical Transactions of the Royal Society B: Biological Sciences, 377(1843):20200319, Jan. 2022.
  39. Gender bias and stereotypes in large language models. In Proceedings of The ACM Collective Intelligence Conference, pages 12–24, 2023.
  40. R. Marlow and D. Wood. Ghost in the machine or monkey with a typewriter—generating titles for Christmas research articles in The BMJ using artificial intelligence: observational study. The BMJ, 375:e067732, Dec. 2021.
  41. F. J. Massey. The kolmogorov-smirnov test for goodness of fit. Journal of the American Statistical Association, 46(253):68–78, 1951.
  42. A. Mesoudi. Experimental studies of cultural evolution, July 2021.
  43. M. Mitchell. Complexity: A Guided Tour. Oxford University Press, Oxford, New York, Apr. 2009.
  44. H. Miton. Cultural Attraction, Feb. 2024.
  45. Universal cognitive mechanisms explain the cultural success of bloodletting. Evolution and Human Behavior, 36(4):303–312, July 2015.
  46. Motor constraints influence cultural evolution of rhythm. Proceedings of the Royal Society B: Biological Sciences, 287(1937):20202001, Oct. 2020. Publisher: Royal Society.
  47. O. Morin. How Traditions Live and Die. Oxford University Press, 2016. Google-Books-ID: kSukCgAAQBAJ.
  48. More human than human: measuring ChatGPT political bias. Public Choice, Aug. 2023.
  49. Stereoset: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456, 2020.
  50. Social Network Structure Shapes Innovation: Experience-sharing in RL with SAPIENS, Nov. 2022. arXiv:2206.05060 [cs].
  51. Generative Agents: Interactive Simulacra of Human Behavior, Aug. 2023. arXiv:2304.03442 [cs].
  52. Social Simulacra: Creating Populated Prototypes for Social Computing Systems, Aug. 2022. arXiv:2208.04024 [cs].
  53. Cultural evolution in populations of Large Language Models, Mar. 2024. arXiv:2403.08882 [cs, q-bio].
  54. A. J. Peterson. Ai and the problem of knowledge collapse. arXiv preprint arXiv:2404.03502, 2024.
  55. AngleKindling: Supporting Journalistic Angle Ideation with Large Language Models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, pages 1–16, New York, NY, USA, Apr. 2023. Association for Computing Machinery.
  56. Cultural reinforcement learning: a framework for modeling cumulative culture on a limited channel, May 2023.
  57. Breaking Bias, Building Bridges: Evaluation and Mitigation of Social Biases in LLMs via Contact Hypothesis, July 2024. arXiv:2407.02030 [cs].
  58. The Role of Social Network Structure in the Emergence of Linguistic Structure. Cognitive Science, 44(8):e12876, Aug. 2020.
  59. N. Reimers and I. Gurevych. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2020.
  60. P. Richerson. Group size determines cultural complexity. Nature, 503(7476):351–352, 2013. Publisher: Nature Publishing Group UK London.
  61. The evolution and impact of large language model chatbots in social media: A comprehensive review of past, present, and future applications, 12 2023.
  62. In-Context Impersonation Reveals Large Language Models’ Strengths and Biases. Advances in Neural Information Processing Systems, 36, 2024.
  63. Whose opinions do language models reflect? In International Conference on Machine Learning, pages 29971–30004. PMLR, 2023.
  64. Kickstarting Deep Reinforcement Learning, Mar. 2018. arXiv:1803.03835 [cs].
  65. The curse of recursion: Training on generated data makes models forget. arXiv preprint arXiv:2305.17493, 2023.
  66. Exploring Value Biases: How LLMs Deviate Towards the Ideal, Feb. 2024. arXiv:2402.11005 [cs].
  67. D. Sperber. Anthropology and Psychology: Towards an Epidemiology of Representations. Man, 20(1):73–89, 1985. Publisher: [Wiley, Royal Anthropological Institute of Great Britain and Ireland].
  68. AI model GPT-3 (dis)informs us better than humans. Science Advances, 9(26):eadh1850, June 2023. Publisher: American Association for the Advancement of Science.
  69. Open-Ended Learning Leads to Generally Capable Agents, July 2021. arXiv:2107.12808 [cs].
  70. On the Automatic Generation and Simplification of Children’s Stories, Oct. 2023.
  71. Generative agent-based modeling with actions grounded in physical, social, or digital space using Concordia, Dec. 2023. arXiv:2312.03664 [cs].
  72. "Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference Letters, Dec. 2023. arXiv:2310.09219 [cs].
  73. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
  74. D. Weiss. Generative AI is the Next Step in Democratizing Knowledge. https://techstrong.ai/articles/generative-ai-is-the-next-step-in-democratizing-knowledge/. [Accessed 22-05-2024].
  75. pymc-devs/pymc: v3.11.6, May 2024.
  76. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  77. Simulating Public Administration Crisis: A Novel Generative Agent-Based Simulation System to Lower Technology Barriers in Social Science Research, Nov. 2023. arXiv:2311.06957 [cs].
  78. The Next Chapter: A Study of Large Language Models in Storytelling, July 2023. arXiv:2301.09790 [cs].
  79. More human than human: LLM-generated narratives outperform human-LLM interleaved narratives. In Proceedings of the 15th Conference on Creativity and Cognition, pages 368–370, New York, NY, USA, June 2023. Association for Computing Machinery.
Citations (3)

Summary

  • The paper introduces a transmission chain framework to analyze cumulative text transformations, revealing attractor states in LLM outputs.
  • The paper finds that open-ended tasks yield stronger attractor effects than constrained rephrasing, influencing text convergence.
  • The paper compares different LLMs to show how initial inputs and model choices drive varying semantic drift and bias evolution.

Cumulative Changes and Attractors in Iterated Cultural Transmissions of LLMs

The paper "When LLMs Play the Telephone Game: Cumulative Changes and Attractors in Iterated Cultural Transmissions" explores the understanding of how information evolves when transmitted repeatedly through LLMs. Specifically, the paper adopts a transmission chain design from human cultural evolution literature, aiming to uncover how LLM-generated texts change across multiple iterations and whether these changes reveal certain attractor states.

Research Objectives

The primary objective of this paper is to examine the collective behaviors and information distortions that arise from iterated interactions between LLMs. This is motivated by the growing presence of LLM-generated content online, necessitating an understanding of the transformation and potential biases that occur in multi-turn interactions. The paper systematically explores how properties such as toxicity, positivity, difficulty, and length evolve in texts through these iterative transmissions.

Methodology

The paper employs a transmission chain design where LLM agents iteratively receive, produce, and transmit texts across a sequence of agents. This setup mimics the "telephone game," where each agent's output becomes the next agent's input. The research evaluates the following key metrics across multiple generations:

  • Toxicity: Assessed using the Detoxify classifier, measuring the probability of a text being rude or harmful.
  • Positivity: Measured via sentiment analysis, providing a score from highly negative (-1.0) to highly positive (1.0).
  • Difficulty: Quantified using the Gunning-Fog index, indicating the years of formal education required to understand the text.
  • Length: Evaluated by the character count of generated texts.

Experimental Setup

Initial human-generated texts from various sources, including scientific abstracts, news articles, and social media posts, served as the starting point for transmission chains. The tasks assigned to the LLMs were:

  1. Rephrase: Paraphrasing the text without changing its meaning.
  2. Take Inspiration: Creating a new text inspired by the input.
  3. Continue: Continuing the provided text.

Five different LLMs (ChatGPT-3.5-turbo-0125, Llama3-8B-Instruct, Mistral-7B-Instruct-v0.2, Llama3-70B-Instruct, and Mixtral-8x7B-Instruct-v0.1) were used to observe how different models and model sizes influence the evolution of text properties.

Key Findings

Property Evolution Beyond Single-Turn Transmissions

The research findings indicate that textual properties evolve significantly beyond initial transmissions, particularly in less constrained tasks such as "Take Inspiration" and "Continue." This underscores the insufficiency of single-turn interaction studies in capturing the full extent of LLM dynamics.

Strength and Position of Attractors

Using linear regressions, the paper identifies attractors—equilibrium points toward which text properties tend to converge. It was found that:

  • Attractors for different properties vary in strength and position.
  • Toxicity converges strongly towards low values (close to zero).
  • Positivity and difficulty exhibit task and model-dependent attractor positions and strengths.
  • More open-ended tasks (e.g., "Continue") result in stronger attractors than constrained tasks (e.g., "Rephrase").

Convergence and Divergence in Chains

The paper reveals that the degree of semantic similarity among the texts in different chains can vary. Some models encourage convergence towards a common semantic content, while others exhibit divergence, influenced heavily by the initial text and the specific task.

Implications and Future Directions

The implications of these findings are multifaceted. Theoretically, they contribute to our understanding of LLM cultural dynamics and offer insights into the emergence of biases and attractors in multi-turn interactions. Practically, these insights can inform the development of more robust and bias-aware LLMs, especially when used in settings that require multi-agent interactions or generate iterative content. Future research may explore more complex network interactions, heterogeneous agent populations, and hybrid networks involving both humans and LLMs to simulate more realistic interaction scenarios.

Conclusion

This paper represents a significant step towards understanding the iterative cultural transmissions of LLMs. By leveraging a robust methodological framework from cultural evolution studies, the researchers provide a nuanced analysis of how textual properties evolve and the associated attractor states. This contributes valuable insights into the designing and regulating of LLMs in real-world applications involving extensive iterative interactions.

Youtube Logo Streamline Icon: https://streamlinehq.com