Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy (2402.19379v6)
Abstract: Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of LLMs suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human crowd forecasting tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of twelve LLMs. We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of 925 human forecasters from a three-month forecasting tournament. Our preregistered main analysis shows that the LLM crowd outperforms a simple no-information benchmark and is not statistically different from the human crowd. In exploratory analyses, we find that these two approaches are equivalent with respect to medium-effect-size equivalence bounds. We also observe an acquiescence effect, with mean model predictions being significantly above 50%, despite an almost even split of positive and negative resolutions. Moreover, in Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%: though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of human crowd forecasting tournaments: via the simple, practically applicable method of forecast aggregation. This replicates the 'wisdom of the crowd' effect for LLMs, and opens up their use for a variety of applications throughout society.
- “Perils and Opportunities in Using Large Language Models in Psychological Research” In PsyArXiv, 2023 URL: https://osf.io/preprints/psyarxiv/d695y
- Mahdi Abolghasemi, Odkhishig Ganbold and Kristian Rotaru “Humans vs Large Language Models: Judgmental Forecasting in an Era of Advanced AI” In arXiv preprint arXiv:2312.06941, 2023 URL: https://arxiv.org/abs/2312.06941
- Daron Acemoğlu “Harms of AI” In The Oxford Handbook of AI Governance Oxford University Press, 2023 DOI: 10.1093/oxfordhb/9780197579329.013.65
- “When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards”, 2024 arXiv:2402.01781 [cs.CL]
- Anthropic “Model Card and Evaluations for Claude Models”, 2023 URL: https://www-cdn.anthropic.com/files/4zrzovbb/website/bd2a28d2535bfb0494cc8e2a3bf135d2e7523226.pdf
- “A Theory for Emergence of Complex Skills in Language Models” In arXiv preprint arXiv:2307.15936, 2023
- “Small steps to accuracy: Incremental belief updaters are better forecasters” In Proceedings of the 21st ACM Conference on Economics and Computation, 2020, pp. 873–874
- “Which humans?” In PsyArXiv PsyArXiv, 2023 URL: https://osf.io/preprints/psyarxiv/5b26t
- “Two reasons to make aggregated probability forecasts more extreme” In Decision Analysis 11.2 INFORMS, 2014, pp. 133–145
- Achal Bassamboo, Ruomeng Cui and Antonio Moreno “Wisdom of crowds: Forecasting using prediction markets”, 2018
- “Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments” In Surgery Elsevier, 2024
- “On the Dangers of Stochastic Parrots: Can Language Models be too Big?” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21 Virtual Event, Canada: Association for Computing Machinery, 2021, pp. 610–623 DOI: 10.1145/3442188.3445922
- “Emergent and Predictable Memorization in Large Language Models”, 2023 arXiv:2304.11158 [cs.CL]
- Glenn W Brier “Verification of forecasts expressed in terms of probability” In Monthly weather review 78.1, 1950, pp. 1–3
- “Sparks of Artificial General Intelligence: Early Experiments with GPT-4”, 2023 arXiv:2303.12712 [cs.CL]
- Budescu and Chen “Identifying expertise to extract the wisdom of crowds” In Management science 61.2 INFORMS, 2015, pp. 267–280
- David V. Budescu “Confidence in aggregation of opinions from multiple sources” In Information Sampling and Adaptive Cognition Cambridge, UK: Cambridge University Press, 2006, pp. 327–352
- Roberto Buizza “Ensemble forecasting and the need for calibration” In Statistical postprocessing of ensemble forecasts Elsevier, 2018, pp. 15–48
- “Quantifying Memorization Across Neural Language Models” In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 OpenReview.net, 2023 URL: https://openreview.net/pdf?id=TatRHT%5C_1cK
- “Transformers predicting the future. Applying attention in next-frame and time series forecasting” In arXiv preprint arXiv:2108.08224, 2021
- Jacob Cohen “Statistical power analysis for the behavioral sciences” Academic press, 2013
- “Acquiescence response bias—Yeasaying and higher education” In The Educational and Developmental Psychologist 32.2 Cambridge University Press, 2015, pp. 105–119
- “Harnessing the wisdom of crowds” In Management Science 66.5 INFORMS, 2020, pp. 1847–1867
- “When is a crowd wise?” In Decision 1.2 Educational Publishing Foundation, 2014, pp. 79
- “More Than Meets the AI: Evaluating the performance of GPT-4 on Computer Graphics assessment questions” In Proceedings of the 26th Australasian Computing Education Conference, 2024, pp. 182–191
- “A Review of ChatGPT Applications in Education, Marketing, Software Engineering, and Healthcare: Benefits, Drawbacks, and Research Directions”, 2023 arXiv:2305.00237 [cs.CY]
- “Gemini: A Family of Highly Capable Multimodal Models”, 2023 arXiv:2312.11805 [cs.CL]
- Paolo Ghirardato “Revisiting Savage in a conditional world” In Economic Theory 20 Springer, 2002, pp. 83–92
- Tilmann Gneiting and Adrian E Raftery “Strictly proper scoring rules, prediction, and estimation” In Journal of the American statistical Association 102.477 Taylor & Francis, 2007, pp. 359–378
- Nathaniel P Grove and Stacey Lowery Bretz “A Continuum of Learning: From Rote Memorization to Meaningful Learning in Organic Chemistry” In Chemistry Education Research and Practice 13.3 Royal Society of Chemistry, 2012, pp. 201–208
- “Large language models are zero-shot time series forecasters” In Advances in Neural Information Processing Systems 36, 2024
- “Approaching Human-Level Forecasting with Language Models”, 2024 arXiv:2402.18563 [cs.LG]
- “Devising and detecting phishing: Large language models vs. smaller human models” In arXiv preprint arXiv:2308.12287, 2023
- Michael Himmelstein, David V. Budescu and Yoonjin Han “The wisdom of timely crowds” In Judgment in predictive analytics Springer International Publishing, 2023, pp. 215–242
- Michael Himmelstein, David V. Budescu and Elizabeth H. Ho “The wisdom of many in few: Finding individuals who are as wise as the crowd” In Journal of Experimental Psychology: General American Psychological Association, 2023
- “The acquiescence effect in responding to a questionnaire” In GMS Psycho-Social Medicine 4 German Medical Science, 2007
- “Is ChatGPT a Good Translator? Yes with GPT-4 as the Engine”, 2023 arXiv:2301.08745 [cs.CL]
- “Time-llm: Time series forecasting by reprogramming large language models” In arXiv preprint arXiv:2310.01728, 2023
- “GPT-4 Passes the Bar Exam” In SSRN, 2023
- “Human-AI Collaboration in Large Language Model-Assisted Brain MRI Differential Diagnosis: A Usability Study” In medRxiv Cold Spring Harbor Laboratory Press, 2024, pp. 2024–02
- “How to build a benchmark” In Proceedings of the 6th ACM/SPEC international conference on performance engineering, 2015, pp. 333–336
- Asher Koriat “When are two heads better than one and why?” In Science 336.6079 American Association for the Advancement of Science, 2012, pp. 360–362
- “A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets”, 2023 arXiv:2305.18486 [cs.CL]
- Kenneth C. Lichtendahl Jr, Yael Grushka-Cockayne and Phillip E. Pfeifer “The wisdom of competitive crowds” In Operations Research 61.6 INFORMS, 2013, pp. 1383–1398
- “Data Contamination: From Memorization to Exploitation” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 157–165 DOI: 10.18653/v1/2022.acl-short.18
- David R Mandel and Alan Barnes “Accuracy of forecasts in strategic intelligence” In Proceedings of the National Academy of Sciences 111.30 National Acad Sciences, 2014, pp. 10984–10989
- Metaculus “Metaculus” Accessed: 2024-02-21, https://www.metaculus.com/home/, 2024
- “A Comprehensive Overview of Large Language Models”, 2023 arXiv:2307.06435 [cs.CL]
- “Capabilities of GPT-4 on Medical Challenge Problems”, 2023 arXiv:2303.13375 [cs.CL]
- “GPT-4 Technical Report”, 2023 arXiv:2303.08774 [cs.CL]
- Peter S Park, Philipp Schoenegger and Chongyang Zhu “Diminished diversity-of-thought in a standard large language model” In Behavior Research Methods Springer, 2024, pp. 1–17
- Peter S. Park “The evolution of cognitive biases in human learning” In Journal of Theoretical Biology 541 Elsevier, 2022, pp. 111031
- Peter S. Park and Max Tegmark “Divide-and-Conquer Dynamics in AI-Driven Disempowerment”, 2023 arXiv:2310.06009 [cs.CY]
- “AI Deception: A Survey of Examples, Risks, and Potential Solutions”, 2023 arXiv:2308.14752 [cs.CY]
- “Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data” In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023, pp. 1–8 IEEE
- “ChatGPT applications in medical, dental, pharmacy, and public health education: A descriptive study highlighting the advantages and limitations” In Narra J 3.1, 2023, pp. e103–e103
- Leonard J Savage “The Foundations of Statistics” New York: Dover Publications, 1972
- Philipp Schoenegger and Peter S. Park “Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament” In arXiv preprint arXiv:2310.13014, 2023
- “AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy” In arXiv preprint arXiv:2402.07862, 2024 DOI: 10.48550/arXiv.2402.07862
- “Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization”, 2023 arXiv:2305.13091 [cs.CL]
- “SlimPajama-DC: Understanding Data Combinations for LLM Training” In arXiv preprint arXiv:2309.10818, 2023
- Stefan Siegert “Simplifying and generalising Murphy’s Brier score decomposition” In Quarterly Journal of the Royal Meteorological Society 143.703 Wiley Online Library, 2017, pp. 1178–1183
- Lawrence H Summers and Steve Rattner “Larry Summers on who could be replaced by AI [Interviewed by Bloomberg TV’s David Westin]”, 2023 URL: https://www.youtube.com/watch?v=8Epl9yAu0gk
- James Surowiecki “The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations” London: Little, Brown, 2004
- Rich Sutton “AI succession [Youtube video of talk]” World Artificial Intelligence Conference in Shanghai, 2023 URL: https://www.youtube.com/watch?v=NgHFMolXs3U
- Philip E. Tetlock and Dan Gardner “Superforecasting: The Art and Science of Prediction” Random House, 2016
- “Forecasting Tournaments: Tools for Increasing Transparency and Improving the Quality of Debate” In Current Directions in Psychological Science 23.4 Sage Publications Sage CA: Los Angeles, CA, 2014, pp. 290–295
- “Llama 2: Open Foundation and Fine-Tuned Chat Models”, 2023 arXiv:2307.09288 [cs.CL]
- “Attention is All You Need” In Advances in Neural Information Processing Systems 30, 2017
- “Chatgpt for robotics: Design principles and model abilities” In Microsoft Auton. Syst. Robot. Res 2, 2023, pp. 20
- Volker Walter, Michael Kölle and David Collmar “Measuring the Wisdom of the Crowd: How Many is Enough?” In PFG–Journal of Photogrammetry, Remote Sensing and Geoinformation Science 90.3 Springer, 2022, pp. 269–291
- “Emergent abilities of large language models” In arXiv preprint arXiv:2206.07682, 2022
- Joost C.F. Winter “Can ChatGPT Pass High School Exams on English Language Comprehension?” In International Journal of Artificial Intelligence in Education, 2023
- “Human-AI Interactions in the Communication Era: Autophagy Makes Large Models Achieving Local Optima” In arXiv preprint arXiv:2402.11271, 2024