Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Evaluation of Estimative Uncertainty in Large Language Models (2405.15185v1)

Published 24 May 2024 in cs.CL, cs.AI, and cs.HC

Abstract: Words of estimative probability (WEPs), such as ''maybe'' or ''probably not'' are ubiquitous in natural language for communicating estimative uncertainty, compared with direct statements involving numerical probability. Human estimative uncertainty, and its calibration with numerical estimates, has long been an area of study -- including by intelligence agencies like the CIA. This study compares estimative uncertainty in commonly used LLMs like GPT-4 and ERNIE-4 to that of humans, and to each other. Here we show that LLMs like GPT-3.5 and GPT-4 align with human estimates for some, but not all, WEPs presented in English. Divergence is also observed when the LLM is presented with gendered roles and Chinese contexts. Further study shows that an advanced LLM like GPT-4 can consistently map between statistical and estimative uncertainty, but a significant performance gap remains. The results contribute to a growing body of research on human-LLM alignment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Verbal versus numerical probabilities: Efficiency, biases, and the preference paradox. \JournalTitleOrganizational behavior and human decision processes 45, 1–18 (1990).
  2. Do people really prefer verbal probabilities? \JournalTitlePsychological research 84, 2325–2338 (2020).
  3. Hyland, K. Writing without conviction? hedging in science research articles. \JournalTitleApplied linguistics 17, 433–454 (1996).
  4. Vlasyan, G. R. et al. Linguistic hedging in the light of politeness theory. \JournalTitleEuropean Proceedings of Social and Behavioural Sciences (2018).
  5. Communicating uncertainty: Media coverage of new and controversial science (Routledge, 2012).
  6. Kent, S. Words of estimative probability. \JournalTitleStudies in intelligence 8, 49–65 (1964).
  7. Beyth-Marom, R. How probable is probable? a numerical translation of verbal probability expressions. \JournalTitleJournal of forecasting 1, 257–269 (1982).
  8. Handling and mishandling estimative probability: Likelihood, confidence, and the search for bin laden. \JournalTitleIntelligence and National Security 30, 77–99 (2015).
  9. Shinagare, A. B. et al. Radiologist preferences, agreement, and variability in phrases used to convey diagnostic certainty in radiology reports. \JournalTitleJournal of the American College of Radiology 16, 458–464 (2019).
  10. Measuring the vague meanings of probability terms. \JournalTitleJournal of Experimental Psychology: General 115, 348 (1986).
  11. Lenhardt, E. D. et al. How likely is that chance of thunderstorms? a study of how national weather service forecast offices use words of estimative probability and what they mean to the public. \JournalTitleJournal of Operational Meteorology (2020).
  12. Barclay, S. et al. Handbook for decisions analysis. (1977).
  13. Fagen-Ulmschneider, W. Perception of probability words (2019).
  14. Zhao, W. X. et al. A survey of large language models. \JournalTitlearXiv preprint arXiv:2303.18223 (2023).
  15. Achiam, J. et al. Gpt-4 technical report. \JournalTitlearXiv preprint arXiv:2303.08774 (2023).
  16. Touvron, H. et al. Llama: Open and efficient foundation language models. \JournalTitlearXiv preprint arXiv:2302.13971 (2023).
  17. Openai’s gpt-3.5-turbo. https://platform.openai.com/docs/models/gpt-3-5.
  18. Openai’s gpt-4. https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo.
  19. Zhang, T. et al. Benchmarking large language models for news summarization. \JournalTitleTransactions of the Association for Computational Linguistics 12, 39–57 (2024).
  20. Llms for customer service and support. https://www.databricks.com/solutions/accelerators/llms-customer-service-and-support.
  21. Shared interest: Measuring human-ai alignment to identify recurring patterns in model behavior. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 1–17 (2022).
  22. Ji, J. et al. Ai alignment: A comprehensive survey. \JournalTitlearXiv preprint arXiv:2310.19852 (2023).
  23. Gabriel, I. Artificial intelligence, values, and alignment. \JournalTitleMinds and machines 30, 411–437 (2020).
  24. Human-aligned artificial intelligence is a multiobjective problem. \JournalTitleEthics and Information Technology 20, 27–40 (2018).
  25. Baidu’s ernie-4.0. https://yiyan.baidu.com/.
  26. Prompting large language model for machine translation: A case study. In International Conference on Machine Learning, 41092–41110 (PMLR, 2023).
  27. Kocmi, T. et al. Findings of the 2023 conference on machine translation (wmt23): Llms are here but not quite there yet. In Proceedings of the Eighth Conference on Machine Translation, 1–42 (2023).
  28. Sumblogger: Abstractive summarization of large collections of scientific articles. In European Conference on Information Retrieval, 371–386 (Springer, 2024).
  29. Aclsum: A new dataset for aspect-based summarization of scientific publications. \JournalTitlearXiv preprint arXiv:2403.05303 (2024).
  30. Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. \JournalTitleAdvances in Neural Information Processing Systems 35, 24824–24837 (2022).
  31. Wang, X. et al. Self-consistency improves chain of thought reasoning in language models. \JournalTitlearXiv preprint arXiv:2203.11171 (2022).
  32. Large language models are zero-shot reasoners. \JournalTitleAdvances in neural information processing systems 35, 22199–22213 (2022).
  33. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. \JournalTitlearXiv preprint arXiv:2212.10509 (2022).
  34. Probing neural language models for understanding of words of estimative probability. \JournalTitlearXiv preprint arXiv:2211.03358 (2022).
  35. A primer in bertology: What we know about how bert works. \JournalTitleTransactions of the Association for Computational Linguistics 8, 842–866 (2021).
  36. Bert: Pre-training of deep bidirectional transformers for language understanding. \JournalTitlearXiv preprint arXiv:1810.04805 (2018).
  37. Using cognitive psychology to understand gpt-3. \JournalTitleProceedings of the National Academy of Sciences 120, e2218523120 (2023).
  38. Wang, Y. et al. Aligning large language models with human: A survey. \JournalTitlearXiv preprint arXiv:2307.12966 (2023).
  39. \JournalTitleHarvard Business Review (2018).
  40. On information and sufficiency. \JournalTitleThe annals of mathematical statistics 22, 79–86 (1951).
  41. On a test of whether one of two random variables is stochastically larger than the other. \JournalTitleThe annals of mathematical statistics 50–60 (1947).
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets