Emergent Mind

Abstract

This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. We find that ChatGPT outperforms LLMs with zero-shot learning on most tasks and even outperforms fine-tuned models on some tasks. We find that it is better at understanding non-Latin script languages than generating them. It is able to generate multimodal content from textual prompts, via an intermediate code generation step. Moreover, we find that ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning, hence making it an unreliable reasoner. It is, for example, better at deductive than inductive reasoning. ChatGPT suffers from hallucination problems like other LLMs and it generates more extrinsic hallucinations from its parametric memory as it does not have access to an external knowledge base. Finally, the interactive feature of ChatGPT enables human collaboration with the underlying LLM to improve its performance, i.e, 8% ROUGE-1 on summarization and 2% ChrF++ on machine translation, in a multi-turn "prompt engineering" fashion. We also release codebase for evaluation set extraction.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
References
  1. One country, 700+ languages: NLP challenges for underrepresented languages and dialects in Indonesia. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7226–7249, Dublin, Ireland. Association for Computational Linguistics.
  2. Ömer Aydın and Enis Karaarslan. 2022. Openai chatgpt generated literature review: Digital twin in healthcare. Available at SSRN 4308687.
  3. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  4. Paul Bartha. 2013. Analogy and analogical reasoning.
  5. Abductive commonsense reasoning. In International Conference on Learning Representations.
  6. Prajjwal Bhargava and Vincent Ng. 2022. Commonsense knowledge reasoning and generation with pre-trained language models: a survey. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36 (11), pages 12317–12325.
  7. Findings of the wmt 2022 shared task on automatic post-editing. In Proceedings of the Seventh Conference on Machine Translation, pages 109–117, Abu Dhabi.
  8. David G.W. Birch. 2022. Chatgpt is a window into the real future of financial services.
  9. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34 (05), pages 7432–7439.
  10. The Role of AI in Drug Discovery: Challenges, Opportunities, and Strategies
  11. Nusacrowd: Open source initiative for indonesian nlp resources
  12. IndoNLG: Benchmark and resources for evaluating Indonesian natural language generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8875–8898, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  13. Ethan C. Chau and Noah A. Smith. 2021. Specializing multilingual language models: An empirical study. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 51–61, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  14. Chatgpt goes to law school. Available at SSRN.
  15. Palm: Scaling language modeling with pathways
  16. Deep reinforcement learning from human preferences
  17. Think you have solved question answering? try arc, the ai2 reasoning challenge
  18. Cookup.ai. 2022. Chatgpt - where it lacks.
  19. Enabling multimodal generation on CLIP via vision-language knowledge distillation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2383–2395, Dublin, Ireland. Association for Computational Linguistics.
  20. Instructblip: Towards general-purpose vision-language models with instruction tuning
  21. Plausible may not be faithful: Probing object hallucination in vision-language pre-training. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2136–2148, Dubrovnik, Croatia. Association for Computational Linguistics.
  22. Explaining answers with entailment trees. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7358–7370.
  23. Benchmarks for Automated Commonsense Reasoning: A Survey
  24. Mathematics, word problems, common sense, and artificial intelligence
  25. Tech Desk. 2023a. Chatgpt vs satya nadella over biryani: The chatbot is learning from its mistakes.
  26. Web Desk. 2023b. Colombian judge uses chatgpt in ruling, triggers debate.
  27. Igor Douven. 2017. Abduction.
  28. Michael Dowling and Brian Lucey. 2023. Chatgpt for (finance) research: The bananarama conjecture. Finance Research Letters, page 103662.
  29. e-CARE: a new dataset for exploring explainable causal reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 432–446, Dublin, Ireland. Association for Computational Linguistics.
  30. Mathematical capabilities of chatgpt
  31. A framework for few-shot language model evaluation
  32. How well does chatgpt do when taking the medical licensing exams? the implications of large language models for medical education and knowledge assessment. medRxiv, pages 2022–12.
  33. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. EMNLP-IJCNLP 2019, page 70.
  34. Yoav Goldberg. 2023. Some remarks on large language models.
  35. Cindy Gordon. 2023. Chatgpt is the fastest growing app in the history of web applications.
  36. The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538.
  37. News Summarization and Evaluation in the Era of GPT-3
  38. ChatGPT is not all you need. A State of the Art Review of large Generative AI models
  39. How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection
  40. James Hawthorne. 2021. Inductive Logic. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy, Spring 2021 edition. Metaphysics Research Lab, Stanford University.
  41. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
  42. Training compute-optimal large language models
  43. Krystal Hu. 2023. Chatgpt sets record for fastest-growing user base - analyst note.
  44. Towards Reasoning in Large Language Models: A Survey
  45. Aspect detection and sentiment classification using deep neural network for indonesian aspect-based sentiment analysis. In 2018 International Conference on Asian Language Processing (IALP), pages 62–67.
  46. Hadar Yoana Jabotinsky and Roee Sarel. 2022. Co-authoring with an ai? ethical dilemmas and artificial intelligence. Ethical Dilemmas and Artificial Intelligence (December 15, 2022).
  47. ChatGPT Makes Medicine Easy to Swallow: An Exploratory Case Study on Simplified Radiology Reports
  48. Survey of hallucination in natural language generation. ACM Comput. Surv. Just Accepted.
  49. RHO ($ρ$): Reducing Hallucination in Open-domain Dialogues with Knowledge Grounding
  50. Is chatgpt a good translator? a preliminary study
  51. Arianna Johnson. 2023. Is chatgpt partisan? poems about trump and biden raise questions about the ai bot’s bias-here’s what experts think.
  52. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
  53. Jennifer A. Kingson. 2023. Friend or foe? teachers debate chatgpt.
  54. Chatgpt: Jack of all trades, master of none. Information Fusion, page 101861.
  55. Escape Velocity Labs. 2022. Chatgpt imitates logical reasoning surprisingly well.
  56. ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning
  57. A systematic study and comprehensive evaluation of ChatGPT on benchmark datasets. In Findings of the Association for Computational Linguistics: ACL 2023, pages 431–469, Toronto, Canada. Association for Computational Linguistics.
  58. A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets
  59. Anton E Lawson. 2005. What is the role of induction and deduction in reasoning and scientific inquiry? Journal of Research in Science Teaching, 42(6):716–740.
  60. Towards few-shot fact-checking via perplexity. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1971–1981, Online. Association for Computational Linguistics.
  61. Factuality enhanced language models for open-ended text generation. In Advances in Neural Information Processing Systems.
  62. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  63. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
  64. Holistic evaluation of language models
  65. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
  66. Zero-shot dialogue state tracking via cross-task transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7890–7900.
  67. Every picture tells a story: Image-grounded controllable stylistic story generation. In Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 40–52, Gyeongju, Republic of Korea. International Conference on Computational Linguistics.
  68. Advancing Multilingual Pre-training: TRIP Triangular Document-level Pre-training for Multilingual Language Models
  69. Few-Shot Bot: Prompt-Based Learning for Dialogue Systems
  70. Dissociating language and thought in large language models
  71. GPTEval: A Survey on Assessments of ChatGPT and GPT-4
  72. Bernard Marr. 2022. What does chatgpt really mean for businesses?
  73. A Survey on Multi-hop Question Answering and Generation
  74. Learning reasoning strategies in end-to-end differentiable proving. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org.
  75. Roshanak Mirzaee and Parisa Kordjamshidi. 2022. Transfer learning with synthetic corpora for spatial role labeling and reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6148–6165, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  76. SPARTQA: A textual question answering benchmark for spatial reasoning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4582–4598, Online. Association for Computational Linguistics.
  77. Opendialkg: Explainable conversational reasoning with attention-based walks over knowledge graphs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 845–854.
  78. Crosslingual Generalization through Multitask Finetuning
  79. When being unseen from mBERT is just the beginning: Handling new languages with multilingual language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 448–462, Online. Association for Computational Linguistics.
  80. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany. Association for Computational Linguistics.
  81. Tomáš Nekvinda and Ondřej Dušek. 2021. Shades of BLEU, flavours of success: The case of MultiWOZ. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pages 34–46, Online. Association for Computational Linguistics.
  82. Putting chatgpt’s medical advice to the (turing) test. medRxiv, pages 2023–01.
  83. OpenAI. 2023. Gpt-4 technical report.
  84. ThoughtSource: A central hub for large language model reasoning data
  85. Training language models to follow instructions with human feedback
  86. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
  87. Modeling event plausibility with consistent conceptual abstraction. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1732–1743, Online. Association for Computational Linguistics.
  88. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
  89. Reasoning with Language Model Prompting: A Survey
  90. Is ChatGPT a General-Purpose Natural Language Processing Task Solver?
  91. TIMEDIAL: Temporal commonsense reasoning in dialog. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7066–7076, Online. Association for Computational Linguistics.
  92. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  93. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  94. Scaling language models: Methods, analysis and insights from training gopher
  95. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1).
  96. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR.
  97. Fabin Rasheed. 2020. Gpt3 sees.
  98. Partha Pratim Ray. 2023. Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems.
  99. Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension. ACM Comput. Surv. Just Accepted.
  100. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695.
  101. Analysing mathematical reasoning abilities of neural models. In International Conference on Learning Representations.
  102. Stephen Shankland. 2023. Why the chatgpt ai chatbot is blowing everyone’s mind.
  103. Chatgpt and other large language models are double-edged swords
  104. Stepgame: A new benchmark for robust multi-hop spatial reasoning in texts. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):11321–11329.
  105. StepGame: A new benchmark for robust multi-hop spatial reasoning in texts. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):11321–11329.
  106. Denis Shiryaev. 2022. Drawing mona lisa with chatgpt.
  107. Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage
  108. Clutrr: A diagnostic benchmark for inductive reasoning from text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4506–4515.
  109. Noah Smith. 2023. Why does chatgpt constantly lie?
  110. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  111. Recent Advances in Natural Language Inference: A Survey of Benchmarks, Resources, and Approaches
  112. Read before generate! faithful long form question answering with machine reading. In Findings of the Association for Computational Linguistics: ACL 2022, pages 744–756.
  113. Improve query focused abstractive summarization by incorporating answer relevance. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3124–3131.
  114. Recitation-augmented language models. In The Eleventh International Conference on Learning Representations.
  115. ChatGPT: The End of Online Exam Integrity?
  116. olmpics-on what language model pre-training captures. Transactions of the Association for Computational Linguistics, 8:743–758.
  117. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
  118. No language left behind: Scaling human-centered machine translation
  119. Richmond Thomason. 2018. Logic and artificial intelligence.
  120. LaMDA: Language Models for Dialog Applications
  121. H Holden Thorp. 2023. Chatgpt is fun, but not an author.
  122. Giuseppe Venuto. 2023. Giuven95/chatgpt-failures: Chatgpt failure archive.
  123. Douglas Walton. 2014. Abductive reasoning. University of Alabama Press.
  124. Ada Wan. 2022. Fairness in representation for multilingual NLP: Insights from controlled experiments on conditional language modeling. In International Conference on Learning Representations.
  125. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
  126. Modeling Semantic Plausibility by Injecting World Knowledge
  127. Self-Consistency Improves Chain of Thought Reasoning in Language Models
  128. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  129. Peter Cathcart Wason and Philip Nicholas Johnson-Laird. 1972. Psychology of reasoning: Structure and content, volume 86. Harvard University Press.
  130. Emergent analogical reasoning in large language models
  131. Emergent Analogical Reasoning in Large Language Models
  132. Emergent abilities of large language models. Transactions on Machine Learning Research.
  133. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  134. Towards ai-complete question answering: A set of prerequisite toy tasks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
  135. Towards ai-complete question answering: A set of prerequisite toy tasks. In 4th International Conference on Learning Representations, ICLR 2016.
  136. Indonlu: Benchmark and resources for evaluating indonesian natural language understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 843–857.
  137. Nusax: Multilingual parallel sentiment dataset for 10 indonesian local languages
  138. Cameron R. Wolfe. 2023. Specialized llms: Chatgpt, lamda, galactica, codex, sparrow, and more.
  139. Bloom: A 176b-parameter open-access multilingual language model
  140. Retrieval-free knowledge-grounded dialogue response generation with adapters. In Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering, pages 93–107.
  141. Diverse and faithful knowledge-grounded dialogue generation via sequential posterior inference. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 38518–38534. PMLR.
  142. Ubar: Towards fully end-to-end task-oriented dialog system with gpt-2. Proceedings of the AAAI Conference on Artificial Intelligence, 35(16):14230–14238.
  143. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380.
  144. Vision guided generative pre-trained language models for multimodal abstractive summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3995–4007, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  145. Adaptsum: Towards low-resource domain adaptation for abstractive summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5892–5904.
  146. Multiwoz 2.2: A dialogue dataset with additional annotation corrections and state tracking baselines. ACL 2020, page 109.
  147. Star: Bootstrapping reasoning with reasoning. In Advances in Neural Information Processing Systems.
  148. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  149. Description-Driven Task-Oriented Dialog Modeling
  150. Knowledge-grounded dialogue generation with pre-trained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3377–3390.
  151. Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity
  152. Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity

Show All 152