Emergent Mind

A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets

(2305.18486)
Published May 29, 2023 in cs.CL , cs.AI , and cs.LG

Abstract

The development of LLMs such as ChatGPT has brought a lot of attention recently. However, their evaluation in the benchmark academic datasets remains under-explored due to the difficulty of evaluating the generative outputs produced by this model against the ground truth. In this paper, we aim to present a thorough evaluation of ChatGPT's performance on diverse academic datasets, covering tasks like question-answering, text summarization, code generation, commonsense reasoning, mathematical problem-solving, machine translation, bias detection, and ethical considerations. Specifically, we evaluate ChatGPT across 140 tasks and analyze 255K responses it generates in these datasets. This makes our work the largest evaluation of ChatGPT in NLP benchmarks. In short, our study aims to validate the strengths and weaknesses of ChatGPT in various tasks and provide insights for future research using LLMs. We also report a new emergent ability to follow multi-query instructions that we mostly found in ChatGPT and other instruction-tuned models. Our extensive evaluation shows that even though ChatGPT is capable of performing a wide variety of tasks, and may obtain impressive performance in several benchmark datasets, it is still far from achieving the ability to reliably solve many challenging tasks. By providing a thorough assessment of ChatGPT's performance across diverse NLP tasks, this paper sets the stage for a targeted deployment of ChatGPT-like LLMs in real-world applications.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Towards a Human-like Open-Domain Chatbot
  2. Mega: Multilingual evaluation of generative ai
  3. Can we trust the evaluation on chatgpt?
  4. Publicly Available Clinical BERT Embeddings
  5. Program Synthesis with Large Language Models
  6. Refined: An efficient zero-shot-capable approach to end-to-end entity linking. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track, pages 209–220.
  7. PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts
  8. Chatgpt: Applications, opportunities, and threats
  9. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity
  10. The second PASCAL recognising textual entailment challenge
  11. Findings of the 2019 conference on machine translation (wmt19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61.
  12. Ask and you shall receive (a graph drawing): Testing chatgpt’s potential to apply graph layout algorithms
  13. Better by you, better than me, chatgpt3 as writing assistance in students essays
  14. Scibert: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620.
  15. The fifth PASCAL recognizing textual entailment challenge
  16. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA. Association for Computational Linguistics.
  17. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
  18. Gpt-neox-20b: An open-source autoregressive language model
  19. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, pages 12–58.
  20. Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 131–198.
  21. Ali Borji. 2023. A categorical archive of chatgpt failures.
  22. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  23. Does chatgpt resemble humans in language use?
  24. Evaluating Large Language Models Trained on Code
  25. DialogSum: A Real-Life Scenario Dialogue Summarization Dataset
  26. Evaluation of chatgpt model for vulnerability detection
  27. PaLM: Scaling Language Modeling with Pathways
  28. Scaling Instruction-Finetuned Language Models
  29. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936.
  30. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
  31. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
  32. Training Verifiers to Solve Math Word Problems
  33. The PASCAL recognising textual entailment challenge. In Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment, pages 177–190. Springer.
  34. Auggpt: Leveraging chatgpt for text data augmentation
  35. The CommitmentBank: Investigating projection in naturally occurring discourse. To appear in proceedings of Sinn und Bedeutung 23. Data can be found at https://github.com/mcdm/CommitmentBank/.

  36. Results of the WNUT2017 shared task on novel and emerging entity recognition. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 140–147, Copenhagen, Denmark. Association for Computational Linguistics.
  37. Sanjay Deshpande and Jakub Szefer. 2023. Analyzing chatgpt’s aptitude in an introductory computer engineering course.
  38. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171-4186.
  39. A Survey on In-context Learning
  40. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR.
  41. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
  42. What does chatgpt return about human values? exploring value bias in chatgpt using a descriptive value theory
  43. Mathematical capabilities of chatgpt
  44. An effective, performant named entity recognition system for noisy business telephone conversation transcripts. In Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022), pages 96–100.
  45. The Pile: An 800GB Dataset of Diverse Text for Language Modeling
  46. Human-like summarization evaluation with chatgpt
  47. Tanda: Transfer and adapt pre-trained transformer models for answer sentence selection. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7780–7788.
  48. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pages 1–9. Association for Computational Linguistics.
  49. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79.
  50. Dongyu Gong. 2023. Assessing working memory capacity of chatgpt.
  51. Google. 2023. Palm 2 technical report. Goole AI.
  52. News summarization and evaluation in the era of gpt-3
  53. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing
  54. Semantic communications with ordered importance using chatgpt
  55. The political ideology of conversational ai: Converging evidence on chatgpt’s pro-environmental, left-libertarian orientation
  56. Unifying Human and Statistical Evaluation for Natural Language Generation
  57. A survey on recent approaches for natural language processing in low-resource scenarios. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2545–2568.
  58. Aligning ai with shared human values. Proceedings of the International Conference on Learning Representations (ICLR).
  59. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  60. Measuring mathematical problem solving with the math dataset. NeurIPS.
  61. How good are gpt models at machine translation? a comprehensive evaluation
  62. Teaching machines to read and comprehend. Advances in neural information processing systems, 28.
  63. Training Compute-Optimal Large Language Models
  64. LoRA: Low-Rank Adaptation of Large Language Models
  65. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International Conference on Machine Learning, pages 4411–4421. PMLR.
  66. Zero-shot clinical entity recognition using chatgpt
  67. Is ChatGPT better than human annotators? potential and limitations of ChatGPT in explaining implicit hate speech. In Companion Proceedings of the ACM Web Conference 2023. ACM.
  68. Evaluation of ChatGPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers
  69. Myeongjun Jang and Thomas Lukasiewicz. 2023. Consistency analysis of chatgpt.
  70. Is chatgpt a good translator? yes with gpt-4 as the engine
  71. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611.
  72. Chart-to-text: A large-scale benchmark for chart summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4005–4023.
  73. Evaluating gpt-4 and chatgpt on japanese medical licensing examinations
  74. Ali Kashefi and Tapan Mukerji. 2023. Chatgpt for programming numerical methods.
  75. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–262.
  76. Mind the gap! injecting commonsense knowledge for abstractive dialogue summarization. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6285–6300.
  77. Chatgpt: Jack of all trades, master of none
  78. The moral authority of chatgpt
  79. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  80. Race: Large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794.
  81. Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning
  82. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
  83. Improving Named Entity Recognition in Telephone Conversations via Effective Active Learning with Human in the Loop
  84. An auto encoder-based dimensionality reduction technique for efficient entity linking in business phone conversations. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3363–3367.
  85. BLINK with Elasticsearch for efficient entity linking in business conversations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track, pages 344–352, Hybrid: Seattle, Washington + Online. Association for Computational Linguistics.
  86. Domain adaptation with pre-trained transformers for query-focused abstractive text summarization. Computational Linguistics, 48(2):279–320.
  87. WSL-DS: Weakly supervised learning with distant supervision for query focused multi-document abstractive summarization. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5647–5654.
  88. Contextualized embeddings based transformer encoder for sentence similarity modeling in answer selection task. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5505–5514.
  89. CQSumDP: A ChatGPT-Annotated Resource for Query-Focused Abstractive Summarization Based on Debatepedia
  90. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
  91. Chatgpt: A meta-analysis after 2.5 months.
  92. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  93. The Winograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, volume 46, page 47.
  94. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
  95. Solving quantitative reasoning problems with language models. In Advances in Neural Information Processing Systems.
  96. Evaluating chatgpt’s information extraction capabilities: An assessment of performance, explainability, calibration, and faithfulness
  97. Multi-step jailbreaking privacy attacks on chatgpt
  98. "hot" chatgpt: The promise of chatgpt in detecting and discriminating hateful, offensive, and toxic comments on social media
  99. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online. Association for Computational Linguistics.
  100. XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation
  101. Differentiate chatgpt-generated and human-written medical texts
  102. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  103. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252.
  104. A comprehensive evaluation of chatgpt’s zero-shot text-to-sql capability
  105. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2122–2132, Austin, Texas. Association for Computational Linguistics.
  106. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation
  107. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
  108. RoBERTa: A Robustly Optimized BERT Pretraining Approach
  109. Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, ACL’23, Toronto, Canada. ACL.
  110. Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses
  111. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
  112. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in bioinformatics, 23(6):bbac409.
  113. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
  114. Semeval-2022 task 11: Multilingual complex named entity recognition (multiconer). In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022). Association for Computational Linguistics.
  115. Gary Marcus. 2022. Is chatgpt really a “code red” for google search?
  116. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391.
  117. Neurips 2020 efficientqa competition: Systems, analyses and lessons learned. In NeurIPS 2020 Competition and Demonstration Track, pages 86–111. PMLR.
  118. Cross-Task Generalization via Natural Language Crowdsourcing Instructions
  119. Chatgpt or human? detect and explain. explaining decisions of machine learning model for detecting short chatgpt-generated text.
  120. Crosslingual Generalization through Multitask Finetuning
  121. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807.
  122. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online. Association for Computational Linguistics.
  123. Industrial engineering with large language models: A case study of chatgpt’s performance on oil & gas problems
  124. OpenAI. 2023. Gpt-4 technical report.
  125. OpenAI-Blog. 2022. Chatgpt: Optimizing language models for dialogue.
  126. Linguistic ambiguity analysis in chatgpt
  127. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pages 9 – 16, Mannheim. Leibniz-Institut für Deutsche Sprache.
  128. Training language models to follow instructions with human feedback
  129. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  130. Ralph Peeters and Christian Bizer. 2023. Using chatgpt for entity matching.
  131. To chatgpt, or not to chatgpt: That is the question!
  132. Towards making the most of chatgpt for machine translation
  133. Transfer learning in biomedical natural language processing: An evaluation of bert and elmo on ten benchmarking datasets. In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 58–65.
  134. Inverse scaling prize: Round 1 winners
  135. AdapterFusion: Non-destructive task composition for transfer learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 487–503, Online. Association for Computational Linguistics.
  136. Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. WiC: The word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of NAACL-HLT.
  137. Collecting diverse natural language inference problems for sentence representation evaluation. In Proceedings of EMNLP.
  138. Is chatgpt a general-purpose natural language processing task solver?
  139. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  140. Scaling Language Models: Methods, Analysis & Insights from Training Gopher
  141. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67.
  142. ChartSumm: A Comprehensive Benchmark for Automatic Chart Summarization of Long and Short Summaries
  143. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789.
  144. Aman Rangapur and Haoran Wang. 2023. Chatgpt-crawler: Find out if chatgpt really knows what it’s talking about.
  145. Can chatgpt assess human personalities? a general evaluation framework
  146. Summareranker: A multi-task mixture-of-experts re-ranking framework for abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4504–4524.
  147. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series.
  148. Generating phishing attacks using chatgpt
  149. Gender bias in coreference resolution. In Proceedings of NAACL-HLT.
  150. Winogrande: An adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8732–8740.
  151. Michael Sandel. 2019. The moral side of murder.
  152. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
  153. Multitask Prompted Training Enables Zero-Shot Task Generalization
  154. Social iqa: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473.
  155. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
  156. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
  157. In chatgpt we trust? measuring and characterizing the reliability of chatgpt
  158. Language Models are Multilingual Chain-of-Thought Reasoners
  159. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  160. TaCL: Improving BERT pre-training with token-aware contrastive learning. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2497–2507, Seattle, United States. Association for Computational Linguistics.
  161. Black-box tuning for language-model-as-a-service. In ICML, volume 162 of Proceedings of Machine Learning Research, pages 20841–20855. PMLR.
  162. Is chatgpt good at search? investigating large language models as re-ranking agent
  163. Teo Susnjak. 2022. Chatgpt: The end of online exam integrity?
  164. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
  165. Chatgpt4pcg competition: Character-like level generation for science birds
  166. UL2: Unifying Language Learning Paradigms
  167. Judith Jarvis Thomson. 2020. The Trolley Problem/Das Trolley-Problem (Englisch/Deutsch): Reclam Great Papers Philosophie. Reclam Verlag.
  168. LaMDA: Language Models for Dialog Applications
  169. Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 142–147.
  170. LLaMA: Open and Efficient Foundation Language Models
  171. Chatlog: Recording and analyzing chatgpt across time
  172. Attention is all you need. Advances in neural information processing systems, 30.
  173. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
  174. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
  175. Ben Wang. 2021. Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX. https://github.com/kingoflolz/mesh-transformer-jax.

  176. Is chatgpt a good nlg evaluator? a preliminary study
  177. Super-naturalinstructions:generalization via declarative instructions on 1600+ tasks. In EMNLP.
  178. Is chatgpt a good sentiment analyzer? a preliminary study
  179. Finetuned Language Models Are Zero-Shot Learners
  180. Emergent abilities of large language models
  181. Inverse scaling can become U-shaped
  182. Chatgpt or grammarly? evaluating chatgpt on grammatical error correction benchmark
  183. Scalable zero-shot entity linking with dense entity retrieval. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6397–6407.
  184. Zero-shot temporal relation extraction with chatgpt
  185. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800
  186. GLM-130B: An Open Bilingual Pre-trained Model
  187. How would stance detection techniques evolve after the launch of chatgpt?
  188. ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension
  189. Dialogpt: Large-scale generative pre-training for conversational response generation. In ACL, system demonstration.
  190. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 15–20.
  191. Is chatgpt equipped with emotional dialogue capabilities?
  192. QMSum: A new benchmark for query-based multi-domain meeting summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5905–5921, Online. Association for Computational Linguistics.
  193. Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT
  194. A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT

Show All 194