Emergent Mind

Abstract

ChatGPT is a powerful LLM that covers knowledge resources such as Wikipedia and supports natural language question answering using its own knowledge. Therefore, there is growing interest in exploring whether ChatGPT can replace traditional knowledge-based question answering (KBQA) models. Although there have been some works analyzing the question answering performance of ChatGPT, there is still a lack of large-scale, comprehensive testing of various types of complex questions to analyze the limitations of the model. In this paper, we present a framework that follows the black-box testing specifications of CheckList proposed by Ribeiro et. al. We evaluate ChatGPT and its family of LLMs on eight real-world KB-based complex question answering datasets, which include six English datasets and two multilingual datasets. The total number of test cases is approximately 190,000. In addition to the GPT family of LLMs, we also evaluate the well-known FLAN-T5 to identify commonalities between the GPT family and other LLMs. The dataset and code are available at https://github.com/tan92hl/Complex-Question-Answering-Evaluation-of-GPT-family.git

Variation in error rates and QA accuracy for GPT/T5 models and LLMs with threshold changes.

Overview

  • The paper investigates ChatGPT's potential to surpass traditional KBQA systems in answering complex questions using a newly proposed evaluation framework.

  • The framework includes new labeling strategies for questions, and a nuanced version of the exact match method for assessing LLM QA performance.

  • The study conducted tests using six English and two multilingual CQA datasets, with around 190,000 test cases to evaluate the QA abilities of LLMs.

  • Despite ChatGPT's strengths, it is not universally superior to state-of-the-art models and shows a plateau in multilingual capabilities.

  • Improvements such as chain-of-thought prompting are suggested, and further exploration in diverse domains and models is recommended for advancing AI-driven QA systems.

Introduction

In the realm of AI and NLP, ChatGPT, a powerful LLM, has sparked interest due to its potential to leverage extensive knowledge from resources like Wikipedia for question answering (QA) tasks. This prompts an intriguing exploration into whether it could supersede traditional Knowledge-Based Question Answering (KBQA) systems.

Evaluation Framework and Methodology

A new study proposes an evaluation framework inspired by previous methodologies, including CheckList, to assess LLMs' QA capabilities, particularly focusing on complex questions. This framework not only labels questions from compiled datasets for uniform feature analysis but also enhances the exact match (EM) method for more nuanced evaluation. Three distinct tests - the minimal functionality test (MFT), invariance test (INV), and directional expectation test (DIR) - offer insights into the LLMs' abilities, stabilities, and behavior under modified inputs, respectively.

Experimental Findings

The study's experiments encompass six English and two multilingual CQA datasets with about 190,000 test cases. The findings reveal that although LLMs like the GPT family excel in some areas, they are not universally superior to state-of-the-art models, particularly on newer datasets. It was also noted that the GPT's multilingual capabilities seem to be reaching a plateau, suggesting a potential limit to its current learning strategy.

Concluding Insights

The comprehensive performance analysis of ChatGPT across various QA tasks showed noteworthy improvements with model iterations, closely rivaling traditional KBQA models. While some limitations exist, particularly with specific reasoning skills, enhancements like chain-of-thought prompting can improve model performance on selective question types. Last but not least, the study recommends further exploration in other domains and with different model types to extend these findings and develop more exceptional AI-driven QA systems.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.