Can ChatGPT Replace Traditional KBQA Models? An In-depth Analysis of the Question Answering Performance of the GPT LLM Family (2303.07992v3)

Published 14 Mar 2023 in cs.CL

Abstract: ChatGPT is a powerful LLM that covers knowledge resources such as Wikipedia and supports natural language question answering using its own knowledge. Therefore, there is growing interest in exploring whether ChatGPT can replace traditional knowledge-based question answering (KBQA) models. Although there have been some works analyzing the question answering performance of ChatGPT, there is still a lack of large-scale, comprehensive testing of various types of complex questions to analyze the limitations of the model. In this paper, we present a framework that follows the black-box testing specifications of CheckList proposed by Ribeiro et. al. We evaluate ChatGPT and its family of LLMs on eight real-world KB-based complex question answering datasets, which include six English datasets and two multilingual datasets. The total number of test cases is approximately 190,000. In addition to the GPT family of LLMs, we also evaluate the well-known FLAN-T5 to identify commonalities between the GPT family and other LLMs. The dataset and code are available at https://github.com/tan92hl/Complex-Question-Answering-Evaluation-of-GPT-family.git

Citations (78)

View on Semantic Scholar

Summary

The paper introduces a new evaluation framework with MFT, INV, and DIR tests applied to around 190,000 QA cases.
It finds that while GPT models rival some traditional KBQA systems, they underperform on newer and more complex datasets.
The study highlights that chain-of-thought prompting can boost performance, indicating promising avenues for future QA advancements.

Introduction

In the field of AI and NLP, ChatGPT, a powerful LLM, has sparked interest due to its potential to leverage extensive knowledge from resources like Wikipedia for question answering (QA) tasks. This prompts an intriguing exploration into whether it could supersede traditional Knowledge-Based Question Answering (KBQA) systems.

Evaluation Framework and Methodology

A new paper proposes an evaluation framework inspired by previous methodologies, including CheckList, to assess LLMs' QA capabilities, particularly focusing on complex questions. This framework not only labels questions from compiled datasets for uniform feature analysis but also enhances the exact match (EM) method for more nuanced evaluation. Three distinct tests - the minimal functionality test (MFT), invariance test (INV), and directional expectation test (DIR) - offer insights into the LLMs' abilities, stabilities, and behavior under modified inputs, respectively.

Experimental Findings

The paper's experiments encompass six English and two multilingual CQA datasets with about 190,000 test cases. The findings reveal that although LLMs like the GPT family excel in some areas, they are not universally superior to state-of-the-art models, particularly on newer datasets. It was also noted that the GPT's multilingual capabilities seem to be reaching a plateau, suggesting a potential limit to its current learning strategy.

Concluding Insights

The comprehensive performance analysis of ChatGPT across various QA tasks showed noteworthy improvements with model iterations, closely rivaling traditional KBQA models. While some limitations exist, particularly with specific reasoning skills, enhancements like chain-of-thought prompting can improve model performance on selective question types. Last but not least, the paper recommends further exploration in other domains and with different model types to extend these findings and develop more exceptional AI-driven QA systems.

PDF Markdown

GitHub

GitHub - tan92hl/Complex-Question-Answering-Evaluation-of-GPT-family: A large-scale complex question answering evaluation of ChatGPT and similar large-language models (38 stars)