M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models (2305.10263v2)

Published 17 May 2023 in cs.CL

Abstract: LLMs have recently made tremendous progress in a variety of aspects, e.g., cross-task generalization, instruction following. Comprehensively evaluating the capability of LLMs in multiple tasks is of great importance. In this paper, we propose M3KE, a Massive Multi-Level Multi-Subject Knowledge Evaluation benchmark, which is developed to measure knowledge acquired by Chinese LLMs by testing their multitask accuracy in zero- and few-shot settings. We have collected 20,477 questions from 71 tasks. Our selection covers all major levels of Chinese education system, ranging from the primary school to college, as well as a wide variety of subjects, including humanities, history, politics, law, education, psychology, science, technology, art and religion. All questions are multiple-choice questions with four options, hence guaranteeing a standardized and unified assessment process. We've assessed a number of state-of-the-art open-source Chinese LLMs on the proposed benchmark. The size of these models varies from 335M to 130B parameters. Experiment results demonstrate that they perform significantly worse than GPT-3.5 that reaches an accuracy of ~ 48% on M3KE. The dataset is available at https://github.com/tjunlp-lab/M3KE.

Authors (13)

Chuang Liu (71 papers)
Renren Jin (18 papers)
Yuqi Ren (6 papers)
Linhao Yu (10 papers)
Tianyu Dong (15 papers)
Xiaohan Peng (1 paper)
Shuting Zhang (5 papers)
Jianxiang Peng (3 papers)
Peiyi Zhang (4 papers)
Qingqing Lyu (1 paper)
Xiaowen Su (3 papers)
Qun Liu (231 papers)
Deyi Xiong (104 papers)

Citations (20)

View on Semantic Scholar

Summary

The paper introduces the M3KE benchmark, which comprises 20,477 multiple-choice questions over 71 tasks covering a wide range of Chinese educational topics.
It demonstrates that Chinese LLMs, especially smaller and less instruction-tuned models, perform near random compared to GPT-3.5's 48% average accuracy.
The findings highlight the benefits of advanced instruction tuning and model scaling to enhance cross-task generalization in Chinese language models.

Evaluation of LLMs with the M3KE Benchmark

The paper presents a comprehensive examination of the evaluation capabilities of Chinese LLMs using a newly developed benchmark named M3KE: a Massive Multi-Level Multi-Subject Knowledge Evaluation benchmark. The focus of the research lies in evaluating the breadth and depth of knowledge acquired by various Chinese LLMs, spanning multiple domains and educational levels, evaluated under zero- and few-shot learning paradigms.

M3KE: Benchmark Characteristics

M3KE encompasses an extensive collection of 20,477 multiple-choice questions sourced from 71 distinct tasks. These tasks are diverse, covering all main sectors of the Chinese educational curriculum—ranging from primary school topics to college-level and beyond. The subjects assessed include areas such as arts, humanities, social sciences, and natural sciences, as well as specialized topics like ancient Chinese language and traditional Chinese medicine. This exhaustive variety provides a broad platform to assess the competencies of Chinese LLMs effectively, enabling standardized performance comparison across different domains and educational levels.

Performance Analysis

The evaluation includes several pre-trained and instruction-tuned models, with parameter sizes ranging from 335M to 130B. Key findings showed a noticeable performance gap between these models and GPT-3.5, with the latter achieving an average accuracy of around 48% on M3KE. The smaller models and those with fewer instruction-tuning epochs generally displayed near-random performance, especially at the primary school level tasks. Notably, this reveals a significant room for improvement in the field of Chinese LLMs when compared to leading English-LLMs.

Instruction Tuning and Model Size

Among the tested models, the ones which had undergone supervised fine-tuning on instructions demonstrated varying degrees of success. Models such as ChatGLM-6B and BELLE-7B exhibited promising results, indicating the potential benefits of using instruction-tuning to enhance model generalization across tasks. Moreover, the variations in model performance as a function of size and instruction volume underline the impact of training strategies on the final capabilities of LLMs.

Implications and Future Directions

The findings underscore the need for more sophisticated training strategies tailored to enhance the cross-task generalization of Chinese LLMs. The disparity in performance between GPT-3.5 and other open-sourced models suggests significant potential gains through further studies into architectures, training paradigms, and linguistic data processing tailored to non-English languages. The insights gained through the M3KE benchmark could guide future research toward optimizing multilingual, cross-disciplinary AI models.

Overall, this research significantly contributes to the ongoing discourse on LLM evaluation and provides an imperative foundation for extensions into other high-resource languages. Such progressive steps are crucial for advancing AI capabilities in a global educational context, where model applicability spans multiple cultural and academic specificities.

PDF Markdown

Related Papers

GitHub

GitHub - tjunlp-lab/M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation benchmark (99 stars)