- The paper introduces the M3KE benchmark, which comprises 20,477 multiple-choice questions over 71 tasks covering a wide range of Chinese educational topics.
- It demonstrates that Chinese LLMs, especially smaller and less instruction-tuned models, perform near random compared to GPT-3.5's 48% average accuracy.
- The findings highlight the benefits of advanced instruction tuning and model scaling to enhance cross-task generalization in Chinese language models.
Evaluation of LLMs with the M3KE Benchmark
The paper presents a comprehensive examination of the evaluation capabilities of Chinese LLMs using a newly developed benchmark named M3KE: a Massive Multi-Level Multi-Subject Knowledge Evaluation benchmark. The focus of the research lies in evaluating the breadth and depth of knowledge acquired by various Chinese LLMs, spanning multiple domains and educational levels, evaluated under zero- and few-shot learning paradigms.
M3KE: Benchmark Characteristics
M3KE encompasses an extensive collection of 20,477 multiple-choice questions sourced from 71 distinct tasks. These tasks are diverse, covering all main sectors of the Chinese educational curriculum—ranging from primary school topics to college-level and beyond. The subjects assessed include areas such as arts, humanities, social sciences, and natural sciences, as well as specialized topics like ancient Chinese language and traditional Chinese medicine. This exhaustive variety provides a broad platform to assess the competencies of Chinese LLMs effectively, enabling standardized performance comparison across different domains and educational levels.
Performance Analysis
The evaluation includes several pre-trained and instruction-tuned models, with parameter sizes ranging from 335M to 130B. Key findings showed a noticeable performance gap between these models and GPT-3.5, with the latter achieving an average accuracy of around 48% on M3KE. The smaller models and those with fewer instruction-tuning epochs generally displayed near-random performance, especially at the primary school level tasks. Notably, this reveals a significant room for improvement in the field of Chinese LLMs when compared to leading English-LLMs.
Instruction Tuning and Model Size
Among the tested models, the ones which had undergone supervised fine-tuning on instructions demonstrated varying degrees of success. Models such as ChatGLM-6B and BELLE-7B exhibited promising results, indicating the potential benefits of using instruction-tuning to enhance model generalization across tasks. Moreover, the variations in model performance as a function of size and instruction volume underline the impact of training strategies on the final capabilities of LLMs.
Implications and Future Directions
The findings underscore the need for more sophisticated training strategies tailored to enhance the cross-task generalization of Chinese LLMs. The disparity in performance between GPT-3.5 and other open-sourced models suggests significant potential gains through further studies into architectures, training paradigms, and linguistic data processing tailored to non-English languages. The insights gained through the M3KE benchmark could guide future research toward optimizing multilingual, cross-disciplinary AI models.
Overall, this research significantly contributes to the ongoing discourse on LLM evaluation and provides an imperative foundation for extensions into other high-resource languages. Such progressive steps are crucial for advancing AI capabilities in a global educational context, where model applicability spans multiple cultural and academic specificities.