Advancing the Evaluation of Traditional Chinese Language Models: Towards a Comprehensive Benchmark Suite

Published 15 Sep 2023 in cs.CL | (2309.08448v2)

Abstract: The evaluation of LLMs is an essential task in the field of language understanding and generation. As LLMs continue to advance, the need for effective benchmarks to assess their performance has become imperative. In the context of Traditional Chinese, there is a scarcity of comprehensive and diverse benchmarks to evaluate the capabilities of LLMs, despite the existence of certain benchmarks such as DRCD, TTQA, CMDQA, and FGC dataset. To address this gap, we propose a novel set of benchmarks that leverage existing English datasets and are tailored to evaluate LLMs in Traditional Chinese. These benchmarks encompass a wide range of tasks, including contextual question-answering, summarization, classification, and table understanding. The proposed benchmarks offer a comprehensive evaluation framework, enabling the assessment of LLMs' capabilities across different tasks. In this paper, we evaluate the performance of GPT-3.5, Taiwan-LLaMa-v1.0, and Model 7-C, our proprietary model, on these benchmarks. The evaluation results highlight that our model, Model 7-C, achieves performance comparable to GPT-3.5 with respect to a part of the evaluated capabilities. In an effort to advance the evaluation of LLMs in Traditional Chinese and stimulate further research in this field, we have open-sourced our benchmark and opened the model for trial.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (10)

View on Semantic Scholar

Summary

The paper introduces a comprehensive benchmark suite designed to evaluate LLMs in Traditional Chinese, filling a critical gap in NLP research.
It adapts and translates established English datasets and introduces novel resources like TMMLU to assess a range of tasks including QA, summarization, and classification.
Evaluations of models such as GPT-3.5 and Model 7-C highlight both performance strengths and challenges, especially in table understanding and output consistency.

Evaluation of Traditional Chinese LLMs: A Comprehensive Benchmark Approach

In this paper, the authors address the critical need for robust benchmarks to evaluate the performance of LLMs in Traditional Chinese. While significant progress has been made with numerous benchmarks available for English LLMs, Traditional Chinese lacks such comprehensive evaluation frameworks. To bridge this gap, the study introduces a comprehensive suite of benchmarks designed to assess various capabilities of LLMs specifically tailored for Traditional Chinese. These cover a broad range of tasks including, but not limited to, contextual question answering, world knowledge evaluation, summarization, classification, and table understanding.

Benchmark Design and Implementation

The benchmarks proposed in this study originate from a thoughtful adaptation of existing English datasets, translated into Traditional Chinese where necessary. Among these, existing datasets such as the Delta Reading Comprehension Dataset (DRCD) and Taiwanese Trivia Question Answering (TTQA) are used for contextual QA and world knowledge tasks. The novel dataset, Taiwan Massive Multitask Language Understanding (TMMLU), is introduced to evaluate a model's competency across 55 subjects, leveraging educational exams from Taiwan. Classification and summarization tasks utilize translated datasets from English benchmarks like IMDB and XSum.

Evaluation and Numerical Insights

The performance of several models, including GPT-3.5, Taiwan-LLaMa-v1.0, and a proprietary model series, Model 7-C, was evaluated using the proposed benchmarks. The results highlighted that GPT-3.5 consistently achieves superior performance across the evaluated tasks, setting a high standard for Traditional Chinese models. Notably, Model 7-C demonstrated comparable effectiveness to GPT-3.5 on specific benchmarks such as DRCD and XSum-TC, showcasing competitive capabilities in contextual question answering and summarization.

A crucial observation from the evaluations is the inadequacy in table understanding tasks, where open-source models exhibit significant hallucinations, coupled with consistent underperformance in summarization attributed to deviation in output structure from the target summaries.

Open-Ended Generation and Model Helpfulness

In assessing models’ utility in generating helpful responses, the TAIDE-14 tasks served as a benchmark. Model 7-C-Chat, the chat-optimized variant of Model 7-C, achieved notable performance, occasionally surpassing GPT-3.5 in helpfulness, demonstrating its proficiency in Traditional Chinese text generation tasks across diverse domains.

Implications and Future Outlook

The development and open-sourcing of such benchmarks is pivotal for advancing research on Traditional Chinese LLMs. By providing a foundation for broad-spectrum evaluation, the study not only benchmarks current models but also defines areas for future enhancements in LLMs. Encouragingly, this work opens new avenues for research, inviting both academic and industry stakeholders to refine models in adherence to these comprehensive standards.

This initiative underscores the importance of culturally and linguistically appropriate evaluation frameworks, offering a pathway for equitable progress across languages. The open-source release of these resources is expected to facilitate collaboration and foster innovation, driving the development of more sophisticated and inclusive AI systems in the future. This research could, therefore, serve as a catalyst for nuanced advancements in the field of Traditional Chinese NLP and beyond.

Markdown Report Issue