KMMLU: Measuring Massive Multitask Language Understanding in Korean

Published 18 Feb 2024 in cs.CL | (2402.11548v2)

Abstract: We propose KMMLU, a new Korean benchmark with 35,030 expert-level multiple-choice questions across 45 subjects ranging from humanities to STEM. While prior Korean benchmarks are translated from existing English benchmarks, KMMLU is collected from original Korean exams, capturing linguistic and cultural aspects of the Korean language. We test 27 public and proprietary LLMs and observe the best public model to score 50.5%, leaving significant room for improvement. This model was primarily trained for English and Chinese, not Korean. Current LLMs tailored to Korean, such as Polyglot-Ko, perform far worse. Surprisingly, even the most capable proprietary LLMs, e.g., GPT-4 and HyperCLOVA X do not exceed 60%. This suggests that further work is needed to improve LLMs for Korean, and we believe KMMLU offers the appropriate tool to track this progress. We make our dataset publicly available on the Hugging Face Hub and integrate the benchmark into EleutherAI's LLM Evaluation Harness.

Abstract PDF HTML Upgrade to Chat

Authors (9)

Citations (21)

View on Semantic Scholar

Summary

The paper introduces KMMLU, a benchmark using original Korean exam questions across 45 subjects to provide an authentic assessment tool.
It reveals significant performance gaps, with leading models like GPT-4 scoring 59.95% versus an average human score of 62.6%.
The study emphasizes the need for culturally and linguistically informed training and refined prompting strategies for improved LLM performance.

Analysis of KMMLU: Measuring Massive Multitask Language Understanding in Korean

This paper introduces KMMLU, a novel benchmark designed to evaluate the capabilities of LLMs specifically in the Korean language. Unlike previous benchmarks that rely on translated content from English, KMMLU consists of original Korean multiple-choice questions sourced from Korean exams, providing a culturally and linguistically authentic assessment tool. The benchmark spans 35,030 questions across 45 diverse subjects including humanities, STEM, applied sciences, and others.

Key Findings

In testing 26 publicly available and proprietary LLMs, the study uncovered significant performance gaps relative to human scores, indicating much room for improvement. The highest performance by a publicly available model was 50.54%, which contrasts with an average human test-taker performance of 62.6%. Interestingly, even leading proprietary models like GPT-4 and HyperCLOVA X scored 59.95% and 53.40%, respectively, showcasing the challenging nature of the benchmark.

Implications for Model Performance

Examining the breakdown of performance across different disciplines, GPT-4 was found to be generally more competent than other models across most subjects, especially when it does not require specific Korean contextual knowledge, succeeding well in areas like marketing and IT. HyperCLOVA X, however, showed competitive performance in Korean history and law, suggesting that understanding culturally proximate content might still require domain-specific training that aligns closely with the language and cultural nuances.

The paper highlights a notable trend—larger models with greater pretraining budgets tend to perform better, reflecting a scaling effect where increased resources improve model effectiveness across complex tasks. Yet, merely increasing size is not uniformly beneficial, as the degree of performance gain varies by subject and methodology used, such as Direct vs. Chain-of-Thought (CoT) prompting.

Implications for Future Research

The findings encourage development efforts targeting localized language training, underscoring the importance of linguistically and culturally informed benchmarks for enhancing LLM competencies in non-majority languages. The contrasting results in CoT prompting, where HyperCLOVA X benefited more than its counterparts, suggest there’s a nuance in reasoning process acquisition in LLMs that warrants further exploration, especially in culturally specific tests.

KMMLU sets a foundation for future work focusing on nuanced understanding and evaluating cross-lingual representation within multilingual models. The paper provides empirical data challenging the notion of a "curse of multilinguality," showing that model scaling mitigates issues of diluted competence across a model's languages.

Conclusion

By providing a sophisticated evaluation tool, KMMLU supports the Korean NLP community's aim to critically assess and improve the proficiency of LLMs in Korean. The benchmark opens avenues for focused language-specific model refinements and highlights the significance of culturally and linguistically native datasets to the development of truly multilingual and efficient AI systems. The implications of this research extend to both theoretical understandings of linguistic representations in machine learning and the practical development of AI geared towards more accurate, culturally aligned interaction capabilities. As AI continues evolving, such benchmarks will prove critical in steering the direction of future multilingual model enhancements.

Markdown Report Issue