KLUE: Korean Language Understanding Evaluation (2105.09680v4)

Published 20 May 2021 in cs.CL

Abstract: We introduce Korean Language Understanding Evaluation (KLUE) benchmark. KLUE is a collection of 8 Korean natural language understanding (NLU) tasks, including Topic Classification, SemanticTextual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking. We build all of the tasks from scratch from diverse source corpora while respecting copyrights, to ensure accessibility for anyone without any restrictions. With ethical considerations in mind, we carefully design annotation protocols. Along with the benchmark tasks and data, we provide suitable evaluation metrics and fine-tuning recipes for pretrained LLMs for each task. We furthermore release the pretrained LLMs (PLM), KLUE-BERT and KLUE-RoBERTa, to help reproducing baseline models on KLUE and thereby facilitate future research. We make a few interesting observations from the preliminary experiments using the proposed KLUE benchmark suite, already demonstrating the usefulness of this new benchmark suite. First, we find KLUE-RoBERTa-large outperforms other baselines, including multilingual PLMs and existing open-source Korean PLMs. Second, we see minimal degradation in performance even when we replace personally identifiable information from the pretraining corpus, suggesting that privacy and NLU capability are not at odds with each other. Lastly, we find that using BPE tokenization in combination with morpheme-level pre-tokenization is effective in tasks involving morpheme-level tagging, detection and generation. In addition to accelerating Korean NLP research, our comprehensive documentation on creating KLUE will facilitate creating similar resources for other languages in the future. KLUE is available at https://klue-benchmark.com.

Citations (179)

View on Semantic Scholar

Summary

The paper presents KLUE, a benchmark comprising eight tailored tasks that evaluate Korean NLP with a focus on ethical annotation and accessibility.
The methodology integrates domain-diverse data, rigorous PII removal, and innovative pre-tokenization techniques to enhance performance.
Baseline models KLUE-BERT and KLUE-RoBERTa outperform existing multilingual alternatives, paving the way for advanced research in Korean language technologies.

Overview of KLUE: Korean Language Understanding Evaluation

The KLUE benchmark aims to facilitate research in Korean NLP by providing a comprehensive evaluation framework for various Korean language understanding tasks. This benchmark encompasses eight distinct tasks, each developed from scratch to ensure accessibility and minimize copyright-related issues while promoting ethical annotation practices. The tasks included are Topic Classification (TC), Semantic Textual Similarity (STS), Natural Language Inference (NLI), Named Entity Recognition (NER), Relation Extraction (RE), Dependency Parsing (DP), Machine Reading Comprehension (MRC), and Dialogue State Tracking (DST).

Task Suite and Methodology

The development of KLUE involved the creation of a diverse set of resources, drawing from domains such as news, encyclopedias, reviews, and dialogues. The ethical considerations were paramount, ensuring that personally identifiable information (PII) was removed and annotation protocols designed to mitigate AI ethical issues. Pre-trained LLMs, KLUE-BERT and KLUE-RoBERTa, provide baseline models that outperform existing multilingual and Korean-specific LLMs in preliminary experiments.

Observations and Results

Several noteworthy findings emerged from the preliminary experiments:

KLUE-RoBERTa demonstrated superior performance compared to alternative baselines, including multilingual and other Korean-specific models.
Privacy-preserving measures, like the removal of PII, did not compromise the natural language understanding capabilities of the models.
Combining BPE tokenization with morpheme-level pre-tokenization enhanced performance on tasks involving morphological tagging, detection, and generation.

Practical and Theoretical Implications

KLUE is anticipated to accelerate Korean NLP research by offering a standardized evaluation suite that addresses linguistic and domain-specific characteristics of Korean. The release of KLUE-BERT and KLUE-RoBERTa is poised to significantly reduce the retraining burdens on researchers, promoting experimental replication and facilitating progress in model architecture and learning algorithms specific to Korean. The documentation of KLUE's creation process also serves as a valuable guide for constructing similar benchmarks in other languages.

Future Directions in AI

KLUE sets the stage for future explorations into the scalability and efficacy of LLMs tailored to the Korean language. Opportunities for future research include refining LLMs to mitigate embedded social biases and leveraging KLUE as a foundation for cross-linguistic and multilingual studies.

Overall, KLUE represents a significant advancement for Korean NLP, both as a rigorous benchmark suite and as a catalyst for ongoing research and development in the field.