CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean (2403.06412v4)
Abstract: Despite the rapid development of LLMs for the Korean language, there remains an obvious lack of benchmark datasets that test the requisite Korean cultural and linguistic knowledge. Because many existing Korean benchmark datasets are derived from the English counterparts through translation, they often overlook the different cultural contexts. For the few benchmark datasets that are sourced from Korean data capturing cultural knowledge, only narrow tasks such as bias and hate speech detection are offered. To address this gap, we introduce a benchmark of Cultural and Linguistic Intelligence in Korean (CLIcK), a dataset comprising 1,995 QA pairs. CLIcK sources its data from official Korean exams and textbooks, partitioning the questions into eleven categories under the two main categories of language and culture. For each instance in CLIcK, we provide fine-grained annotation of which cultural and linguistic knowledge is required to answer the question correctly. Using CLIcK, we test 13 LLMs to assess their performance. Our evaluation uncovers insights into their performances across the categories, as well as the diverse factors affecting their comprehension. CLIcK offers the first large-scale comprehensive Korean-centric analysis of LLMs' proficiency in Korean culture and language.
- The falcon series of open language models.
- L.F. Bachman and A.S. Palmer. 1996. Language Testing in Practice: Designing and Developing Useful Language Tests. Language applied linguistic. OUP Oxford.
- Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research, 24(251):1–43.
- A systematic review of automatic question generation for educational purposes. International Journal of Artificial Intelligence in Education, 30:121–204.
- Hate speech classifiers are culturally insensitive. In Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP), pages 35–46, Dubrovnik, Croatia. Association for Computational Linguistics.
- Hugo Liu and Push Singh. 2004. Conceptnet—a practical commonsense reasoning tool-kit. BT technology journal, 22.
- EnCBP: A new benchmark dataset for finer-grained cultural background prediction in English. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2811–2823, Dublin, Ireland. Association for Computational Linguistics.
- Research community dynamics behind popular ai benchmarks. Nature Machine Intelligence, 3(7):581–589.
- Claude Elwood Shannon. 1948. A mathematical theory of communication. The Bell System Technical Journal, 27:379–423.
- Understanding the capabilities, limitations, and societal impact of large language models.
- Commonsense knowledge in machine intelligence. SIGMOD Rec., 46(4):49–52.
- Flask: Fine-grained language model evaluation based on alignment skill sets.
- Neural question generation from text: A preliminary study. ArXiv, abs/1704.01792.
- Serengeti: Massively multilingual language models for africa.
- Mega: Multilingual evaluation of generative ai.
- Think you have solved question answering? try arc, the ai2 reasoning challenge.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Towards leaving no indic language behind: Building monolingual corpora, benchmark and models for indic languages.
- KorNLI and KorSTS: New benchmark datasets for Korean natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 422–430, Online. Association for Computational Linguistics.
- Kornli and korsts: New benchmark datasets for korean natural language understanding.
- Measuring massive multitask language understanding. In International Conference on Learning Representations (ICLR).
- Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization.
- Cosmos QA: Machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2391–2401, Hong Kong, China. Association for Computational Linguistics.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.
- KoBEST: Korean balanced evaluation of significant tasks. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3697–3708, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- KOLD: Korean offensive language dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10818–10833, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Kobbq: Korean bias benchmark for question answering.
- XGLUE: A new benchmark dataset for cross-lingual pre-training, understanding and generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6008–6018, Online. Association for Computational Linguistics.
- Korquad1.0: Korean qa dataset for machine reading comprehension.
- BEEP! Korean corpus of online news comments for toxic speech detection. In Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media, pages 25–31, Online. Association for Computational Linguistics.
- Klue: Korean language understanding evaluation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1. Curran.
- BBQ: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105, Dublin, Ireland. Association for Computational Linguistics.
- A dog is passing over the jet? a text-generation dataset for Korean commonsense reasoning and evaluation. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2233–2249, Seattle, United States. Association for Computational Linguistics.
- Hae-rae bench: Evaluation of korean knowledge in language models.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.
- CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
- Superglue: A stickier benchmark for general-purpose language understanding systems.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
- PAWS-X: A cross-lingual adversarial dataset for paraphrase identification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3687–3692, Hong Kong, China. Association for Computational Linguistics.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
- Agieval: A human-centric benchmark for evaluating foundation models.