Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language? (2404.06644v1)

Published 9 Apr 2024 in cs.CL and cs.AI

Abstract: Evaluating LLMs is challenging due to their generative nature, necessitating precise evaluation methodologies. Additionally, non-English LLM evaluation lags behind English, resulting in the absence or weakness of LLMs for many languages. In response to this necessity, we introduce Khayyam Challenge (also known as PersianMMLU), a meticulously curated collection comprising 20,192 four-choice questions sourced from 38 diverse tasks extracted from Persian examinations, spanning a wide spectrum of subjects, complexities, and ages. The primary objective of the Khayyam Challenge is to facilitate the rigorous evaluation of LLMs that support the Persian language. Distinctive features of the Khayyam Challenge are (i) its comprehensive coverage of various topics, including literary comprehension, mathematics, sciences, logic, intelligence testing, etc., aimed at assessing different facets of LLMs such as language comprehension, reasoning, and information retrieval across various educational stages, from lower primary school to upper secondary school (ii) its inclusion of rich metadata such as human response rates, difficulty levels, and descriptive answers (iii) its utilization of new data to avoid data contamination issues prevalent in existing frameworks (iv) its use of original, non-translated data tailored for Persian speakers, ensuring the framework is free from translation challenges and errors while encompassing cultural nuances (v) its inherent scalability for future data updates and evaluations without requiring special human effort. Previous works lacked an evaluation framework that combined all of these features into a single comprehensive benchmark. Furthermore, we evaluate a wide range of existing LLMs that support the Persian language, with statistical analyses and interpretations of their outputs.

References (23)

Citations (6)

View on Semantic Scholar

Summary

The paper introduces the Khayyam Challenge, a benchmark of 20,192 questions across 38 subjects for assessing Persian language understanding in LLMs.
The study compares models like GPT-3.5 and GPT-4, revealing that although GPT-4 leads, both lag behind human benchmarks in advanced reasoning tasks.
Researchers leverage detailed metadata on difficulty and educational levels to uncover insights for targeted improvements in Persian language processing.

Evaluation and Insights from the Khayyam Challenge: A Benchmark for Persian Language Understanding in LLMs

Introduction

The landscape of LLMs evaluation has been enriched with the introduction of the Khayyam Challenge, also known as PersianMMLU. This comprehensive benchmark aims to rigorously assess LLMs' understanding of the Persian language through a diverse array of subjects and complexities. The challenge is named after Omar Khayyam, reflecting the multidisciplinary nature of the tasks it comprises. Unique in its construction, the Khayyam Challenge leverages questions sourced from the Iranian educational context, extending from lower primary to upper secondary education levels. This initiative addresses critical gaps in non-English LLM evaluations and sets the stage for future advancements in Persian language processing.

Data Characteristics

The dataset, derived from Iran's "Pellekan Yadgiri" website and the esteemed Kanoon Farhangi Amoozesh, spans 38 subjects with a total of 20,192 four-choice questions. These subjects range widely from mathematics and science to humanities, each requiring a mix of language comprehension, reasoning, and knowledge retrieval. Notable for its high-quality and expert-validated content, Khayyam Challenge stands out by including:

Rich Metadata: Information on difficulty levels, educational stages, and detailed explanations for each question.
Original, Non-Translated Content: Specifically tailored for Persian, avoiding the common pitfalls of translated data.
Comprehensive Coverage and Scalability: From literary comprehension to logical reasoning across various educational stages.

Evaluation Methodology

The paper describes a meticulous evaluation of several state-of-the-art LLMs, including GPT-3.5 and GPT-4, across this comprehensive dataset. A significant part of this paper is the use of different methods for answer extraction, like Regex and Probability approaches, alongside traditional performance metrics. Notably, the analysis includes a detailed comparison of LLMs' performance against human benchmarks, shedding light on current models' limitations and areas requiring improvement.

Observations and Insights

The results presented underscore several important findings:

Performance Gaps: LLMs, including GPT-4, demonstrated promising yet still lagging performance compared to human benchmarks. This discrepancy was particularly noticeable in tasks requiring advanced reasoning, such as those in the mathematics and natural sciences categories.
Model Comparisons: Among the evaluated models, GPT-4 showed superior performance, yet with a noted need for enhancement to reach human-like understanding and reasoning in Persian.
Rich Metadata Utilization: The analysis of metadata, such as question difficulty and the presence of traps, provided deeper insights into the models' operational nuances.

Implications for Future Research

The Khayyam Challenge not only marks a significant advancement in evaluating Persian language understanding in LLMs but also opens several avenues for future research. The detailed insights into model performances and the comprehensive nature of the dataset pave the way for targeted improvements in model architectures and training methodologies. Moreover, the scalable framework of the Khayyam Challenge allows for easy updates and expansions, ensuring its relevance and utility in the fast-evolving field of AI and language understanding.

Concluding Remarks

In summary, the Khayyam Challenge represents a pivotal step towards a deeper and more nuanced understanding of Persian language processing in LLMs. By offering a rigorous, varied, and scalable benchmark, it provides a valuable resource for researchers aiming to push the boundaries of AI language capabilities. The insights gained from this challenge highlight the existing gaps in LLMs' understanding and reasoning in Persian, offering clear directions for future advancements in the field.

PDF Markdown

Related Papers

Tweets

https://twitter.com/MhRohban/status/1779803351808446816