Emergent Mind

Abstract

Evaluating LLMs is challenging due to their generative nature, necessitating precise evaluation methodologies. Additionally, non-English LLM evaluation lags behind English, resulting in the absence or weakness of LLMs for many languages. In response to this necessity, we introduce Khayyam Challenge (also known as PersianMMLU), a meticulously curated collection comprising 20,192 four-choice questions sourced from 38 diverse tasks extracted from Persian examinations, spanning a wide spectrum of subjects, complexities, and ages. The primary objective of the Khayyam Challenge is to facilitate the rigorous evaluation of LLMs that support the Persian language. Distinctive features of the Khayyam Challenge are (i) its comprehensive coverage of various topics, including literary comprehension, mathematics, sciences, logic, intelligence testing, etc., aimed at assessing different facets of LLMs such as language comprehension, reasoning, and information retrieval across various educational stages, from lower primary school to upper secondary school (ii) its inclusion of rich metadata such as human response rates, difficulty levels, and descriptive answers (iii) its utilization of new data to avoid data contamination issues prevalent in existing frameworks (iv) its use of original, non-translated data tailored for Persian speakers, ensuring the framework is free from translation challenges and errors while encompassing cultural nuances (v) its inherent scalability for future data updates and evaluations without requiring special human effort. Previous works lacked an evaluation framework that combined all of these features into a single comprehensive benchmark. Furthermore, we evaluate a wide range of existing LLMs that support the Persian language, with statistical analyses and interpretations of their outputs.

Comparison of model and human accuracy over three difficulty levels by educational stage.

Overview

  • The Khayyam Challenge, or PersianMMLU, is introduced as a benchmark to evaluate LLMs' understanding of Persian, covering a wide range of subjects and complexities.

  • It utilizes a dataset consisting of 20,192 questions from 38 subjects, sourced from Iran's educational materials, aimed at providing rich, high-quality content for assessing LLMs.

  • A detailed evaluation methodology was employed, including comparison of state-of-the-art LLMs, like GPT-3.5 and GPT-4, with a focus on different answer extraction techniques and traditional metrics.

  • Findings highlight performance gaps between LLMs and human benchmarks, particularly in tasks requiring advanced reasoning, suggesting areas for model improvement and future research directions.

Evaluation and Insights from the Khayyam Challenge: A Benchmark for Persian Language Understanding in LLMs

Introduction

The landscape of LLMs evaluation has been enriched with the introduction of the Khayyam Challenge, also known as PersianMMLU. This comprehensive benchmark aims to rigorously assess LLMs' understanding of the Persian language through a diverse array of subjects and complexities. The challenge is named after Omar Khayyam, reflecting the multidisciplinary nature of the tasks it comprises. Unique in its construction, the Khayyam Challenge leverages questions sourced from the Iranian educational context, extending from lower primary to upper secondary education levels. This initiative addresses critical gaps in non-English LLM evaluations and sets the stage for future advancements in Persian language processing.

Data Characteristics

The dataset, derived from Iran's "Pellekan Yadgiri" website and the esteemed Kanoon Farhangi Amoozesh, spans 38 subjects with a total of 20,192 four-choice questions. These subjects range widely from mathematics and science to humanities, each requiring a mix of language comprehension, reasoning, and knowledge retrieval. Notable for its high-quality and expert-validated content, Khayyam Challenge stands out by including:

  • Rich Metadata: Information on difficulty levels, educational stages, and detailed explanations for each question.
  • Original, Non-Translated Content: Specifically tailored for Persian, avoiding the common pitfalls of translated data.
  • Comprehensive Coverage and Scalability: From literary comprehension to logical reasoning across various educational stages.

Evaluation Methodology

The paper describes a meticulous evaluation of several state-of-the-art LLMs, including GPT-3.5 and GPT-4, across this comprehensive dataset. A significant part of this study is the use of different methods for answer extraction, like Regex and Probability approaches, alongside traditional performance metrics. Notably, the analysis includes a detailed comparison of LLMs' performance against human benchmarks, shedding light on current models' limitations and areas requiring improvement.

Observations and Insights

The results presented underscore several important findings:

  • Performance Gaps: LLMs, including GPT-4, demonstrated promising yet still lagging performance compared to human benchmarks. This discrepancy was particularly noticeable in tasks requiring advanced reasoning, such as those in the mathematics and natural sciences categories.
  • Model Comparisons: Among the evaluated models, GPT-4 showed superior performance, yet with a noted need for enhancement to reach human-like understanding and reasoning in Persian.
  • Rich Metadata Utilization: The analysis of metadata, such as question difficulty and the presence of traps, provided deeper insights into the models' operational nuances.

Implications for Future Research

The Khayyam Challenge not only marks a significant advancement in evaluating Persian language understanding in LLMs but also opens several avenues for future research. The detailed insights into model performances and the comprehensive nature of the dataset pave the way for targeted improvements in model architectures and training methodologies. Moreover, the scalable framework of the Khayyam Challenge allows for easy updates and expansions, ensuring its relevance and utility in the fast-evolving field of AI and language understanding.

Concluding Remarks

In summary, the Khayyam Challenge represents a pivotal step towards a deeper and more nuanced understanding of Persian language processing in LLMs. By offering a rigorous, varied, and scalable benchmark, it provides a valuable resource for researchers aiming to push the boundaries of AI language capabilities. The insights gained from this challenge highlight the existing gaps in LLMs' understanding and reasoning in Persian, offering clear directions for future advancements in the field.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.