Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language? (2404.06644v1)
Abstract: Evaluating LLMs is challenging due to their generative nature, necessitating precise evaluation methodologies. Additionally, non-English LLM evaluation lags behind English, resulting in the absence or weakness of LLMs for many languages. In response to this necessity, we introduce Khayyam Challenge (also known as PersianMMLU), a meticulously curated collection comprising 20,192 four-choice questions sourced from 38 diverse tasks extracted from Persian examinations, spanning a wide spectrum of subjects, complexities, and ages. The primary objective of the Khayyam Challenge is to facilitate the rigorous evaluation of LLMs that support the Persian language. Distinctive features of the Khayyam Challenge are (i) its comprehensive coverage of various topics, including literary comprehension, mathematics, sciences, logic, intelligence testing, etc., aimed at assessing different facets of LLMs such as language comprehension, reasoning, and information retrieval across various educational stages, from lower primary school to upper secondary school (ii) its inclusion of rich metadata such as human response rates, difficulty levels, and descriptive answers (iii) its utilization of new data to avoid data contamination issues prevalent in existing frameworks (iv) its use of original, non-translated data tailored for Persian speakers, ensuring the framework is free from translation challenges and errors while encompassing cultural nuances (v) its inherent scalability for future data updates and evaluations without requiring special human effort. Previous works lacked an evaluation framework that combined all of these features into a single comprehensive benchmark. Furthermore, we evaluate a wide range of existing LLMs that support the Persian language, with statistical analyses and interpretations of their outputs.
- ParSQuAD: Machine translated SQuAD dataset for persian question answering. In 2021 7th International Conference on Web Research (ICWR). IEEE, May 2021.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf, 2024.
- Mohammad Yasin Ayoubi, Sajjad & Davoodeh. Persianqa: a dataset for persian question answering. https://github.com/SajjjadAyobi/PersianQA, 2021.
- Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol., jan 2024. ISSN 2157-6904. doi: 10.1145/3641289. URL https://doi.org/10.1145/3641289. Just Accepted.
- Pquad: A persian question answering dataset. Computer Speech & Language, 80:101486, 2023. ISSN 0885-2308. doi: https://doi.org/10.1016/j.csl.2023.101486. URL https://www.sciencedirect.com/science/article/pii/S0885230823000050.
- Llm censorship: A machine learning challenge or a computer security problem? arXiv preprint arXiv:2307.10719, 2023.
- Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736, 2023.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- 3d-llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems, 36, 2024.
- Scaling laws for neural language models, 2020.
- Parsinlu: A suite of language understanding challenges for persian, 2021.
- Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023.
- Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
- OpenAI. Gpt-4 technical report, 2023.
- PersianMind: A Cross-Lingual Persian-English Large Language Model, 2024.
- mgpt: Few-shot learners go multilingual, 2022. URL https://arxiv.org/abs/2204.07580.
- Large language models in medicine. Nature medicine, 29(8):1930–1940, 2023.
- Aya model: An instruction finetuned open-access multilingual language model. arXiv preprint arXiv:2402.07827, 2024.
- Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023.
- M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=hJPATsBb3l.
- Agieval: A human-centric benchmark for evaluating foundation models, 2023.