Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPT (2404.02403v1)

Published 3 Apr 2024 in cs.CL and cs.LG

Abstract: This paper explores the efficacy of LLMs for Persian. While ChatGPT and consequent LLMs have shown remarkable performance in English, their efficiency for more low-resource languages remains an open question. We present the first comprehensive benchmarking study of LLMs across diverse Persian language tasks. Our primary focus is on GPT-3.5-turbo, but we also include GPT-4 and OpenChat-3.5 to provide a more holistic evaluation. Our assessment encompasses a diverse set of tasks categorized into classic, reasoning, and knowledge-based domains. To enable a thorough comparison, we evaluate LLMs against existing task-specific fine-tuned models. Given the limited availability of Persian datasets for reasoning tasks, we introduce two new benchmarks: one based on elementary school math questions and another derived from the entrance exams for 7th and 10th grades. Our findings reveal that while LLMs, especially GPT-4, excel in tasks requiring reasoning abilities and a broad understanding of general knowledge, they often lag behind smaller pre-trained models fine-tuned specifically for particular tasks. Additionally, we observe improved performance when test sets are translated to English before inputting them into GPT-3.5. These results highlight the significant potential for enhancing LLM performance in the Persian language. This is particularly noteworthy due to the unique attributes of Persian, including its distinct alphabet and writing styles.

Citations (2)

View on Semantic Scholar

Summary

The paper evaluates LLM performance for Persian, revealing GPT-4's superior results over GPT-3.5 and OpenChat-3.5.
It employs classic NLP, reasoning, and knowledge benchmarks with varying prompt configurations to assess task-specific strengths and gaps.
The findings indicate that English prompts boost performance in Persian tasks, suggesting targeted fine-tuning for low-resource languages.

Benchmarking LLMs for Persian: A Preliminary Study Focusing on ChatGPT

Overview

The paper "Benchmarking LLMs for Persian: A Preliminary Study Focusing on ChatGPT" provides a detailed evaluation of LLMs specifically in the context of the Persian language. The paper primarily focuses on OpenAI's GPT-3.5-turbo, GPT-4, and an open-source model, OpenChat-3.5. Various benchmarks are established, categorized into classic NLP tasks, reasoning tasks, and knowledge-based tasks. The authors ensure thoroughness by employing both Persian and English prompts in zero-shot, one-shot, and few-shot configurations to obtain a comprehensive understanding of the efficacy of these LLMs.

Key Findings

Classic NLP Tasks:
- Sentiment Analysis: GPT-4 achieved a peak macro F1-score of 0.906 in the three-shot setting with English prompts, surpassing the fine-tuned mt5-base model (F1 0.891). However, GPT-3.5's performance plateaued with more demonstrations.
- Emotion Recognition: The highest F1-score for GPT-4 was .621, lower than the fine-tuned ParsBERT model (F1 0.699). GPT models showed modest performance, indicating room for improvement in this domain.
- Named Entity Recognition (NER): GPT-4 achieved a top F1-score of 0.712, compared to the SOTA F1 of 0.988, revealing challenges in recognizing named entities accurately.
- Machine Translation (MT): GPT-4 performed best in English to Persian translation, with an 8.7 BLEU score in the three-shot setting with English prompts, whereas Persian to English translation saw the SOTA (BLEU 11.7) outperforming the LLMs.
- Reading Comprehension: GPT-4 achieved an F1-score of 0.687, slightly below the SOTA score of 0.691. The models showed a marked improvement with few-shot prompts compared to zero-shot settings.
Reasoning Tasks:
- Textual Entailment: Using the ParsiNLU dataset, GPT-4 achieved an F1-score of 0.636 in a three-shot English prompt setting, while SOTA was 0.690. For the ConjNLI dataset, GPT-4 attained an F1-score of 0.512, short of the SOTA (0.524).
- Multiple-choice QA (Math & Logic): GPT-4 displayed strong reasoning capabilities with an accuracy of 0.725 in a three-shot English setting, outperforming SOTA (0.395).
- Elementary School QA: GPT-4 achieved an accuracy of 0.740, indicating it is proficient in simpler reasoning tasks.
- Math Problems: GPT-4 led with a 0.564 math-solving accuracy in a three-shot English prompt setting, indicating robust mathematical reasoning capabilities.
Knowledge-based Tasks:
- Literature Knowledge: GPT-4 exhibited an accuracy of 0.485, but GPT-3.5's performance (0.310) highlighted limitations in domain-specific knowledge.
- Common Knowledge: GPT-4’s accuracy of 0.635 outperformed both GPT-3.5 and SOTA models, showcasing its general knowledge proficiency.

Implications and Future Directions

The evaluation underscores the variance in capabilities of LLMs like GPT-4, GPT-3.5, and OpenChat-3.5 across different tasks in Persian NLP. GPT-4 consistently outperformed GPT-3.5 and OpenChat-3.5, indicating its superior generalization ability and robustness. OpenChat-3.5, although an open-source model with a smaller parameter count, showed competitiveness in certain tasks, suggesting potential for further improvements through more targeted optimizations.

The paper highlights specific areas where LLMs underperform, particularly in tasks like Named Entity Recognition and domain-specific knowledge tasks like literature. These results suggest an opportunity for fine-tuning models with more Persian-specific data to bridge these gaps.

The findings also revealed that performance was generally better when tasks used English prompts, even when test data were in Persian. This insight invites further investigation into the mechanics of multilingual model training and might encourage the development of more sophisticated prompt engineering techniques tailored to low-resource languages.

Conclusion

This preliminary benchmarking of LLMs for Persian reveals both the promise and limitations of current models like ChatGPT and OpenChat-3.5. The paper provides essential insights into the models' performance across various linguistic tasks and sets the stage for future research to enhance LLM performance for low-resource languages. Continued development and evaluation will likely focus on addressing identified weaknesses and extending these models to be more effective and accurate in non-English contexts.

These efforts will be instrumental in fostering broader, more inclusive AI development, where LLMs can serve diverse linguistic communities with equal proficiency. Future work could also explore integration with other LLM advancements, emphasizing fine-tuning and prompt engineering, specifically catering to low-resource languages like Persian.

PDF Markdown

Related Papers

Tweets

https://twitter.com/AmirAbaskohi/status/1777161044101869735