Enhancing Language Learning through Technology: Introducing a New English-Azerbaijani (Arabic Script) Parallel Corpus (2407.05189v1)

Published 6 Jul 2024 in cs.CL

Abstract: This paper introduces a pioneering English-Azerbaijani (Arabic Script) parallel corpus, designed to bridge the technological gap in language learning and machine translation (MT) for under-resourced languages. Consisting of 548,000 parallel sentences and approximately 9 million words per language, this dataset is derived from diverse sources such as news articles and holy texts, aiming to enhance NLP applications and language education technology. This corpus marks a significant step forward in the realm of linguistic resources, particularly for Turkic languages, which have lagged in the neural machine translation (NMT) revolution. By presenting the first comprehensive case study for the English-Azerbaijani (Arabic Script) language pair, this work underscores the transformative potential of NMT in low-resource contexts. The development and utilization of this corpus not only facilitate the advancement of machine translation systems tailored for specific linguistic needs but also promote inclusive language learning through technology. The findings demonstrate the corpus's effectiveness in training deep learning MT systems and underscore its role as an essential asset for researchers and educators aiming to foster bilingual education and multilingual communication. This research covers the way for future explorations into NMT applications for languages lacking substantial digital resources, thereby enhancing global language education frameworks. The Python package of our code is available at https://pypi.org/project/chevir-kartalol/, and we also have a website accessible at https://translate.kartalol.com/.

Summary

The paper introduces a 548K-sentence English-Azerbaijani (Arabic Script) parallel corpus aimed at advancing neural machine translation for under-resourced languages.
It combines automated translation with human verification to ensure linguistic fidelity and employs a transformer-based model for accurate translation.
Evaluation with GLEU, ChrF, and NIST metrics highlights the corpus's potential to enhance MT systems and promote culturally inclusive language education.

Enhancing Language Learning through Technology: Introducing a New English-Azerbaijani (Arabic Script) Parallel Corpus

This paper presents the creation of an English-Azerbaijani (Arabic Script) parallel corpus, marking an important contribution to machine translation (MT) for under-resourced languages. The Azerbaijani language, particularly in its Arabic script form, has been underserved in the field of neural machine translation (NMT) despite its cultural significance and substantial speaker base. This paper addresses this disparity by offering a comprehensive dataset conducive to MT system development and inclusive language learning technologies.

Dataset Summary

The introduced corpus consists of 548,000 parallel sentences, approximately encapsulating 9 million words per language. It sources from diverse materials, including news articles and religious texts, aiming to boost NLP applications and education technology. The dataset's creation involved a meticulous process of translation from Azerbaijani (Latin script) to Azerbaijani (Arabic script) using an automated tool called Mirze, followed by human verification to ensure linguistic fidelity.

Methodology and System Architecture

This paper implements a transformer-based NMT system to translate English to Azerbaijani (Arabic Script). The system's architecture follows a conventional transformer model with an encoder-decoder setup. The encoder processes input sequences to generate semantic representations, incorporating tokenization, positional encoding, self-attention, and feed-forward neural networks. The decoder operates similarly with additional cross-attention mechanisms to leverage encoded representations. The model is optimized using a cross-entropy loss function and employs AdaGrad with specific hyperparameters, achieving convergence across 50 epochs.

Evaluation of NMT Performance

The evaluation of the NMT systems employs GLEU, ChrF, and NIST metrics to assess translation quality. The proposed model's performance was juxtaposed with GPT-4. In the KartalOlv.0 testset, the model showed comparable results in NIST but scored lower in ChrF and GLEU compared to GPT-4. Contrarily, in the UNv1.0 testset, the model's performance lagged in all metrics, suggesting areas for further refinement. These evaluations highlight the corpus's potential to enhance MT systems, addressing the linguistic needs of Azerbaijani speakers, and promoting global multilingual communication.

Implications and Future Directions

The implications of developing this corpus extend beyond NMT systems; it bears significant cultural and educational value. Enhancing resources for Azerbaijani aids in preserving linguistic heritage and mitigating linguistic discrimination, as demonstrated by the challenges faced by Azerbaijani speakers in Iran. The paper advocates for more inclusive language education frameworks, wherein accurate translations empower educational outcomes and foster bilingualism.

In the field of theoretical and practical applications, this work sets a precedent for similar efforts in other under-resourced languages. It underscores the necessity of tailored linguistic resources to democratize technological advancements across diverse linguistic populations. Future research endeavors include the expansion of this corpus to encompass the United Nations dataset, offering approximately 23 million sentences to further enrich the MT systems for Azerbaijani.

The paper exemplifies the tangible benefits of targeted resource development in low-resource language contexts, asserting the relevance of data-driven linguistic contributions in the broader landscape of NLP.