- The paper introduces a 548K-sentence English-Azerbaijani (Arabic Script) parallel corpus aimed at advancing neural machine translation for under-resourced languages.
- It combines automated translation with human verification to ensure linguistic fidelity and employs a transformer-based model for accurate translation.
- Evaluation with GLEU, ChrF, and NIST metrics highlights the corpus's potential to enhance MT systems and promote culturally inclusive language education.
Enhancing Language Learning through Technology: Introducing a New English-Azerbaijani (Arabic Script) Parallel Corpus
This paper presents the creation of an English-Azerbaijani (Arabic Script) parallel corpus, marking an important contribution to machine translation (MT) for under-resourced languages. The Azerbaijani language, particularly in its Arabic script form, has been underserved in the field of neural machine translation (NMT) despite its cultural significance and substantial speaker base. This paper addresses this disparity by offering a comprehensive dataset conducive to MT system development and inclusive language learning technologies.
Dataset Summary
The introduced corpus consists of 548,000 parallel sentences, approximately encapsulating 9 million words per language. It sources from diverse materials, including news articles and religious texts, aiming to boost NLP applications and education technology. The dataset's creation involved a meticulous process of translation from Azerbaijani (Latin script) to Azerbaijani (Arabic script) using an automated tool called Mirze, followed by human verification to ensure linguistic fidelity.
Methodology and System Architecture
This paper implements a transformer-based NMT system to translate English to Azerbaijani (Arabic Script). The system's architecture follows a conventional transformer model with an encoder-decoder setup. The encoder processes input sequences to generate semantic representations, incorporating tokenization, positional encoding, self-attention, and feed-forward neural networks. The decoder operates similarly with additional cross-attention mechanisms to leverage encoded representations. The model is optimized using a cross-entropy loss function and employs AdaGrad with specific hyperparameters, achieving convergence across 50 epochs.
Evaluation of NMT Performance
The evaluation of the NMT systems employs GLEU, ChrF, and NIST metrics to assess translation quality. The proposed model's performance was juxtaposed with GPT-4. In the KartalOlv.0 testset, the model showed comparable results in NIST but scored lower in ChrF and GLEU compared to GPT-4. Contrarily, in the UNv1.0 testset, the model's performance lagged in all metrics, suggesting areas for further refinement. These evaluations highlight the corpus's potential to enhance MT systems, addressing the linguistic needs of Azerbaijani speakers, and promoting global multilingual communication.
Implications and Future Directions
The implications of developing this corpus extend beyond NMT systems; it bears significant cultural and educational value. Enhancing resources for Azerbaijani aids in preserving linguistic heritage and mitigating linguistic discrimination, as demonstrated by the challenges faced by Azerbaijani speakers in Iran. The paper advocates for more inclusive language education frameworks, wherein accurate translations empower educational outcomes and foster bilingualism.
In the field of theoretical and practical applications, this work sets a precedent for similar efforts in other under-resourced languages. It underscores the necessity of tailored linguistic resources to democratize technological advancements across diverse linguistic populations. Future research endeavors include the expansion of this corpus to encompass the United Nations dataset, offering approximately 23 million sentences to further enrich the MT systems for Azerbaijani.
The paper exemplifies the tangible benefits of targeted resource development in low-resource language contexts, asserting the relevance of data-driven linguistic contributions in the broader landscape of NLP.