Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MTet: Multi-domain Translation for English and Vietnamese (2210.05610v2)

Published 11 Oct 2022 in cs.CL and cs.AI

Abstract: We introduce MTet, the largest publicly available parallel corpus for English-Vietnamese translation. MTet consists of 4.2M high-quality training sentence pairs and a multi-domain test set refined by the Vietnamese research community. Combining with previous works on English-Vietnamese translation, we grow the existing parallel dataset to 6.2M sentence pairs. We also release the first pretrained model EnViT5 for English and Vietnamese languages. Combining both resources, our model significantly outperforms previous state-of-the-art results by up to 2 points in translation BLEU score, while being 1.6 times smaller.

Citations (7)

Summary

  • The paper introduces MTet, a multi-domain translation resource with 6.2M curated sentence pairs that significantly raises translation benchmarks.
  • It employs innovative curation techniques that combine machine learning-assisted filtering and expert review across diverse domains including law and biomedicine.
  • EnViT5, the first pretrained model for English-Vietnamese tasks, achieves state-of-the-art performance with improved computational efficiency.

Evaluation of MTet: A Multi-domain Translation Resource for English-Vietnamese

The paper "MTet: Multi-domain Translation for English and Vietnamese" introduces MTet, a substantial contribution to the field of machine translation for English-Vietnamese language pairs. This corpus boasts a collection of 4.2 million high-quality sentence pairs, making it the largest publicly available resource for this language pair. Further enrichment combines these sentence pairs with prior datasets to expand to 6.2 million pairs. Complementing this corpus is the release of EnViT5, the first pretrained model specifically tailored for English and Vietnamese translation tasks.

MTet's methodology demonstrates a nuanced approach to addressing the challenge of limited language resources, particularly concerning the English-Vietnamese pair. Unlike previous predominantly single-domain datasets, MTet inclusively spans multiple domains, incorporating technical fields previously underrepresented, such as law and biomedical studies. The introduction of a refined, expert-vetted multi-domain test set further enhances the dataset's utility and reliability, as it cross-cuts through diverse textual structures and language styles.

Quantitative assessment of MTet’s efficacy reveals significant improvements in both English-to-Vietnamese (En-Vi) and Vietnamese-to-English (Vi-En) translation benchmarks, evidenced by outperforming existing models by up to two BLEU points. Furthermore, the EnViT5 model, pretrained on a broad corpus and fine-tuned on MTet and additional datasets, illustrates superior performance compared to multilingual models like mBART, achieving state-of-the-art results while maintaining a reduced parameter size, thus demonstrating enhanced computational efficiency.

A systematic examination of dataset collection and curation methodologies underpins MTet's robustness. Strategies include combing existing repositories, employing machine learning model assistance for scoring and filtering, applying dynamic programming alignment on related texts, and engaging in manual data curation for high-impact domains. The iterative enhancement process and rigorous quality control measures underpin the dataset's integrity and applicability across diverse translation scenarios.

Experimentation beyond baseline evaluations suggests several novel insights. A model trained on a multi-domain dataset indicates improved generalization across various text domains compared to models trained on domain-specific data sets. This signifies a potential paradigm shift towards multi-domain training paradigms as a means to enhance model versatility. Furthermore, this approach presents both immediate practical implications for developing applied translation services and contributes to theoretical advancements in data-centric model training methodologies.

The release of MTet and EnViT5 opens new streams of research and application development by providing a comprehensive resource catered to low-resource language translation challenges. Anticipated future work includes expanding this approach to other low-resource languages, exploring the training of larger EnViT5 models, and continued refinement of bilingual dataset quality through advanced data collection techniques. These efforts endeavor to further cement the foundational frameworks for efficient and accurate machine translation models across underrepresented languages, fostering greater linguistic inclusivity in technology.

Github Logo Streamline Icon: https://streamlinehq.com