BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese (2109.09701v3)

Published 20 Sep 2021 in cs.CL

Abstract: We present BARTpho with two versions, BARTpho-syllable and BARTpho-word, which are the first public large-scale monolingual sequence-to-sequence models pre-trained for Vietnamese. BARTpho uses the "large" architecture and the pre-training scheme of the sequence-to-sequence denoising autoencoder BART, thus it is especially suitable for generative NLP tasks. We conduct experiments to compare our BARTpho with its competitor mBART on a downstream task of Vietnamese text summarization and show that: in both automatic and human evaluations, BARTpho outperforms the strong baseline mBART and improves the state-of-the-art. We further evaluate and compare BARTpho and mBART on the Vietnamese capitalization and punctuation restoration tasks and also find that BARTpho is more effective than mBART on these two tasks. We publicly release BARTpho to facilitate future research and applications of generative Vietnamese NLP tasks. Our BARTpho models are available at https://github.com/VinAIResearch/BARTpho

Citations (44)

View on Semantic Scholar

Summary

The paper presents BARTpho, the first large-scale monolingual pre-trained sequence-to-sequence model for Vietnamese, achieving superior ROUGE scores on text summarization compared to mBART.
It employs a robust architecture with 12 encoder and decoder layers and integrates language-specific pre-training methods using extensive Vietnamese corpora.
The models show significant improvements in practical tasks such as capitalization (F1 92.41%) and punctuation restoration, highlighting their real-world applicability.

Insights into BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

The paper presents BARTpho, an endeavor to create the first large-scale monolingual sequence-to-sequence models specifically pre-trained for the Vietnamese language. This research introduces two variants of BARTpho: BARTpho\textsubscript{syllable} and BARTpho\textsubscript{word}. These models are grounded in the ``large'' architecture and pre-training methodology of the BART training framework, underscoring their robustness for generative NLP tasks.

Evaluative Highlights

BARTpho's efficacy is rigorously evaluated against mBART, a strong multilingual BART model, across several Vietnamese-specific NLP tasks, notably text summarization, capitalization, and punctuation restoration. The findings reveal significant advancements:

Vietnamese Text Summarization:
- On the downstream text summarization task, BARTpho shows superior performance compared to mBART, as measured using ROUGE scores. Specifically, BARTpho\textsubscript{syllable} and BARTpho\textsubscript{word} improve upon mBART with ROUGE-1 scores of 60.89% and 61.10%, respectively.
Capitalization and Punctuation Restoration:
- The models' strengths are further corroborated in tasks requiring Vietnamese text capitalization and punctuation restoration. BARTpho\textsubscript{word} achieved an F\textsubscript{1} score of 92.41% in capitalization tasks, illustrating the benefits of word-level encoding over syllable-level in certain contexts.

Structural and Technical Framework

The architecture of both BARTpho variants mirrors the large BART model, featuring 12 encoder and decoder layers optimized via a denoising autoencoder approach. BARTpho differentiates itself by integrating Vietnamese-specific data handling techniques, such as utilizing a 20GB corpus for BARTpho\textsubscript{word} derived from PhoBERT's pre-training dataset. This was complemented by fine-tuning with detokenized Vietnamese text for BARTpho\textsubscript{syllable}, facilitated by leveraging the SentencePiece tokenization associated with XLM-RoBERTa and mBART.

Optimization during pre-training involved significant computational resources, including the utilization of 8 A100 GPUs for training epochs. This rigorous approach highlights the emphasis on using language-specific optimization to derive models effective in their targeted linguistic context.

Implications and Future Directions

The paper underscores several practical and theoretical implications. Practically, BARTpho's deployment lays the groundwork for its application in real-world Vietnamese NLP tasks, potentially enhancing automated processes within ASR systems and beyond. Theoretically, the research reiterates the premise that language-specific models, when properly optimized, can surpass multilingual counterparts, particularly when dealing with nuanced linguistic features such as Vietnamese syllables and word segmentation.

Future work might consider expanding BARTpho's scope, potentially incorporating more diverse datasets or integrating LLMs that accommodate additional Vietnamese linguistic features. Given the increasing focus on cross-lingual and low-resource language research, BARTpho could serve as a foundational benchmark for further advances in Southeast Asian languages and similar linguistic terrains.

By presenting BARTpho, the authors contribute a robust framework for enhancing Vietnamese NLP, setting a new bar for language-specific pre-training methodologies, and reinforcing the importance of tailored model architectures within the multilingual NLP landscape.

PDF Markdown

Related Papers

GitHub

GitHub - VinAIResearch/BARTpho: BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese (INTERSPEECH 2022) (98 stars)