Investigating the translation capabilities of Large Language Models trained on parallel data only (2406.09140v1)

Published 13 Jun 2024 in cs.CL

Abstract: In recent years, LLMs have demonstrated exceptional proficiency across a broad spectrum of NLP tasks, including Machine Translation. However, previous methods predominantly relied on iterative processes such as instruction fine-tuning or continual pre-training, leaving unexplored the challenges of training LLMs solely on parallel data. In this work, we introduce PLUME (Parallel LLM), a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel examples. These models perform comparably to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones. Utilizing this set of models, we conduct a thorough investigation into the translation capabilities of LLMs, probing their performance, the impact of the different elements of the prompt, and their cross-lingual representation space.

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that decoder-only LLMs trained exclusively on Catalan-centric parallel data can achieve translation quality comparable to traditional encoder-decoder systems.
Larger vocabulary sizes are shown to enhance zero-shot translation performance, emphasizing the role of tokenization in model effectiveness.
Attention head analysis and pruning strategies provide insights into optimizing model efficiency while maintaining robust translation accuracy.

Investigating the Translation Capabilities of LLMs Trained on Parallel Data Only

The paper "Investigating the translation capabilities of LLMs trained on parallel data only" provides a comprehensive analysis of the capabilities of decoder-only LLMs when solely trained on parallel corpora for machine translation tasks. This research introduces the Plume\ model, comprising three distinct LLMs, each leveraging different vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel data.

Introduction

Contrasting with traditional Neural Machine Translation (NMT) approaches that predominantly use encoder-decoder architectures, this paper explores the potential of decoder-only architectures. The conventional encoder-decoder models, where an encoder handles the source sentence and a decoder generates the target sentence, have demonstrated effectiveness across multilingual scenarios. However, recent trends indicate a shift towards decoder-only architectures, which prompt LLMs with the source sentence. This paper seeks to investigate the performance of decoder-only LLMs trained solely on parallel data, which can provide more focused insights free from pre-existing training biases.

Methodology

Dataset and Tokenization

The paper utilizes a Catalan-centric dataset encompassing translations between Catalan and eight other languages: Spanish, French, Italian, Portuguese, Galician, German, English, and Basque. This dataset comprises 783.6M sentences and 30.9 billion words. Data preprocessing involved filtering, deduplication, and normalization processes to ensure high-quality corpora.

For tokenization, three tokenizers were trained using Byte Pair Encoding (BPE) with varying vocabulary sizes (32k, 128k, and 256k). The training aimed to balance language representation to improve performance consistency across languages.

Model Architecture and Training

The Plume\ models are 2-billion parameter transformer-based, decoder-only LLMs. Training configurations included a sequence length of 2048 tokens, usage of Adam optimizer, and a causal LLMing objective. One model was trained for each tokenizer.

Evaluation

The models were evaluated using the COMET-22 and BLEU metrics on the Flores-200 and NTREX-101 datasets, employing both supervised and zero-shot translation directions. Beam search with a beam size of 5 was utilized during inference.

Results

Supervised and Zero-Shot Evaluation

The Plume\ models demonstrated comparable performance to encoder-decoder architectures in supervised translation tasks, with similar BLEU and COMET scores. Notably, the models' performance decline in zero-shot scenarios was more pronounced in BLEU than COMET scores, suggesting robust overall translation quality.

Vocabulary Size Impact

A critical finding was that larger vocabularies consistently enhanced zero-shot translation quality. Higher vocabulary sizes correlated positively with translation performance, particularly in zero-shot contexts.

Analysis and Insights

The paper performed an attention analysis to understand how the models utilized contextual information. Findings indicated significant variability in attention to the source tag, affecting translation quality when it was removed. Moreover, the research identified "sink heads" in the attention mechanism, focusing on the BOS token, implying possible optimization avenues.

Attention Head Pruning

Leveraging coverage metrics, the research proposed a strategy to mask less significant attention heads, achieving minimal performance loss. This method reduced computational complexity while retaining translation quality.

Cross-Lingual Representation

The models demonstrated the ability to progressively align language representations across layers. This was reflected in the decreasing distances between language subspaces as embeddings progressed through the model, although a notable increase was observed in the final layer where tokens grouped by source language.

Conclusion and Future Work

This research elucidates the potential of training LLMs on parallel data alone, with results on par with traditional encoder-decoder systems for supervised tasks. The paper highlights the benefits of larger vocabularies and introduces effective pruning strategies for model optimization. Future research could explore further the scalability of these architectures, the implications of various model and data scales, and the optimization of vocabulary size relative to model size.

This paper contributes significantly to understanding how LLMs can be effectively used for machine translation, particularly in zero-shot scenarios, providing a foundation for future advancements in multilingual NMT.

PDF Markdown

Related Papers

Tweets

https://twitter.com/carlosep93/status/1801545692671816165

https://twitter.com/arxivsanitybot/status/1801969493842759859