Emergent Mind

Abstract

In recent years, LLMs have demonstrated exceptional proficiency across a broad spectrum of NLP tasks, including Machine Translation. However, previous methods predominantly relied on iterative processes such as instruction fine-tuning or continual pre-training, leaving unexplored the challenges of training LLMs solely on parallel data. In this work, we introduce PLUME (Parallel Language Model), a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel examples. These models perform comparably to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones. Utilizing this set of models, we conduct a thorough investigation into the translation capabilities of LLMs, probing their performance, the impact of the different elements of the prompt, and their cross-lingual representation space.

UMAP token embeddings in each layer grouped by source language using Plume 128k.

Overview

  • The paper investigates the translation capabilities of LLMs with a focus on decoder-only architectures trained exclusively on parallel corpora for machine translation tasks.

  • Three variations of the Plume\ model, each with different vocabulary sizes (32k, 128k, and 256k), were trained on a Catalan-centric dataset involving translations between Catalan and eight other languages.

  • The study found that larger vocabulary sizes enhance translation quality, particularly in zero-shot scenarios, and proposes effective pruning strategies for model optimization without significant performance loss.

Investigating the Translation Capabilities of LLMs Trained on Parallel Data Only

The paper "Investigating the translation capabilities of LLMs trained on parallel data only" provides a comprehensive analysis of the capabilities of decoder-only LLMs when solely trained on parallel corpora for machine translation tasks. This research introduces the Plume\ model, comprising three distinct LLMs, each leveraging different vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel data.

Introduction

Contrasting with traditional Neural Machine Translation (NMT) approaches that predominantly use encoder-decoder architectures, this study explores the potential of decoder-only architectures. The conventional encoder-decoder models, where an encoder handles the source sentence and a decoder generates the target sentence, have demonstrated effectiveness across multilingual scenarios. However, recent trends indicate a shift towards decoder-only architectures, which prompt language models with the source sentence. This paper seeks to investigate the performance of decoder-only LLMs trained solely on parallel data, which can provide more focused insights free from pre-existing training biases.

Methodology

Dataset and Tokenization

The study utilizes a Catalan-centric dataset encompassing translations between Catalan and eight other languages: Spanish, French, Italian, Portuguese, Galician, German, English, and Basque. This dataset comprises 783.6M sentences and 30.9 billion words. Data preprocessing involved filtering, deduplication, and normalization processes to ensure high-quality corpora.

For tokenization, three tokenizers were trained using Byte Pair Encoding (BPE) with varying vocabulary sizes (32k, 128k, and 256k). The training aimed to balance language representation to improve performance consistency across languages.

Model Architecture and Training

The Plume\ models are 2-billion parameter transformer-based, decoder-only LLMs. Training configurations included a sequence length of 2048 tokens, usage of Adam optimizer, and a causal language modeling objective. One model was trained for each tokenizer.

Evaluation

The models were evaluated using the COMET-22 and BLEU metrics on the Flores-200 and NTREX-101 datasets, employing both supervised and zero-shot translation directions. Beam search with a beam size of 5 was utilized during inference.

Results

Supervised and Zero-Shot Evaluation

The Plume\ models demonstrated comparable performance to encoder-decoder architectures in supervised translation tasks, with similar BLEU and COMET scores. Notably, the models' performance decline in zero-shot scenarios was more pronounced in BLEU than COMET scores, suggesting robust overall translation quality.

Vocabulary Size Impact

A critical finding was that larger vocabularies consistently enhanced zero-shot translation quality. Higher vocabulary sizes correlated positively with translation performance, particularly in zero-shot contexts.

Analysis and Insights

The study performed an attention analysis to understand how the models utilized contextual information. Findings indicated significant variability in attention to the source tag, affecting translation quality when it was removed. Moreover, the research identified "sink heads" in the attention mechanism, focusing on the BOS token, implying possible optimization avenues.

Attention Head Pruning

Leveraging coverage metrics, the research proposed a strategy to mask less significant attention heads, achieving minimal performance loss. This method reduced computational complexity while retaining translation quality.

Cross-Lingual Representation

The models demonstrated the ability to progressively align language representations across layers. This was reflected in the decreasing distances between language subspaces as embeddings progressed through the model, although a notable increase was observed in the final layer where tokens grouped by source language.

Conclusion and Future Work

This research elucidates the potential of training LLMs on parallel data alone, with results on par with traditional encoder-decoder systems for supervised tasks. The paper highlights the benefits of larger vocabularies and introduces effective pruning strategies for model optimization. Future research could explore further the scalability of these architectures, the implications of various model and data scales, and the optimization of vocabulary size relative to model size.

This study contributes significantly to understanding how LLMs can be effectively used for machine translation, particularly in zero-shot scenarios, providing a foundation for future advancements in multilingual NMT.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.