Emergent Mind

Abstract

This study addresses the challenge of extending LLMs to non-English languages using non-Roman scripts. We propose an approach that utilizes the romanized form of text as an interface for LLMs, hypothesizing that its frequent informal use and shared tokens with English enhance cross-lingual alignment. Our approach involves the continual pretraining of an English LLM like Llama 2 on romanized text of non-English, non-Roman script languages, followed by instruction tuning on romanized data. The results indicate that romanized text not only reduces token fertility by 2x-4x but also matches or outperforms native script representation across various NLU, NLG, and MT tasks. Moreover, the embeddings computed on romanized text exhibit closer alignment with their English translations than those from the native script. Our approach presents a promising direction for leveraging the power of English LLMs in languages traditionally underrepresented in NLP.

Zero-shot English to Hindi translation with romanized and native script outputs.

Overview

  • The paper introduces a method for making LLMs more multilingual by using romanized text for non-English, non-Roman script languages, showcasing the efficiency and performance gains this approach can provide.

  • The methodology involves continual pretraining of an English LLM on romanized text, followed by instruction tuning on the same data, leading to better cross-lingual alignment and superior performance in various Natural Language Understanding and Generation tasks.

  • Empirical evidence from experiments with multiple Indic languages suggests that models using romanized text outperform those using native scripts, indicating that romanization can significantly enhance LLM capabilities.

RomanSetu: Efficiently Unlocking Multilingual Capabilities of LLMs Via Romanization

The paper "RomanSetu: Efficiently unlocking multilingual capabilities of LLMs via Romanization" presents an innovative approach aimed at extending the capabilities of LLMs to non-English languages that utilize non-Roman scripts. Authored by Jaavid Aktar Husain and his collaborators, the paper makes a compelling case for using romanized text as an interface to leverage the multilingual potential of English-centric LLMs.

Overview

The primary hypothesis driving this research is that the frequent informal use of romanized forms and shared tokens with English can enhance cross-lingual alignment. The paper proposes a methodology that involves two stages: continual pretraining of an English LLM (such as Llama 2) on romanized text of non-English, non-Roman script languages, followed by instruction tuning on the same romanized data.

Key Contributions

  1. Efficiency Gains: The study demonstrates that romanized text is significantly more efficient than native script text. The token fertility—defined as the number of tokens per word generated—is 2x-4x smaller for romanized text compared to native scripts, leading to faster inference times and the ability to process longer sequences.
  2. Cross-lingual Alignment: The embeddings computed on romanized text exhibit closer alignment with their English translations than those from the native script. This improved alignment facilitates more effective cross-lingual transfer capabilities of LLMs.
  3. Empirical Results: The research shows that romanized text not only matches but often outperforms native script representation across various Natural Language Understanding (NLU), Natural Language Generation (NLG), and Machine Translation (MT) tasks. For example, in machine translation tasks from and to English, the models using romanized text consistently outperformed those using the native script.

Experimental Setup and Results

The authors conducted comprehensive experiments spanning multiple languages, benchmarks, and models. They experimented with five Indic languages—Hindi, Marathi, Gujarati, Tamil, and Malayalam—covering two language families and four different scripts. The LLaMA2 model served as the base for all experimentation.

Datasets and Metrics

  • Pretraining Data: Approximately 500 million words of document-level data sourced from web-crawled corpora for each language were used, along with English data.
  • Instruction Tuning Data: 120k examples per language were created, involving translated datasets from high-quality English-supervised instruction-tuning datasets.
  • Evaluation Metrics: The primary metrics used were chrF for MT, Rouge-L for summarization and headline generation, and accuracy/F1 scores for various NLU tasks.

The paper employs lists sparingly, only in cases where they improve readability without sacrificing the formal tone typical to academic discourse.

Implications and Speculations

The implications of this research are multi-faceted:

  1. Practical Benefits: The approach is computationally less demanding compared to extending an LLM's tokenizer vocabulary and performing extensive pretraining on native script data. This has immediate practical benefits as it allows leveraging existing English-heavy LLMs more effectively.
  2. Extended Capabilities: The approach enables cross-lingual transfer in decoder-only English-centric LLMs, a challenging yet crucial scenario given the dominance of English in LLM training corpora.
  3. Task Performance: The research also shows that romanization can enhance performance in generation tasks, which has been largely unexplored compared to understanding tasks.
  4. Future Directions: The study opens avenues for exploring deterministic and reversible transliteration schemes to mitigate the potential lossiness in transliterated outputs. Additionally, experiments with larger models and datasets could yield further insights into scaling the approach.

Conclusion

The paper provides a robust, empirically backed argument for romanization as a method to unlock the multilingual potential of English-centric LLMs efficiently. By demonstrating superior performance in both NLU and NLG tasks, the authors show that romanization can serve as an effective bridge between English and other languages. This research paves the way for future work in extending LLM capabilities to underrepresented languages using non-Roman scripts, thereby broadening the inclusivity of NLP technologies.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.