RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models via Romanization (2401.14280v3)

Published 25 Jan 2024 in cs.CL and cs.AI

Abstract: This study addresses the challenge of extending LLMs to non-English languages that use non-Roman scripts. We propose an approach that utilizes the romanized form of text as an interface for LLMs, hypothesizing that its frequent informal use and shared tokens with English enhance cross-lingual alignment. Our approach involves the continual pretraining of an English LLM like Llama 2 on romanized text of non-English, non-Roman script languages, followed by instruction tuning on romanized data. The results indicate that romanized text not only reduces token fertility by 2x-4x but also matches or outperforms native script representation across various NLU, NLG, and MT tasks. Moreover, the embeddings computed on romanized text exhibit closer alignment with their English translations than those from the native script. Our approach presents a promising direction for leveraging the power of English LLMs in languages traditionally underrepresented in NLP. Our code is available on https://github.com/AI4Bharat/romansetu.

Citations (6)

View on Semantic Scholar

Summary

The paper demonstrates that romanized text reduces token usage by 2-4x and improves cross-lingual alignment compared to native scripts.
The paper introduces a two-stage method combining continual pretraining and instruction tuning on romanized data to enhance multilingual capabilities.
The paper’s empirical results reveal superior performance on NLU, NLG, and MT tasks, validating the efficiency of the romanization approach.

RomanSetu: Efficiently Unlocking Multilingual Capabilities of LLMs Via Romanization

The paper "RomanSetu: Efficiently unlocking multilingual capabilities of LLMs via Romanization" presents an innovative approach aimed at extending the capabilities of LLMs to non-English languages that utilize non-Roman scripts. Authored by Jaavid Aktar Husain and his collaborators, the paper makes a compelling case for using romanized text as an interface to leverage the multilingual potential of English-centric LLMs.

Overview

The primary hypothesis driving this research is that the frequent informal use of romanized forms and shared tokens with English can enhance cross-lingual alignment. The paper proposes a methodology that involves two stages: continual pretraining of an English LLM (such as Llama 2) on romanized text of non-English, non-Roman script languages, followed by instruction tuning on the same romanized data.

Key Contributions

Efficiency Gains: The paper demonstrates that romanized text is significantly more efficient than native script text. The token fertility—defined as the number of tokens per word generated—is 2x-4x smaller for romanized text compared to native scripts, leading to faster inference times and the ability to process longer sequences.
Cross-lingual Alignment: The embeddings computed on romanized text exhibit closer alignment with their English translations than those from the native script. This improved alignment facilitates more effective cross-lingual transfer capabilities of LLMs.
Empirical Results: The research shows that romanized text not only matches but often outperforms native script representation across various Natural Language Understanding (NLU), Natural Language Generation (NLG), and Machine Translation (MT) tasks. For example, in machine translation tasks from and to English, the models using romanized text consistently outperformed those using the native script.

Experimental Setup and Results

The authors conducted comprehensive experiments spanning multiple languages, benchmarks, and models. They experimented with five Indic languages—Hindi, Marathi, Gujarati, Tamil, and Malayalam—covering two language families and four different scripts. The LLaMA2 model served as the base for all experimentation.

Datasets and Metrics

Pretraining Data: Approximately 500 million words of document-level data sourced from web-crawled corpora for each language were used, along with English data.
Instruction Tuning Data: 120k examples per language were created, involving translated datasets from high-quality English-supervised instruction-tuning datasets.
Evaluation Metrics: The primary metrics used were chrF for MT, Rouge-L for summarization and headline generation, and accuracy/F1 scores for various NLU tasks.

The paper employs lists sparingly, only in cases where they improve readability without sacrificing the formal tone typical to academic discourse.

Implications and Speculations

The implications of this research are multi-faceted:

Practical Benefits: The approach is computationally less demanding compared to extending an LLM's tokenizer vocabulary and performing extensive pretraining on native script data. This has immediate practical benefits as it allows leveraging existing English-heavy LLMs more effectively.
Extended Capabilities: The approach enables cross-lingual transfer in decoder-only English-centric LLMs, a challenging yet crucial scenario given the dominance of English in LLM training corpora.
Task Performance: The research also shows that romanization can enhance performance in generation tasks, which has been largely unexplored compared to understanding tasks.
Future Directions: The paper opens avenues for exploring deterministic and reversible transliteration schemes to mitigate the potential lossiness in transliterated outputs. Additionally, experiments with larger models and datasets could yield further insights into scaling the approach.

Conclusion

The paper provides a robust, empirically backed argument for romanization as a method to unlock the multilingual potential of English-centric LLMs efficiently. By demonstrating superior performance in both NLU and NLG tasks, the authors show that romanization can serve as an effective bridge between English and other languages. This research paves the way for future work in extending LLM capabilities to underrepresented languages using non-Roman scripts, thereby broadening the inclusivity of NLP technologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/prajdabre1/status/1750743509575848152

https://twitter.com/wadkar/status/1768688472985076218

https://twitter.com/TheSaddlePoint/status/1790580238964899869