- The paper demonstrates that romanized text reduces token usage by 2-4x and improves cross-lingual alignment compared to native scripts.
- The paper introduces a two-stage method combining continual pretraining and instruction tuning on romanized data to enhance multilingual capabilities.
- The paper’s empirical results reveal superior performance on NLU, NLG, and MT tasks, validating the efficiency of the romanization approach.
RomanSetu: Efficiently Unlocking Multilingual Capabilities of LLMs Via Romanization
The paper "RomanSetu: Efficiently unlocking multilingual capabilities of LLMs via Romanization" presents an innovative approach aimed at extending the capabilities of LLMs to non-English languages that utilize non-Roman scripts. Authored by Jaavid Aktar Husain and his collaborators, the paper makes a compelling case for using romanized text as an interface to leverage the multilingual potential of English-centric LLMs.
Overview
The primary hypothesis driving this research is that the frequent informal use of romanized forms and shared tokens with English can enhance cross-lingual alignment. The paper proposes a methodology that involves two stages: continual pretraining of an English LLM (such as Llama 2) on romanized text of non-English, non-Roman script languages, followed by instruction tuning on the same romanized data.
Key Contributions
- Efficiency Gains: The paper demonstrates that romanized text is significantly more efficient than native script text. The token fertility—defined as the number of tokens per word generated—is 2x-4x smaller for romanized text compared to native scripts, leading to faster inference times and the ability to process longer sequences.
- Cross-lingual Alignment: The embeddings computed on romanized text exhibit closer alignment with their English translations than those from the native script. This improved alignment facilitates more effective cross-lingual transfer capabilities of LLMs.
- Empirical Results: The research shows that romanized text not only matches but often outperforms native script representation across various Natural Language Understanding (NLU), Natural Language Generation (NLG), and Machine Translation (MT) tasks. For example, in machine translation tasks from and to English, the models using romanized text consistently outperformed those using the native script.
Experimental Setup and Results
The authors conducted comprehensive experiments spanning multiple languages, benchmarks, and models. They experimented with five Indic languages—Hindi, Marathi, Gujarati, Tamil, and Malayalam—covering two language families and four different scripts. The LLaMA2 model served as the base for all experimentation.
Datasets and Metrics
- Pretraining Data: Approximately 500 million words of document-level data sourced from web-crawled corpora for each language were used, along with English data.
- Instruction Tuning Data: 120k examples per language were created, involving translated datasets from high-quality English-supervised instruction-tuning datasets.
- Evaluation Metrics: The primary metrics used were chrF for MT, Rouge-L for summarization and headline generation, and accuracy/F1 scores for various NLU tasks.
The paper employs lists sparingly, only in cases where they improve readability without sacrificing the formal tone typical to academic discourse.
Implications and Speculations
The implications of this research are multi-faceted:
- Practical Benefits: The approach is computationally less demanding compared to extending an LLM's tokenizer vocabulary and performing extensive pretraining on native script data. This has immediate practical benefits as it allows leveraging existing English-heavy LLMs more effectively.
- Extended Capabilities: The approach enables cross-lingual transfer in decoder-only English-centric LLMs, a challenging yet crucial scenario given the dominance of English in LLM training corpora.
- Task Performance: The research also shows that romanization can enhance performance in generation tasks, which has been largely unexplored compared to understanding tasks.
- Future Directions: The paper opens avenues for exploring deterministic and reversible transliteration schemes to mitigate the potential lossiness in transliterated outputs. Additionally, experiments with larger models and datasets could yield further insights into scaling the approach.
Conclusion
The paper provides a robust, empirically backed argument for romanization as a method to unlock the multilingual potential of English-centric LLMs efficiently. By demonstrating superior performance in both NLU and NLG tasks, the authors show that romanization can serve as an effective bridge between English and other languages. This research paves the way for future work in extending LLM capabilities to underrepresented languages using non-Roman scripts, thereby broadening the inclusivity of NLP technologies.