Emergent Mind

Abstract

Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. This \textit{vocabulary bottleneck} limits the representational capabilities of multilingual models like XLM-R. In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), to named entity recognition (WikiAnn). XLM-V is particularly effective on low-resource language tasks and outperforms XLM-R by 11.2% and 5.8% absolute on MasakhaNER and Americas NLI, respectively.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Masakhaner: named entity recognition for african languages. Transactions of the Association for Computational Linguistics, 9:1116–1131.
  2. On the Cross-lingual Transferability of Monolingual Representations
  3. Adaptive Input Representations for Neural Language Modeling
  4. Conditional Computation in Neural Networks for faster models
  5. Improving multilingual models with language-clustered vocabularies. EMNLP.
  6. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8:454–470.
  7. Canine: Pre-training an efficient tokenization-free encoder for language representation. Transactions of the Association for Computational Linguistics, 10:73–91.
  8. Unsupervised Cross-lingual Representation Learning at Scale
  9. XNLI: Evaluating Cross-lingual Sentence Representations
  10. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  11. AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages
  12. Larger-Scale Transformers for Multilingual Masked Language Modeling
  13. The flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538.
  14. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International Conference on Machine Learning, pages 4411–4421. PMLR.
  15. Efficient softmax approximation for gpus. In International conference on machine learning, pages 1302–1310. PMLR.
  16. Adam: A Method for Stochastic Optimization
  17. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
  18. Cross-lingual Language Model Pretraining
  19. MLQA: Evaluating Cross-lingual Extractive Question Answering
  20. Few-shot Learning with Multilingual Language Models
  21. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
  22. Decoupled Weight Decay Regularization
  23. fairseq: A Fast, Extensible Toolkit for Sequence Modeling
  24. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958.
  25. Massively Multilingual Transfer for NER
  26. SQuAD: 100,000+ Questions for Machine Comprehension of Text
  27. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models
  28. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
  29. Andrew Viterbi. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE transactions on Information Theory, 13(2):260–269.
  30. Improving Pre-Trained Multilingual Models with Vocabulary Expansion
  31. Wikipedia. 2023. Table of general standard chinese characters — wikipedia, the free encyclopedia. http://en.wikipedia.org/w/index.php?title=Table%20of%20General%20Standard%20Chinese%20Characters&oldid=1123968033. [Online; accessed 05-January-2023].

  32. The multi-genre nli corpus
  33. Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306.
  34. mT5: A massively multilingual pre-trained text-to-text transformer
  35. Allocating large vocabulary capacity for cross-lingual language model pre-training. EMNLP.

Show All 35