Accelerating Multilingual Language Model for Excessively Tokenized Languages (2401.10660v2)
Abstract: Recent advancements in LLMs have remarkably enhanced performances on a variety of tasks in multiple languages. However, tokenizers in LLMs trained primarily on English-centric corpora often overly fragment a text into character or Unicode-level tokens in non-Roman alphabetic languages, leading to inefficient text generation. We introduce a simple yet effective framework to accelerate text generation in such languages. Our approach involves employing a new LLM head with a vocabulary set tailored to a specific target language for a pre-trained LLM. This is followed by fine-tuning the new head while incorporating a verification step to ensure the model's performance is preserved. We show that this targeted fine-tuning, while freezing other model parameters, effectively reduces token fragmentation for the target language. Our extensive experiments demonstrate that the proposed framework increases the generation speed by a factor of 1.7 while maintaining the performance of pre-trained multilingual models on target monolingual tasks.
- Load what you need: Smaller versions of multilingual bert. In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing.
- Do all languages cost the same? tokenization in the era of commercial language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
- MAD-G: Multilingual adapter generation for efficient cross-lingual transfer. In Findings of the Association for Computational Linguistics: EMNLP 2021.
- Antropic. 2023. Model card and evaluations for claude models.
- On the cross-lingual transferability of monolingual representations. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).
- The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. arXiv preprint arXiv:2308.16884.
- Efficient inference for multilingual neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Large language model inference with lexical shortlisting. arXiv preprint arXiv:2311.09709.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Parsing with multilingual BERT, a small corpus, and a small treebank. In Findings of the Association for Computational Linguistics: EMNLP.
- Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318.
- Monolingual or multilingual instruction tuning: Which makes a better alpaca. arXiv preprint arXiv:2309.08958.
- MultilingualSIFT: Multilingual Supervised Instruction Fine-tuning.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).
- Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177.
- Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
- The devil is in the details: On the pitfalls of vocabulary selection in neural machine translation. In Proceedings of The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
- Gpts are gpts: An early look at the labor market impact potential of large language models. arXiv preprint arXiv:2303.10130.
- Hierarchical neural story generation.
- The flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics.
- Xl-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics, ACL).
- The curious case of neural text degeneration. In Proceedings of the International Conference on Learning Representations (ICLR).
- Avocado: Strategy for adapting vocabulary to downstream domain. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Parameter-efficient transfer learning for nlp. In Proceedings of the International Conference on Machine Learning (ICML).
- Montreal neural machine translation systems for WMT’15. In Proceedings of the Tenth Workshop on Statistical Machine Translation.
- Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).
- Copy is all you need. In Proceedings of the International Conference on Learning Representations (ICLR).
- Fast inference from transformers via speculative decoding. In Proceedings of the International Conference on Machine Learning (ICML).
- Xlm-v: Overcoming the vocabulary bottleneck in multilingual masked language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out.
- Chipnemo: Domain-adapted llms for chip design. arXiv preprint arXiv:2311.00176.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
- Crosslingual generalization through multitask finetuning. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).
- Kenton Murray and David Chiang. 2018. Correcting length bias in neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers.
- OpenAI. 2023. Gpt-4 technical report.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
- BAD-X: Bilingual adapters improve zero-shot cross-lingual transfer. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- Language model tokenizers introduce unfairness between languages. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS).
- Mad-x: An adapter-based framework for multi-task cross-lingual transfer.
- Language models are unsupervised multitask learners.
- How good is your tokenizer? on the monolingual performance of multilingual language models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).
- Bloom: A 176b-parameter open-access multilingual language model. ArXiv, arXiv preprint arXiv:2211.05100.
- Neural machine translation of rare words with subword units. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).
- Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint arXiv:2002.05202.
- High-throughput generative inference of large language models with a single gpu. In Proceedings of the International Conference on Machine Learning (ICML).
- Evaluating the social impact of generative ai systems in systems and society. arXiv preprint arXiv:2306.05949.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- An efficient multilingual language model compression through vocabulary trimming. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP): Findings. Association for Computational Linguistics.
- UDapter: Language adaptation for truly Universal Dependency parsing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Orthogonal language and task adapters in zero-shot cross-lingual transfer. arXiv preprint arXiv:2012.06460.
- Neural machine translation with byte-level subwords. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).
- Improving pre-trained multilingual model with vocabulary expansion. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL). Association for Computational Linguistics.
- Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453.
- Byt5: Towards a token-free future with pre-trained byte-to-byte models. In Transactions of the Association for Computational Linguistics.
- mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
- Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22).
- How robust is neural machine translation to language imbalance in multilingual tokenizer training? In Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track).
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.