Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Accelerating Multilingual Language Model for Excessively Tokenized Languages (2401.10660v2)

Published 19 Jan 2024 in cs.CL and cs.AI

Abstract: Recent advancements in LLMs have remarkably enhanced performances on a variety of tasks in multiple languages. However, tokenizers in LLMs trained primarily on English-centric corpora often overly fragment a text into character or Unicode-level tokens in non-Roman alphabetic languages, leading to inefficient text generation. We introduce a simple yet effective framework to accelerate text generation in such languages. Our approach involves employing a new LLM head with a vocabulary set tailored to a specific target language for a pre-trained LLM. This is followed by fine-tuning the new head while incorporating a verification step to ensure the model's performance is preserved. We show that this targeted fine-tuning, while freezing other model parameters, effectively reduces token fragmentation for the target language. Our extensive experiments demonstrate that the proposed framework increases the generation speed by a factor of 1.7 while maintaining the performance of pre-trained multilingual models on target monolingual tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Load what you need: Smaller versions of multilingual bert. In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing.
  2. Do all languages cost the same? tokenization in the era of commercial language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
  3. MAD-G: Multilingual adapter generation for efficient cross-lingual transfer. In Findings of the Association for Computational Linguistics: EMNLP 2021.
  4. Antropic. 2023. Model card and evaluations for claude models.
  5. On the cross-lingual transferability of monolingual representations. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).
  6. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. arXiv preprint arXiv:2308.16884.
  7. Efficient inference for multilingual neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
  8. Large language model inference with lexical shortlisting. arXiv preprint arXiv:2311.09709.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  10. Parsing with multilingual BERT, a small corpus, and a small treebank. In Findings of the Association for Computational Linguistics: EMNLP.
  11. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318.
  12. Monolingual or multilingual instruction tuning: Which makes a better alpaca. arXiv preprint arXiv:2309.08958.
  13. MultilingualSIFT: Multilingual Supervised Instruction Fine-tuning.
  14. Unsupervised cross-lingual representation learning at scale. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).
  15. Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177.
  16. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
  17. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
  18. The devil is in the details: On the pitfalls of vocabulary selection in neural machine translation. In Proceedings of The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
  19. Gpts are gpts: An early look at the labor market impact potential of large language models. arXiv preprint arXiv:2303.10130.
  20. Hierarchical neural story generation.
  21. The flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics.
  22. Xl-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics, ACL).
  23. The curious case of neural text degeneration. In Proceedings of the International Conference on Learning Representations (ICLR).
  24. Avocado: Strategy for adapting vocabulary to downstream domain. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
  25. Parameter-efficient transfer learning for nlp. In Proceedings of the International Conference on Machine Learning (ICML).
  26. Montreal neural machine translation systems for WMT’15. In Proceedings of the Tenth Workshop on Statistical Machine Translation.
  27. Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).
  28. Copy is all you need. In Proceedings of the International Conference on Learning Representations (ICLR).
  29. Fast inference from transformers via speculative decoding. In Proceedings of the International Conference on Machine Learning (ICML).
  30. Xlm-v: Overcoming the vocabulary bottleneck in multilingual masked language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
  31. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out.
  32. Chipnemo: Domain-adapted llms for chip design. arXiv preprint arXiv:2311.00176.
  33. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
  34. Crosslingual generalization through multitask finetuning. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).
  35. Kenton Murray and David Chiang. 2018. Correcting length bias in neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers.
  36. OpenAI. 2023. Gpt-4 technical report.
  37. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
  38. BAD-X: Bilingual adapters improve zero-shot cross-lingual transfer. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  39. Language model tokenizers introduce unfairness between languages. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS).
  40. Mad-x: An adapter-based framework for multi-task cross-lingual transfer.
  41. Language models are unsupervised multitask learners.
  42. How good is your tokenizer? on the monolingual performance of multilingual language models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).
  43. Bloom: A 176b-parameter open-access multilingual language model. ArXiv, arXiv preprint arXiv:2211.05100.
  44. Neural machine translation of rare words with subword units. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).
  45. Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint arXiv:2002.05202.
  46. High-throughput generative inference of large language models with a single gpu. In Proceedings of the International Conference on Machine Learning (ICML).
  47. Evaluating the social impact of generative ai systems in systems and society. arXiv preprint arXiv:2306.05949.
  48. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  49. An efficient multilingual language model compression through vocabulary trimming. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP): Findings. Association for Computational Linguistics.
  50. UDapter: Language adaptation for truly Universal Dependency parsing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  51. Orthogonal language and task adapters in zero-shot cross-lingual transfer. arXiv preprint arXiv:2012.06460.
  52. Neural machine translation with byte-level subwords. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).
  53. Improving pre-trained multilingual model with vocabulary expansion. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL). Association for Computational Linguistics.
  54. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453.
  55. Byt5: Towards a token-free future with pre-trained byte-to-byte models. In Transactions of the Association for Computational Linguistics.
  56. mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
  57. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22).
  58. How robust is neural machine translation to language imbalance in multilingual tokenizer training? In Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track).
  59. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com