Poro 34B and the Blessing of Multilinguality (2404.01856v3)
Abstract: The pretraining of state-of-the-art LLMs now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages. While including text in more than one language is an obvious way to acquire more pretraining data, multilinguality is often seen as a curse, and most model training efforts continue to focus near-exclusively on individual large languages. We believe that multilinguality can be a blessing: when the lack of training data is a constraint for effectively training larger models for a target language, augmenting the dataset with other languages can offer a way to improve over the capabilities of monolingual models for that language. In this study, we introduce Poro 34B, a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages, and demonstrate that a multilingual training approach can produce a model that substantially advances over the capabilities of existing models for Finnish and excels in translation, while also achieving competitive performance in its class for English and programming languages. We release the model parameters, scripts, and data under open licenses at https://huggingface.co/LumiOpen/Poro-34B.
- GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Position Interpolation Improves ALiBi Extrapolation. arXiv preprint arXiv:2310.13017, 2023.
- Tokenizer Choice For LLM Training: Negligible or Crucial? arXiv preprint arXiv:2310.08754, 2023.
- Falcon-40B: an open large language model with state-of-the-art performance. 2023a.
- The Falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023b.
- Tower: An open multilingual large language model for translation-related tasks, 2024.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
- A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness, 2022.
- Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- When is multilinguality a curse? language modeling for 250 high- and low-resource languages, 2023.
- Evaluating large language models trained on code, 2021.
- Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. ArXiv, abs/1803.05457, 2018.
- Training verifiers to solve math word problems, 2021.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
- No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672, 2022.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
- Lessons learned from GPT-SW3: Building the first large-scale generative language model for Swedish. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 3509–3518, 2022. URL https://aclanthology.org/2022.lrec-1.376.
- Beyond english-centric multilingual machine translation. Journal of Machine Learning Research, 22(107):1–48, 2021.
- Match the script, adapt if multilingual: Analyzing the effect of multilingual pretraining on cross-lingual transferability. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1500–1512. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.acl-long.106. URL https://aclanthology.org/2022.acl-long.106.
- A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
- The unreasonable effectiveness of few-shot learning for machine translation. In International Conference on Machine Learning, pp. 10867–10878. PMLR, 2023.
- Continual learning under language shift, 2023.
- The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538, 2022. doi: 10.1162/tacl˙a˙00474. URL https://aclanthology.org/2022.tacl-1.30.
- OLMo: Accelerating the science of language models, 2024.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487, 2022.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022a.
- An empirical analysis of compute-optimal large language model training. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 30016–30030. Curran Associates, Inc., 2022b. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/c1e2faff6f588870935f114ebe04a3e5-Paper-Conference.pdf.
- Simple and scalable strategies to continually pre-train large language models, 2024.
- The state and fate of linguistic diversity and inclusion in the NLP world. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6282–6293, 2020. doi: 10.18653/v1/2020.acl-main.560. URL https://aclanthology.org/2020.acl-main.560.
- Turning english-centric LLMs into polyglots: How much multilinguality is needed?, 2023.
- The stack: 3 TB of permissively licensed source code. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=pxpbTdUEpD.
- BLOOM: A 176b-parameter open-access multilingual language model. 2022.
- Starcoder: may the source be with you!, 2023.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3214–3252, Dublin, Ireland, May 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
- Few-shot learning with multilingual language models, 2022b.
- Starcoder 2 and the stack v2: The next generation, 2024.
- FinGPT: Large generative models for a small language. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2710–2726. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-main.164. URL https://aclanthology.org/2023.emnlp-main.164.
- Language models of code are few-shot commonsense learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 1384–1403, 2022.
- MosaicML. Introducing MPT-30B: Raising the bar for open-source foundation models, 2023. URL www.mosaicml.com/blog/mpt-30b.
- Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024.
- Language model tokenizers introduce unfairness between languages. arXiv preprint arXiv:2305.15425, 2023.
- Lifting the curse of multilinguality by pre-training modular transformers. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3479–3495, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.255. URL https://aclanthology.org/2022.naacl-main.255.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
- WikiBERT models: Deep transfer learning for many languages. NoDaLiDa 2021, pp. 1, 2021.
- Improving language understanding by generative pre-training. 2018.
- How good is your tokenizer? on the monolingual performance of multilingual language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3118–3135, 2021. doi: 10.18653/v1/2021.acl-long.243. URL https://aclanthology.org/2021.acl-long.243.
- Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019.
- Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. arXiv preprint arXiv:2401.00448, 2023.
- SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint, 2024.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Jörg Tiedemann. News from OPUS-a collection of multilingual parallel corpora with tools and interfaces. In Recent advances in natural language processing, volume 5, pp. 237–248, 2009.
- Jörg Tiedemann. The Tatoeba Translation Challenge – Realistic data sets for low resource and multilingual MT. In Proceedings of the Fifth Conference on Machine Translation, pp. 1174–1182, Online, November 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.wmt-1.139.
- OPUS-MT – building open translation services for the world. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pp. 479–480, Lisboa, Portugal, November 2020. European Association for Machine Translation. URL https://aclanthology.org/2020.eamt-1.61.
- Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Llama: Open and efficient foundation language models, 2023.
- Water security and climate change: hydropower reservoir greenhouse gas emissions. Water Security Under Climate Change, pp. 69–94, 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Prompting palm for translation: Assessing strategies and performance. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15406–15427, 2023.
- Will we run out of data? an analysis of the limits of scaling datasets in machine learning. arXiv preprint arXiv:2211.04325, 2022.
- PolyLM: An open source polyglot large language model. arXiv preprint arXiv:2307.06018, 2023.
- HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4791–4800. Association for Computational Linguistics, 2019. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472.
- LLaMA beyond english: An empirical study on language capability transfer, 2024.
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, 2023.
- Multilingual machine translation with large language models: Empirical results and analysis, 2023.