Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 62 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 14 tok/s Pro

GPT-5 High 13 tok/s Pro

GPT-4o 93 tok/s Pro

Kimi K2 213 tok/s Pro

GPT OSS 120B 458 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

Poro 34B and the Blessing of Multilinguality (2404.01856v3)

Published 2 Apr 2024 in cs.CL

Abstract: The pretraining of state-of-the-art LLMs now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages. While including text in more than one language is an obvious way to acquire more pretraining data, multilinguality is often seen as a curse, and most model training efforts continue to focus near-exclusively on individual large languages. We believe that multilinguality can be a blessing: when the lack of training data is a constraint for effectively training larger models for a target language, augmenting the dataset with other languages can offer a way to improve over the capabilities of monolingual models for that language. In this study, we introduce Poro 34B, a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages, and demonstrate that a multilingual training approach can produce a model that substantially advances over the capabilities of existing models for Finnish and excels in translation, while also achieving competitive performance in its class for English and programming languages. We release the model parameters, scripts, and data under open licenses at https://huggingface.co/LumiOpen/Poro-34B.

References (68)

Citations (8)

View on Semantic Scholar

Summary

The paper presents a novel multilingual model that leverages a 1-trillion-token corpus to boost Finnish, English, and code generation tasks.
The model employs a 34B-parameter decoder-only architecture with a custom tokenizer optimized for low-resource and cross-lingual performance.
The study demonstrates that incorporating Finnish-English translation pairs in training significantly enhances capabilities for under-resourced languages.

Poro~34B: Advancing LLM Capabilities for Finnish through Multilingual Training

Introduction

The development of large-scale generative LLMs has increasingly faced the challenge of data scarcity, particularly for languages other than English. This paper introduces Poro~34B, a model that leverages multilingual training to not only enhance capabilities in Finnish - a relatively low-resource language in the context of LLMs - but also to demonstrate competitiveness in English and programming language tasks. By training across Finnish, English, and various programming languages, Poro~34B addresses both the theoretical and practical aspects of multilingual training, challenging the prevailing view of multilinguality as detrimental to performance in specific languages.

Pretraining Data

The pretraining corpus for Poro~34B spanned 1 trillion tokens, comprising Finnish, English, and programming language data, with a specific emphasis on including high-quality, deduplicated, and filtered datasets to maximize the model's learning potential. The Finnish data, comprising 32 billion tokens, was sourced from a combination of web crawls and various curated datasets. English and programming languages accounted for the majority of the training data, with the inclusion of code aiming to provide syntactic diversity. An explicit cross-lingual signal was incorporated through Finnish-English translation pairs, facilitating the model’s capabilities in both languages and translation tasks.

Methodology

The Poro~34B model, with its 34 billion parameters, employs a decoder-only architecture mirroring advancements in generative model design. The training progressed over 1 trillion tokens, exceeding the current recommendations for compute-optimal training, underscoring a strategic choice for enhancing inference efficiency. The tokenizer, custom-designed for the multilingual corpus, aimed at low fertility rates across the languages of interest, suggesting effectiveness in tokenization which is fundamental for generative tasks. Key training parameters and architecture details were aligned with best practices, while also optimally configuring the model for the unique multilingual training approach undertaken.

Model Evaluation

Poro~34B underwent comprehensive evaluation across Finnish, English, and code generation tasks, demonstrating superior performance in Finnish and competitive results in English and programming languages. The model’s prowess in Finnish showcases the substantial advantage of multilingual training for enhancing performance in lower-resource languages. In English and code generation, Poro~34B displayed competitive capabilities, affirming the effectiveness of incorporating diverse linguistic data in training broad-spectrum generative models. The model also exhibited remarkable translation capabilities between English and Finnish, underscoring the beneficial impact of including translation pairs in training.

Practical Implications and Future Directions

The successful development of Poro~34B through multilingual training opens new vistas for creating robust large-scale LLMs for languages with relatively limited resources. The model’s adeptness across languages and various generative tasks highlights the potential of leveraging multilingual data to circumvent the constraints posed by data scarcity in specific languages. Future research could explore the scalability of this approach across additional languages and more deeply investigate the underlying factors contributing to the model's performance in translation and cross-linguistic generative tasks.

Conclusion

Poro~34B represents a significant advancement in the utilization of multilingual training to enhance LLM capabilities beyond high-resource languages. By effectively leveraging a diverse pretraining corpus, the model has set new benchmarks in Finnish language processing, while also achieving noteworthy performance in English and programming languages. This paper not only underscores the potential of multilingual training to expand the horizons of LLM capabilities but also provides a blueprint for future explorations in leveraging multilingual data for LLM development.