Emergent Mind

Abstract

Recent breakthroughs in LLMs have centered around a handful of data-rich languages. What does it take to broaden access to breakthroughs beyond first-class citizen languages? Our work introduces Aya, a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced. Aya outperforms mT0 and BLOOMZ on the majority of tasks while covering double the number of languages. We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages -- including discriminative and generative tasks, human evaluation, and simulated win rates that cover both held-out tasks and in-distribution performance. Furthermore, we conduct detailed investigations on the optimal finetuning mixture composition, data pruning, as well as the toxicity, bias, and safety of our models. We open-source our instruction datasets and our model at https://hf.co/CohereForAI/aya-101

Aya enhances IFT training with dataset weighting, performance evaluation, and fine-tuning a 13B parameter mT5 model.

Overview

  • The Aya Model introduces a novel, open-source multilingual LLM covering 101 languages, aiming to bridge the performance gap in non-dominant languages.

  • Built on the 13B parameter mT5 model and involving extensive datasets, it undergoes instruction-finetuning to ensure high performance across a wide language spectrum.

  • An evaluation suite provides insights into the model's capabilities in various linguistic tasks, highlighting its performance in seen and unseen scenarios.

  • Addressing biases, risks, and limitations is a core aspect of development, with the model's open-source nature inviting global collaboration for ongoing improvements.

Multilingual Instruction: Advancing the State-of-the-Art with Aya Model

Introduction

LLMs have primarily benefited a handful of languages, leaving a wide gap in performance and accessibility for the majority of the world's languages. However, the Aya Model aims to bridge this gap by introducing a novel massively multilingual LLM that is open-source and instruction-finetuned, covering an unprecedented scope of 101 languages.

Training Data and Process

Datasets

The Aya Model is built upon extensive datasets including xP3x, Aya Collection, Aya Dataset, and the Data Provenance collection, among others. It significantly expands language coverage and incorporates detailed pruning processes to ensure data quality. The training phase employs a mixture of these datasets, focusing on diversity in language, task, and complexity.

Training Details

The model leverages the 13B parameter mT5 model as its foundation, benefiting from mT5's robust pretraining on multilingual data. With a training budget of 25M samples, the instruction-finetuning process focuses on maximizing coverage and performance across the included languages, facilitated by sophisticated data sampling and weighting strategies.

Evaluation Suite

A comprehensive evaluation suite has been developed to test the model's capabilities across various dimensions. This suite includes unseen discriminative tasks, generative tasks, and novel benchmarks like Multilingual MMLU, providing a thorough overview of the model's performance in both seen and unseen linguistic scenarios. Additionally, human and LLM preference evaluations offer insights into the model's qualitative performance and relative standing against existing models.

Bias, Risks, and Limitation Analysis

Critical to the development of the Aya Model is a conscientious approach to addressing biases, risks, and limitations inherent in multilingual LLMs. Through targeted efforts in safety mitigation and a detailed examination of toxicity and bias across different languages and contexts, the project underscores the importance of ethical considerations in LLM development. Despite these efforts, challenges such as sociolinguistic nuances, model values, and behavior across diverse languages underscore the complexity of creating truly inclusive and fair language models.

Model Version and Maintenance

The Aya Model is actively maintained, with its initial release marked for February 2024. The project team commits to regular updates and improvements, reflecting ongoing research and feedback from the broader community. The open-source nature of the model invites collaboration and contributions, setting a new standard for transparency and inclusivity in the field of language model research.

Conclusion

The Aya Model represents a significant advancement in the effort to democratize access to state-of-the-art language technologies. By substantially increasing the number of languages covered and incorporating ethical considerations throughout its development process, the Aya Model paves the way for more equitable advancements in NLP. Its open-source release not only facilitates immediate access and utility but also encourages ongoing collaboration and innovation within the global research community.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.