Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model (2402.07827v1)

Published 12 Feb 2024 in cs.CL

Abstract: Recent breakthroughs in LLMs have centered around a handful of data-rich languages. What does it take to broaden access to breakthroughs beyond first-class citizen languages? Our work introduces Aya, a massively multilingual generative LLM that follows instructions in 101 languages of which over 50% are considered as lower-resourced. Aya outperforms mT0 and BLOOMZ on the majority of tasks while covering double the number of languages. We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages -- including discriminative and generative tasks, human evaluation, and simulated win rates that cover both held-out tasks and in-distribution performance. Furthermore, we conduct detailed investigations on the optimal finetuning mixture composition, data pruning, as well as the toxicity, bias, and safety of our models. We open-source our instruction datasets and our model at https://hf.co/CohereForAI/aya-101

Citations (129)

View on Semantic Scholar

Summary

The paper introduces the Aya Model, an innovative open-access multilingual LLM finetuned to support 101 languages.
It employs diverse datasets and a 13B parameter mT5 foundation with rigorous instruction-finetuning to enhance language performance.
The evaluation suite highlights robust performance and ethical bias mitigation, setting new standards in multilingual NLP research.

Multilingual Instruction: Advancing the State-of-the-Art with Aya Model

Introduction

LLMs have primarily benefited a handful of languages, leaving a wide gap in performance and accessibility for the majority of the world's languages. However, the Aya Model aims to bridge this gap by introducing a novel massively multilingual LLM that is open-source and instruction-finetuned, covering an unprecedented scope of 101 languages.

Training Data and Process

Datasets

The Aya Model is built upon extensive datasets including xP3x, Aya Collection, Aya Dataset, and the Data Provenance collection, among others. It significantly expands language coverage and incorporates detailed pruning processes to ensure data quality. The training phase employs a mixture of these datasets, focusing on diversity in language, task, and complexity.

Training Details

The model leverages the 13B parameter mT5 model as its foundation, benefiting from mT5's robust pretraining on multilingual data. With a training budget of 25M samples, the instruction-finetuning process focuses on maximizing coverage and performance across the included languages, facilitated by sophisticated data sampling and weighting strategies.

Evaluation Suite

A comprehensive evaluation suite has been developed to test the model's capabilities across various dimensions. This suite includes unseen discriminative tasks, generative tasks, and novel benchmarks like Multilingual MMLU, providing a thorough overview of the model's performance in both seen and unseen linguistic scenarios. Additionally, human and LLM preference evaluations offer insights into the model's qualitative performance and relative standing against existing models.

Bias, Risks, and Limitation Analysis

Critical to the development of the Aya Model is a conscientious approach to addressing biases, risks, and limitations inherent in multilingual LLMs. Through targeted efforts in safety mitigation and a detailed examination of toxicity and bias across different languages and contexts, the project underscores the importance of ethical considerations in LLM development. Despite these efforts, challenges such as sociolinguistic nuances, model values, and behavior across diverse languages underscore the complexity of creating truly inclusive and fair LLMs.

Model Version and Maintenance

The Aya Model is actively maintained, with its initial release marked for February 2024. The project team commits to regular updates and improvements, reflecting ongoing research and feedback from the broader community. The open-source nature of the model invites collaboration and contributions, setting a new standard for transparency and inclusivity in the field of LLM research.

Conclusion

The Aya Model represents a significant advancement in the effort to democratize access to state-of-the-art language technologies. By substantially increasing the number of languages covered and incorporating ethical considerations throughout its development process, the Aya Model paves the way for more equitable advancements in NLP. Its open-source release not only facilitates immediate access and utility but also encourages ongoing collaboration and innovation within the global research community.

PDF Markdown

Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1757228453072339377

https://twitter.com/_akhaliq/status/1757247413335535734

https://twitter.com/iScienceLuvr/status/1757227760550916373

https://twitter.com/sarahookr/status/1828793544141942787

https://twitter.com/CohereForAI/status/1760364473229992333

https://twitter.com/ydnysh/status/1757397957954175031

YouTube

Show All Videos