Emergent Mind

Aya 23: Open Weight Releases to Further Multilingual Progress

(2405.15032)
Published May 23, 2024 in cs.CL

Abstract

This technical report introduces Aya 23, a family of multilingual language models. Aya 23 builds on the recent release of the Aya model (\"Ust\"un et al., 2024), focusing on pairing a highly performant pre-trained model with the recently released Aya collection (Singh et al., 2024). The result is a powerful multilingual large language model serving 23 languages, expanding state-of-art language modeling capabilities to approximately half of the world's population. The Aya model covered 101 languages whereas Aya 23 is an experiment in depth vs breadth, exploring the impact of allocating more capacity to fewer languages that are included during pre-training. Aya 23 outperforms both previous massively multilingual models like Aya 101 for the languages it covers, as well as widely used models like Gemma, Mistral and Mixtral on an extensive range of discriminative and generative tasks. We release the open weights for both the 8B and 35B models as part of our continued commitment for expanding access to multilingual progress.

Multilingual benchmark results from 8 datasets for Aya 23 models against similar-sized multilingual models.

Overview

  • Aya 23 represents a significant advancement in multilingual natural language processing, spanning 23 languages and leveraging Cohere's Command model architecture.

  • The paper identifies and addresses major bottlenecks such as lack of robust multilingual pre-trained models and scarcity of diverse instruction-style training data with Aya 23's sophisticated architecture.

  • Evaluation results indicate Aya 23's superior performance in multilingual tasks, evidenced by its success in various tasks including discriminative tasks, language understanding, mathematical reasoning, and generative tasks.

An Analysis of Aya 23: Multilingual Instruction-Tuned Language Models

The introduction of the Aya 23 family posits a significant advancement in multilingual NLP. Unlike previous models that are predominantly English-centric, the Aya 23 spans 23 languages, aimed at addressing the performance disparities across languages by leveraging Cohere's Command model architecture. This paper undertakes a comprehensive evaluation of the Aya 23 models' capacity for handling multilingual tasks using a multi-faceted benchmark approach.

The paper identifies two major bottlenecks in the development of robust multilingual language models: the lack of robust multilingual pre-trained models and the scarcity of language-diverse instruction-style training data. The Aya initiative itself, leading to Aya 101 and subsequently to Aya 23, was predicated on mitigating these issues by offering a robust multilingual instruction-style dataset and leveraging the relatively up-to-date Command R model.

The Aya 23 marks a departure from the Aya 101 approach by concentrating resources on 23 languages rather than attempting to cover 101 languages as Aya 101 does. The consolidation was driven to counteract the limitations of the so-called "curse of multilinguality," which posits that increasing language breadth often leads to reduced per-language performance due to distributed model capacity.

Model Architecture and Training

Aya 23 employs a state-of-the-art infrastructure, building on recent advancements in the architecture of decoder-only transformers. Noteworthy architectural features include:

  • Parallel Attention and FFN layers for enhanced training efficiency.
  • SwiGLU Activation which demonstrated superior downstream performance.
  • RoPE for improved long-context understanding and extrapolation capabilities.
  • Grouped Query Attention (GQA) which reduces the inference-time memory footprint in the 8B model configuration.

The models are trained using a robust TPU v4 infrastructure facilitated by a distributed Jax-based framework, showcasing a methodologically rigorous approach to high-throughput, efficient training.

Instruction Fine-Tuning

The instruction fine-tuning phase employs a diverse mixture of multilingual data sources encompassing structured templates from datasets like xP3x, human annotations, translated subsets, and synthetic data generated via machine translation and Cohere's models. This extensive and varied dataset ensures that the Aya 23 models are well-rounded in handling the complexities inherent in multilingual text processing.

Evaluation and Results

The paper uses a multi-layered evaluation framework, assessing the models on discriminative tasks, language understanding, mathematical reasoning, and generative tasks. Distinctions between baseline models and Aya 23 are articulated throughout the evaluation results.

  • Discriminative Tasks: Aya-23-35B outperforms all baselines in accuracy, with a significant 70.8% average score across tasks like XCOPA, XStoryCloze, and XWinoGrad.
  • Multilingual MMLU: The Aya models exhibit superior performance with Aya-23-35B achieving 58.2% accuracy—outstripping similarly sized models on languages like Arabic, Hindi, and Vietnamese.
  • Mathematical Reasoning: Aya models markedly outperform baselines in solving math problems under native context settings, with Aya-23-35B achieving the highest scores.
  • Generative Tasks: Aya 23 models excel in machine translation and summarization, with Aya-23-35B leading at 43.0 spBleu for translation tasks.

The models also perform impressively well in GPT-4 simulated win-rate tests, consistently edging out competing models across a wide range of languages.

Implications and Future Directions

The Aya 23 models underscore the importance of both selective multilingual pre-training and robust instruction fine-tuning in creating high-performance language models. The Aya family sets a precedent for future work aiming to balance linguistic breadth with depth, avoiding the pitfalls of overextended language distribution.

The Aya initiative's direction highlights various avenues for future work. One crucial aspect is expanding language coverage to underrepresented groups, particularly those prevalent in Asia and Africa. Addressing this imbalance aligns with broader goals of equitable technological advancement. Moreover, improving model safety, reducing generational biases, and addressing the inherent cultural sensitivities in language can form pillars for subsequent research.

Conclusion

Aya 23 exemplifies a significant step towards overcoming historical linguistic biases in NLP systems by ensuring high performance across a focused set of 23 languages. By releasing model weights and comprehensive evaluation frameworks, the paper envisions facilitating future research and practical applications, enriching the landscape of multilingual AI and fostering broader linguistic inclusivity.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.