Emergent Mind

Abstract

Pretrained language models underpin several AI applications, but their high computational cost for training limits accessibility. Initiatives such as BLOOM and StarCoder aim to democratize access to pretrained models for collaborative community development. However, such existing models face challenges: limited multilingual capabilities, continual pretraining causing catastrophic forgetting, whereas pretraining from scratch is computationally expensive, and compliance with AI safety and development laws. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435 billion additional tokens, Aurora-M surpasses 2 trillion tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Aurora-M is rigorously evaluated across various tasks and languages, demonstrating robustness against catastrophic forgetting and outperforming alternatives in multilingual settings, particularly in safety evaluations. To promote responsible open-source LLM development, Aurora-M and its variants are released at https://huggingface.co/collections/aurora-m/aurora-m-models-65fdfdff62471e09812f5407 .

Aurora-M outperforms StarCoderPlus in code and multilingual benchmarks, showing higher Pass@1 and 0-shot accuracy.

Overview

  • Aurora-M is a 15B parameter open-source multilingual Large Language Model (LLM) pre-trained on over 2 trillion tokens, encompassing languages like English, Finnish, Hindi, Japanese, Vietnamese, and code, aligning with the Biden-Harris Executive Order on AI safety.

  • The model underwent a two-stage training curriculum, using diverse datasets and focusing on both general and instruction-tuned datasets to enhance its capabilities and safety alignment.

  • Training techniques included the use of the LUMI supercomputer, mixed precision training, and a meticulously optimized learning schedule, prioritizing environmental considerations like hydro-powered energy use.

  • Aurora-M demonstrated superior performance in multilingual tasks and coding-related tasks, particularly in safety evaluations, underlining its commitment to producing ethically and legally sound content.

Introducing Aurora-M: A Multilingual Open-Source LLM Compliant with the Biden-Harris Executive Order on AI Safety

Overview of Aurora-M

The paper introduces Aurora-M, a 15B parameter open-source multilingual Large Language Model (LLM) that has been continually pretrained on a diverse and extensive dataset. Unlike its predecessors, Aurora-M stands out not only for its multilingual capabilities, which cover English, Finnish, Hindi, Japanese, Vietnamese, and code, but also for its alignment with stringent AI safety and legal standards, specifically the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. The model was continually pretrained from the StarCoderPlus model on an additional 435 billion tokens, reaching a staggering total of over 2 trillion tokens. This comprehensive training enables Aurora-M to demonstrate robustness against catastrophic forgetting and superior performance in multilingual settings, particularly in safety evaluations.

Data Curation and Processing

The dataset preparation for Aurora-M involved a two-stage training curriculum, integrating general text data from diverse sources, covering both natural languages and coding languages, along with instruction-tuning datasets. The Continual Auxiliary Pretraining (CAP) stage utilized general web data and multilingual datasets from sources like RefinedWeb and the Pile, while the Continual Alignment Tuning (CAT) stage focused on further boosting its capabilities in specific areas and aligning with safety objectives. Rigorous data filtering techniques were employed to ensure the high quality and relevance of training data, addressing challenges like toxic content removal and sensitive information anonymization.

Training Methodology

Aurora-M's training exploited advanced techniques, including the use of the LUMI supercomputer, mixed precision training, and a carefully optimized learning rate schedule, culminating in a training period of 48 days. This training was not only highly efficient but also environmentally considerate, using 100% hydro-powered energy and incorporating waste heat recycling.

Emphasis on Safety and Legal Compliance

A critical aspect of Aurora-M's development was its instruction-tuning on a carefully curated dataset designed to align with the Biden-Harris Executive Order’s focus areas. This safety consideration is crucial for mitigating risks related to AI applications and ensuring the model’s outputs adhere to accepted ethical and legal standards. The construction of this tailored safety dataset underscores a proactive approach to addressing contemporary concerns regarding AI safety and compliance.

Evaluation and Performance

Aurora-M was subjected to comprehensive evaluations across a range of tasks and languages. Its performance was benchmarked against leading models, showcasing its enhanced capabilities in multilingual language understanding and generation, as well as in coding-related tasks. Notably, Aurora-M demonstrated superior performance in safety evaluations, affirming its commitment to producing legally compliant and ethically sound content.

Contributions and Future Directions

The development of Aurora-M represents a significant step forward in the field of AI research, particularly in fostering open-source LLM development. The model's release is intended to encourage further research and innovation, with its underlying datasets and training methodologies made accessible for community refinement and expansion. Looking ahead, there are plans to explore continual training of Aurora-M on advanced base models and expand its domain-specific expertise, leveraging the insights gained from this project to push the boundaries of AI capabilities while maintaining a steadfast commitment to safety and legal compliance.

In conclusion, Aurora-M embodies a harmonious blend of technical excellence, multilingual inclusivity, and unwavering commitment to safety and ethical AI development. Its introduction paves the way for further advancements in LLM research and applications, promising wider accessibility and responsible innovation in the AI domain.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube