Aya 23: Open Weight Releases to Further Multilingual Progress (2405.15032v2)
Abstract: This technical report introduces Aya 23, a family of multilingual LLMs. Aya 23 builds on the recent release of the Aya model (\"Ust\"un et al., 2024), focusing on pairing a highly performant pre-trained model with the recently released Aya collection (Singh et al., 2024). The result is a powerful multilingual LLM serving 23 languages, expanding state-of-art LLMing capabilities to approximately half of the world's population. The Aya model covered 101 languages whereas Aya 23 is an experiment in depth vs breadth, exploring the impact of allocating more capacity to fewer languages that are included during pre-training. Aya 23 outperforms both previous massively multilingual models like Aya 101 for the languages it covers, as well as widely used models like Gemma, Mistral and Mixtral on an extensive range of discriminative and generative tasks. We release the open weights for both the 8B and 35B models as part of our continued commitment for expanding access to multilingual progress.
- Ethnologue. https://www.ethnologue.com/insights/how-many-languages/, 2023. Accessed: 2023-06-17.
- Breaking the unwritten language barrier: The bulb project. Procedia Computer Science, 81:8–14, 2016. ISSN 1877-0509. https://doi.org/10.1016/j.procs.2016.04.023. URL https://www.sciencedirect.com/science/article/pii/S1877050916300370. SLTU-2016 5th Workshop on Spoken Language Technologies for Under-resourced languages 09-12 May 2016 Yogyakarta, Indonesia.
- Do all languages cost the same? tokenization in the era of commercial language models, 2023.
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.
- Palm 2 technical report. arXiv, abs/2305.10403, 2023.
- Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019, 2019.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Xnli: Evaluating cross-lingual sentence representations. pp. 2475–2485, October-November 2018. 10.18653/v1/D18-1269. URL https://aclanthology.org/D18-1269.
- Unsupervised cross-lingual representation learning at scale. pp. 8440–8451, July 2019. 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm. Databricks, 2023a.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023b. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
- Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback. arXiv e-prints, pp. arXiv–2307, 2023.
- Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474, 2023.
- Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
- Towards measuring the representation of subjective global opinions in language models. arXiv, abs/2306.16388, 2023.
- A framework for few-shot language model evaluation. 12 2023. 10.5281/zenodo.10256836. URL https://zenodo.org/records/10256836.
- Gemini: A family of highly capable multimodal models, 2024.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
- Gemma-Team. Gemma: Open models based on gemini research and technology, 2024.
- The flores-101 evaluation benchmark for low-resource and multilingual machine translation. arXiv, abs/2106.03193, 2021.
- XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages. pp. 4693–4703, August 2021. 10.48550/arXiv.2106.13822. URL https://aclanthology.org/2021.findings-acl.413.
- A material lens on coloniality in nlp. arXiv, abs/2311.08391, 2023.
- Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020.
- Mistral 7b, 2023.
- Mixtral of experts. arXiv, abs/2401.04088, 2024.
- Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings, 2023.
- Casteist but not racist? quantifying disparities in large language model bias between india and the west. ArXiv, abs/2309.08573, 2023. URL https://api.semanticscholar.org/CorpusID:262013517.
- Gptaraeval: A comprehensive evaluation of chatgpt on arabic nlp. arXiv, abs/2305.14976, 2023.
- Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491, 2023.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Gender bias and stereotypes in large language models. Proceedings of The ACM Collective Intelligence Conference, 2023. URL https://api.semanticscholar.org/CorpusID:261276445.
- Bactrian-x: Multilingual replicable instruction-following models with low-rank adaptation. arXiv, abs/2305.15011, 2023a.
- Privacy in large language models: Attacks, defenses and future directions. ArXiv, abs/2310.10383, 2023b. URL https://api.semanticscholar.org/CorpusID:264145758.
- Few-shot learning with multilingual language models. arXiv, abs/2112.10668, 2021.
- The flan collection: Designing data and methods for effective instruction tuning. arXiv, abs/2301.13688, 2023a.
- The data provenance initiative: A large scale audit of dataset licensing & attribution in ai. arXiv preprint arXiv:2310.16787, 2023b.
- Analyzing leakage of personally identifiable information in language models. 2023 IEEE Symposium on Security and Privacy (SP), pp. 346–363, 2023. URL https://api.semanticscholar.org/CorpusID:256459554.
- Crosslingual generalization through multitask finetuning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15991–16111, Toronto, Canada, July 2023. Association for Computational Linguistics. 10.18653/v1/2023.acl-long.891. URL https://aclanthology.org/2023.acl-long.891.
- Scalable extraction of training data from (production) language models. arXiv, abs/2311.17035, 2023.
- Lost in translation: Large language models in non-english content analysis. arXiv, abs/2306.07377, 2023.
- No language left behind: Scaling human-centered machine translation. 2022.
- How good are large language models on african languages? arXiv, abs/2311.07978, 2023.
- Lifting the curse of multilinguality by pre-training modular transformers. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3479–3495, Seattle, United States, July 2022. Association for Computational Linguistics. 10.18653/v1/2022.naacl-main.255. URL https://aclanthology.org/2022.naacl-main.255.
- Xcopa: A multilingual dataset for causal commonsense reasoning. pp. 2362–2376, November 2020. 10.18653/v1/2020.emnlp-main.185. URL https://aclanthology.org/2020.emnlp-main.185.
- Train short, test long: Attention with linear biases enables input length extrapolation. CoRR, abs/2108.12409, 2021. URL https://arxiv.org/abs/2108.12409.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Towards a standard for identifying and managing bias in artificial intelligence. NIST special publication, 1270(10.6028), 2022.
- Noam Shazeer. GLU variants improve transformer. CoRR, abs/2002.05202, 2020. URL https://arxiv.org/abs/2002.05202.
- Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=fR3wGCk-IXp.
- Aya dataset: An open-access collection for multilingual instruction tuning. arXiv preprint arXiv:2402.06619, 2024.
- Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
- Stanford alpaca: An instruction-following llama model. 2023.
- Llama: Open and efficient foundation language models. arXiv, abs/2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv, abs/2307.09288, 2023b.
- On evaluating and mitigating gender biases in multilingual settings. arXiv, abs/2307.01503, 2023.
- mt5: A massively multilingual pre-trained text-to-text transformer. pp. 483–498, June 2020. 10.18653/v1/2021.naacl-main.41. URL https://aclanthology.org/2021.naacl-main.41.
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2369–2380, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. 10.18653/v1/D18-1259. URL https://aclanthology.org/D18-1259.
- Low-resource languages jailbreak GPT-4. arXiv, abs/2310.02446, 2023a.
- BLOOM+1: Adding language support to BLOOM for zero-shot prompting. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 11682–11703, Toronto, Canada, July 2023b. Association for Computational Linguistics. 10.18653/v1/2023.acl-long.653. URL https://aclanthology.org/2023.acl-long.653.
- Scalable training of language models using jax pjit and tpuv4, 2022.
- Llama beyond english: An empirical study on language capability transfer. arXiv, abs/2401.01055, 2024.
- Aya model: An instruction finetuned open-access multilingual language model, 2024.