LLMs have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
We're not able to analyze this paper right now due to high demand.
Please check back later (sorry!).
Generate a detailed summary of this paper with a premium account.
We ran into a problem analyzing this paper.
Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus. In Harald Lüngen, Marc Kupietz, Piotr Bański, Adrien Barbaresi, Simon Clematide, and Ines Pisetta, editors, Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-9), pages 1–9, Limerick, Ireland, 2021. Leibniz-Institut für Deutsche Sprache. doi: 10.14618/ids-pub-10468. https://nbn-resolving.org/urn:nbn:de:bsz:mh39-104688.
Judit Ács. Exploring bert’s vocabulary, 2019. http://juditacs.github.io/2019/02/19/bert-tokenization-stats.html.
PromptSource: An integrated development environment and repository for natural language prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 93–104, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-demo.9. https://aclanthology.org/2022.acl-demo.9.
Evaluating the carbon footprint of NLP methods: a survey and analysis of existing tools. In Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing, pages 11–21, Virtual, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.sustainlp-1.2. https://aclanthology.org/2021.sustainlp-1.2.
DiaBLa: A Corpus of Bilingual Spontaneous Written Dialogues for Machine Translation. Language Resources and Evaluation, pages 635–660, 2020. doi: 10.1007/s10579-020-09514-4. https://doi.org/10.1007/s10579-020-09514-4.
Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, March 2022. doi: 10.1162/coli˙a˙00422. https://aclanthology.org/2022.cl-1.7.
Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72, March 2019. doi: 10.1162/tacl˙a˙00254. https://www.aclweb.org/anthology/Q19-1004.
What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 861–872, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1080. https://www.aclweb.org/anthology/P17-1080.
BigScience Workshop. BLOOM (revision 4ab0472), 2022. https://huggingface.co/bigscience/bloom.
Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow, march 2021. https://doi. org/10.5281/zenodo, 5297715.
Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1004–1015, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.81. https://aclanthology.org/2021.acl-long.81.
Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–58, Baltimore, Maryland, USA, June 2014. Association for Computational Linguistics. doi: 10.3115/v1/W14-3302. https://aclanthology.org/W14-3302.
What to expect when you’re expecting robots: Futures, expectations, and pseudo-artificial general intelligence in uk news. Journalism, 23(1):22–38, 2022. doi: 10.1177/1464884920947535. https://doi.org/10.1177/1464884920947535.
The grammar-learning trajectories of neural language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8281–8297, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.568. https://aclanthology.org/2022.acl-long.568.
What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1198. https://aclanthology.org/P18-1198.
Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. https://aclanthology.org/2020.acl-main.747.
Entities, dates, and languages: Zero-shot on historical texts with t0. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating LLMs, pages 75–83, virtual+Dublin, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.bigscience-1.7. https://aclanthology.org/2022.bigscience-1.7.
Probing for semantic evidence of composition by means of simple classification tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pages 134–139, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/W16-2524. https://www.aclweb.org/anthology/W16-2524.
Beyond English-Centric multilingual machine translation. Journal of Machine Learning Research, 22(107):1–48, 2021. http://jmlr.org/papers/v22/20-1307.html.
Dataset debt in biomedical language modeling. In Challenges & Perspectives in Creating LLMs, 2022a. https://openreview.net/forum?id=HRfzInfr8Z9.
BigBio: A framework for data-centric biomedical natural language processing. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022b. https://openreview.net/forum?id=8lQDn9zTQlW.
Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2023. https://openreview.net/forum?id=COZDy0WYGg.
A framework for few-shot language model evaluation, September 2021. https://doi.org/10.5281/zenodo.5371628.
The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538, 2022. doi: 10.1162/tacl˙a˙00474. https://aclanthology.org/2022.tacl-1.30.
Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1275. https://aclanthology.org/D19-1275.
SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. https://aclanthology.org/D18-2012.
WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4034–4048, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.360. https://aclanthology.org/2020.findings-emnlp.360.
The BigScience ROOTS corpus: A 1.6TB composite multilingual dataset. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. https://openreview.net/forum?id=UoEw6KigkUn.
What language model to train if you have one million GPU hours? In Challenges & Perspectives in Creating LLMs, 2022. https://openreview.net/forum?id=rI7BL3fHIZq.
Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-demo.21. https://aclanthology.org/2021.emnlp-demo.21.
Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. https://aclanthology.org/W04-1013.
CamemBERT: a tasty French language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7203–7219, Online, July 2020. Association for Computational Linguistics. https://www.aclweb.org/anthology/2020.acl-main.645.
Mixed precision training. In International Conference on Learning Representations, 2018. https://openreview.net/forum?id=r1gs9JgRZ.
Hugging face tokenizers library. https://github.com/huggingface/tokenizers
CrowS-pairs: A challenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1953–1967, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.154. https://aclanthology.org/2020.emnlp-main.154.
French CrowS-pairs: Extending a challenge dataset for measuring social bias in masked language models to a language other than English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8521–8531, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.583. https://aclanthology.org/2022.acl-long.583.
Universal Dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 1659–1666, Portorož, Slovenia, May 2016. European Language Resources Association (ELRA). https://aclanthology.org/L16-1262.
Universal Dependencies. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts, Valencia, Spain, April 2017. Association for Computational Linguistics. https://aclanthology.org/E17-5001.
Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. In Piotr Bański, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Marc Kupietz, Harald Lüngen, and Caroline Iliadi, editors, Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7), pages 9 – 16, Cardiff, UK, 2019. Leibniz-Institut für Deutsche Sprache. doi: 10.14618/ids-pub-9021. http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215.
Matt Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-6319. https://aclanthology.org/W18-6319.
Ai and the everything in the whole wide world benchmark. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021. https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/084b6fbb10729ed4da8c3d3f5a3ae7c9-Paper-round2.pdf.
How good is your tokenizer? on the monolingual performance of multilingual language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3118–3135, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.243. https://aclanthology.org/2021.acl-long.243.
KUISAIL at SemEval-2020 task 12: BERT-CNN for offensive speech identification in social media. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 2054–2059, Barcelona (online), December 2020. International Committee for Computational Linguistics. https://www.aclweb.org/anthology/2020.semeval-1.271.
Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022. https://openreview.net/forum?id=9Vrb9D0WI4.
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017. https://openreview.net/forum?id=B1ckMDqlg.
Un modèle Transformer Génératif Pré-entrainé pour le ______ français. In Pascal Denis, Natalia Grabar, Amel Fraisse, Rémi Cardon, Bernard Jacquemin, Eric Kergosien, and Antonio Balvet, editors, Traitement Automatique des Langues Naturelles, pages 246–255, Lille, France, 2021. ATALA. https://hal.archives-ouvertes.fr/hal-03265900.
You reap what you sow: On the challenges of bias evaluation under multilingual settings. In Challenges & Perspectives in Creating LLMs, 2022. https://openreview.net/forum?id=rK-7NhfSIW5.
Emergent structures and training dynamics in LLMs. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating LLMs, pages 146–159, virtual+Dublin, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.bigscience-1.11. https://aclanthology.org/2022.bigscience-1.11.
Superglue: A stickier benchmark for general-purpose language understanding systems. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. https://proceedings.neurips.cc/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf.
Bfloat16: The secret to high performance on cloud tpus, 2019. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus.
What language model architecture and pretraining objective works best for zero-shot generalization? In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 22964–22984. PMLR, 17–23 Jul 2022a. https://proceedings.mlr.press/v162/wang22u.html.
mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. https://aclanthology.org/2021.naacl-main.41.
When do you need billions of words of pretraining data? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1112–1125, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.90. https://aclanthology.org/2021.acl-long.90.