Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, this technical report details the first release of OLMo, a state-of-the-art, truly Open Language Model and its framework to build and study the science of language modeling. Unlike most prior efforts that have only released model weights and inference code, we release OLMo and the whole framework, including training data and training and evaluation code. We hope this release will empower and strengthen the open research community and inspire a new wave of innovation.
OLMo provides a comprehensive framework for LLMs, enhancing open access by including training data, logs, model checkpoints, and evaluation tools.
The architecture of OLMo features a decoder-only transformer optimized for resource utilization and stability, with variants at 1B and 7B scales including state-of-the-art enhancements.
OLMo's pretraining data, called Dolma, is a meticulously curated dataset aimed at promoting transparent and high-quality language model development.
The evaluation framework of OLMo includes both continuous assessment during training and detailed offline benchmarking, complete with rich metadata.
The project emphasizes training efficiency and carbon footprint transparency, thoroughly documenting power usage and emissions for environmental awareness.
OLMo represents an essential contribution to the open access landscape of LLMs by providing a comprehensive framework that includes not only the models but also the vital components enabling their development and evaluation. Unlike preceding efforts that may have limited openness by sharing just model weights or parts of the pipeline, OLMo distinguishes itself by offering the complete suite - from the training data and logs to the model checkpoints and evaluation tools. The unprecedented degree of access is poised to democratize the process of LLM research, providing a holistic resource for the deeper understanding and advancement of language modeling science.
The OLMo models utilize a decoder-only transformer architecture, optimized for computational resource utilization and minimizing training instabilities. The paper presents multiple variants of the model at scales of 1B and 7B, equipped with enhancements such as elimination of biases and the use of non-parametric layer normalization and the SwiGLU activation function. These modifications parallel those adopted in other state-of-the-art models, and comparisons against these show that OLMo stands on the cutting-edge in terms of structural design.
The data underpinning model pretraining is as critical as the models themselves. OLMo's training dataset, Dolma, is a curated amalgamation of publicly-available texts processed through a rigorous pipeline. Through disclosing Dolma, OLMo empowers researchers to replicate and understand the intricacies of assembling pretraining corpora that are diversified and qualitatively high-graded, promoting more transparent language model experimentation.
Empirical evaluation dictates an essential portion of the development lifecycle of LLMs. OLMo's evaluation framework operates in two dimensions - an in-loop ongoing assessment during training to inform model adjustments, and a detailed offline evaluation against established benchmarks. The checkpoints released include sufficient metadata to allow methodical analysis of the model's performance over its training tenure.
In line with escalating environmental concerns, the paper also underscores the models' training efficiency and carbon emissions. OLMo has been trained on both NVIDIA and AMD GPUs, with explicit documentation of power consumption and emissions, fostering consciousness of the environmental impact within the domain of high-performance computing.
The project crystallizes its commitment to openness with the release of the entirety of its assets under the Apache 2.0 License. This liberal licensing model facilitates wide-ranging experimentation and application, potentially easing barriers to entry into LLM research.
By releasing models, code, data, and insights from OLMo, the authors deliver a rich repository to the research community. This effort not only bridges the existing transparency gap in language model research but also provides a foundational platform to nurture understanding and foster innovation in the field.
A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, 2003. https://api.semanticscholar.org/CorpusID:221275765.
Pythia: A suite for analyzing LLMs across training and scaling. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 2397–2430. PMLR, 23–29 Jul 2023. https://proceedings.mlr.press/v202/biderman23a.html.
Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020. https://ojs.aaai.org/index.php/AAAI/article/view/6239.
Demographic dialectal variation in social media: A case study of African-American English. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1119–1130, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1120. https://aclanthology.org/D16-1120.
Efficient hierarchical domain adaptation for pretrained language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1336–1351, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.96. https://aclanthology.org/2022.naacl-main.96.
Automatically constructing a corpus of sentential paraphrases. In International Joint Conference on Natural Language Processing, 2005. https://www.microsoft.com/en-us/research/publication/automatically-constructing-a-corpus-of-sentential-paraphrases/.
A framework for few-shot language model evaluation, 12 2023. https://zenodo.org/records/10256836.
The international corpus of english (ICE) project. World Englishes, 15(1):3–15, mar 1996. doi: 10.1111/j.1467-971x.1996.tb00088.x. https://doi.org/10.1111%2Fj.1467-971x.1996.tb00088.x.
OpenLM: a minimal but performative language modeling (lm) repository, 2023. https://github.com/mlfoundations/open_lm/. GitHub repository.
Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. https://openreview.net/forum?id=Bkg6RiCqY7.
Treebank-3, 1999. https://catalog.ldc.upenn.edu/LDC99T42.
Distributed representations of words and phrases and their compositionality. In Neural Information Processing Systems, 2013. https://api.semanticscholar.org/CorpusID:16447573.
Davide Nunes. Preprocessed penn tree bank, 2020. https://zenodo.org/record/3910021.
Raiders of the lost kek: 3.5 years of augmented 4chan posts from the politically incorrect board. Proceedings of the International AAAI Conference on Web and Social Media, 14:885–894, may 2020. doi: 10.1609/icwsm.v14i1.7354. https://doi.org/10.1609%2Ficwsm.v14i1.7354.
Zero: Memory optimizations toward training trillion parameter models. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16, 2019. https://api.semanticscholar.org/CorpusID:203736482.
M2D2: A massively multi-domain language modeling dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 964–975, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. https://aclanthology.org/2022.emnlp-main.63.
The evolution of the manosphere across the web. Proceedings of the International AAAI Conference on Web and Social Media, 15:196–207, may 2021. doi: 10.1609/icwsm.v15i1.18053. https://doi.org/10.1609%2Ficwsm.v15i1.18053.
Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011. https://aaai.org/papers/02418-2418-choice-of-plausible-alternatives-an-evaluation-of-commonsense-causal-reasoning/.
Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021. https://dl.acm.org/doi/abs/10.1145/3474381.
Energy and policy considerations for deep learning in NLP. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1355. https://aclanthology.org/P19-1355.
Together Computer. RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset, April 2023. https://github.com/togethercomputer/RedPajama-Data.
Water Security and Climate Change: Hydropower Reservoir Greenhouse Gas Emissions, pages 69–94. Springer Singapore, Singapore, 2022. ISBN 978-981-16-5493-0. doi: 10.1007/978-981-16-5493-0˙5. https://doi.org/10.1007/978-981-16-5493-0_5.
Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
HEAD-QA: A healthcare dataset for complex reasoning. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 960–966, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1092. https://aclanthology.org/P19-1092.
Pytorch fsdp: Experiences on scaling fully sharded data parallel. Proc. VLDB Endow., 16:3848–3860, 2023. https://api.semanticscholar.org/CorpusID:258297871.