Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, this technical report details the first release of OLMo, a state-of-the-art, truly Open Language Model and its framework to build and study the science of language modeling. Unlike most prior efforts that have only released model weights and inference code, we release OLMo and the whole framework, including training data and training and evaluation code. We hope this release will empower and strengthen the open research community and inspire a new wave of innovation.

Progression of OLMo-7B's accuracy on 8 core tasks from the Catwalk evaluation suite detailed.


  • OLMo provides a comprehensive framework for LLMs, enhancing open access by including training data, logs, model checkpoints, and evaluation tools.

  • The architecture of OLMo features a decoder-only transformer optimized for resource utilization and stability, with variants at 1B and 7B scales including state-of-the-art enhancements.

  • OLMo's pretraining data, called Dolma, is a meticulously curated dataset aimed at promoting transparent and high-quality language model development.

  • The evaluation framework of OLMo includes both continuous assessment during training and detailed offline benchmarking, complete with rich metadata.

  • The project emphasizes training efficiency and carbon footprint transparency, thoroughly documenting power usage and emissions for environmental awareness.

Overview of OLMo

OLMo represents an essential contribution to the open access landscape of LLMs by providing a comprehensive framework that includes not only the models but also the vital components enabling their development and evaluation. Unlike preceding efforts that may have limited openness by sharing just model weights or parts of the pipeline, OLMo distinguishes itself by offering the complete suite - from the training data and logs to the model checkpoints and evaluation tools. The unprecedented degree of access is poised to democratize the process of LLM research, providing a holistic resource for the deeper understanding and advancement of language modeling science.

Architecture & Framework

The OLMo models utilize a decoder-only transformer architecture, optimized for computational resource utilization and minimizing training instabilities. The paper presents multiple variants of the model at scales of 1B and 7B, equipped with enhancements such as elimination of biases and the use of non-parametric layer normalization and the SwiGLU activation function. These modifications parallel those adopted in other state-of-the-art models, and comparisons against these show that OLMo stands on the cutting-edge in terms of structural design.

Pretraining Data: A Deep Dive

The data underpinning model pretraining is as critical as the models themselves. OLMo's training dataset, Dolma, is a curated amalgamation of publicly-available texts processed through a rigorous pipeline. Through disclosing Dolma, OLMo empowers researchers to replicate and understand the intricacies of assembling pretraining corpora that are diversified and qualitatively high-graded, promoting more transparent language model experimentation.

Evaluation Protocol

Empirical evaluation dictates an essential portion of the development lifecycle of LLMs. OLMo's evaluation framework operates in two dimensions - an in-loop ongoing assessment during training to inform model adjustments, and a detailed offline evaluation against established benchmarks. The checkpoints released include sufficient metadata to allow methodical analysis of the model's performance over its training tenure.

Training Efficiency and Carbon Footprint

In line with escalating environmental concerns, the paper also underscores the models' training efficiency and carbon emissions. OLMo has been trained on both NVIDIA and AMD GPUs, with explicit documentation of power consumption and emissions, fostering consciousness of the environmental impact within the domain of high-performance computing.

Artifacts and Licensing

The project crystallizes its commitment to openness with the release of the entirety of its assets under the Apache 2.0 License. This liberal licensing model facilitates wide-ranging experimentation and application, potentially easing barriers to entry into LLM research.

By releasing models, code, data, and insights from OLMo, the authors deliver a rich repository to the research community. This effort not only bridges the existing transparency gap in language model research but also provides a foundational platform to nurture understanding and foster innovation in the field.

