Emergent Mind

OLMo: Accelerating the Science of Language Models

(2402.00838)
Published Feb 1, 2024 in cs.CL

Abstract

Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, this technical report details the first release of OLMo, a state-of-the-art, truly Open Language Model and its framework to build and study the science of language modeling. Unlike most prior efforts that have only released model weights and inference code, we release OLMo and the whole framework, including training data and training and evaluation code. We hope this release will empower and strengthen the open research community and inspire a new wave of innovation.

Progression of OLMo-7B's accuracy on 8 core tasks from the Catwalk evaluation suite detailed.

Overview

  • OLMo provides a comprehensive framework for LLMs, enhancing open access by including training data, logs, model checkpoints, and evaluation tools.

  • The architecture of OLMo features a decoder-only transformer optimized for resource utilization and stability, with variants at 1B and 7B scales including state-of-the-art enhancements.

  • OLMo's pretraining data, called Dolma, is a meticulously curated dataset aimed at promoting transparent and high-quality language model development.

  • The evaluation framework of OLMo includes both continuous assessment during training and detailed offline benchmarking, complete with rich metadata.

  • The project emphasizes training efficiency and carbon footprint transparency, thoroughly documenting power usage and emissions for environmental awareness.

Overview of OLMo

OLMo represents an essential contribution to the open access landscape of LLMs by providing a comprehensive framework that includes not only the models but also the vital components enabling their development and evaluation. Unlike preceding efforts that may have limited openness by sharing just model weights or parts of the pipeline, OLMo distinguishes itself by offering the complete suite - from the training data and logs to the model checkpoints and evaluation tools. The unprecedented degree of access is poised to democratize the process of LLM research, providing a holistic resource for the deeper understanding and advancement of language modeling science.

Architecture & Framework

The OLMo models utilize a decoder-only transformer architecture, optimized for computational resource utilization and minimizing training instabilities. The paper presents multiple variants of the model at scales of 1B and 7B, equipped with enhancements such as elimination of biases and the use of non-parametric layer normalization and the SwiGLU activation function. These modifications parallel those adopted in other state-of-the-art models, and comparisons against these show that OLMo stands on the cutting-edge in terms of structural design.

Pretraining Data: A Deep Dive

The data underpinning model pretraining is as critical as the models themselves. OLMo's training dataset, Dolma, is a curated amalgamation of publicly-available texts processed through a rigorous pipeline. Through disclosing Dolma, OLMo empowers researchers to replicate and understand the intricacies of assembling pretraining corpora that are diversified and qualitatively high-graded, promoting more transparent language model experimentation.

Evaluation Protocol

Empirical evaluation dictates an essential portion of the development lifecycle of LLMs. OLMo's evaluation framework operates in two dimensions - an in-loop ongoing assessment during training to inform model adjustments, and a detailed offline evaluation against established benchmarks. The checkpoints released include sufficient metadata to allow methodical analysis of the model's performance over its training tenure.

Training Efficiency and Carbon Footprint

In line with escalating environmental concerns, the paper also underscores the models' training efficiency and carbon emissions. OLMo has been trained on both NVIDIA and AMD GPUs, with explicit documentation of power consumption and emissions, fostering consciousness of the environmental impact within the domain of high-performance computing.

Artifacts and Licensing

The project crystallizes its commitment to openness with the release of the entirety of its assets under the Apache 2.0 License. This liberal licensing model facilitates wide-ranging experimentation and application, potentially easing barriers to entry into LLM research.

By releasing models, code, data, and insights from OLMo, the authors deliver a rich repository to the research community. This effort not only bridges the existing transparency gap in language model research but also provides a foundational platform to nurture understanding and foster innovation in the field.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
References
  1. SemDeDup: Data-efficient learning at web-scale through semantic deduplication
  2. The Falcon Series of Open Language Models
  3. Layer Normalization
  4. A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, 2003. https://api.semanticscholar.org/CorpusID:221275765.

  5. Pythia: A suite for analyzing LLMs across training and scaling. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 2397–2430. PMLR, 23–29 Jul 2023. https://proceedings.mlr.press/v202/biderman23a.html.

  6. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
  7. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020. https://ojs.aaai.org/index.php/AAAI/article/view/6239.

  8. GPT-NeoX-20B: An Open-Source Autoregressive Language Model
  9. Demographic dialectal variation in social media: A case study of African-American English. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1119–1130, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1120. https://aclanthology.org/D16-1120.

  10. Language Models are Few-Shot Learners
  11. PaLM: Scaling Language Modeling with Pathways
  12. Efficient hierarchical domain adaptation for pretrained language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1336–1351, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.96. https://aclanthology.org/2022.naacl-main.96.

  13. UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining
  14. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
  15. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
  16. Measuring the carbon intensity of ai in cloud instances, 2022. https://dl.acm.org/doi/10.1145/3531146.3533234.
  17. Automatically constructing a corpus of sentential paraphrases. In International Joint Conference on Natural Language Processing, 2005. https://www.microsoft.com/en-us/research/publication/automatically-constructing-a-corpus-of-sentential-paraphrases/.

  18. What's In My Big Data?
  19. The Pile: An 800GB Dataset of Diverse Text for Language Modeling
  20. A framework for few-shot language model evaluation, 12 2023. https://zenodo.org/records/10256836.

  21. The international corpus of english (ICE) project. World Englishes, 15(1):3–15, mar 1996. doi: 10.1111/j.1467-971x.1996.tb00088.x. https://doi.org/10.1111%2Fj.1467-971x.1996.tb00088.x.

  22. Catwalk: A Unified Language Model Evaluation Framework for Many Datasets
  23. OpenLM: a minimal but performative language modeling (lm) repository, 2023. https://github.com/mlfoundations/open_lm/. GitHub repository.

  24. Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2
  25. Mixtral of Experts
  26. Holistic Evaluation of Language Models
  27. LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning
  28. LLM360: Towards Fully Transparent Open-Source LLMs
  29. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. https://openreview.net/forum?id=Bkg6RiCqY7.

  30. Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model
  31. Paloma: A Benchmark for Evaluating Language Model Fit
  32. Treebank-3, 1999. https://catalog.ldc.upenn.edu/LDC99T42.

  33. Pointer Sentinel Mixture Models
  34. Mixed Precision Training
  35. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
  36. Distributed representations of words and phrases and their compositionality. In Neural Information Processing Systems, 2013. https://api.semanticscholar.org/CorpusID:16447573.

  37. MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. www.mosaicml.com/blog/mpt-7b. Accessed: 2023-05-05.
  38. Scaling Data-Constrained Language Models
  39. Davide Nunes. Preprocessed penn tree bank, 2020. https://zenodo.org/record/3910021.

  40. GPT-4 Technical Report
  41. Raiders of the lost kek: 3.5 years of augmented 4chan posts from the politically incorrect board. Proceedings of the International AAAI Conference on Web and Social Media, 14:885–894, may 2020. doi: 10.1609/icwsm.v14i1.7354. https://doi.org/10.1609%2Ficwsm.v14i1.7354.

  42. Carbon Emissions and Large Neural Network Training
  43. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
  44. Deep contextualized word representations
  45. WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations
  46. Scaling Language Models: Methods, Analysis & Insights from Training Gopher
  47. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), jan 2020. ISSN 1532-4435.
  48. Zero: Memory optimizations toward training trillion parameter models. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16, 2019. https://api.semanticscholar.org/CorpusID:203736482.

  49. M2D2: A massively multi-domain language modeling dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 964–975, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. https://aclanthology.org/2022.emnlp-main.63.

  50. The evolution of the manosphere across the web. Proceedings of the International AAAI Conference on Web and Social Media, 15:196–207, may 2021. doi: 10.1609/icwsm.v15i1.18053. https://doi.org/10.1609%2Ficwsm.v15i1.18053.

  51. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011. https://aaai.org/papers/02418-2418-choice-of-plausible-alternatives-an-evaluation-of-commonsense-causal-reasoning/.

  52. Ronald Rosenfeld. Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8):1270–1278
  53. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021. https://dl.acm.org/doi/abs/10.1145/3474381.

  54. GLU Variants Improve Transformer
  55. Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
  56. Energy and policy considerations for deep learning in NLP. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1355. https://aclanthology.org/P19-1355.

  57. RoFormer: Enhanced Transformer with Rotary Position Embedding
  58. Together Computer. RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset, April 2023. https://github.com/togethercomputer/RedPajama-Data.

  59. LLaMA: Open and Efficient Foundation Language Models
  60. Llama 2: Open Foundation and Fine-Tuned Chat Models
  61. Water Security and Climate Change: Hydropower Reservoir Greenhouse Gas Emissions, pages 69–94. Springer Singapore, Singapore, 2022. ISBN 978-981-16-5493-0. doi: 10.1007/978-981-16-5493-0˙5. https://doi.org/10.1007/978-981-16-5493-0_5.

  62. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

  63. HEAD-QA: A healthcare dataset for complex reasoning. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 960–966, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1092. https://aclanthology.org/P19-1092.

  64. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
  65. How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources
  66. Crowdsourcing Multiple Choice Science Questions
  67. Sustainable AI: Environmental Implications, Challenges and Opportunities
  68. What is gab: A bastion of free speech or an alt-right echo chamber. In Companion Proceedings of the The Web Conference 2018, WWW ’18, page 1007–1014, Republic and Canton of Geneva, CHE, 2018. International World Wide Web Conferences Steering Committee. ISBN 9781450356404. doi: 10.1145/3184558.3191531. https://doi.org/10.1145/3184558.3191531.
  69. HellaSwag: Can a Machine Really Finish Your Sentence?
  70. Root Mean Square Layer Normalization
  71. OPT: Open Pre-trained Transformer Language Models
  72. Pytorch fsdp: Experiences on scaling fully sharded data parallel. Proc. VLDB Endow., 16:3848–3860, 2023. https://api.semanticscholar.org/CorpusID:258297871.

Show All 72