Emergent Mind

LLM360: Towards Fully Transparent Open-Source LLMs

(2312.06550)
Published Dec 11, 2023 in cs.CL , cs.AI , and cs.LG

Abstract

The recent surge in open-source LLMs, such as LLaMA, Falcon, and Mistral, provides diverse options for AI practitioners and researchers. However, most LLMs have only released partial artifacts, such as the final model weights or inference code, and technical reports increasingly limit their scope to high-level design choices and surface statistics. These choices hinder progress in the field by degrading transparency into the training of LLMs and forcing teams to rediscover many details in the training process. We present LLM360, an initiative to fully open-source LLMs, which advocates for all training code and data, model checkpoints, and intermediate results to be made available to the community. The goal of LLM360 is to support open and collaborative AI research by making the end-to-end LLM training process transparent and reproducible by everyone. As a first step of LLM360, we release two 7B parameter LLMs pre-trained from scratch, Amber and CrystalCoder, including their training code, data, intermediate checkpoints, and analyses (at https://www.llm360.ai). We are committed to continually pushing the boundaries of LLMs through this open-source effort. More large-scale and stronger models are underway and will be released in the future.

CrystalCoder's performance on Open LLM leaderboard across three training stages, highlighted by grey dashed lines.

Overview

  • LLM360 is an initiative dedicated to increasing the transparency of LLMs by advocating for the open-sourcing of training details.

  • The paper describes the challenges in LLM research, such as data provenance issues, lack of reproducibility, and barriers to collaboration.

  • LLM360 introduces two open-sourced LLMs, AMBER and CRYSTAL CODER, complete with training materials, model checkpoints, and analyses.

  • The initiative emphasizes full transparency from model weights to training codes to foster innovation and replicability in LLM research.

  • LLM360 envisions the continuous release of open-source models, contributing to the advancement of LLM pre-training techniques and community collaboration.

Introduction

The paper introduces LLM360, an initiative to enhance the transparency of LLMs by promoting the open-sourcing of comprehensive training details. The initiative underscores the recent trend in restricting access to training processes of LLMs, creating hurdles to replicability and innovation. LLM360 aims to reverse this trend by advocating the sharing of training codes, data, model checkpoints, and analyses. As part of this initiative, the paper highlights the release of two LLMs, AMBER and CRYSTAL CODER, accompanied by extensive training materials made available to the public.

Transparency and Challenges in LLM Research

The open-sourcing philosophy behind LLM360 extends from model weights to training codes and the nuanced details involved in the creation of LLMs. This approach is designed to combat several challenges faced in the LLM field, such as:

  • Data provenance and the consequential understanding of training data to mitigate biases.
  • Reproducibility hurdles due to the non-disclosure of full training configurations, impediments in validating reported results.
  • The barrier to open collaboration caused by the release of only final model weights, which limits research into emergent abilities or training data effects on LLM behavior.

LLM360 Framework and Initial Model Releases

LLM360 focuses on a complete open-source effort that includes all training components, intermediate checkpoints, model configurations, and data origins. Specifically, the paper discusses the introduction of AMBER and CRYSTAL CODER, LLMs trained from scratch with respective parameter scales of 7B, showcasing their development details, data sources, and training methodologies. The framework embodies model transparency from code, training procedures, to intermediate checkpoints, aiming to set standards for future model releases.

Future Directions and Conclusion

Looking ahead, LLM360 promises the publication of larger, more powerful models while maintaining open-source principles. The initiative paves the way for continuous research collaboration and methodological development, aiming to address better training data mixtures, filtering techniques, and optimization strategies. The paper concludes with a commitment to the LLM360 vision of propelling sophistication and openness in LLM pre-training domains while acknowledging the need for responsible use, risk management, and community engagement.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

Reddit
LLM360: Towards Fully Transparent Open-Source LLMs (69 points, 5 comments) in /r/LocalLLaMA
References
  1. OpenAI. Gpt-4 technical report
  2. Claude. Claude 2.1 model card. Technical report, Claude Inc.
  3. LLaMA: Open and Efficient Foundation Language Models
  4. Llama 2: Open Foundation and Fine-Tuned Chat Models
  5. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
  6. Mistral 7B
  7. Skywork: A more open bilingual foundation model
  8. Don’t make your llm an evaluation benchmark cheater
  9. Together Computer. Redpajama: an open dataset for training large language models
  10. Together Computer. Redpajama-incite-7b-base
  11. Openllama: An open reproduction of llama, May 2023
  12. Emergent and Predictable Memorization in Large Language Models
  13. Emergent abilities of large language models
  14. Large language model as attributed training data generator: A tale of diversity and bias
  15. DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
  16. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR
  17. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

  18. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.

  19. Gpt-neox-20b: An open-source autoregressive language model
  20. OPT: Open Pre-trained Transformer Language Models
  21. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
  22. MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. Accessed: 2023-05-05.
  23. The falcon series of open language models
  24. Qwen Technical Report
  25. 01.ai. 01-ai/yi: A series of large language models trained from scratch by developers @01-ai
  26. Slimpajama-dc: Understanding data combinations for llm training
  27. GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch, 9 2023
  28. Scaling laws and interpretability of learning from repeated data
  29. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15
  30. RoFormer: Enhanced Transformer with Rotary Position Embedding
  31. Llm.int8(): 8-bit matrix multiplication for transformers at scale
  32. Llm-qat: Data-free quantization aware training for large language models
  33. Adam: A method for stochastic optimization
  34. GLM-130B: An Open Bilingual Pre-trained Model
  35. Mixed Precision Training
  36. WizardLM: Empowering Large Language Models to Follow Complex Instructions
  37. Judging llm-as-a-judge with mt-bench and chatbot arena
  38. Pytorch fsdp: Experiences on scaling fully sharded data parallel
  39. Direct preference optimization: Your language model is secretly a reward model
  40. Beavertails: Towards improved safety alignment of llm via a human-preference dataset
  41. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama

  42. StarCoder: may the source be with you!
  43. Code Llama: Open Foundation Models for Code
  44. Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
  45. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650
  46. Quantifying Memorization Across Neural Language Models
  47. Pufferfish: Communication-efficient models at no extra cost. Proceedings of Machine Learning and Systems, 3:365–386
  48. Cuttlefish: Low-rank model training without all the tuning. Proceedings of Machine Learning and Systems, 5
  49. Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 214–229, New York, NY, USA, 2022. Association for Computing Machinery.
  50. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21. ACM, March 2021.

Show All 50