The recent surge in open-source LLMs, such as LLaMA, Falcon, and Mistral, provides diverse options for AI practitioners and researchers. However, most LLMs have only released partial artifacts, such as the final model weights or inference code, and technical reports increasingly limit their scope to high-level design choices and surface statistics. These choices hinder progress in the field by degrading transparency into the training of LLMs and forcing teams to rediscover many details in the training process. We present LLM360, an initiative to fully open-source LLMs, which advocates for all training code and data, model checkpoints, and intermediate results to be made available to the community. The goal of LLM360 is to support open and collaborative AI research by making the end-to-end LLM training process transparent and reproducible by everyone. As a first step of LLM360, we release two 7B parameter LLMs pre-trained from scratch, Amber and CrystalCoder, including their training code, data, intermediate checkpoints, and analyses (at https://www.llm360.ai). We are committed to continually pushing the boundaries of LLMs through this open-source effort. More large-scale and stronger models are underway and will be released in the future.
LLM360 is an initiative dedicated to increasing the transparency of LLMs by advocating for the open-sourcing of training details.
The paper describes the challenges in LLM research, such as data provenance issues, lack of reproducibility, and barriers to collaboration.
LLM360 introduces two open-sourced LLMs, AMBER and CRYSTAL CODER, complete with training materials, model checkpoints, and analyses.
The initiative emphasizes full transparency from model weights to training codes to foster innovation and replicability in LLM research.
LLM360 envisions the continuous release of open-source models, contributing to the advancement of LLM pre-training techniques and community collaboration.
The paper introduces LLM360, an initiative to enhance the transparency of LLMs by promoting the open-sourcing of comprehensive training details. The initiative underscores the recent trend in restricting access to training processes of LLMs, creating hurdles to replicability and innovation. LLM360 aims to reverse this trend by advocating the sharing of training codes, data, model checkpoints, and analyses. As part of this initiative, the paper highlights the release of two LLMs, AMBER and CRYSTAL CODER, accompanied by extensive training materials made available to the public.
The open-sourcing philosophy behind LLM360 extends from model weights to training codes and the nuanced details involved in the creation of LLMs. This approach is designed to combat several challenges faced in the LLM field, such as:
LLM360 focuses on a complete open-source effort that includes all training components, intermediate checkpoints, model configurations, and data origins. Specifically, the paper discusses the introduction of AMBER and CRYSTAL CODER, LLMs trained from scratch with respective parameter scales of 7B, showcasing their development details, data sources, and training methodologies. The framework embodies model transparency from code, training procedures, to intermediate checkpoints, aiming to set standards for future model releases.
Looking ahead, LLM360 promises the publication of larger, more powerful models while maintaining open-source principles. The initiative paves the way for continuous research collaboration and methodological development, aiming to address better training data mixtures, filtering techniques, and optimization strategies. The paper concludes with a commitment to the LLM360 vision of propelling sophistication and openness in LLM pre-training domains while acknowledging the need for responsible use, risk management, and community engagement.
Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama