LLM360: Towards Fully Transparent Open-Source LLMs (2312.06550v1)
Abstract: The recent surge in open-source LLMs, such as LLaMA, Falcon, and Mistral, provides diverse options for AI practitioners and researchers. However, most LLMs have only released partial artifacts, such as the final model weights or inference code, and technical reports increasingly limit their scope to high-level design choices and surface statistics. These choices hinder progress in the field by degrading transparency into the training of LLMs and forcing teams to rediscover many details in the training process. We present LLM360, an initiative to fully open-source LLMs, which advocates for all training code and data, model checkpoints, and intermediate results to be made available to the community. The goal of LLM360 is to support open and collaborative AI research by making the end-to-end LLM training process transparent and reproducible by everyone. As a first step of LLM360, we release two 7B parameter LLMs pre-trained from scratch, Amber and CrystalCoder, including their training code, data, intermediate checkpoints, and analyses (at https://www.LLM360.ai). We are committed to continually pushing the boundaries of LLMs through this open-source effort. More large-scale and stronger models are underway and will be released in the future.
- OpenAI. Gpt-4 technical report, 2023.
- Claude. Claude 2.1 model card. Technical report, Claude Inc., 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Skywork: A more open bilingual foundation model, 2023.
- Don’t make your llm an evaluation benchmark cheater, 2023.
- Together Computer. Redpajama: an open dataset for training large language models, 2023.
- Together Computer. Redpajama-incite-7b-base, 2023.
- Openllama: An open reproduction of llama, May 2023.
- Emergent and predictable memorization in large language models. arXiv preprint arXiv:2304.11158, 2023.
- Emergent abilities of large language models, 2022.
- Large language model as attributed training data generator: A tale of diversity and bias, 2023.
- Doremi: Optimizing data mixtures speeds up language model pretraining. arXiv preprint arXiv:2305.10429, 2023.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
- GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
- Gpt-neox-20b: An open-source autoregressive language model, 2022.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. Accessed: 2023-05-05.
- The falcon series of open language models, 2023.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- 01.ai. 01-ai/yi: A series of large language models trained from scratch by developers @01-ai, 2023.
- Slimpajama-dc: Understanding data combinations for llm training, 2023.
- GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch, 9 2023.
- Scaling laws and interpretability of learning from repeated data, 2022.
- Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022.
- Llm-qat: Data-free quantization aware training for large language models, 2023.
- Adam: A method for stochastic optimization, 2017.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2023.
- Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Pytorch fsdp: Experiences on scaling fully sharded data parallel, 2023.
- Direct preference optimization: Your language model is secretly a reward model, 2023.
- Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023.
- SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021.
- Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
- Pufferfish: Communication-efficient models at no extra cost. Proceedings of Machine Learning and Systems, 3:365–386, 2021.
- Cuttlefish: Low-rank model training without all the tuning. Proceedings of Machine Learning and Systems, 5, 2023.
- Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 214–229, New York, NY, USA, 2022. Association for Computing Machinery.
- Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21. ACM, March 2021.