JetMoE: Reaching Llama2 Performance with 0.1M Dollars (2404.07413v1)
Abstract: LLMs have achieved remarkable results, but their increasing resource demand has become a major obstacle to the development of powerful and accessible super-human intelligence. This report introduces JetMoE-8B, a new LLM trained with less than $0.1 million, using 1.25T tokens from carefully mixed open-source corpora and 30,000 H100 GPU hours. Despite its low cost, the JetMoE-8B demonstrates impressive performance, with JetMoE-8B outperforming the Llama2-7B model and JetMoE-8B-Chat surpassing the Llama2-13B-Chat model. These results suggest that LLM training can be much more cost-effective than generally thought. JetMoE-8B is based on an efficient Sparsely-gated Mixture-of-Experts (SMoE) architecture, composed of attention and feedforward experts. Both layers are sparsely activated, allowing JetMoE-8B to have 8B parameters while only activating 2B for each input token, reducing inference computation by about 70% compared to Llama2-7B. Moreover, JetMoE-8B is highly open and academia-friendly, using only public datasets and training code. All training parameters and data mixtures have been detailed in this report to facilitate future efforts in the development of open foundation models. This transparency aims to encourage collaboration and further advancements in the field of accessible and efficient LLMs. The model weights are publicly available at https://github.com/myshell-ai/JetMoE.
- abacusai. Systemchat, 2024. URL https://huggingface.co/datasets/abacusai/SystemChat.
- ajibawa 2023. Code-290k-sharegpt, 2024. URL https://huggingface.co/datasets/ajibawa-2023/Code-290k-ShareGPT.
- Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation, 2024.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Llemma: An open language model for mathematics, 2023.
- A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness, 2022.
- Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373, 2023.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 7432–7439, 2020.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Evaluating large language models trained on code, 2021.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Deep reinforcement learning from human preferences, 2023.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- CogStack. OpenGPT: A framework for creating grounded instruction based datasets and training conversational domain expert Large Language Models (LLMs). https://github.com/CogStack/OpenGPT, 2023.
- CollectiveCognition. Collective cognition chatgpt conversations, 2023. URL https://huggingface.co/datasets/CollectiveCognition/chats-data-2023-09-22.
- Ultrafeedback: Boosting language models with high-quality feedback, 2023.
- Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024.
- Luigi Daniele and Suphavadeeprasit. Amplify-instruct: Synthetically generated diverse multi-turn conversations for effecient llm training. arXiv preprint arXiv:(coming soon), 2023. URL https://huggingface.co/datasets/LDJnr/Capybara.
- Databricks. Dbrx: Resources and code examples. https://github.com/databricks/dbrx, 2024.
- Enhancing chat language models by scaling high-quality instructional conversations, 2023.
- Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp. 5547–5569. PMLR, 2022.
- Jon Durbin. airoboros: Customizable implementation of the self-instruct paper. https://github.com/jondurbin/airoboros, 2023.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2021.
- Megablocks: Efficient sparse training with mixture-of-experts. Proceedings of Machine Learning and Systems, 5, 2023.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama.
- glaiveai. Glaive-code-assistant, 2023. URL https://huggingface.co/datasets/glaiveai/glaive-code-assistant.
- Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024a.
- Direct language model alignment from online ai feedback, 2024b.
- Transcending runtime-memory tradeoffs in checkpointing by being fusion aware. Proceedings of Machine Learning and Systems, 5, 2023.
- Minicpm: Unveiling the potential of small language models with scalable training strategies, 2024.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Mixtral of experts, 2024.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
- LAION-AI. Open-Assistant: A chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically. https://github.com/LAION-AI/Open-Assistant, 2023.
- Platypus: Quick, cheap, and powerful refinement of llms, 2024.
- Camel: Communicative agents for ”mind” exploration of large language model society. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.
- Starcoder: may the source be with you!, 2023b.
- Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023. URL https://https://huggingface.co/Open-Orca/SlimOrca.
- Let’s verify step by step. preprint arXiv:2305.20050, 2023.
- lm sys. FastChat: An open platform for training, serving, and evaluating large language model based chatbots. https://github.com/lm-sys/FastChat, 2023.
- Locutusque. Ultratextbooks, 2024. URL https://huggingface.co/datasets/Locutusque/UltraTextbooks.
- The flan collection: Designing data and methods for effective instruction tuning, 2023.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024.
- Wizardcoder: Empowering code large language models with evol-instruct, 2023.
- Orca-math: Unlocking the potential of slms in grade school math, 2024.
- Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023a.
- Crosslingual generalization through multitask finetuning, 2023b.
- Orca: Progressive learning from complex explanation traces of gpt-4, 2023.
- Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15, 2021.
- Training language models to follow instructions with human feedback, 2022.
- Dense training, sparse inference: Rethinking training of mixture-of-experts language models, 2024.
- The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.
- Openwebmath: An open dataset of high-quality mathematical web text, 2023.
- The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
- Direct preference optimization: Your language model is secretly a reward model, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
- Zero: Memory optimizations toward training trillion parameter models, 2020.
- Code llama: Open foundation models for code, 2024.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
- Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
- Moduleformer: Learning modular large language models from uncurated data. arXiv preprint arXiv:2306.04640, 2023.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
- An empirical study of instruction-tuning large language models in chinese, 2023.
- Dolma: an open corpus of three trillion tokens for language model pretraining research, 2024.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
- Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023. URL https://huggingface.co/datasets/teknium/OpenHermes-2.5.
- Teknium1. GPTeacher: A collection of modular datasets generated by GPT-4. https://github.com/teknium1/GPTeacher, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- The alignment handbook. https://github.com/huggingface/alignment-handbook, 2023a.
- Zephyr: Direct distillation of lm alignment, 2023b.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints, 2023a.
- Scibench: Evaluating college-level scientific problem-solving abilities of large language models, 2023b.
- Self-instruct: Aligning language model with self generated instructions, 2022.
- Magicoder: Source code is all you need, 2023.
- xai-org. Grok-1 open release. https://github.com/xai-org/grok-1, 2024.
- Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024.
- Wizardlm: Empowering large language models to follow complex instructions, 2023a.
- Some things are more cringe than others: Preference optimization with the pairwise cringe loss, 2023b.
- Openmoe: An early effort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739, 2024.
- Metamath: Bootstrap your own mathematical questions for large language models, 2023.
- Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems, 23(8):1177–1193, 2012.
- Tinyllama: An open-source small language model, 2024a.
- Mixture of attention heads: Selecting attention heads per token. arXiv e-prints, pp. arXiv–2210, 2022.
- Yifan Zhang. Stackmathqa: A curated collection of 2 million mathematical questions and answers sourced from stack exchange, 2024.
- Training language models with syntactic data generation, 2024b.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2024a.
- Opencodeinterpreter: Integrating code generation with execution and refinement, 2024b.
- St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022.