Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM (2403.07816v1)
Abstract: We investigate efficient methods for training LLMs to possess capabilities in multiple specialized domains, such as coding, math reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion with high throughput and reduced communication cost. After individual experts are asynchronously trained, BTX brings together their feedforward parameters as experts in Mixture-of-Expert (MoE) layers and averages the remaining parameters, followed by an MoE-finetuning stage to learn token-level routing. BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously. Compared to alternative approaches, BTX achieves the best accuracy-efficiency tradeoff.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Expert gate: Lifelong learning with a network of experts. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7120–7129, 2016. https://api.semanticscholar.org/CorpusID:914027.
- Program synthesis with large language models. ArXiv, abs/2108.07732, 2021. https://api.semanticscholar.org/CorpusID:237142385.
- Continual learning with neural networks: A review. In Proceedings of the ACM India Joint International Conference on Data Science and Management of Data, pages 362–365, 2019.
- Llemma: An open language model for mathematics. ArXiv, abs/2310.10631, 2023. https://api.semanticscholar.org/CorpusID:264172303.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
- Language models are few-shot learners. ArXiv, abs/2005.14165, 2020. https://api.semanticscholar.org/CorpusID:218971783.
- Evaluating large language models trained on code. ArXiv, abs/2107.03374, 2021. https://api.semanticscholar.org/CorpusID:235755472.
- Think you have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. ArXiv, abs/2401.06066, 2024. https://api.semanticscholar.org/CorpusID:266933338.
- Diloco: Distributed low-communication training of language models. ArXiv, abs/2311.08105, 2023. https://api.semanticscholar.org/CorpusID:265158012.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
- Gemini Team. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. Team, Gemini and Anil, Rohan and Borgeaud, Sebastian and Wu, Yonghui and Alayrac, Jean-Baptiste and Yu, Jiahui and Soricut, Radu and Schalkwyk, Johan and Dai, Andrew M and Hauth, Anja and others.
- Demix layers: Disentangling domains for modular language modeling. In North American Chapter of the Association for Computational Linguistics, 2021. https://api.semanticscholar.org/CorpusID:236976189.
- Scaling expert language models with unsupervised domain discovery. arXiv preprint arXiv:2303.14177, 2023.
- Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021a. https://openreview.net/forum?id=d7KBjmI3GmQ.
- Measuring mathematical problem solving with the math dataset. ArXiv, abs/2103.03874, 2021b. https://api.semanticscholar.org/CorpusID:232134851.
- Adaptive mixtures of local experts. Neural Computation, 3:79–87, 1991. https://api.semanticscholar.org/CorpusID:572361.
- Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- Mixtral of experts. ArXiv, abs/2401.04088, 2024. https://api.semanticscholar.org/CorpusID:266844877.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. ArXiv, abs/1705.03551, 2017. https://api.semanticscholar.org/CorpusID:26501419.
- Sparse upcycling: Training mixture-of-experts from dense checkpoints. ArXiv, abs/2212.05055, 2022. https://api.semanticscholar.org/CorpusID:254535822.
- Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019.
- A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44:3366–3385, 2019. https://api.semanticscholar.org/CorpusID:218889912.
- Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning, 2021. https://api.semanticscholar.org/CorpusID:232428341.
- Branch-train-merge: Embarrassingly parallel training of expert language models. ArXiv, abs/2208.03306, 2022a. https://api.semanticscholar.org/CorpusID:251371375.
- Competition-level code generation with alphacode. Science, 378:1092 – 1097, 2022b. https://api.semanticscholar.org/CorpusID:246527904.
- Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155, 2022. https://api.semanticscholar.org/CorpusID:246426909.
- Hash layers for large sparse models. In Neural Information Processing Systems, 2021. https://api.semanticscholar.org/CorpusID:235367626.
- Code llama: Open foundation models for code. ArXiv, abs/2308.12950, 2023. https://api.semanticscholar.org/CorpusID:261100919.
- Progressive neural networks. ArXiv, abs/1606.04671, 2016. https://api.semanticscholar.org/CorpusID:15350923.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
- Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
- Deepseekmath: Pushing the limits of mathematical reasoning in open language models. ArXiv, abs/2402.03300, 2024. https://api.semanticscholar.org/CorpusID:267412607.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ArXiv, abs/1701.06538, 2017. https://api.semanticscholar.org/CorpusID:12462234.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. ArXiv, abs/2203.05482, 2022. https://api.semanticscholar.org/CorpusID:247362886.
- Openmoe: An early effort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739, 2024.
- Deep learning with elastic averaging sgd. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. https://proceedings.neurips.cc/paper_files/paper/2015/file/d18f655c3fce66ca401d5f38b48c89af-Paper.pdf.
- Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022. https://api.semanticscholar.org/CorpusID:248496292.
- Llama beyond english: An empirical study on language capability transfer. arXiv preprint arXiv:2401.01055, 2024.