Fast Inference of Mixture-of-Experts Language Models with Offloading (2312.17238v1)
Abstract: With the widespread adoption of LLMs, many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) - a type of model architectures where only a fraction of model layers are active for any given input. This property allows MoE-based LLMs to generate tokens faster than their dense counterparts, but it also increases model size due to having multiple experts. Unfortunately, this makes state-of-the-art MoE LLMs difficult to run without high-end GPUs. In this work, we study the problem of running large MoE LLMs on consumer hardware with limited accelerator memory. We build upon parameter offloading algorithms and propose a novel strategy that accelerates offloading by taking advantage of innate properties of MoE LLMs. Using this strategy, we build can run Mixtral-8x7B with mixed quantization on desktop hardware and free-tier Google Colab instances.
- Deepspeed-inference: Enabling efficient inference of transformer models at unprecedented scale. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’22. IEEE Press, 2022. ISBN 9784665454445.
- Half-quadratic quantization of large machine learning models, November 2023. URL https://mobiusml.github.io/hqq_blog/.
- Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373, 2023.
- Language models are few-shot learners. In Conference on Neural Information Processing Systems (NeurIPS), 2020.
- Quip: 2-bit quantization of large language models with guarantees, 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- A parallel mixture of svms for very large scale problems. In Advances in Neural Information Processing Systems, pp. 633–640, 2002.
- The case for 4-bit precision: k-bit inference scaling laws. arXiv preprint arXiv:2212.09720, 2022.
- LLM.int8(): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, 2022.
- Spqr: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078, 2023.
- Glam: Efficient scaling of language models with mixture-of-experts, 2022.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961, 2021.
- SparseGPT: Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774, 2023a.
- Qmoe: Practical sub-1-bit compression of trillion-parameter models, 2023b.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
- A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630, 2021.
- Google. Google colaboratory, 2023. URL https://colab.research.google.com/.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- Language model compression with weighted low-rank factorization. arXiv preprint arXiv:2207.00112, 2022.
- Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, March 1991. ISSN 0899-7667. doi: 10.1162/neco.1991.3.1.79. URL https://doi.org/10.1162/neco.1991.3.1.79.
- Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994.
- Mixture of quantized experts (moqe): Complementary effect of low-bit quantization and robustness, 2023.
- Openassistant conversations – democratizing large language model alignment, 2023.
- Large memory layers with product keys. In Wallach, H., Larochelle, H., Beygelzimer, A., dÁlché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 8546–8557. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/9061-large-memory-layers-with-product-keys.pdf.
- Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
- Base layers: Simplifying training of large, sparse models. arXiv preprint arXiv:2103.16716, 2021.
- Pruning and quantization for deep neural network acceleration: A survey. CoRR, abs/2101.09671, 2021. URL https://arxiv.org/abs/2101.09671.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
- Llm-pruner: On the structural pruning of large language models, 2023.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Mixtral AI team. Mixtral of experts a high quality sparse mixture of experts, 2023. URL https://mistral.ai/news/mixtral-of-experts/.
- Up or down? Adaptive rounding for post-training quantization. In International Conference on Machine Learning (ICML), 2020.
- OpenAI. Gpt-4 technical report. arXiv, 2023.
- Training large neural networks with constant memory using a new execution algorithm. CoRR, abs/2002.05645, 2020. URL https://arxiv.org/abs/2002.05645.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
- Zero-offload: Democratizing billion-scale model training. CoRR, abs/2101.06840, 2021. URL https://arxiv.org/abs/2101.06840.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Nonlinear models using dirichlet process mixtures. Journal of Machine Learning Research, 10(Aug):1829–1850, 2009.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
- Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pp. 31094–31116. PMLR, 2023.
- Steam. Steam hardware & software survey: October 2023, accessed on 2023.11.02, 2023. URL https://store.steampowered.com/hwsurvey/videocard/.
- Gemini: A family of highly capable multimodal models, 2023.
- TII UAE. The Falcon family of large language models. https://huggingface.co/tiiuae/falcon-40b, May 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.