Distributed Inference and Fine-tuning of Large Language Models Over The Internet (2312.08361v1)
Abstract: LLMs are useful in many NLP tasks and become more capable with size, with the best open-source models having over 50 billion parameters. However, using these 50B+ models requires high-end hardware, making them inaccessible to most researchers. In this work, we investigate methods for cost-efficient inference and fine-tuning of LLMs, comparing local and distributed strategies. We observe that a large enough model (50B+) can run efficiently even on geodistributed devices in a consumer-grade network. This could allow running LLM efficiently by pooling together idle compute resources of multiple research groups and volunteers. We address two open problems: (1) how to perform inference and fine-tuning reliably if any device can disconnect abruptly and (2) how to partition LLMs between devices with uneven hardware, joining and leaving at will. In order to do that, we develop special fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput. We showcase these algorithms in Petals - a decentralized system that runs Llama 2 (70B) and BLOOM (176B) over the Internet up to 10x faster than offloading for interactive generation. We evaluate the performance of our system in simulated conditions and a real-world setup spanning two continents.
- AI21. Jurassic-1 language models. "https://studio.ai21.com/docs/jurassic1-language-models". Accessed: 2022-06-22.
- Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale. arXiv preprint arXiv:2207.00032, 2022.
- Varuna: scalable, low-cost training of massive deep learning models. In Proceedings of the Seventeenth European Conference on Computer Systems, pp. 472–487, 2022.
- Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Comput. Surv., 52(4), aug 2019. ISSN 0360-0300. doi: 10.1145/3320060. URL https://doi.org/10.1145/3320060.
- BigScience. BLOOM: a 176B-parameter open-access multilingual language model. ArXiv, abs/2211.05100, 2022a.
- BigScience. A version of BLOOM with 7.1 billion parameters. https://huggingface.co/bigscience/bloom-7b1, 2022b.
- The fork of Megatron-LM and Megatron-DeepSpeed by BigScience. https://github.com/bigscience-workshop/Megatron-DeepSpeed, 2022.
- Gpt-neox-20b: An open-source autoregressive language model, 2022. URL https://arxiv.org/abs/2204.06745.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
- LLM.int8(): 8-bit matrix multiplication for transformers at scale. ArXiv, abs/2208.07339, 2022a.
- 8-bit optimizers via block-wise quantization. International Conference on Learning Representations (ICLR), 2022b.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- Glam: Efficient scaling of language models with mixture-of-experts. CoRR, abs/2112.06905, 2021. URL https://arxiv.org/abs/2112.06905.
- A pragmatic introduction to secure multi-party computation. Foundations and Trends in Privacy and Security, 2(2-3):70–246, 2018.
- Face, H. and contributors. Accelerate: Run your raw pytorch training script on any kind of device. GitHub. Note: https://github.com/huggingface/datasets, 1, 2020.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2021.
- A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
- Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Transactions on Mathematical Software (TOMS), 26(1):19–45, 2000.
- Parameter-efficient transfer learning with diff pruning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021.
- The curious case of neural text degeneration. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp. 2790–2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models, 2021.
- Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems, pp. 103–112, 2019.
- Beyond data and model parallelism for deep neural networks. In Talwalkar, A., Smith, V., and Zaharia, M. (eds.), Proceedings of Machine Learning and Systems, volume 1, pp. 1–13, 2019. URL https://proceedings.mlsys.org/paper/2019/file/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf.
- Scaling laws for neural language models, 2020.
- Yalm 100b, 2022. "https://huggingface.co/yandex/yalm-100b".
- What changes can large-scale language models bring? intensive study on hyperclova: Billions-scale korean generative pretrained transformers. CoRR, abs/2109.04650, 2021. URL https://arxiv.org/abs/2109.04650.
- Fast replanning for navigation in unknown terrain. IEEE Transactions on Robotics, 21(3):354–363, 2005. doi: 10.1109/TRO.2004.838026.
- Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014. URL http://arxiv.org/abs/1404.5997.
- Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc., 2012.
- Kuszmaul, J. Bamboo trimming revisited: Simple algorithms can do well too. arXiv preprint arXiv:2201.07350, 2022.
- Gshard: Scaling giant models with conditional computation and automatic sharding. ArXiv, abs/2006.16668, 2020.
- The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243. URL https://aclanthology.org/2021.emnlp-main.243.
- libp2p. libp2p circuit relay. https://docs.libp2p.io/concepts/nat/circuit-relay/, 2022.
- Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022a. URL https://arxiv.org/abs/2205.05638.
- Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022b. URL https://arxiv.org/abs/2205.05638.
- P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021a.
- Gpt understands, too. arXiv:2103.10385, 2021b.
- Kademlia: A peer-to-peer information system based on the xor metric. In International Workshop on Peer-to-Peer Systems, pp. 53–65. Springer, 2002.
- Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, pp. 1–15, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450368735. doi: 10.1145/3341301.3359646. URL https://doi.org/10.1145/3341301.3359646.
- Efficient large-scale language model training on gpu clusters. arXiv preprint arXiv:2104.04473, 2021.
- NVIDIA. NVIDIA Ampere GA102 GPU architecture, 2020. URL https://images.nvidia.com/aem-dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf.
- NVIDIA. Nvidia confidential computing. https://www.nvidia.com/en-in/data-center/solutions/confidential-computing/, 2022.
- Training large neural networks with constant memory using a new execution algorithm. arXiv preprint arXiv:2002.05645, 2020.
- Improving language understanding by generative pre-training. 2018. URL https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
- Language models are unsupervised multitask learners. 2019.
- Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446, 2021. URL https://arxiv.org/abs/2112.11446.
- Zero: Memory optimization towards training a trillion parameter models. In SC, 2020.
- Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. arXiv preprint arXiv:2104.07857, 2021.
- Zero-offload: Democratizing billion-scale model training, 2021.
- SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 29416–29440. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/ryabinin23a.html.
- Generating datasets with pretrained language models. pp. 6943–6951, November 2021. doi: 10.18653/v1/2021.emnlp-main.555. URL https://aclanthology.org/2021.emnlp-main.555.
- Mesh-tensorflow: Deep learning for supercomputers. CoRR, abs/1811.02084, 2018. URL http://arxiv.org/abs/1811.02084.
- Training neural networks with fixed sparse masks. Advances in Neural Information Processing Systems, 2021.
- Communication-efficient distributed deep learning: A comprehensive survey, 2020.
- Galactica: A large language model for science. 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
- Fine-tuning language models over slow networks using activation compression with guarantees, 2022. URL https://arxiv.org/abs/2206.01299.
- Symbolic knowledge distillation: from general language models to commonsense models. arXiv preprint arXiv:2110.07178, 2021.
- Pipemare: Asynchronous pipeline parallel dnn training. ArXiv, abs/1910.05124, 2019.
- Adapting bigscience multilingual model to unseen languages, 2022. URL https://arxiv.org/abs/2204.04873.
- Decentralized training of foundation models in heterogeneous environments. Advances in Neural Information Processing Systems, 35:25464–25477, 2022.
- GLM-130B: An open bilingual pre-trained model, 2022. URL http://keg.cs.tsinghua.edu.cn/glm-130b/posts/glm-130b/.
- Pangu-α𝛼\alphaitalic_α: Large-scale autoregressive pretrained chinese language models with auto-parallel computation. CoRR, abs/2104.12369, 2021. URL https://arxiv.org/abs/2104.12369.
- OPT: open pre-trained transformer language models, 2022. URL https://arxiv.org/abs/2205.01068.