Distributed Inference and Fine-tuning of Large Language Models Over The Internet (2312.08361v1)

Published 13 Dec 2023 in cs.LG and cs.DC

Abstract: LLMs are useful in many NLP tasks and become more capable with size, with the best open-source models having over 50 billion parameters. However, using these 50B+ models requires high-end hardware, making them inaccessible to most researchers. In this work, we investigate methods for cost-efficient inference and fine-tuning of LLMs, comparing local and distributed strategies. We observe that a large enough model (50B+) can run efficiently even on geodistributed devices in a consumer-grade network. This could allow running LLM efficiently by pooling together idle compute resources of multiple research groups and volunteers. We address two open problems: (1) how to perform inference and fine-tuning reliably if any device can disconnect abruptly and (2) how to partition LLMs between devices with uneven hardware, joining and leaving at will. In order to do that, we develop special fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput. We showcase these algorithms in Petals - a decentralized system that runs Llama 2 (70B) and BLOOM (176B) over the Internet up to 10x faster than offloading for interactive generation. We evaluate the performance of our system in simulated conditions and a real-world setup spanning two continents.

References (68)

Citations (27)

View on Semantic Scholar

Summary

The paper introduces a fault-tolerant, distributed inference system for LLMs that leverages unreliable devices across the internet.
It employs a novel load-balancing protocol and dual attention caches to recover quickly and optimize performance despite network variability.
Extensive tests show up to a tenfold speed improvement for interactive tasks, validating the decentralized approach for large-scale LLM usage.

Distributed Inference and Fine-tuning of LLMs

Introduction

The deployment and utilization of LLMs with over 50 billion parameters in various NLP tasks have been constrained by the requirement for high-end hardware. Traditional methods, like offloading parameters to RAM, do not suffice as they are inefficient for applications such as chatbots and search engines which are latency-sensitive. An alternative strategy to overcome these challenges involves employing distributed computing over the internet, using a swarm of unreliable devices to run these LLMs, which is the main focus of this paper.

Fault-Tolerance in Model Inference

The paper introduces advanced algorithms tailored for distributed environments where devices can be unreliable and have variable network latencies. Through the development of a novel fault-tolerant autoregressive inference algorithm and a decentralized load-balancing mechanism, the authors established means to recover quickly from server failures. The fault-tolerance comes from maintaining dual attention caches that facilitate rapid server state restoration by other standby servers. This approach minimizes the volume of re-transmitted data to only what's necessary when a failure occurs.

Load Balancing and Fine-Tuning

Furthermore, the paper tackles the dynamic and uneven nature of consumer-grade hardware and network resources by devising a load-balancing protocol. This adaptive mechanism assigns transformer blocks across the distributed system to optimize overall throughput, despite servers joining or leaving freely. The system also supports parameter-efficient fine-tuning methods where clients, not servers, store and update their trainable parameters - adapting to various tasks without heavily taxing the network.

Performance Evaluation

Extensive simulations and real-world experiments confirmed that the introduced system can execute LLMs efficiently over the internet. When compared to local offloading, the approach exhibited up to a tenfold increase in speed for interactive generation tasks. Tests spanned different continents, asserting the system's robustness and efficiency despite geodistribution challenges.

Conclusion

The paper concludes by validating the proposed decentralized system as a cost-effective alternative for using LLMs on distributed, unreliable devices. It leverages the collective power of idle compute resources while guaranteeing correct model outputs and promising significant improvements in speed over traditional offloading methods. The authors also call attention to privacy considerations and potential future improvements like integrating secure multi-party computations to safeguard sensitive data processed by the system.