Emergent Mind

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

(2312.08361)
Published Dec 13, 2023 in cs.LG and cs.DC

Abstract

LLMs are useful in many NLP tasks and become more capable with size, with the best open-source models having over 50 billion parameters. However, using these 50B+ models requires high-end hardware, making them inaccessible to most researchers. In this work, we investigate methods for cost-efficient inference and fine-tuning of LLMs, comparing local and distributed strategies. We observe that a large enough model (50B+) can run efficiently even on geodistributed devices in a consumer-grade network. This could allow running LLM efficiently by pooling together idle compute resources of multiple research groups and volunteers. We address two open problems: (1) how to perform inference and fine-tuning reliably if any device can disconnect abruptly and (2) how to partition LLMs between devices with uneven hardware, joining and leaving at will. In order to do that, we develop special fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput. We showcase these algorithms in Petals - a decentralized system that runs Llama 2 (70B) and BLOOM (176B) over the Internet up to 10x faster than offloading for interactive generation. We evaluate the performance of our system in simulated conditions and a real-world setup spanning two continents.

System design showing server storage of LLM layers and client-side embedding layers for inferencing.

Overview

  • The paper explores overcoming hardware constraints for LLMs by utilizing distributed computing across internet-connected devices.

  • A fault-tolerant autoregressive inference algorithm and a decentralized load-balancing mechanism improve reliability despite server failures and variable latency.

  • An adaptive load-balancing protocol optimizes throughput by assigning transformer blocks efficiently across the distributed system.

  • The system supports fine-tuning on the client side, reducing network load and enabling task-specific model adjustments without server strain.

  • Performance evaluations show the system offers significant speed improvements for interactive generation tasks and maintains robustness across different continents.

Distributed Inference and Fine-tuning of LLMs

Introduction

The deployment and utilization of LLMs with over 50 billion parameters in various NLP tasks have been constrained by the requirement for high-end hardware. Traditional methods, like offloading parameters to RAM, do not suffice as they are inefficient for applications such as chatbots and search engines which are latency-sensitive. An alternative strategy to overcome these challenges involves employing distributed computing over the internet, using a swarm of unreliable devices to run these LLMs, which is the main focus of this study.

Fault-Tolerance in Model Inference

The study introduces advanced algorithms tailored for distributed environments where devices can be unreliable and have variable network latencies. Through the development of a novel fault-tolerant autoregressive inference algorithm and a decentralized load-balancing mechanism, the authors established means to recover quickly from server failures. The fault-tolerance comes from maintaining dual attention caches that facilitate rapid server state restoration by other standby servers. This approach minimizes the volume of re-transmitted data to only what's necessary when a failure occurs.

Load Balancing and Fine-Tuning

Furthermore, the study tackles the dynamic and uneven nature of consumer-grade hardware and network resources by devising a load-balancing protocol. This adaptive mechanism assigns transformer blocks across the distributed system to optimize overall throughput, despite servers joining or leaving freely. The system also supports parameter-efficient fine-tuning methods where clients, not servers, store and update their trainable parameters - adapting to various tasks without heavily taxing the network.

Performance Evaluation

Extensive simulations and real-world experiments confirmed that the introduced system can execute LLMs efficiently over the internet. When compared to local offloading, the approach exhibited up to a tenfold increase in speed for interactive generation tasks. Tests spanned different continents, asserting the system's robustness and efficiency despite geodistribution challenges.

Conclusion

The paper concludes by validating the proposed decentralized system as a cost-effective alternative for using LLMs on distributed, unreliable devices. It leverages the collective power of idle compute resources while guaranteeing correct model outputs and promising significant improvements in speed over traditional offloading methods. The authors also call attention to privacy considerations and potential future improvements like integrating secure multi-party computations to safeguard sensitive data processed by the system.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube