DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs (2312.05215v3)

Published 8 Dec 2023 in cs.DC and cs.LG

Abstract: Fine-tuning LLMs greatly improves model quality for downstream tasks. However, serving many fine-tuned LLMs concurrently is challenging due to the sporadic, bursty, and varying request patterns of different LLMs. To bridge this gap, we present DeltaZip, an LLM serving system that efficiently serves multiple full-parameter fine-tuned models concurrently by aggressively compressing model deltas by up to 10x while maintaining high model quality. The key insight behind this design is that fine-tuning results in small-magnitude changes to the pre-trained model. By co-designing the serving system with the compression algorithm, DeltaZip achieves 2x to 12x improvement in throughput compared to the state-of-the-art systems.

Citations (6)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/xiaozheyao/status/1904850492548579645

DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs (2312.05215v3)

Summary

Related Papers

Tweets