Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 167 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 106 tok/s Pro
Kimi K2 187 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference (2401.11240v1)

Published 20 Jan 2024 in cs.DC

Abstract: Pre-trained LLMs often need specialization for domain-specific tasks. Low-Rank Adaptation (LoRA) is a popular approach that adapts a base model to multiple tasks by adding lightweight trainable adapters. In this paper, we present CaraServe, a system that efficiently serves many LoRA adapters derived from a common base model. CaraServe maintains the base model on GPUs and dynamically loads activated LoRA adapters from main memory. As GPU loading results in a cold-start that substantially delays token generation, CaraServe employs a CPU-assisted approach. It early starts the activated adapters on CPUs for prefilling as they are being loaded onto GPUs; after loading completes, it then switches to the GPUs for generative LoRA inference. CaraServe develops a highly optimized synchronization mechanism to efficiently coordinate LoRA computation on the CPU and GPU. Moreover, CaraServe employs a rank-aware scheduling algorithm to optimally schedule heterogeneous LoRA requests for maximum service-level objective (SLO) attainment. We have implemented CaraServe and evaluated it against state-of-the-art LoRA serving systems. Our results demonstrate that CaraServe can speed up the average request serving latency by up to 1.4$\times$ and achieve an SLO attainment of up to 99%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Punica: Multi-tenant lora serving, 2023.
  2. Longlora: Efficient fine-tuning of long-context large language models. arXiv:2309.12307, 2023.
  3. Clipper: A {{\{{Low-Latency}}\}} online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613–627, 2017.
  4. Databricks. Llm inference performance engineering: Best practices. https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices, 2023.
  5. Qlora: Efficient finetuning of quantized llms. In Advances in Neural Information Processing Systems, 2023.
  6. Nvidia Developer. Nvidia nsight compute. https://developer.nvidia.com/nsight-compute, 2024.
  7. PyTorch Docs. Cpu threading and torchscript inference. https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html#runtime-api, 2023.
  8. Serving DNNs like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443–462. USENIX Association, November 2020.
  9. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  10. Huggingface. Text generation inference. https://github.com/huggingface/text-generation-inference.
  11. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  12. AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663–679, Boston, MA, July 2023. USENIX Association.
  13. ORION and the three rights: Sizing, bundling, and prewarming for serverless DAGs. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 303–320, Carlsbad, CA, July 2022. USENIX Association.
  14. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  15. Spotserve: Serving generative large language models on preemptible instances. In ASPLOS, 2024.
  16. ModelTC. Light llm. https://github.com/ModelTC/lightllm.
  17. OpenAI. Custom instructions for chatgpt. https://openai.com/blog/custom-instructions-for-chatgpt, 2023.
  18. OpenAI. Gpt-3.5 turbo fine-tuning and api updates. https://openai.com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates, 2023.
  19. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  20. INFaaS: Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 397–411. USENIX Association, July 2021.
  21. Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider. In 2020 USENIX Annual Technical Conference (USENIX ATC 20), 2020.
  22. Nexus: A gpu cluster engine for accelerating dnn-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 322–337, 2019.
  23. S-lora: Serving thousands of concurrent lora adapters, 2023.
  24. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023.
  25. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
  26. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca., 2023.
  27. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019.
  28. Llama 2: Open foundation and fine-tuned chat models, 2023.
  29. Attention is all you need. Advances in neural information processing systems, 2017.
  30. MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 945–960, Renton, WA, April 2022. USENIX Association.
  31. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association.
  32. {{\{{MArk}}\}}: Exploiting cloud services for {{\{{Cost-Effective}}\}},{{\{{SLO-Aware}}\}} machine learning inference serving. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 1049–1062, 2019.
  33. SHEPHERD: Serving DNNs in the wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 787–808, Boston, MA, April 2023. USENIX Association.
  34. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  35. PetS: A unified framework for Parameter-Efficient transformers serving. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), 2022.
Citations (8)

Summary

  • The paper presents CaraServe, a system that uses CPU prefill and rank-aware scheduling to reduce cold-start latency for LoRA adapters in LLM inference.
  • It employs asynchronous memory copy and shared memory coordination to optimize GPU-CPU workload during inference.
  • Experimental results demonstrate up to 1.4x faster latency and 99% SLO compliance, enhancing performance in multi-tenant environments.

CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference

CaraServe is a system designed to efficiently serve Low-Rank Adaptation (LoRA) adapters derived from a common base model in generative LLM inference. It confronts issues inherent in traditional LLM deployment, specifically aiming to reduce latency and meet service-level objectives (SLO). Utilizing both CPU and GPU resources, CaraServe dynamically manages LoRA adapter activation and employs rank-aware scheduling.

CPU-Assisted Serving Mechanism

CaraServe mitigates GPU cold-start delays by employing CPU-assisted serving. This approach involves prefilling using CPU before the GPU completes loading the activated LoRA adapters. Figure 1

Figure 1: Illustration of CPU-assisted LoRA serving.

In practical terms, when a request arrives, CaraServe starts computation of the activated LoRA adapter on the CPU while it is being loaded to the GPU. The system employs asynchronous memory copy and signaling to coordinate the computations efficiently across devices. This synchronization between GPU and CPU reduces the overhead typically associated with loading adapters onto the GPU.

Efficient GPU-CPU Coordination

CaraServe optimizes the synchronization needed between the GPU and CPU to execute LoRA computations. It leverages shared memory for fast data exchange, significantly reducing inter-process communication overheads. Figure 2

Figure 2: Illustration of coordinated LoRA computation on GPU and CPU per transformer block's attention layer.

Additionally, CaraServe implements a profiling-guided parallelization scheme allowing LoRA computations to scale across multiple CPUs, addressing potential bottlenecks when processing long input prompts.

Rank-Aware Scheduling

In multi-tenant environments, requests trigger activation of LoRA adapters with varying ranks. CaraServe introduces a rank-aware scheduling algorithm informed by performance models developed through profiling. Figure 3

Figure 3: Performance models for BGMV (Left) and MBGMV (Right) kernels. Both linear regression models achieve a high coefficient of determination (R2R^2) of 0.96.

CaraServe's scheduler evaluates server options based on the batch heterogeneity in rank and selects the server with the minimal cost score, ensuring optimal batch compositions and meeting SLOs efficiently.

Architecture Overview

CaraServe comprises LLM inference servers, a scheduler, and a global LoRA registry. The servers manage base models on GPUs and LoRA adapters in memory for efficient multiplexing. The scheduler adopts rank-aware invocation strategies for request routing. Figure 4

Figure 4: An architecture overview of CaraServe.

Unlike existing systems that either suffer from cold-start latency or inefficient scheduling, CaraServe simultaneously addresses both challenges, providing a scalable solution for serving numerous adapters efficiently.

Experimental Evaluation

CaraServe was evaluated against state-of-the-art systems and demonstrated a significant reduction in serving latency—up to 1.4 times faster—while achieving a high SLO attainment of up to 99%. Experimental setups included various workload conditions, accommodating the differences in request traffic and rank configurations. Figure 5

Figure 5: Prefill performance of different kernels on Llama2-7B model. Native: PyTorch default kernels. CaraServe: Implementation with our optimized kernels.

The evaluations highlighted CaraServe's ability to leverage CPU and GPU effectively, minimizing latency and enhancing user experience in real-time LLM deployments.

Conclusion

CaraServe is a robust solution for serving LoRA adapters in multi-tenant cloud environments, addressing challenges of cold-start latency and SLO compliance through intelligent rank-awareness and CPU-assisted pre-filling. Its architecture enables the efficient virtualization of computational resources while maintaining high performance and user satisfaction in generative AI applications.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We found no open problems mentioned in this paper.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 1 like.

Upgrade to Pro to view all of the tweets about this paper: