Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Pareto Optimal Throughput in Small Language Model Serving (2404.03353v1)

Published 4 Apr 2024 in cs.CL

Abstract: LLMs have revolutionized the state-of-the-art of many different natural language processing tasks. Although serving LLMs is computationally and memory demanding, the rise of Small LLMs (SLMs) offers new opportunities for resource-constrained users, who now are able to serve small models with cutting-edge performance. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis provides a new perspective in serving, highlighting that the small memory footprint of SLMs allows for reaching the Pareto-optimal throughput within the resource capacity of a single accelerator. In this regard, we present an initial set of findings demonstrating how model replication can effectively improve resource utilization for serving SLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15.
  2. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
  3. Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models. arXiv preprint arXiv:2401.00625 (2024).
  4. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
  5. Ronen Eldan and Yuanzhi Li. 2023. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv preprint arXiv:2305.07759 (2023).
  6. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022).
  7. Prompt cache: Modular attention reuse for low-latency inference. arXiv preprint arXiv:2311.04934 (2023).
  8. Textbooks Are All You Need. arXiv preprint arXiv:2306.11644 (2023).
  9. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 (2022).
  10. HuggingFace. 2023. Text Generation Inference. https://huggingface.co/docs/text-generation-inference/index.
  11. S3superscript𝑆3S^{3}italic_S start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT: Increasing GPU Utilization during Generative Inference for Higher Throughput. arXiv preprint arXiv:2306.06000 (2023).
  12. SqueezeLLM: Dense-and-Sparse Quantization. arXiv preprint arXiv:2306.07629 (2023).
  13. Full stack optimization of transformer inference: a survey. arXiv preprint arXiv:2302.14017 (2023).
  14. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626.
  15. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. arXiv preprint arXiv:2302.11665 (2023).
  16. Microsoft. 2023. DeepSpeed-FastGen. https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen.
  17. NVIDIA. 2023a. FasterTransformer. https://github.com/NVIDIA/FasterTransformer.
  18. NVIDIA. 2023b. GPU Performance Background User’s Guide. https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html#understand-perf.
  19. NVIDIA. 2024. DCGM User guide: Feature overview. https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html.
  20. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems 5 (2023).
  21. PyTorch. 2024. Accelerating PyTorch with CUDA Graphs. https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/.
  22. From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference. In 2023 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–9.
  23. ShareGPT. 2023. ShareGPT. https://sharegpt.com/.
  24. Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150 (2019).
  25. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning. PMLR, 31094–31116.
  26. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).
  27. MosaicML NLP Team. 2023. Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs. www.mosaicml.com/blog/mpt-7b Accessed: 2023-05-05.
  28. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  29. vLLM. 2024. Project update. https://docs.google.com/presentation/d/12mI2sKABnUw5RBWXDYY-HtHth4iMSNcEoQ10jDQbxgA/mobilepresent?slide=id.g2b46085d608_1_0.
  30. Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
  31. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 97–110.
  32. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations. 38–45.
  33. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning. PMLR, 38087–38099.
  34. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521–538.
  35. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
Citations (2)

Summary

We haven't generated a summary for this paper yet.