LLM in a flash: Efficient Large Language Model Inference with Limited Memory (2312.11514v3)
Abstract: LLMs are central to modern natural language processing, delivering exceptional performance in various tasks. However, their substantial computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks. Within this hardware-informed framework, we introduce two principal techniques. First, "windowing" strategically reduces data transfer by reusing previously activated neurons, and second, "row-column bundling", tailored to the sequential data access strengths of flash memory, increases the size of data chunks read from flash memory. These methods collectively enable running models up to twice the size of the available DRAM, with a 4-5x and 20-25x increase in inference speed compared to naive loading approaches in CPU and GPU, respectively. Our integration of sparsity awareness, context-adaptive loading, and a hardware-oriented design paves the way for effective inference of LLMs on devices with limited memory.
- Atomlayer: minimizing dram data movement for ultra-sparse models on gpus. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 223–238.
- Intriguing properties of quantization at scale. ArXiv, abs/2305.19268.
- The falcon series of language models: Towards open frontier models.
- Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE.
- Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding. ArXiv, abs/2310.05424.
- Alternating updates for efficient transformers. ArXiv, abs/2301.13310.
- Petals: Collaborative inference and fine-tuning of large models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 558–568, Toronto, Canada. Association for Computational Linguistics.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Spatten: Efficient sparse attention architecture with cascade token and head pruning. In Advances in Neural Information Processing Systems, volume 34.
- computedram: In-memory compute using off-the-shelf dram. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 1065–1079.
- Alex Graves. 2016. Adaptive computation time for recurrent neural networks. In International Conference on Machine Learning, pages 3500–3509. PMLR.
- Graphssd: a high performance flash-based storage system for large-scale graph processing. In 2016 USENIX Annual Technical Conference (USENIXATC 16), pages 243–256.
- Eie: efficient inference engine on compressed deep neural network. arXiv preprint arXiv:1602.01528.
- Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR).
- Rest: Retrieval-based speculative decoding. ArXiv, abs/2311.08252.
- (dynamic) prompting might be all you need to repair compressed llms. ArXiv, abs/2310.00867.
- Compressing llms: The truth is rarely pure and never simple. ArXiv, abs/2310.01382.
- Fast inference from transformers via speculative decoding.
- Jiaxi Li and Wei Lu. 2023. Contextual distortion reveals constituency: Masked language models are implicit parsers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5208–5222, Toronto, Canada. Association for Computational Linguistics.
- Norm tweaking: High-performance low-bit quantization of large language models. ArXiv, abs/2309.02784.
- Awq: Activation-aware weight quantization for llm compression and acceleration. ArXiv, abs/2306.00978.
- Llm-qat: Data-free quantization aware training for large language models. ArXiv, abs/2305.17888.
- Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR.
- Neural cache: Bit-serial in-cache acceleration of deep neural networks. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 383–394. IEEE.
- Relu strikes back: Exploiting activation sparsity in large language models.
- Firefly: A lightweight system for running multi-billion parameter models on commodity hardware. In 2022 ACM/IEEE 49th Annual International Symposium on Computer Architecture (ISCA), pages 757–771. IEEE.
- Sparse gpu kernels for deep learning. In International Conference on Learning Representations.
- Timeloop: A systematic approach to dnn accelerator evaluation. In 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 241–251. IEEE.
- Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In SC21: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14.
- vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), page Article 13. IEEE Computer Society.
- Omniquant: Omnidirectionally calibrated quantization for large language models. ArXiv, abs/2308.13137.
- Hotpot: Warmed-up gigascale inference with tightly-coupled compute and reuse in flash. In Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture, pages 335–349.
- Flexgen: High-throughput generative inference of large language models with a single GPU. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 31094–31116. PMLR.
- Adapt: Parameter adaptive token-wise inference for vision transformers. In Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture.
- A simple and effective pruning approach for large language models. ArXiv, abs/2306.11695.
- Flash-llm: Enabling low-cost and highly-efficient large generative model inference with unstructured sparsity. Proc. VLDB Endow., 17:211–224.
- Compress, then prompt: Improving accuracy-efficiency trade-off of llm inference with transferable prompt. ArXiv, abs/2305.11186.
- Edgemoe: Fast on-device inference of moe-based large language models. ArXiv, abs/2308.14352.
- Draft & verify: Lossless large language model acceleration via self-speculative decoding. ArXiv, abs/2309.08168.
- Llm quantization: Quantization-aware training for large language models. In Advances in Neural Information Processing Systems, volume 35.
- OPT: open pre-trained transformer language models. CoRR, abs/2205.01068.
- Atom: Low-bit quantization for efficient and accurate llm serving. ArXiv, abs/2310.19102.