Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding (2307.05908v2)

Published 12 Jul 2023 in cs.CL and cs.LG

Abstract: This paper presents "Predictive Pipelined Decoding (PPD)," an approach that speeds up greedy decoding in LLMs while maintaining the exact same output as the original decoding. Unlike conventional strategies, PPD employs additional compute resources to parallelize the initiation of subsequent token decoding during the current token decoding. This method reduces decoding latency and reshapes the understanding of trade-offs in LLM decoding strategies. We have developed a theoretical framework that allows us to analyze the trade-off between computation and latency. Using this framework, we can analytically estimate the potential reduction in latency associated with our proposed method, achieved through the assessment of the match rate, represented as p_correct. The results demonstrate that the use of extra computational resources has the potential to accelerate LLM decoding. Additionally, we implement PPD and conduct preliminary experiments to empirically validate its efficacy, addressing potential practical overheads not covered by theoretical analysis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Findings of the 2015 workshop on statistical machine translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pp.  1–46, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/W15-3001. URL https://aclanthology.org/W15-3001.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. To asymmetry and beyond: Structured pruning of sequence to sequence models for improved inference efficiency. arXiv preprint arXiv:2304.02721, 2023.
  4. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  6. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  7. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332, 2022.
  8. Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556, 2019.
  9. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  10. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
  11. Teaching machines to read and comprehend. Advances in neural information processing systems, 28, 2015.
  12. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  13. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
  14. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
  15. Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation. arXiv preprint arXiv:2006.10369, 2020.
  16. I-bert: Integer-only bert quantization. In International conference on machine learning, pp. 5506–5518. PMLR, 2021.
  17. Full stack optimization of transformer inference: a survey. arXiv preprint arXiv:2302.14017, 2023a.
  18. Big little transformer decoder. arXiv preprint arXiv:2302.07863, 2023b.
  19. The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint arXiv:2203.07259, 2022.
  20. A fast post-training pruning framework for transformers. arXiv preprint arXiv:2204.09656, 2022.
  21. Fast inference from transformers via speculative decoding, 2023.
  22. Faster depth-adaptive transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  13424–13432, 2021.
  23. Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019.
  24. OpenAI. Gpt-4 technical report, 2023.
  25. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  26. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL https://aclanthology.org/D16-1264.
  27. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  28. Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems, 33:20378–20389, 2020.
  29. Consistent accelerated inference via confident adaptive transformers. arXiv preprint arXiv:2104.08803, 2021.
  30. Confident adaptive language modeling. Advances in Neural Information Processing Systems, 35:17456–17472, 2022.
  31. Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  8815–8821, 2020.
  32. A simple hash-based early exiting approach for language understanding and generation. arXiv preprint arXiv:2203.01670, 2022.
  33. You need multiple exiting: Dynamic early exiting for accelerating unified vision language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10781–10791, 2023.
  34. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  35. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  36. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  37. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418, 2019.
  38. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  39. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  40. Extreme compression for pre-trained transformers made simple and efficient. arXiv preprint arXiv:2206.01859, 2022.
  41. Berxit: Early exiting for bert with better fine-tuning and extension to regression. In Proceedings of the 16th conference of the European chapter of the association for computational linguistics: Main Volume, pp. 91–104, 2021.
  42. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183, 2022.
  43. Adavit: Adaptive tokens for efficient vision transformer. arXiv preprint arXiv:2112.07658, 2021.
  44. Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp.  811–824. IEEE, 2020.
  45. Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pp.  36–39. IEEE, 2019.
  46. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
Citations (24)

Summary

We haven't generated a summary for this paper yet.