DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling (2405.00888v1)
Abstract: Traditional LLMs operate autoregressively, i.e., they predict one token at a time. Rapid explosion in model sizes has resulted in high inference times. In this work, we propose DynaMo, a suite of multi-token prediction LLMs that reduce net inference times. Our models $\textit{dynamically}$ predict multiple tokens based on their confidence in the predicted joint probability distribution. We propose a lightweight technique to train these models, leveraging the weights of traditional autoregressive counterparts. Moreover, we propose novel ways to enhance the estimated joint probability to improve text generation quality, namely co-occurrence weighted masking and adaptive thresholding. We also propose systematic qualitative and quantitative methods to rigorously test the quality of generated text for non-autoregressive generation. One of the models in our suite, DynaMo-7.3B-T3, achieves same-quality generated text as the baseline (Pythia-6.9B) while achieving 2.57$\times$ speed-up with only 5.87% and 2.67% parameter and training time overheads, respectively.
- Pythia: A suite for analyzing large language models across training and scaling. In Proceedings of the International Conference on Machine Learning, pages 2397–2430.
- PIQA: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439.
- Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901.
- Medusa: Simple framework for accelerating LLM generation with multiple decoding heads. https://github.com/FasterDecoding/Medusa.
- Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318.
- AlpaGasus: Training a better Alpaca with fewer data. arXiv preprint arXiv:2307.08701.
- Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. https://lmsys.org/blog/2023-03-30-vicuna/.
- Neural architecture search for transformers: A survey. IEEE Access, 10:108374–108412.
- BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 1, pages 2924–2936.
- Think you have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
- A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3366–3385.
- Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, volume 1, pages 889–898.
- The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
- A framework for few-shot language model evaluation. https://doi.org/10.5281/zenodo.5371628.
- Xinyang Geng and Hao Liu. 2023. OpenLLaMA: An open reproduction of LLaMA. https://github.com/openlm-research/open_llama.
- In Proceedings of the International Conference on Learning Representations.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations.
- Sparse is enough in scaling transformers. Advances in Neural Information Processing Systems, 34:9895–9907.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- CTRL: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
- RACE: Large-scale reading comprehension dataset from examinations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 785–794.
- xFormers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, volume 1, pages 3214–3252.
- Online speculative decoding. arXiv preprint arXiv:2310.07177.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Can a suit of armor conduct electricity? A new dataset for open book question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 2381–2391.
- Orca: Progressive learning from complex explanation traces of GPT-4. arXiv preprint arXiv:2306.02707.
- CrowS-pairs: A challenge dataset for measuring social biases in masked language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1953–1967.
- Skeleton-of-Thought: Large language models can do parallel decoding. arXiv preprint arXiv:2307.15337.
- OpenAI. 2023a. ChatGPT. https://chat.openai.com.
- OpenAI. 2023b. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.
- Nobuyuki Otsu. 1979. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics, 9(1):62–66.
- Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355–607.
- Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9.
- Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, volume 2, pages 784–789.
- Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In Proceedings of the AAAI Spring Symposium Series.
- WinoGrande: An adversarial Winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- Sinkhorn divergences for unbalanced optimal transport. arXiv preprint arXiv:1910.12958.
- Q-BERT: Hessian based ultra low precision quantization of BERT. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8815–8821.
- Benjamin Frederick Spector and Christopher Re. 2023. Accelerating LLM inference with staged speculative decoding. In Workshop on Efficient Systems for Foundation Models@ ICML2023.
- Shikaripur N. Sridhar. 2012. Cognition and Sentence Production: A Cross-linguistic Study, volume 22. Springer Science & Business Media.
- Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31.
- Challenging BIG-Bench tasks and whether chain-of-thought can solve them. In Proceedings of the Association for Computational Linguistics, pages 13003–13051.
- Stanford Alpaca: An instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
- LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Shikhar Tuli and Niraj K. Jha. 2023a. AccelTran: A sparsity-aware accelerator for dynamic inference with transformers. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 42(11):4038–4051.
- Shikhar Tuli and Niraj K. Jha. 2023b. EdgeTran: Device-aware co-search of transformers for efficient inference on mobile edge platforms. IEEE Transactions on Mobile Computing, pages 1–18.
- HAT: Hardware-aware transformers for efficient natural language processing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7675–7688.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768.
- CTC alignments improve autoregressive translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1615–1631.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800.
- Judging LLM-as-a-judge with MT-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- AGIEval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.