Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 147 tok/s

Gemini 2.5 Pro 42 tok/s Pro

GPT-5 Medium 33 tok/s Pro

GPT-5 High 28 tok/s Pro

GPT-4o 81 tok/s Pro

Kimi K2 190 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling (2405.00888v1)

Published 1 May 2024 in cs.CL

Abstract: Traditional LLMs operate autoregressively, i.e., they predict one token at a time. Rapid explosion in model sizes has resulted in high inference times. In this work, we propose DynaMo, a suite of multi-token prediction LLMs that reduce net inference times. Our models $\textit{dynamically}$ predict multiple tokens based on their confidence in the predicted joint probability distribution. We propose a lightweight technique to train these models, leveraging the weights of traditional autoregressive counterparts. Moreover, we propose novel ways to enhance the estimated joint probability to improve text generation quality, namely co-occurrence weighted masking and adaptive thresholding. We also propose systematic qualitative and quantitative methods to rigorously test the quality of generated text for non-autoregressive generation. One of the models in our suite, DynaMo-7.3B-T3, achieves same-quality generated text as the baseline (Pythia-6.9B) while achieving 2.57$\times$ speed-up with only 5.87% and 2.67% parameter and training time overheads, respectively.

References (56)