Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 37 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 10 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4 31 tok/s Pro
2000 character limit reached

Think before you speak: Training Language Models With Pause Tokens (2310.02226v3)

Published 3 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs generate responses by producing a series of tokens in immediate succession: the $(K+1){th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1){th}$ token? We operationalize this idea by performing training and inference on LLMs with a (learnable) $\textit{pause}$ token, a sequence of which is appended to the input prefix. We then delay extracting the model's outputs until the last pause token is seen, thereby allowing the model to process extra computation before committing to an answer. We empirically evaluate $\textit{pause-training}$ on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and fact recall. Our main finding is that inference-time delays show gains when the model is both pre-trained and finetuned with delays. For the 1B model, we witness gains on 8 of 9 tasks, most prominently, a gain of $18\%$ EM score on the QA task of SQuAD, $8\%$ on CommonSenseQA and $1\%$ accuracy on the reasoning task of GSM8k. Our work raises a range of conceptual and practical future research questions on making delayed next-token prediction a widely applicable new paradigm.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Rl4f: Generating natural language feedback with reinforcement learning for repairing model outputs. In Annual Meeting of the Association for Computational Linguistics, 2023. URL https://api.semanticscholar.org/CorpusID:258685337.
  2. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.  1533–1544, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/D13-1160.
  3. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
  4. Recurrent memory transformer. In NeurIPS, 2022.
  5. Memory transformer. arXiv preprint arXiv:2006.11527, 2020.
  6. Multi-cls bert: An efficient alternative to traditional ensembling, 2023.
  7. Training verifiers to solve math word problems, 2021.
  8. Vision transformers need registers, 2023.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  10. Critic: Large language models can self-correct with tool-interactive critiquing. ArXiv, abs/2305.11738, 2023. URL https://api.semanticscholar.org/CorpusID:258823123.
  11. WARP: Word-level Adversarial ReProgramming. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  4921–4933, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.381. URL https://aclanthology.org/2021.acl-long.381.
  12. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019.
  13. Learning to reason and memorize with self-notes, 2023.
  14. Measuring faithfulness in chain-of-thought reasoning, 2023.
  15. The power of scale for parameter-efficient prompt tuning, 2021.
  16. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  4582–4597, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.353. URL https://aclanthology.org/2021.acl-long.353.
  17. Gpt understands, too, 2021.
  18. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks, 2022.
  19. Yang Liu. Fine-tune bert for extractive summarization, 2019.
  20. Few-shot sequence learning with transformers, 2020.
  21. Text and patterns: For effective chain of thought, it takes two to tango, 2022.
  22. Self-refine: Iterative refinement with self-feedback, 2023.
  23. Show your work: Scratchpads for intermediate computation with language models, 2021.
  24. The lambada dataset: Word prediction requiring a broad discourse context, 2016.
  25. Learning how to ask: Querying LMs with mixtures of soft prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  5203–5212, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.410. URL https://aclanthology.org/2021.naacl-main.410.
  26. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
  27. Squad: 100,000+ questions for machine comprehension of text, 2016.
  28. CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019. doi: 10.1162/tacl˙a˙00266. URL https://aclanthology.org/Q19-1016.
  29. It’s not just size that matters: Small language models are also few-shot learners, 2021.
  30. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022.
  31. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4149–4158, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1421. URL https://aclanthology.org/N19-1421.
  32. John Thickstun. The transformer model in equations. University of Washington: Seattle, WA, USA, 2021.
  33. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023.
  34. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pp.  5998–6008, 2017.
  35. Towards understanding chain-of-thought prompting: An empirical study of what matters, 2023a.
  36. Self-consistency improves chain of thought reasoning in language models, 2023b.
  37. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022.
  38. An embarrassingly simple model for dialogue relation extraction. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, may 2022. doi: 10.1109/icassp43922.2022.9747486. URL https://doi.org/10.1109%2Ficassp43922.2022.9747486.
  39. Tree of thoughts: Deliberate problem solving with large language models, 2023a.
  40. Retroformer: Retrospective large language agents with policy gradient optimization, 2023b.
  41. Star: Bootstrapping reasoning with reasoning, 2022.
  42. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
  43. Factual probing is [MASK]: Learning vs. learning to recall. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  5017–5033, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.398. URL https://aclanthology.org/2021.naacl-main.398.
  44. Least-to-most prompting enables complex reasoning in large language models, 2023.
Citations (56)

Summary

  • The paper demonstrates that incorporating pause tokens during both pretraining and finetuning yields significant gains, including an 18% increase in SQuAD EM.
  • The methodology leverages a delay mechanism to expand the model's computational width by allowing extra attention operations beyond the input length.
  • The study emphasizes that an optimal number of pause tokens is crucial, as excessive or missing tokens can degrade performance or reduce robustness.

Training LLMs with Pause Tokens: A Paradigm for Delayed Next-Token Prediction

Introduction and Motivation

This work introduces a novel approach to LLM (LM) training and inference by incorporating "pause tokens"—special, learnable tokens that are appended to the input sequence to intentionally delay the model's output generation. The central hypothesis is that the standard causal Transformer architecture imposes an arbitrary constraint: the number of per-layer operations available to compute the next token is limited by the number of tokens seen so far. By appending MM pause tokens, the model is afforded K+MK+M hidden vectors per layer (for KK input tokens), potentially enabling richer intermediate representations and more expressive computation before outputting the next token.

The method is operationalized by introducing a unique <pause> token, appended multiple times to the input during both pretraining and finetuning. The model is trained to ignore outputs corresponding to these tokens, only producing meaningful outputs after the final pause. This approach is evaluated on decoder-only models of 1B and 130M parameters, pretrained on C4 and finetuned on a diverse set of downstream tasks.

Methodology: Pause-Training and Inference

Pause-Token Mechanism

The pause-training paradigm consists of three stages:

  1. Pause-Pretraining: Insert MptM_{\rm pt} <pause> tokens at random positions in the pretraining sequence. The model is trained with the standard next-token prediction loss, but loss terms for predicting <pause> tokens are omitted. This ensures the model learns to utilize the additional computation without being distracted by the need to predict the pause tokens themselves.
  2. Pause-Finetuning: For downstream tasks, append MftM_{\rm ft} <pause> tokens to the input prefix. The model is trained to predict the target sequence only after the last pause token, again ignoring outputs for the pause tokens.
  3. Pause-Inference: At inference, append MinfM_{\rm inf} <pause> tokens to the input, and ignore model outputs until the last pause token is seen.

This approach is compared against several baselines: standard pretraining and finetuning (StdPT), pause-finetuning only (StdPT), pause-pretraining only (PausePT), and pause-pretraining plus pause-finetuning (PausePT).

(Figure 1)

Figure 1: Downstream performance for a 1B model. Injecting delays in both stages of training (PausePT) outperforms standard end-to-end training (StdPT) on a wide variety of tasks (except HellaSwag). In contrast, introducing delays only in the finetuning stage provides only lukewarm gains, and even hurts in GSM8k.

Empirical Results

Main Findings

  • PausePT (Pause-Pretraining + Pause-Finetuning) yields consistent improvements across a range of tasks for the 1B model, with the most notable gains being an 18% increase in SQuAD EM, 8% in CommonSenseQA, and 1% in GSM8k accuracy over the standard baseline.
  • Pause-finetuning alone (StdPT) provides only mild or inconsistent gains, and in some cases, degrades performance.
  • Pause-pretraining alone (PausePT) without downstream delay offers limited benefits, indicating that both pretraining and finetuning with pauses are necessary to realize the full advantage.
  • Filler tokens (e.g., periods) as delays do not confer benefits, corroborating prior findings that models must be explicitly trained to utilize such delays.

(Figure 2)

Figure 2: Downstream performance of pause-training on a 130M decoder-only model. On six out of nine tasks, PausePT outperforms StdPT, but gains are less pronounced than for the 1B model.

Ablation Studies

  • Optimal Number of Pauses: Each downstream task has an optimal MftM_{\rm ft}; excessive pauses can degrade performance, likely due to overwhelming the self-attention mechanism.
  • Robustness to Inference-Time Delay Mismatch: Pause-trained models degrade gracefully when the number of inference-time pauses differs from training, but performance collapses if no pauses are provided at inference.
  • Appending vs. Prepending: Appending pauses is generally superior, but prepending still outperforms the baseline, suggesting positional biases induced by pause-pretraining. Figure 3

    Figure 3: Varying finetuning delay: There exists an optimal number of <pause> tokens for each downstream dataset; gains diminish or reverse beyond this point.

    Figure 4

Figure 4

Figure 4

Figure 4

Figure 4: Zero-shot evaluation of pause-pretrained models. Zero-shot inference with pause tokens gives gains on some tasks, but absolute accuracies remain low for small models.

Theoretical Analysis

The authors formalize the intuition that pause tokens increase the "computational width" of the Transformer. In standard inference, the number of parallel operations per layer is limited by the input length KK. By appending MM pause tokens, the model can perform K+MK+M parallel computations per layer, potentially enabling it to solve tasks that require more independent operations than the input length allows.

A key theoretical result (Theorem~\ref{thm:pause}) demonstrates that, under reasonable assumptions about the representational capacity of the attention-feedforward block, there exist tasks that a 2-layer Transformer can solve if and only if it uses pause tokens. This is because the number of distinct operations the model can implement is bottlenecked by the input length in standard inference, but can be expanded to match the parameter count of the attention module with pause tokens.

Computational and Practical Considerations

  • Parameter Efficiency: The addition of a single learnable pause token increases the parameter count negligibly (e.g., 1024 parameters for a 1B model).
  • FLOPS and Latency: Pause tokens increase the number of attention operations per layer, but do not add sequential depth. Thus, with sufficient parallelism, wall-clock overhead is minimal compared to adding layers or attention heads.
  • Comparison to Chain-of-Thought (CoT): While both pause-inference and CoT increase computational width, CoT also increases computational depth via autoregressive generation of intermediate steps. CoT does not require pretraining modifications, whereas pause-inference does.

Limitations and Open Questions

  • Generalization: Gains are not universal; some tasks (e.g., HellaSwag) do not benefit, and the optimal number of pauses is task-dependent.
  • Accessibility: The requirement for pause-pretraining limits immediate applicability to existing pretrained models.
  • Robustness: Pause-trained models are not robust to zero-delay inference; performance collapses if no pauses are provided.
  • Scaling: Preliminary results suggest that larger models benefit more from pause tokens, contrary to the hypothesis that smaller models would benefit due to increased effective capacity.
  • Future Directions: Open questions include developing methods to make pause-training effective for standard pretrained models, determining the optimal number of pauses adaptively, and extending the approach to other architectures and pretraining objectives.

Implications and Future Developments

The pause-training paradigm challenges the conventional design of causal LLMs by decoupling the number of per-layer operations from the input length. This has both theoretical and practical implications:

  • Theoretical: It motivates a re-examination of the relationship between model capacity, computational width, and input sequence length in Transformer architectures.
  • Practical: It suggests a new axis for model improvement—computational width via input manipulation—distinct from parameter scaling or architectural changes.

Potential future developments include adaptive pause mechanisms (varying the number of pauses per input), integration with adaptive compute methods, and exploration of pause tokens in encoder-decoder or multi-modal architectures.

Conclusion

Pause-training introduces a simple yet effective modification to the Transformer training and inference pipeline by leveraging learnable pause tokens to expand computational width. Empirical results demonstrate significant gains on a variety of tasks when pauses are introduced during both pretraining and finetuning. Theoretical analysis supports the intuition that pause tokens unlock otherwise inaccessible representational capacity. While the approach has limitations—most notably, the need for pause-pretraining and lack of robustness to zero-delay inference—it opens a promising direction for both the mechanistic understanding and practical enhancement of LLMs.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com