Emergent Mind

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

(2403.09629)
Published Mar 14, 2024 in cs.CL , cs.AI , and cs.LG

Abstract

When writing and talking, people sometimes pause to think. Although reasoning-focused works have often framed reasoning as a method of answering questions or completing agentic tasks, reasoning is implicit in almost all written text. For example, this applies to the steps not stated between the lines of a proof or to the theory of mind underlying a conversation. In the Self-Taught Reasoner (STaR, Zelikman et al. 2022), useful thinking is learned by inferring rationales from few-shot examples in question-answering and learning from those that lead to a correct answer. This is a highly constrained setting -- ideally, a language model could instead learn to infer unstated rationales in arbitrary text. We present Quiet-STaR, a generalization of STaR in which LMs learn to generate rationales at each token to explain future text, improving their predictions. We address key challenges, including 1) the computational cost of generating continuations, 2) the fact that the LM does not initially know how to generate or use internal thoughts, and 3) the need to predict beyond individual next tokens. To resolve these, we propose a tokenwise parallel sampling algorithm, using learnable tokens indicating a thought's start and end, and an extended teacher-forcing technique. Encouragingly, generated rationales disproportionately help model difficult-to-predict tokens and improve the LM's ability to directly answer difficult questions. In particular, after continued pretraining of an LM on a corpus of internet text with Quiet-STaR, we find zero-shot improvements on GSM8K (5.9%$\rightarrow$10.9%) and CommonsenseQA (36.3%$\rightarrow$47.2%) and observe a perplexity improvement of difficult tokens in natural text. Crucially, these improvements require no fine-tuning on these tasks. Quiet-STaR marks a step towards LMs that can learn to reason in a more general and scalable way.

Model's generalization to reasoning problems analyzed via zero-shot accuracy on GSM8K and CommonsenseQA datasets.

Overview

  • Quiet-STaR introduces a methodology for enhancing generative language models by enabling them to generate internal rationales before making text predictions, building on the Self-Taught Reasoner concept.

  • This approach uses a process involving three core activities: think (generate rationale candidates), talk (assess utility of rationales), and learn (optimize rationale generation).

  • Applying Quiet-STaR to models based on the Mistral architecture has shown significant improvements in reasoning tasks like GSM8K and CommonsenseQA without task-specific finetuning.

  • The methodology opens avenues for research in dynamic rationale adaptation and combining with external rationale generation techniques, offering potential for more sophisticated reasoning capabilities in language models.

Quiet-STaR: Enhancements in Generative Language Models through Self-Taught Reasoning

Introduction

Generative language models have shown significant capabilities in producing coherent and contextually relevant text. However, the latent reasoning potential within these models to generate text with intentional logical progression remains an area ripe for research. In the recent work on Quiet-STaR, a novel approach is proposed to harness the reasoning capabilities of language models by enabling them to 'think' in the form of generating internal rationales before making predictions about future text. This methodology builds on and generalizes from the Self-Taught Reasoner (STaR) by inferring unstated rationales in arbitrary text, thereby allowing the models to benefit from self-reasoning across a broad spectrum of text and tasks.

Methodology

The Quiet-STaR technique revolves around three core processes: think, talk, and learn. Firstly, for each token in a given sequence, the language model generates multiple rationale candidates, which are potential internal thoughts that could logically precede the future text. To efficiently manage this at scale, a parallel generation algorithm is employed, significantly reducing computational overhead. Next, these generated rationales are assessed based on their utility in improving the prediction of subsequent tokens. This evaluation employs a mixing head -- a mechanism that weighs the contribution of thoughts to the final prediction, adjusting the influence based on their perceived utility. Lastly, the optimization of rationale generation is achieved using a REINFORCE approach, where the model is trained to favor rationales that enhance the prediction accuracy of subsequent text.

Results and Implications

The application of Quiet-STaR to a generative language model based on the Mistral architecture yields notable improvements in direct reasoning tasks. Zero-shot performance on datasets such as GSM8K and CommonsenseQA saw significant boosts without any task-specific finetuning, showcasing the model's enhanced reasoning capabilities. Crucially, these improvements scaled with the length of the generated rationales, indicating that more extensive internal reasoning processes contribute positively to the model's predictive accuracy. Beyond task-specific gains, the model exhibited a nuanced improvement distribution across token predictions, emphasizing its capacity to leverage internal thoughts for challenging tokens that likely benefit from deeper reasoning.

Future Directions

Quiet-STaR opens up several avenues for future exploration in generative language models and their reasoning capabilities. One area of interest could be the dynamic adaptation of rationale generation based on context or task requirements, potentially optimizing computational resources while maintaining or enhancing performance. Another exciting direction is the combination of Quiet-STaR with chain-of-thought prompting techniques, investigating the synergies between external and internal rationale generation for complex problem-solving tasks.

Conclusion

The Quiet-STaR methodology represents a meaningful step forward in enabling language models to self-teach reasoning capabilities across diverse text types and tasks. By generating and learning from internal rationales, these models can achieve improved predictive performance and exhibit enhanced reasoning abilities, underscoring the potential of generative language models as multifaceted reasoning entities. The advancements demonstrated by Quiet-STaR not only contribute to the ongoing discourse on generative AI but also lay the groundwork for more sophisticated and capable language-based reasoning systems in the future.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube