Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization (2405.15071v3)

Published 23 May 2024 in cs.CL

Abstract: We study whether transformers can learn to implicitly reason over parametric knowledge, a skill that even the most capable LLMs struggle with. Focusing on two representative reasoning types, composition and comparison, we consistently find that transformers can learn implicit reasoning, but only through grokking, i.e., extended training far beyond overfitting. The levels of generalization also vary across reasoning types: when faced with out-of-distribution examples, transformers fail to systematically generalize for composition but succeed for comparison. We delve into the model's internals throughout training, conducting analytical experiments that reveal: 1) the mechanism behind grokking, such as the formation of the generalizing circuit and its relation to the relative efficiency of generalizing and memorizing circuits, and 2) the connection between systematicity and the configuration of the generalizing circuit. Our findings guide data and training setup to better induce implicit reasoning and suggest potential improvements to the transformer architecture, such as encouraging cross-layer knowledge sharing. Furthermore, we demonstrate that for a challenging reasoning task with a large search space, GPT-4-Turbo and Gemini-1.5-Pro based on non-parametric memory fail badly regardless of prompting styles or retrieval augmentation, while a fully grokked transformer can achieve near-perfect accuracy, showcasing the power of parametric memory for complex reasoning.

Citations (26)

View on Semantic Scholar

Summary

The paper shows that transformers develop implicit reasoning capabilities through grokking, achieving high generalization performance after extended training.
The paper reveals distinct generalization behaviors, where transformers excel at comparison tasks but struggle with composition in out-of-distribution settings.
The paper offers mechanistic insights into circuit formation and efficiency, suggesting architectural modifications to enhance memory sharing and complex reasoning.

Implicit Reasoning in Transformers: Grokking and Its Mechanisms

The investigated paper explores the capacity of transformer models to implicitly reason over parametric knowledge. It explores whether transformers can overcome known challenges and exhibits the grokking phenomenon—an extended training period far beyond overfitting. The research specifically focuses on two archetypal reasoning tasks: composition and comparison.

Key Findings

Implicit Reasoning via Grokking: The paper shows that transformers can develop implicit reasoning capabilities through grokking. For both composition and comparison tasks, high generalization performance is achieved only after prolonged training beyond overfitting.
Differences in Generalization:

The paper reveals a crucial distinction in generalization capabilities: - For composition, transformers fail to generalize systematically in out-of-distribution (OOD) scenarios. - For comparison, transformers successfully generalize systematically even in OOD scenarios.

Mechanistic Insights:

Through analytical experiments, the researchers elucidate the internal mechanisms forming during training and grokking. Two primary insights are highlighted: - Generalizing Circuit Formation: Specific circuits in the transformer model termed as 'generalizing circuits' are responsible for successful implicit reasoning. - Circuit Efficiency: The relative efficiency of generalizing circuits compared to memorizing circuits is a vital factor in achieving grokking.

Task-Specific Generalization: Mechanistic analysis indicates that while transformers can develop scalable solutions through parallel circuits in the comparison task, they struggle with the recursive memory-sharing required for systematic composition reasoning.

Implications for Training and Architecture

Data Distribution Over Size: The data distribution, specifically the ratio of inferred to atomic facts, significantly affects generalization speed, much more than the absolute size of training data. This observation suggests that prior hypotheses focusing on critical data size may require reconsideration—emphasizing data distribution instead.
Cross-Layer Memory Sharing: Findings point to the need for architectural modifications to enhance generalization in tasks requiring sequential reasoning, such as composition. Applying techniques like memory augmentation and explicit recurrence may yield better results.
Parametric Memory for Complex Reasoning: On a highly challenging reasoning task with an expansive search space, the paper illustrates the distinct advantages of parametric memory. Fully grokked transformers outperform state-of-the-art models like GPT-4-Turbo and Gemini-1.5-Pro, emphasizing the unique potential of parametric memory configurations for intricate reasoning tasks.

Future Directions and Conclusion

The paper lays substantial groundwork for future developments in transformer-based reasoning:

Architectural Enhancements: Introducing cross-layer memory-sharing mechanisms to transformers could significantly improve their ability to generalize systematically in varied reasoning tasks.
Extended Analysis: Future research could further explore the exact dynamics within the transforming circuits during grokking, offering deeper insights into the optimization process.
Balancing Parametric and Non-Parametric Approaches: A nuanced understanding of when to leverage parametric versus non-parametric memory is essential, particularly in complex reasoning scenarios requiring extensive knowledge integration and retrieval.

In summary, this research advances our understanding of how transformers can implicitly reason when subjected to extended training via grokking. It highlights crucial implications for the design of datasets and model architectures, aiming to maximize the transformers' potential for complex reasoning. The findings advocate for refined training setups and potential architectural revisions to foster more robust and systematic generalization capabilities in transformer models.

PDF Markdown

Related Papers

GitHub

GitHub - OSU-NLP-Group/GrokkedTransformer: Code for the paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization' (36 stars)

Tweets

https://twitter.com/BoshiWang2/status/1836938361216409750

https://twitter.com/xiangyue96/status/1795115794830942224

https://twitter.com/_akhaliq/status/1794912618882187678

https://twitter.com/ysu_nlp/status/1836941323460653060

https://twitter.com/rohanpaul_ai/status/1809950057086530019

https://twitter.com/BoshiWang2/status/1795294846212567089