TransformerFAM: Feedback attention is working memory (2404.09173v3)

Published 14 Apr 2024 in cs.LG, cs.AI, and cs.CL

Abstract: While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations. This design fosters the emergence of working memory within the Transformer, allowing it to process indefinitely long sequences. TransformerFAM requires no additional weights, enabling seamless integration with pre-trained models. Our experiments show that TransformerFAM significantly improves Transformer performance on long-context tasks across various model sizes (1B, 8B, and 24B). These results showcase the potential to empower LLMs to process sequences of unlimited length.

Citations (9)

View on Semantic Scholar

Summary

The paper introduces Feedback Attention Memory (FAM) that empowers Transformers to emulate working memory and process indefinitely long sequences with linear complexity.
It integrates seamlessly with existing pre-trained models without requiring additional weights, demonstrating scalability across different model sizes.
Experiments reveal significant performance gains on long-context tasks, handling sequences of over 260k tokens in benchmarks like PassKey retrieval.

TransformerFAM: Integrating Working Memory into Transformers Through Feedback Attention

Introduction to TransformerFAM

The paper introduces TransformerFAM, a novel architecture enhancing the Transformer model to process indefinitely long sequences by integrating a feedback loop that acts as working memory. This advancement addresses one of the major limitations of existing Transformer models - their quadratic attention complexity that restricts them from efficiently handling very long inputs. Unlike conventional approaches that either increase computational resources or implement variations of sliding window attention, TransformerFAM allows the model to attend to its own latent representations through a feedback loop, emulating the functionality of working memory in the human brain.

Core Contributions

Feedback Attention Memory (FAM): The introduction of FAM enables the Transformer to maintain and update a working memory of past information, allowing for the processing of indefinitely long sequences with linear computational complexity. This novel component does not introduce additional weights, facilitating its integration with pre-trained models.
Compatibility with Existing Models: TransformerFAM's design allows it to leverage pre-existing Transformer models by integrating seamlessly without necessitating retraining from scratch. It particularly shows compatibility with models of various sizes, demonstrating its scalability.
Significant Performance Improvements: The experiments conducted show that TransformerFAM significantly outperforms standard Transformer models on long-context tasks, a result consistently observed across different model sizes.

Experiments and Results

The experimental results underscore TransformerFAM's ability to enhance performance on tasks requiring long-context processing. For instance, on the PassKey retrieval task, TransformerFAM demonstrated proficiency in handling filler contexts up to 260k tokens, markedly exceeding the capabilities of models employing traditional sliding window attention mechanisms. This proficiency was manifest across model sizes, from 1B to 24B, indicating scalability.

Implications and Future Prospects

Theoretical Implications: TransformerFAM presents a novel approach to integrating working memory into deep learning models, which could stimulate further research into models that more closely mimic human cognitive processes.
Practical Applications: The ability to process indefinitely long sequences efficiently opens up new avenues for application in areas such as document summarization, extended conversation understanding, and anywhere long-contextual understanding is crucial.
Future Development: The architecture invites exploration into models that can handle increasingly heterogeneous data types, perhaps leading toward more integrative and versatile AI systems.

Conclusion

TransformerFAM represents a significant step forward in the quest to overcome the limitations imposed by the quadratic attention complexity of traditional Transformers. By introducing a mechanism that emulates working memory, it not only enhances the model's ability to process long sequences but also aligns artificial neural network architectures more closely with the cognitive functions of the human brain. As such, TransformerFAM not only advances the field of deep learning but also opens new pathways for research into AI systems capable of complex, contextually rich information processing.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1780081593643647022

https://twitter.com/arankomatsuzaki/status/1780078155765633232

https://twitter.com/IntuitMachine/status/1781995832675377600

https://twitter.com/fly51fly/status/1780193867264041232

https://twitter.com/woojinrad/status/1780623259295199483

https://twitter.com/javaeeeee1/status/1780186665153097883