Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 131 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 71 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 385 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

xLSTM: Extended Long Short-Term Memory (2405.04517v2)

Published 7 May 2024 in cs.LG, cs.AI, and stat.ML

Abstract: In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first LLMs. However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule. Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.

Citations (78)

Summary

  • The paper introduces exponential gating and novel memory architectures (sLSTM and mLSTM) to enhance traditional LSTM performance for sequence modeling tasks.
  • The xLSTM architecture achieves linear computation and improved memory capacity, outperforming standard LSTMs and approaching Transformer performance.
  • Experimental results on formal language tasks and the Long Range Arena highlight xLSTM’s potential across language modeling and reinforcement learning applications.

"xLSTM: Extended Long Short-Term Memory" (2405.04517)

Introduction

The "xLSTM: Extended Long Short-Term Memory" paper addresses the limitations of traditional LSTM models when scaled to billions of parameters and compared to Transformers. Despite the proven effectiveness of LSTMs in sequence modelling, notably within the realms of LLMs and reinforcement learning tasks, the sequential processing requirement of LSTMs has hindered their scalability and parallelization. The authors propose new modifications to the LSTM memory cell structure, introducing two variants: sLSTM and mLSTM, both equipped with exponential gating mechanisms. These enhancements aim to address the known constraints of LSTMs by offering improvements in memory capacity, processing speed, and overall model performance.

Extended LSTM Architecture

The xLSTM introduces two primary modifications: exponential gating and novel memory structures. The first variant, sLSTM, uses scalar memory and incorporates exponential gates alongside normalization techniques for memory mixing. The second variant, mLSTM, expands the memory cell from a scalar to a matrix, utilizing matrix multiplications for memory retrieval, thus enabling parallelization by abandoning hidden-hidden recurrent connections. These novel memory structures integrate into residual block architectures called xLSTM blocks, which, when stacked residue-wise, form an xLSTM architecture capable of achieving competitive performance in language modeling tasks. Figure 1

Figure 1: The extended LSTM (xLSTM) family exhibits its original form and enhancements in sLSTM and mLSTM memory cells, which introduce exponential gating and matrix memory integration.

Memory Capacity and Speed Considerations

The xLSTM model contrasts with Transformers by offering linear computation and constant memory complexity with respect to sequence length. Given its compressive nature, industrial applications benefit from the low computational demands, especially notable in edge-course implementations. mLSTM’s matrix memory, high in computation complexity, is optimized for parallel GPU execution, minimizing wall clock time overhead. Despite sLSTM’s slower performance compared to mLSTM, optimizations have led to efficient CUDA implementations, reducing speed disparities.

The research aligns with efforts in linearizing attention mechanisms through alternatives like Linformer, Performer, and other state-space models, which are linear in context length and exhibit desirable scaling properties. In terms of scalable recurrent networks, xLSTM resonates conceptually with approaches like RWKV and Retention, which aim at improving parallel processing and memory retention capabilities.

Experiments

The paper evaluates the xLSTM’s capability through synthetic tasks in formal language domains and its performance across the Long Range Arena benchmark. Experiments reveal xLSTM’s advantageous position in handling memory capacity challenges in Multi-Query Associative Recall (MQAR) tasks and extrapolation capabilities in sequence length, outperforming both traditional LSTMs and contemporary Transformer models. Figure 2

Figure 2: Demonstration of xLSTM's exponential gating impact on formal language tasks as evaluated under the Chomsky hierarchy.

Limitations

The xLSTM architecture, despite its enhanced performance, presents limitations in terms of the parallelization of sLSTM due to its memory mixing property. Moreover, the initialization of gating mechanisms requires careful tailoring to evade computational inefficiencies, while potential overloading concerns persist when extending sequence lengths beyond 16k.

Conclusion

The xLSTM extends the capabilities of LSTM architectures, presenting a viable alternative to state-of-the-art transformer models in large-scale sequence modeling. Its architecture offers promising results in language modeling, with indications that further scaling will remain competitive in the field of LLMs. Looking ahead, xLSTM’s impact may resonate across diverse fields within AI, including time-series prediction and reinforcing learning applications, given its innovative enhancements to memory management within neural architectures.

Dice Question Streamline Icon: https://streamlinehq.com

Open Questions

We haven't generated a list of open questions mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 116 tweets and received 7972 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

  1. XLSTM: Extended Long Short-Term Memory (195 points, 73 comments)
Reddit Logo Streamline Icon: https://streamlinehq.com