Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 37 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 14 tok/s Pro
GPT-4o 90 tok/s Pro
Kimi K2 179 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

xLSTM: Extended Long Short-Term Memory (2405.04517v2)

Published 7 May 2024 in cs.LG, cs.AI, and stat.ML

Abstract: In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first LLMs. However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in LLMing when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule. Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.

Citations (78)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces exponential gating and novel memory architectures (sLSTM and mLSTM) to enhance traditional LSTM performance for sequence modeling tasks.
  • The xLSTM architecture achieves linear computation and improved memory capacity, outperforming standard LSTMs and approaching Transformer performance.
  • Experimental results on formal language tasks and the Long Range Arena highlight xLSTM’s potential across language modeling and reinforcement learning applications.

"xLSTM: Extended Long Short-Term Memory" (2405.04517)

Introduction

The "xLSTM: Extended Long Short-Term Memory" paper addresses the limitations of traditional LSTM models when scaled to billions of parameters and compared to Transformers. Despite the proven effectiveness of LSTMs in sequence modelling, notably within the realms of LLMs and reinforcement learning tasks, the sequential processing requirement of LSTMs has hindered their scalability and parallelization. The authors propose new modifications to the LSTM memory cell structure, introducing two variants: sLSTM and mLSTM, both equipped with exponential gating mechanisms. These enhancements aim to address the known constraints of LSTMs by offering improvements in memory capacity, processing speed, and overall model performance.

Extended LSTM Architecture

The xLSTM introduces two primary modifications: exponential gating and novel memory structures. The first variant, sLSTM, uses scalar memory and incorporates exponential gates alongside normalization techniques for memory mixing. The second variant, mLSTM, expands the memory cell from a scalar to a matrix, utilizing matrix multiplications for memory retrieval, thus enabling parallelization by abandoning hidden-hidden recurrent connections. These novel memory structures integrate into residual block architectures called xLSTM blocks, which, when stacked residue-wise, form an xLSTM architecture capable of achieving competitive performance in LLMing tasks. Figure 1

Figure 1: The extended LSTM (xLSTM) family exhibits its original form and enhancements in sLSTM and mLSTM memory cells, which introduce exponential gating and matrix memory integration.

Memory Capacity and Speed Considerations

The xLSTM model contrasts with Transformers by offering linear computation and constant memory complexity with respect to sequence length. Given its compressive nature, industrial applications benefit from the low computational demands, especially notable in edge-course implementations. mLSTM’s matrix memory, high in computation complexity, is optimized for parallel GPU execution, minimizing wall clock time overhead. Despite sLSTM’s slower performance compared to mLSTM, optimizations have led to efficient CUDA implementations, reducing speed disparities.

The research aligns with efforts in linearizing attention mechanisms through alternatives like Linformer, Performer, and other state-space models, which are linear in context length and exhibit desirable scaling properties. In terms of scalable recurrent networks, xLSTM resonates conceptually with approaches like RWKV and Retention, which aim at improving parallel processing and memory retention capabilities.

Experiments

The paper evaluates the xLSTM’s capability through synthetic tasks in formal language domains and its performance across the Long Range Arena benchmark. Experiments reveal xLSTM’s advantageous position in handling memory capacity challenges in Multi-Query Associative Recall (MQAR) tasks and extrapolation capabilities in sequence length, outperforming both traditional LSTMs and contemporary Transformer models. Figure 2

Figure 2: Demonstration of xLSTM's exponential gating impact on formal language tasks as evaluated under the Chomsky hierarchy.

Limitations

The xLSTM architecture, despite its enhanced performance, presents limitations in terms of the parallelization of sLSTM due to its memory mixing property. Moreover, the initialization of gating mechanisms requires careful tailoring to evade computational inefficiencies, while potential overloading concerns persist when extending sequence lengths beyond 16k.

Conclusion

The xLSTM extends the capabilities of LSTM architectures, presenting a viable alternative to state-of-the-art transformer models in large-scale sequence modeling. Its architecture offers promising results in LLMing, with indications that further scaling will remain competitive in the field of LLMs. Looking ahead, xLSTM’s impact may resonate across diverse fields within AI, including time-series prediction and reinforcing learning applications, given its innovative enhancements to memory management within neural architectures.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

  1. XLSTM: Extended Long Short-Term Memory (195 points, 73 comments)
Reddit Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube