Transformers Can Achieve Length Generalization But Not Robustly (2402.09371v1)

Published 14 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Length generalization, defined as the ability to extrapolate from shorter training sequences to longer test ones, is a significant challenge for LLMs. This issue persists even with large-scale Transformers handling relatively straightforward tasks. In this paper, we test the Transformer's ability of length generalization using the task of addition of two integers. We show that the success of length generalization is intricately linked to the data format and the type of position encoding. Using the right combination of data format and position encodings, we show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length. Nevertheless, unlike in-distribution generalization, length generalization remains fragile, significantly influenced by factors like random weight initialization and training data order, leading to large variances across different random seeds.

Citations (27)

View on Semantic Scholar

Summary

The paper demonstrates that Transformers can extrapolate to 2.5× longer decimal addition problems by leveraging FIRE positional encodings, reversed format, and index hints.
It reveals that choices in data formatting and positional encoding critically impact generalization, with performance sensitive to weight initialization and training data order.
The study highlights robustness challenges in extrapolation, underscoring the need for further research to stabilize length generalization in Transformers.

The paper investigates the length generalization capabilities of Transformers, specifically in the context of $N$ -digit decimal addition. Length generalization is defined as the ability to extrapolate from shorter training sequences to longer test sequences. The paper explores how different position encodings and data formats impact the Transformer's ability to perform length generalization.

The authors demonstrate that a standard Transformer architecture can generalize to sequence lengths $2.5\times$ longer than those seen during training, achieving high accuracy in 100-digit addition after being trained on addition problems up to 40 digits. This result was achieved through a combination of:

Functional Interpolation for Relative Positions Encoding (FIRE) position encodings
Randomized positions
Reversed format
Index hints

However, the paper also highlights the fragility of length generalization, noting its sensitivity to factors such as random weight initialization and the order of training data.

The key contributions of the paper are:

Demonstration of the marked influence of position encoding and data format on the success of length generalization, achieving extrapolation to lengths $2.5\times$ longer than the training data using FIRE position encodings.
The effectiveness of data formatting and augmentation techniques in length generalization is contingent on the choice of position encoding.
The discovery that length generalization is fragile and heavily relies on factors such as random weight initialization and training data order.

The paper evaluates various position encoding techniques, including:

Absolute Positional Encoding (APE)
Additive Relative Positional Encoding (RPE)
- T5
- ALiBi
- KerpleLog
- FIRE
Rotary Positional Encoding (RoPE)
No Positional Encoding (NoPE)
Randomized Position Encoding

The paper also evaluates various data formatting techniques, including:

Reversed format
Index Hints
Random Space Augmentation

The best model uses FIRE position encodings with randomized positions, in reversed format, with index hints. Ablation experiments were performed by removing each of these components.

The experimental setup involved training a 25M parameter Transformer model with 6 blocks, a 512 hidden size, and a feedforward layer with a hidden dimension of 2048. The model was trained using the AdamW optimizer with a weight decay of 0.1 and no dropout, for 50,000 steps, with a learning rate of 3e-4. The dataset consisted of 30M examples for training (input lengths 1-40) and 1,000 examples per input length for testing.

The results indicate that FIRE enables significantly better length generalization compared to other positional encodings. Index hints are crucial for length generalization, as models trained without them exhibit poor in-distribution generalization. The reversed format excels over the standard format across all position encodings. The efficacy of random space augmentation is position encoding-dependent, benefiting RoPE and KerpleLog but deteriorating NoPE and FIRE.

The paper also found that length generalization is sensitive to weight initialization and training data order, with high variance across different random seeds. Error analysis revealed no significant difference between errors with and without carry, suggesting that carry propagation does not impede length generalization.

The paper also presents the following results:

With standard formatting, FIRE excels in length generalization, even matching RoPE in reverse format.
The reversed format training leads to a sharp performance transition, reminiscent of the "grokking" phenomenon.
Randomized PE enhances FIRE but degrades KerpleLog.
Increasing the training length significantly improves length generalization in FIRE, achieving near-perfect accuracy at length 100.
Model size variation has a minor effect on length generalization. Larger models slightly improve generalization in short digit regimes (1 to 10 and 1 to 20 digit additions) but yield mixed results in longer regimes.
Higher weight decay values slightly enhance the likelihood of achieving effective length generalization.

In summary, this paper demonstrates that Transformers can achieve strong length generalization in the decimal addition task through a careful combination of position encoding and data formatting techniques. However, the generalization performance is fragile and sensitive to various factors, highlighting the need for further research in this area.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1757989825351692466

https://twitter.com/denny_zhou/status/1785833140038320457

https://twitter.com/arankomatsuzaki/status/1757948361540554809

https://twitter.com/Yongchao_Zhou_/status/1758554054500782271

https://twitter.com/teortaxesTex/status/1758170582787535280

https://twitter.com/xinyun_chen_/status/1758557969602507164