Papers
Topics
Authors
Recent
2000 character limit reached

Length Generalization of Causal Transformers without Position Encoding (2404.12224v2)

Published 18 Apr 2024 in cs.CL

Abstract: Generalizing to longer sentences is important for recent Transformer-based LLMs. Besides algorithms manipulating explicit position features, the success of Transformers without position encodings (NoPE) provides a new way to overcome the challenge. In this paper, we study the length generalization property of NoPE. We find that although NoPE can extend to longer sequences than the commonly used explicit position encodings, it still has a limited context length. We identify a connection between the failure of NoPE's generalization and the distraction of attention distributions. We propose a parameter-efficient tuning for searching attention heads' best temperature hyper-parameters, which substantially expands NoPE's context size. Experiments on long sequence language modeling, the synthetic passkey retrieval task and real-world long context tasks show that NoPE can achieve competitive performances with state-of-the-art length generalization algorithms. The source code is publicly accessible

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Proof-pile.
  2. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508.
  3. BigScience Workshop. 2022. Bloom (revision 4ab0472).
  4. bloc97. 2023a. Add NTK-Aware interpolation "by parts" correction.
  5. bloc97. 2023b. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
  6. bloc97. 2023c. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
  7. Extending context window of large language models via positional interpolation.
  8. Latent positional information is in the self-attention variance of transformer language models without positional embeddings. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1183–1193, Toronto, Canada. Association for Computational Linguistics.
  9. David Chiang and Peter Cholak. 2022. Overcoming a theoretical limitation of self-attention. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7654–7664.
  10. A survey on long text modeling with transformers. arXiv preprint arXiv:2302.14502.
  11. emozilla. 2023. Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning.
  12. Lm-infinite: Simple on-the-fly length generalization for large language models.
  13. Transformer language models without positional encodings still learn positional information. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1382–1390, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  14. Automl: A survey of the state-of-the-art. Knowledge-Based Systems, 212:106622.
  15. Advancing transformer architecture in long-context large language models: A comprehensive survey. arXiv preprint arXiv:2311.12351.
  16. Atlas: Few-shot learning with retrieval augmented language models. J. Mach. Learn. Res., 24:251:1–251:43.
  17. Llm maybe longlm: Self-extend llm context window without tuning.
  18. kaiokendev. 2023. Things iḿ learning while training superhot.
  19. The impact of positional encoding on length generalization in transformers. In Thirty-seventh Conference on Neural Information Processing Systems.
  20. Starcoder: may the source be with you!
  21. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. In International Conference on Learning Representations.
  22. Amirkeivan Mohtashami and Martin Jaggi. 2023a. Landmark attention: Random-access infinite context length for transformers.
  23. Amirkeivan Mohtashami and Martin Jaggi. 2023b. Random-access infinite context length for transformers. In Thirty-seventh Conference on Neural Information Processing Systems.
  24. MosaicML NLP Team. 2023. Introducing mpt-7b: A new standard for open-source, commercially usable llms. Accessed: 2023-05-05.
  25. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, New York, NY, USA. Association for Computing Machinery.
  26. Yarn: Efficient context window extension of large language models.
  27. YaRN: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations.
  28. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations.
  29. Train short, test long: Attention with linear biases enables input length extrapolation.
  30. Compressive transformers for long-range sequence modelling. International Conference on Learning Representations,International Conference on Learning Representations.
  31. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  32. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama.
  33. Jianlin Su. 2021. AttentionÅ› scale operation from entropy invariance.
  34. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
  35. Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864.
  36. A length-extrapolatable transformer. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14590–14604, Toronto, Canada. Association for Computational Linguistics.
  37. Llama 2: Open foundation and fine-tuned chat models.
  38. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30.
  39. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv preprint arXiv:2310.00746.
  40. Efficient streaming language models with attention sinks.
  41. Soaring from 4k to 400k: Extending llm’s context with activation beacon.
  42. Tinyllama: An open-source small language model.
Citations (8)

Summary

  • The paper demonstrates that removing explicit positional encoding in causal transformers enables long-context generalization by manipulating attention distributions.
  • It introduces uniform and head-based softmax temperature scaling to concentrate attention and sustain lower perplexity beyond the pretraining window.
  • Empirical results on language modeling and synthetic tasks show that head-based scaling significantly outperforms traditional RoPE methods.

Length Generalization of Causal Transformers without Position Encoding

Introduction and Motivation

Transformer architectures underpin most large-scale LLMs, with explicit positional encodings (PE) such as sinusoidal, relative, and Rotary Position Encoding (RoPE) ubiquitously used to infuse sequence order information into token embeddings. However, the need for explicit PE imposes hard limitations on model extrapolation to unseen or longer input lengths, as models can only generalize to context windows encountered during pretraining. This paper investigates the capacity of Transformers—specifically causal, autoregressive models—to generalize to longer contexts when no explicit position encodings are used (NoPE).

Recent theoretical and empirical works have shown that masked attention alone can encode sufficient order information, and models can learn some positional awareness without explicit PE. The central question is: can such models generalize to much longer contexts, and what mechanisms underpin their failure or success in length generalization?

NoPE: Causal Transformers with No Position Encoding

The authors directly train large-scale Transformers with all explicit positional features ablated—removing RoPE from the TinyLlama codebase to produce a 1.1B parameter NoPE model. All other architectural and training details are kept identical to establish parity for evaluation.

Experimental validation demonstrates that within the pretraining window (2K tokens), NoPE matches RoPE in language modeling and commonsense reasoning (see Table~1 of the paper). However, outside the pretraining length, both approaches falter, raising questions about extending context length capability.

Analysis of Length Generalization Failure

The study pinpoints attention distribution distraction—where attention heads become uniformly spread rather than focused—as the point of failure in length generalization for NoPE and RoPE. This is measured by the entropy of attention distributions as sequence length expands beyond the training regime. Figure 1

Figure 1: NoPE and RoPE both exhibit performance and attention entropy inflection points as sequence length extends beyond pretraining; NoPE degrades later than RoPE.

Figure 2

Figure 2: Entropy patterns reveal highly diverse behaviors among attention heads, motivating per-head scaling.

Figure 3

Figure 3: Visualization of entropy across all layers and heads during 8K extension shows most lower-layer heads verge on theoretical maximum entropy as context increases.

The crucial insight is the tight coupling between rising entropy (more uniform, less focused attention) and surging perplexity, directly signaling generation failure.

Temperature Scaling: Uniform and Head-Based

To regain concentrated attention distributions and improve length generalization, the paper explores manipulating the softmax temperature (λ\lambda) in the attention mechanism:

αij(h)=eλqi(h)⋅kj(h)∑keλqi(h)⋅kk(h)\alpha_{ij}^{(h)} = \frac{e^{\lambda \bm{q}_i^{(h)} \cdot \bm{k}_j^{(h)}}}{\sum_k e^{\lambda \bm{q}_i^{(h)} \cdot \bm{k}_k^{(h)}}}

Raising λ\lambda reconcentrates attention and enables longer-context extrapolation without updating any model weights. Systematic scaling allows NoPE to sustain lower perplexity over significantly longer context windows. Figure 4

Figure 4

Figure 4: Uniform scaling of softmax temperature immediately extends NoPE's useful context window, while offering little benefit to RoPE.

The necessity for differentiated scaling arises: not all attention heads display the same entropy dynamics under length extrapolation. Uniform scaling overconcentrates some heads while undercompensating others.

The authors introduce head-based scaling—learning a separate λ(h)\lambda^{(h)} for each attention head via parameter-efficient fine-tuning (only 704 delta parameters for a 1B model), with constraints to avoid degenerate distracted states. This fine-tuning is data- and compute-efficient, requiring only 0.03% of the original pretraining data. Figure 5

Figure 5: Head-based scaling outperforms uniform scaling at extreme context windows; log-perplexity increases more slowly and less steeply.

Figure 6

Figure 6: Correlation analysis of entropy versus optimal λ(h)\lambda^{(h)} per head demonstrates that heads with concentrated attention need larger scaling; dispersed heads require less.

Extensive Empirical Evaluation

The approach is benchmarked on both synthetic and real-world tasks:

  • Language modeling (PG19, Proof-pile): Head-based NoPE outperforms RoPE-based zero-shot extension (NTK), approaches methods with additional finetuning data (YaRN), and surpasses ALiBi baselines at longer contexts.
  • Synthetic passkey retrieval: NoPE and head-based scaling substantially generalize beyond pretraining context, unlike RoPE and ALiBi, with high retrieval accuracy even >2x original context. Figure 7

Figure 7

Figure 7

Figure 7

Figure 7

Figure 7

Figure 7: Passkey retrieval accuracy demonstrates NoPE's superior context window generalization relative to RoPE.

  • LongBench: Head-based NoPE achieves higher or competitive scores vs. RoPE baselines on QA, summarization, and code tasks, retaining more utility in long-context settings.
  • Ablation Study: Removing attention concentration constraints or appropriate initialization degrades both synthetic and real-world long-context tasks, confirming their criticality.

Additionally, empirical fitting provides an explicit mapping between uniform scaling and extension ratio: Figure 8

Figure 8

Figure 8: Fitted optimal uniform scale shows required λ\lambda increases logarithmically with extension factor.

Implications and Future Directions

This work fundamentally challenges the perceived necessity of explicit positional encoding as the sole route to long-context generalization in Transformers. By identifying and directly manipulating attention concentration/entropy via temperature, the study demonstrates that NoPE architectures can be competitive with, or even surpass, position-encoding-dependent strategies—despite lacking any explicit order features.

Theoretically, this reframes length generalization as a problem of attention distribution dynamics, not position encoding alone. Practically, head-based scaling offers a highly parameter-efficient, finetune-light mechanism compatible with large deployed models, mitigating rigidity imposed by fixed context lengths at pretraining.

The persistence of quadratic complexity and memory constraints in NoPE for extremely long contexts leaves open critical optimization and scalability challenges. Moreover, while NoPE with scaling matches or exceeds the performance of several PE-based extension paradigms, it still trails the best results from those with more extensive finetuning (e.g., YaRN). Enhanced theoretical understanding of entropy evolution and more advanced scaling schemes might further close this gap.

Conclusion

The study rigorously demonstrates that Transformers without explicit position encoding (NoPE) possess robust length generalization capabilities when their attention distribution entropy is directly managed through softmax temperature scaling, particularly at a per-head level. The elimination of explicit positional biases does not preclude generalization—on the contrary, it enables new, efficient mechanisms for extending sequence modeling ability for long-context tasks. This reorientation invites a reconsideration of the role of position encoding, laying foundations for more flexible and adaptable Transformer architectures in both research and practical deployments.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 5 likes about this paper.