Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 60 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 22 tok/s Pro
GPT-5 High 18 tok/s Pro
GPT-4o 82 tok/s Pro
Kimi K2 197 tok/s Pro
GPT OSS 120B 458 tok/s Pro
Claude Sonnet 4.5 30 tok/s Pro
2000 character limit reached

RULER: What's the Real Context Size of Your Long-Context Language Models? (2404.06654v3)

Published 9 Apr 2024 in cs.CL

Abstract: The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context LMs. However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding. To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity. RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles. Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context. We evaluate 17 long-context LMs with 13 representative tasks in RULER. Despite achieving nearly perfect accuracy in the vanilla NIAH test, almost all models exhibit large performance drops as the context length increases. While these models all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K. Our analysis of Yi-34B, which supports context length of 200K, reveals large room for improvement as we increase input length and task complexity. We open source RULER to spur comprehensive evaluation of long-context LMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. AI21. Introducing jamba: Ai21’s groundbreaking ssm-transformer model, 2024. URL https://www.ai21.com/blog/announcing-jamba.
  2. L-eval: Instituting standardized evaluation for long context language models. In ICLR, 2024.
  3. Anthropic. Long context prompting for Claude 2.1. Blog, 2023. URL https://www.anthropic.com/index/claude-2-1-prompting.
  4. Anthropic. Introducing the next generation of claude, 2024. URL https://www.anthropic.com/news/claude-3-family.
  5. Zoology: Measuring and improving recall in efficient language models. In ICLR, 2024.
  6. Yushi Bai et al. LongBench: A bilingual, multitask benchmark for long context understanding. arXiv:2308.14508, 2023.
  7. Scaling Transformer to 1M tokens and beyond with RMT. arXiv:2304.11062, 2023.
  8. Introducing GoodAI LTM benchmark. Blog, 2024. URL https://www.goodai.com/introducing-goodai-ltm-benchmark/.
  9. Extending context window of large language models via positional interpolation. In ICLR, 2023.
  10. LongLoRA: Efficient fine-tuning of long-context large language models. In ICLR, 2024.
  11. Generating long sequences with sparse Transformers. arXiv:1904.10509, 2019.
  12. Cohere. Command r, 2024. URL https://docs.cohere.com/docs/command-r#model-details.
  13. Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. arxiv:2307.08691, 2023.
  14. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In NeurIPS, 2022.
  15. Jiayu Ding et al. LongNet: Scaling Transformers to 1,000,000,000 tokens. arXiv:2307.02486, 2023.
  16. Yiran Ding et al. LongRoPE: Extending LLM context window beyond 2 million tokens. arXiv:2402.13753, 2024.
  17. Bamboo: A comprehensive benchmark for evaluating long text modeling capacities of large language models. arXiv:2309.13345, 2023.
  18. GLM: General language model pretraining with autoregressive blank infilling. In Proc of the 60th Annual Meeting of the ACL (Volume 1: Long Papers), pp.  320–335, 2022.
  19. Hungry Hungry Hippos: Towards language modeling with state space models. In ICLR, 2023a.
  20. Daniel Y. Fu et al. Simple hardware-efficient long convolutions for sequence modeling. ICML, 2023b.
  21. Yao Fu et al. Data engineering for scaling language models to 128k context. arXiv:2402.10171, 2024.
  22. Neural Turing machines. arXiv:1410.5401, 2014.
  23. Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752, 2023.
  24. Efficiently modeling long sequences with structured state spaces. In ICLR, 2022.
  25. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv:2308.16137, 2023.
  26. John J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proc of the National Academy of Sciences of the United States of America, 79 8:2554–8, 1982.
  27. Efficient long-text understanding with short-text models. Transactions of the ACL, 11:284–299, 2023.
  28. Sam Ade Jacobs et al. DeepSpeed Ulysses: System optimizations for enabling training of extreme long sequence Transformer models. arXiv:2309.14509, 2023.
  29. Sebastian Jaszczur et al. Sparse is enough in scaling transformers. In NeurIPS, 2021.
  30. Albert Q Jiang et al. Mixtral of experts. arXiv:2401.04088, 2024.
  31. Huiqiang Jiang et al. LongLlmLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression. arXiv:2310.06839, 2023.
  32. Gregory Kamradt. Needle In A Haystack - pressure testing LLMs. Github, 2023. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack/tree/main.
  33. Lauri Karttunen. Discourse referents. In COLING, 1969.
  34. George Kingsley Zipf. Selected studies of the principle of relative frequency in language. Harvard university press, 1932.
  35. Woosuk Kwon et al. Efficient memory management for large language model serving with paged attention. In Proc. of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  36. How long can open-source LLMs truly promise on context length?, 2023a. URL https://lmsys.org/blog/2023-06-29-longchat.
  37. Loogle: Can long-context language models understand long contexts? arXiv:2311.04939, 2023b.
  38. Ring attention with blockwise Transformers for near-infinite context. In ICLR, 2023.
  39. World model on million-length video and language with Ring Attention. arxiv:2402.08268, 2024a.
  40. Jiaheng Liu et al. E2-LLM: Efficient and extreme length extension of large language models. arXiv:2401.06951, 2024b.
  41. Lost in the middle: How language models use long contexts. Transactions of the ACL, 12:157–173, 2024c.
  42. ∞\infty∞-former: Infinite memory Transformer. In Proc. of the 60th Annual Meeting of the ACL (Volume 1: Long Papers), 2022.
  43. Mistral.AI. La plateforme, 2023. URL https://mistral.ai/news/la-plateforme/.
  44. Landmark attention: Random-access infinite context length for Transformers. In Workshop on Efficient Systems for Foundation Models @ ICML, 2023.
  45. Vincent Ng. Supervised noun phrase coreference research: The first fifteen years. In Proc. of the 48th Annual Meeting of the ACL, 2010.
  46. Catherine Olsson et al. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
  47. OpenAI: Josh Achiam et al. GPT-4 technical report. arXiv:2303.08774, 2023.
  48. Bo Peng et al. RWKV: Reinventing RNNs for the transformer era. In EMNLP, 2023.
  49. YaRN: Efficient context window extension of large language models. In ICLR, 2024.
  50. Hyena hierarchy: Towards larger convolutional language models. In ICML, 2023.
  51. Train short, test long: Attention with linear biases enables input length extrapolation. In ICLR, 2022.
  52. Know what you don’t know: Unanswerable questions for SQuAD. In Proc. of the 56th Annual Meeting of the ACL (Volume 2: Short Papers), 2018.
  53. Machel Reid et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530, 2024.
  54. Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proc. of the 58th Annual Meeting of the ACL, 2020.
  55. ZeroSCROLLS: A zero-shot benchmark for long text understanding. In EMNLP, 2023.
  56. RoFormer: Enhanced Transformer with rotary position embedding. arXiv:2104.09864, 2023.
  57. ChapterBreak: A challenge dataset for long-range language models. In Proc. of the 2022 Conference of the North American Chapter of the ACL: Human Language Technologies, 2022.
  58. Retentive network: A successor to Transformer for large language models. arXiv:2307.08621, 2023a.
  59. A length-extrapolatable Transformer. In Proc. of the 61st Annual Meeting of the ACL (Volume 1: Long Papers), 2023b.
  60. A benchmark for learning to translate a new language from one grammar book. In ICLR, 2024.
  61. Yi Tay et al. Long Range Arena: A benchmark for efficient Transformers. In ICLR, 2021.
  62. Together.AI. Preparing for the era of 32k context: Early learnings and explorations, 2023a. URL https://www.together.ai/blog/llama-2-7b-32k.
  63. Together.AI. Llama-2-7b-32k-instruct — and fine-tuning for llama-2 models with together api, 2023b. URL https://www.together.ai/blog/llama-2-7b-32k-instruct.
  64. Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
  65. Musique: Multihop questions via single-hop question composition. Transactions of the ACL, 10:539–554, 2022.
  66. Szymon Tworkowski et al. Focused Transformer: Contrastive training for context scaling. NeurIPS, 36, 2024.
  67. Teun A. van Dijk and Walter Kintsch. Strategies of discourse comprehension. In Academic Press, 1983.
  68. Attention is all you need. In NeurIPS, 2017.
  69. Augmenting language models with long-term memory. NeurIPS, 36, 2024.
  70. Thomas Wolf et al. Huggingface’s Transformers: State-of-the-art natural language processing. arXiv:1910.03771, 2019.
  71. Memformer: A memory-augmented Transformer for sequence modeling. In Findings of the ACL: AACL-IJCNLP, 2022.
  72. X.AI. Announcing grok-1.5, 2024. URL https://x.ai/blog/grok-1.5.
  73. Chaojun Xiao et al. InfLLM: Unveiling the intrinsic capacity of LLMs for understanding extremely long sequences with training-free memory. arXiv:2402.04617, 2024a.
  74. Efficient streaming language models with attention sinks. In ICLR, 2024b.
  75. Wenhan Xiong et al. Effective long-context scaling of foundation models. arXiv:2309.16039, 2023.
  76. Retrieval meets long context large language models. In ICLR, 2024.
  77. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In EMNLP, 2018.
  78. Alex Young et al. Yi: Open foundation models by 01.AI. arXiv:2403.04652, 2024.
  79. Soaring from 4k to 400k: Extending LLM’s context with activation beacon. arXiv:2401.03462, 2024a.
  80. ∞\infty∞bench: Extending long context evaluation beyond 100k tokens. arXiv:2402.13718, 2024b.
  81. PoSE: Efficient context window extension of LLMs via positional skip-wise training. In ICLR, 2024.
Citations (117)

Summary

  • The paper introduces a novel synthetic benchmark called Ruler to evaluate long-context language models across diverse tasks beyond simple retrieval.
  • The paper evaluates ten long-context models up to 128K tokens, revealing discrepancies between claimed and effective context sizes with significant performance degradation.
  • The paper demonstrates that non-retrieval tasks, especially multi-hop tracing and aggregation, suffer as context length increases, with GPT-4 being the notable exception.

"RULER: What's the Real Context Size of Your Long-Context LLMs?" (2404.06654)

Introduction

The paper introduces "Ruler," a synthetic benchmark specifically designed to evaluate the long-context capabilities of LMs. Traditional evaluation methods like the needle-in-a-haystack (NIAH) test focus primarily on retrieval capabilities, leaving other aspects of long-context understanding unexplored. Ruler encompasses a suite of tasks aimed at challenging these models beyond mere retrieval, including multi-hop tracing, aggregation, and question answering (QA). The benchmark's flexibility in task configuration enables a comprehensive assessment across different sequence lengths and complexities.

Benchmark Structure

Ruler comprises four major task categories:

  1. Retrieval (NIAH Extension): This category extends the vanilla NIAH test by introducing diverse needles and distractor setups, necessitating retrieval proficiency across various contexts.
  2. Multi-hop Tracing (Variable Tracking): This task emulates coreference chain resolution, requiring models to track entity references over long contexts.
  3. Aggregation (Common and Frequent Word Extraction): Tasks in this category simulate summarization, evaluating a model's ability to aggregate dispersed, relevant information.
  4. Question Answering (QA): By augmenting existing short-context QA datasets with distracting information, these tasks test models' QA capabilities at scale. Figure 1

Figure 1

Figure 1: In aggregation tasks, we sample words from a vocabulary following the two distributions above. The common words extraction (CWE) samples from uniform distributions. In the frequent words extraction (FWE), the frequency of each word is determined by its rank in the vocabulary and the parameter alpha of Zeta distribution.

Experimental Setup

The researchers benchmarked ten long-context LMs, including prominent models such as GPT-4, Command-R, and Yi-34B. The models were assessed on 13 representative tasks across varying complexity levels and context lengths up to 128K tokens. Using a combination of static thresholds and weighted averaging, the paper establishes "effective length" as a measure of a model's contextual performance capability. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Performance of Yi-34B in the needle-in-a-haystack (NIAH) tasks. By default, we use word-number as the key-value pair and Paul Graham essays as the haystack. Yi is not robust to the change of needle types and degrades with the increasing amount of distractors. (W: words; N: numbers; U: UUIDs; Full: entire haystack).

Key Findings

The benchmark revealed significant discrepancies between claimed and effective context sizes, with all models exhibiting performance degradation as input length increased. Notably, only GPT-4 maintained satisfactory performance at 128K tokens. The paper also demonstrated that greater model parameter size and training context length generally correlates with improved performance in long-context scenarios. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: Performance of Yi-34B in variable tracking (VT), frequent words extraction (FWE), and QA tasks across different task complexities. Yi shows large degradation and distinct trends with scaled context size in these non-retrieval tasks, demonstrating the need to evaluate behavior beyond retrieval from context.

Task-Specific Insights

  • Retrieval (NIAH): All models showed perfect scores on straightforward passkey retrieval tasks, but performance declined considerably as tasks incorporated hard distractors or required multi-value retrieval.
  • Multi-hop Tracing: Models struggled with reliably tracing variable bindings through complex chains, especially as distractor tasks increased.
  • Aggregation: Performance varied significantly based on the input word distribution, with models frequently misjudging word frequencies at large context sizes.
  • QA: Models frequently hallucinated answers, indicating reduced reliance on the provided context, a behavior attributed to the increased length.

Model Analysis

An analysis of non-Transformer architectures (e.g., Mamba, RWKV) highlighted their inferiority compared to Transformer-based models in length extrapolation tasks. Furthermore, increasing the base frequency in RoPE positively correlated with enhanced length extrapolation capabilities. Figure 4

Figure 4

Figure 4

Figure 4

Figure 4: (Left {content} middle left): Comparison of LargeWorldModel (LWM) series trained up to various context sizes with fixed parameter size of 7B. (Middle right): Comparison of Yi suite models with different parameter sizes with controlled training context length of 200K. (Right): Performance of non-Transformer architectures lags behind the Transformer baseline Llama2-7B by large margin. Length extrapolation is presented with dashed lines.

Conclusion

The introduction of Ruler sets a new standard for comprehensive evaluation of long-context LMs by incorporating tasks that extend beyond basic retrieval. The benchmark exposes the limitations of current models, especially regarding their effective context utilization and the gap between claimed and actual performance at longer sequences. Ruler will likely spur further research and development in models capable of handling truly extended contexts, ultimately enhancing their applicability in real-world scenarios requiring long-range dependencies.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 39 posts and received 11064 likes.

Youtube Logo Streamline Icon: https://streamlinehq.com