$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens (2402.13718v3)
Abstract: Processing and reasoning over long contexts is crucial for many practical applications of LLMs, such as document comprehension and agent construction. Despite recent strides in making LLMs process contexts with more than 100K tokens, there is currently a lack of a standardized benchmark to evaluate this long-context capability. Existing public benchmarks typically focus on contexts around 10K tokens, limiting the assessment and comparison of LLMs in processing longer contexts. In this paper, we propose $\infty$Bench, the first LLM benchmark featuring an average data length surpassing 100K tokens. $\infty$Bench comprises synthetic and realistic tasks spanning diverse domains, presented in both English and Chinese. The tasks in $\infty$Bench are designed to require well understanding of long dependencies in contexts, and make simply retrieving a limited number of passages from contexts not sufficient for these tasks. In our experiments, based on $\infty$Bench, we evaluate the state-of-the-art proprietary and open-source LLMs tailored for processing long contexts. The results indicate that existing long context LLMs still require significant advancements to effectively process 100K+ context. We further present three intriguing analyses regarding the behavior of LLMs processing long context.
- 01.AI. 2023a. Yi-34b-200k. https://huggingface.co/01-ai/Yi-34B-200K.
- 01.AI. 2023b. Yi-6b-200k. https://huggingface.co/01-ai/Yi-6B-200K.
- Moonshot AI. 2023. Kimi chat. https://kimi.moonshot.cn/.
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints. ArXiv, abs/2305.13245.
- L-eval: Instituting standardized evaluation for long context language models. ArXiv, abs/2307.11088.
- Exploring length generalization in large language models. Advances in Neural Information Processing Systems, 35:38546–38556.
- Anthropic. 2023. Model card and evaluations for claude models.
- Longbench: A bilingual, multitask benchmark for long context understanding.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
- Language models are few-shot learners. CoRR, abs/2005.14165.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
- Extending context window of large language models via positional interpolation. ArXiv, abs/2306.15595.
- Tri Dao. 2023. FlashAttention-2: Faster attention with better parallelism and work partitioning.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
- Flash-decoding for long-context inference.
- A dataset of information-seeking questions and answers anchored in research papers. ArXiv, abs/2105.03011.
- A survey on long text modeling with transformers. arXiv preprint arXiv:2302.14502.
- Lm-infinite: Simple on-the-fly length generalization for large language models. ArXiv, abs/2308.16137.
- Pre-trained models: Past, present and future. AI Open, 2:225–250.
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780.
- Flashdecoding++: Faster large language model inference on gpus.
- Efficient attentions for long document summarization. ArXiv, abs/2104.02112.
- Advancing transformer architecture in long-context large language models: A comprehensive survey. arXiv preprint arXiv:2311.12351.
- Mistral 7b.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. ArXiv, abs/1705.03551.
- The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
- Loogle: Can long-context language models understand long contexts? ArXiv, abs/2311.04939.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Lost in the middle: How language models use long contexts.
- Amirkeivan Mohtashami and Martin Jaggi. 2023. Landmark attention: Random-access infinite context length for transformers. ArXiv, abs/2305.16300.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15.
- OpenAI. 2023a. Gpt-4 technical report. ArXiv, abs/2303.08774.
- OpenAI. 2023b. Gpt-4 turbo.
- OpenAI. 2023c. Tiktoken.
- The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
- Rwkv: Reinventing rnns for the transformer era.
- Yarn: Efficient context window extension of large language models.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409.
- Pre-trained models for natural language processing: A survey. Science China Technological Sciences, 63(10):1872–1897.
- Noam M. Shazeer. 2019. Fast transformer decoding: One write-head is all you need. ArXiv, abs/1911.02150.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. ArXiv, abs/1909.08053.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, page 127063.
- A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554.
- Long range arena: A benchmark for efficient transformers.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Efficient streaming language models with attention sinks. ArXiv, abs/2309.17453.
- Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing.
- Pose: Efficient context window extension of llms via positional skip-wise training. arXiv preprint arXiv:2309.10400.