L-Eval: Instituting Standardized Evaluation for Long Context Language Models
Abstract: Recently, there has been growing interest in extending the context length of LLMs, aiming to effectively process long inputs of one turn or conversations with more extensive histories. While proprietary models such as GPT-4 and Claude can largely preserve the reasoning ability in an extended context, open-source models are still progressing through the early stages of development. To bridge this gap, we propose L-Eval to institute a more standardized evaluation for long context LLMs (LCLMs) addressing two key aspects: dataset construction and evaluation metrics. On the one hand, we build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs encompassing diverse question styles, domains, and input length (3k$\sim$200k tokens). On the other hand, we investigate the effectiveness in evalution metrics for LCLMs. Results show that popular n-gram matching metrics generally can not correlate well with human judgment, and thus we strongly advocate for length-instruction-enhanced (LIE) evaluation and employing LLM judges. We conducted a comprehensive study of 4 popular commercial LLMs and 12 open-source counterparts using the L-Eval benchmark. Our empirical findings offer useful insights into the study of LCLMs and lay the groundwork for the development of more principled evaluation of these models.
- Extractive opinion summarization in quantized transformer spaces. Transactions of the Association for Computational Linguistics, 9:277–293, 2021. doi: 10.1162/tacl˙a˙00366. URL https://aclanthology.org/2021.tacl-1.17.
- Longbench: A bilingual, multitask benchmark for long context understanding, 2023.
- Scaling transformer to 1m tokens and beyond with rmt, 2023.
- Summscreen: A dataset for abstractive screenplay summarization, 2022.
- Extending context window of large language models via positional interpolation, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Supervised and unsupervised transfer learning for question answering. In NAACL HLT, 2018.
- Training verifiers to solve math word problems, 2021.
- Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1285. URL https://aclanthology.org/P19-1285.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 16344–16359. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/67d57c32e20fd0a7a302cb81d36e40d5-Paper-Conference.pdf.
- A dataset of information-seeking questions and answers anchored in research papers, 2021.
- Longnet: Scaling transformers to 1,000,000,000 tokens, 2023.
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335, 2022.
- Alpacafarm: A simulation framework for methods that learn from human feedback, 2023.
- Multi-news: a large-scale multi-document summarization dataset and abstractive hierarchical model, 2019.
- MultiDoc2dial: Modeling dialogues grounded in multiple documents. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.498. URL https://doi.org/10.18653%2Fv1%2F2021.emnlp-main.498.
- Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=COZDy0WYGg.
- Measuring massive multitask language understanding, 2021a.
- Cuad: An expert-annotated nlp dataset for legal contract review, 2021b.
- Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1419–1436, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.112. URL https://aclanthology.org/2021.naacl-main.112.
- The narrativeqa reading comprehension challenge, 2017.
- Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019. doi: 10.1162/tacl˙a˙00276. URL https://aclanthology.org/Q19-1026.
- How long can open-source llms truly promise on context length?, June 2023a. URL https://lmsys.org/blog/2023-06-29-longchat.
- In-context learning with many demonstration examples, 2023b.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023c.
- Unleashing infinite-length input capacity for large-scale language models with self-controlled memory system. arXiv preprint arXiv:2304.13343, 2023.
- Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp. 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
- Lost in the middle: How language models use long contexts, 2023.
- LocalLLaMA. Dynamically scaled rope further increases performance of long context llama with zero fine-tuning, July 2023a. URL https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/.
- LocalLLaMA. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., June 2023b. URL https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/.
- Landmark attention: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300, 2023.
- Long sequence modeling with xgen: A 7b llm trained on 8k input sequence length. Salesforce AI Research Blog, 2023. URL https://blog.salesforceairesearch.com/xgen.
- Quality: Question answering with long input texts, yes!, 2022.
- Rwkv: Reinventing rnns for the transformer era, 2023a.
- Yarn: Efficient context window extension of large language models, 2023b.
- Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0.
- Linearized relative positional encoding. CoRR, abs/2307.09270, 2023. doi: 10.48550/arXiv.2307.09270. URL https://doi.org/10.48550/arXiv.2307.09270.
- Scrolls: Standardized comparison over long language sequences, 2022.
- Zeroscrolls: A zero-shot benchmark for long text understanding, 2023.
- BIGPATENT: A large-scale dataset for abstractive and coherent summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2204–2213, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1212. URL https://aclanthology.org/P19-1212.
- Roformer: Enhanced transformer with rotary position embedding, 2022.
- Do long-range language models actually use long-range context? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 807–822, 2021.
- A length-extrapolatable transformer, 2022.
- Retentive network: A successor to transformer for large language models, 2023.
- Challenging big-bench tasks and whether chain-of-thought can solve them, 2022.
- Long range arena: A benchmark for efficient transformers, 2020.
- Llama: Open and efficient foundation language models, 2023a.
- Llama 2: Open foundation and fine-tuned chat models, 2023b.
- Towards machine comprehension of spoken content: Initial toefl listening comprehension test by machine. In INTERSPEECH, 2016.
- Easyedit: An easy-to-use knowledge editing framework for large language models, 2023.
- Can we automate scientific reviewing?, 2021.
- Cab: Comprehensive attention benchmarking on long sequence modeling, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Qmsum: A new benchmark for query-based multi-domain meeting summarization, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.