CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models (2403.03514v2)
Abstract: Developing LLMs with robust long-context capabilities has been the recent research focus, resulting in the emergence of long-context LLMs proficient in Chinese. However, the evaluation of these models remains underdeveloped due to a lack of benchmarks. To address this gap, we present CLongEval, a comprehensive Chinese benchmark for evaluating long-context LLMs. CLongEval is characterized by three key features: (1) Sufficient data volume, comprising 7 distinct tasks and 7,267 examples; (2) Broad applicability, accommodating to models with context windows size from 1K to 100K; (3) High quality, with over 2,000 manually annotated question-answer pairs in addition to the automatically constructed labels. With CLongEval, we undertake a comprehensive assessment of 6 open-source long-context LLMs and 2 leading commercial counterparts that feature both long-context abilities and proficiency in Chinese. We also provide in-depth analysis based on the empirical results, trying to shed light on the critical capabilities that present challenges in long-context settings. The dataset, evaluation scripts, and model outputs are released.
- L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088.
 - Qwen technical report.
 - Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508.
 - Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
 - bloc97. 2023. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
 - Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
 - Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307.
 - Kerple: Kernelized relative positional embedding for length extrapolation. Advances in Neural Information Processing Systems, 35:8386–8399.
 - Yiming Cui. 2023. Chinese-llama-alpaca.
 - A span-extraction dataset for chinese machine reading comprehension. arXiv preprint arXiv:1810.07366.
 - Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
 - Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
 - Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486.
 - Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137.
 - How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210.
 - AMERICANO: argument generation with discourse-driven decomposition and agent interaction. CoRR, abs/2310.20352.
 - Manitweet: A new benchmark for identifying manipulation of news on social media. CoRR, abs/2305.14225.
 - InternLMTeam. 2024. Official release of internlm2 7b and 20b base and chat models.
 - Mistral 7b. arXiv preprint arXiv:2310.06825.
 - Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745, 1(10).
 - The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328.
 - Booksum: A collection of datasets for long-form narrative summarization.
 - Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626.
 - How long can context length of open-source llms truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
 - Text revision by on-the-fly representation optimization. pages 10956–10964.
 - Unsupervised text generation by learning from search. Advances in Neural Information Processing Systems, 33:10820–10831.
 - Dataset and neural recurrent sequence labeling model for open-domain factoid question answering. arXiv preprint arXiv:1607.06275.
 - Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
 - Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.
 - Scaling laws of rope-based extrapolation. arXiv preprint arXiv:2310.05209.
 - General and domain-adaptive chinese spelling check with error-consistent pretraining. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(5):1–18.
 - Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660.
 - OpenAI. 2023. New models and developer products announced at DevDay.
 - Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048.
 - Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071.
 - Logigan: Learning logical reasoning via adversarial pre-training. Advances in Neural Information Processing Systems, 35:16290–16304.
 - Jean-Charles Pomerol. 1997. Artificial intelligence and human decision making. European Journal of Operational Research, 99(1):3–25.
 - Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations.
 - A recipe for arbitrary text style transfer with large language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 837–848, Dublin, Ireland. Association for Computational Linguistics.
 - Zeroscrolls: A zero-shot benchmark for long text understanding. arXiv preprint arXiv:2305.14196.
 - Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
 - Thuctc: An efficient chinese text classifier.
 - Do long-range language models actually use long-range context? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 807–822.
 - Introduction to sighan 2015 bake-off for chinese spelling check. In Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing, pages 32–37.
 - Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453.
 - Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039.
 - Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
 - Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103.
 - Memorybank: Enhancing large language models with long-term memory. arXiv preprint arXiv:2305.10250.
 - ZhupuAI. 2023. Chatglm3 series: Open bilingual chat llms.
 
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.