Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation
Abstract: New Natural Langauge Process~(NLP) benchmarks are urgently needed to align with the rapid development of LLMs. We present Xiezhi, the most comprehensive evaluation suite designed to assess holistic domain knowledge. Xiezhi comprises multiple-choice questions across 516 diverse disciplines ranging from 13 different subjects with 249,587 questions and accompanied by Xiezhi-Specialty and Xiezhi-Interdiscipline, both with 15k questions. We conduct evaluation of the 47 cutting-edge LLMs on Xiezhi. Results indicate that LLMs exceed average performance of humans in science, engineering, agronomy, medicine, and art, but fall short in economics, jurisprudence, pedagogy, literature, history, and management. We anticipate Xiezhi will help analyze important strengths and shortcomings of LLMs, and the benchmark is released in~\url{https://github.com/MikeGu721/XiezhiBenchmark}.
- Ext5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
- Pythia: A suite for analyzing large language models across training and scaling.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
- Gpt-neox-20b: An open-source autoregressive language model.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
- How does gpt obtain its ability? tracing emergent abilities of language models to their sources. Yao Fu’s Notion.
- Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726.
- H2O.ai (2023). h2ogpt - the world’s best open source gpt. https://github.com/h2oai/h2ogpt.
- Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938.
- An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems, 35:30016–30030.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933.
- Cosmos qa: Machine reading comprehension with contextual commonsense reasoning. arXiv preprint arXiv:1909.00277.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
- Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases. arXiv preprint arXiv:2303.14742.
- Towards better instruction following language models for chinese: Investigating the impact of training data and evaluation. arXiv preprint arXiv:2304.07854.
- Openassistant conversations – democratizing large language model alignment.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- M3ke: A massive multi-level multi-subject knowledge evaluation benchmark for chinese large language models. arXiv preprint arXiv:2305.10263.
- Crosslingual generalization through multitask finetuning.
- Lextreme: A multi-lingual and multi-task benchmark for the legal domain. arXiv preprint arXiv:2301.13126.
- OpenAI (2023a). Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt.
- OpenAI (2023b). Gpt-4 technical report.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- A survey of evaluation metrics used for nlg systems. ACM Computing Surveys (CSUR), 55(2):1–39.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
- StabilityAI (2023). Stablelm: Stability ai language models. https://github.com/Stability-AI/StableLM.
- Fudannlp moss.
- Moss: An open-source tool-augmented conversational language model from fudan university. https://github.com/OpenLMLab/MOSS.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Alpaca-lora. https://github.com/tloen/alpaca-lora.
- Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
- Doctorglm: Fine-tuning your chinese doctor is not a herculean task.
- Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
- GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations (ICLR).
- Zeng, H. (2023). Measuring massive multitask chinese understanding. arXiv preprint arXiv:2304.12986.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.
- A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. arXiv preprint arXiv:2302.09419.
- Neural correction model for open-domain named entity recognition. arXiv preprint arXiv:1909.06058.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.