State of What Art? A Call for Multi-Prompt LLM Evaluation (2401.00595v3)
Abstract: Recent advances in LLMs have led to the development of various evaluation benchmarks. These benchmarks typically rely on a single instruction template for evaluating all LLMs on a specific task. In this paper, we comprehensively analyze the brittleness of results obtained via single-prompt evaluations across 6.5M instances, involving 20 different LLMs and 39 tasks from 3 benchmarks. To improve robustness of the analysis, we propose to evaluate LLMs with a set of diverse prompts instead. We discuss tailored evaluation metrics for specific use cases (e.g., LLM developers vs. developers interested in a specific downstream task), ensuring a more reliable and meaningful assessment of LLM capabilities. We then implement these criteria and conduct evaluations of multiple models, providing insights into the true strengths and limitations of current LLMs.
- Falcon-40b: an open large language model with state-of-the-art performance. Technical report, Technical report, Technology Innovation Institute.
- BIG bench authors. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- OpenAccess AI Collective. 2023. Minotaur.
- Gregory W Corder and Dale I Foreman. 2011. Nonparametric statistics for non-statisticians.
- Enhancing chat language models by scaling high-quality instructional conversations.
- Jon Durbin. 2023. Airoboros.
- Lmentry: A language model benchmark of elementary language tasks. arXiv preprint arXiv:2211.02069.
- Demystifying prompts in language models via perplexity estimation. arXiv preprint arXiv:2212.04037.
- Gemini Team Google. 2023. Gemini: A family of highly capable multimodal models.
- Robustness of learning from task instructions. arXiv preprint arXiv:2212.03813.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689.
- Instruction induction: From few examples to natural language task descriptions. arXiv preprint arXiv:2205.10782.
- Business statistics: based on schaums outline of theory and problems of business statistics, by leonard j. kazmier. Technical report, McGraw-Hill.
- Maurice G Kendall. 1945. The treatment of ties in ranking problems. Biometrika, 33(3):239–251.
- Maurice G Kendall and B Babington Smith. 1939. The problem of m rankings. The annals of mathematical statistics, 10(3):275–287.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- Is prompt all you need? no. a comprehensive and broader view of instruction learning. arXiv preprint arXiv:2303.10475.
- Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773.
- NousResearch. 2023. Nous-hermes.
- OpenAI. 2023. Gpt-4 technical report.
- Efficient benchmarking (of language models). ArXiv, abs/2308.11696.
- Multitask prompted training enables zero-shot task generalization.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
- Evaluating the zero-shot robustness of instruction-tuned language models. arXiv preprint arXiv:2306.11270.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
- Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7.
- MosaicML NLP Team. 2023. Introducing mpt-7b: A new standard for open-source, commercially usable llms. Accessed: 2023-05-05.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528.