Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models (2402.13887v2)
Abstract: LLMs have demonstrated remarkable capabilities across various applications, fundamentally reshaping the landscape of NLP research. However, recent evaluation frameworks often rely on the output probabilities of LLMs for predictions, primarily due to computational constraints, diverging from real-world LLM usage scenarios. While widely employed, the efficacy of these probability-based evaluation strategies remains an open research question. This study aims to scrutinize the validity of such probability-based evaluation methods within the context of using LLMs for Multiple Choice Questions (MCQs), highlighting their inherent limitations. Our empirical investigation reveals that the prevalent probability-based evaluation method inadequately aligns with generation-based prediction. Furthermore, current evaluation frameworks typically assess LLMs through predictive tasks based on output probabilities rather than directly generating responses, owing to computational limitations. We illustrate that these probability-based approaches do not effectively correspond with generative predictions. The outcomes of our study can enhance the understanding of LLM evaluation methodologies and provide insights for future research in this domain.
- When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. arXiv preprint arXiv:2402.01781.
- Palm 2 technical report. CoRR, abs/2305.10403.
- A general language assistant as a laboratory for alignment. CoRR, abs/2112.00861.
- Qwen technical report. CoRR, abs/2309.16609.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. CoRR, abs/2204.05862.
- Constitutional AI: harmlessness from AI feedback. CoRR, abs/2212.08073.
- The belebele benchmark: a parallel reading comprehension dataset in 122 language variants.
- Missing information, unresponsive authors, experimental flaws: The impossibility of assessing the reproducibility of previous human evaluations in NLP. In The Fourth Workshop on Insights from Negative Results in NLP, pages 1–10, Dubrovnik, Croatia. Association for Computational Linguistics.
- Non-repeatable experiments and non-reproducible results: The reproducibility crisis in human evaluation in NLP. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3676–3687, Toronto, Canada. Association for Computational Linguistics.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311.
- Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
- Training verifiers to solve math word problems. CoRR, abs/2110.14168.
- A survey for in-context learning. arXiv preprint arXiv:2301.00234.
- A framework for few-shot language model evaluation.
- Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- Human feedback is not gold standard. CoRR, abs/2309.16349.
- Twenty years of confusion in human evaluation: NLG needs evaluation sheets and standardised definitions. In Proceedings of the 13th International Conference on Natural Language Generation, pages 169–182, Dublin, Ireland. Association for Computational Linguistics.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. CoRR, abs/2305.08322.
- Mistral 7b.
- Large language models only pass primary school exams in Indonesia: A comprehensive test on IndoMMLU. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12359–12374, Singapore. Association for Computational Linguistics.
- Bactrian-x : A multilingual replicable instruction-following model with low-rank adaptation. CoRR, abs/2305.15011.
- CMMLU: measuring massive multitask language understanding in chinese. CoRR, abs/2306.09212.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Split and merge: Aligning position biases in large language model based evaluators. CoRR, abs/2310.01432.
- Holistic evaluation of language models. CoRR, abs/2211.09110.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
- G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
- The flan collection: Designing data and methods for effective instruction tuning. CoRR, abs/2301.13688.
- Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland. Association for Computational Linguistics.
- Crosslingual generalization through multitask finetuning. CoRR, abs/2211.01786.
- On “scientific debt” in NLP: A case for more rigour in language model pre-training research. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8554–8572, Toronto, Canada. Association for Computational Linguistics.
- Gpt-4 technical report.
- OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems.
- Pouya Pezeshkpour and Estevam Hruschka. 2023. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483.
- SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
- Code llama: Open foundation models for code.
- Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
- Anastasia Shimorina and Anya Belz. 2022. The human evaluation datasheet: A template for recording details of human evaluation experiments in NLP. In Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval), pages 54–75, Dublin, Ireland. Association for Computational Linguistics.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. CoRR, abs/2206.04615.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
- Llama: Open and efficient foundation language models.
- Llama 2: Open foundation and fine-tuned chat models.
- Best practices for the human evaluation of automatically generated text. In Proceedings of the 12th International Conference on Natural Language Generation, pages 355–368, Tokyo, Japan. Association for Computational Linguistics.
- Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks. CoRR, abs/2306.07899.
- Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 3261–3275.
- Large language models are not fair evaluators. CoRR, abs/2305.17926.
- Self-instruct: Aligning language model with self generated instructions. CoRR, abs/2212.10560.
- Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
- Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Gpt4video: A unified multimodal large language model for lnstruction-followed understanding and safety-aware generation. arXiv preprint arXiv:2311.16511.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Learning from task descriptions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1361–1375, Online. Association for Computational Linguistics.
- Minghao Wu and Alham Fikri Aji. 2023. Style over substance: Evaluation biases for large language models. CoRR, abs/2307.03025.
- Adapting large language models for document-level machine translation. arXiv preprint arXiv:2401.06468.
- Lamini-lm: A diverse herd of distilled models from large-scale instructions. CoRR, abs/2304.14402.
- A survey of large language models. CoRR, abs/2303.18223.
- On large language models’ selection bias in multi-choice questions. arXiv preprint arXiv:2309.03882.
- Judging llm-as-a-judge with mt-bench and chatbot arena. CoRR, abs/2306.05685.
- LIMA: less is more for alignment. CoRR, abs/2305.11206.