Benchmarking LLMs via Uncertainty Quantification (2401.12794v3)
Abstract: The proliferation of open-source LLMs from various institutions has highlighted the urgent need for comprehensive evaluation methods. However, current evaluation platforms, such as the widely recognized HuggingFace open LLM leaderboard, neglect a crucial aspect -- uncertainty, which is vital for thoroughly assessing LLMs. To bridge this gap, we introduce a new benchmarking approach for LLMs that integrates uncertainty quantification. Our examination involves nine LLMs (LLM series) spanning five representative natural language processing tasks. Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs. These results underscore the significance of incorporating uncertainty in the evaluation of LLMs.
- 01.AI. 2023. Yi series. https://www.lingyiwanwu.com/en.
- A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information fusion, 76:243–297.
- The falcon series of language models: Towards open frontier models.
- Uncertainty sets for image classifiers using conformal prediction. arXiv preprint arXiv:2009.14193.
- Anastasios N Angelopoulos and Stephen Bates. 2021. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511.
- BAAI. 2023. Flageval: An open-source evaluation toolkit and an open platform for evaluation of large models. https://github.com/FlagOpen/FlagEval.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Conformal prediction for reliable machine learning: theory, adaptations and applications. Newnes.
- A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
- Jiuhai Chen and Jonas Mueller. 2023. Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment. arXiv preprint arXiv:2308.16175.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Felm: Benchmarking factuality evaluation of large language models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- OpenCompass Contributors. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass.
- DeepSeek. 2023. Deepseek llm: Let there be answers. https://github.com/deepseek-ai/DeepSeek-LLM.
- Conformal autoregressive generation: Beam search with coverage guarantees. arXiv preprint arXiv:2309.03797.
- Conformal prediction for text infilling and part-of-speech prediction. The New England Journal of Statistics in Data Science, 1(1):69–83.
- Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 862–872.
- A survey for in-context learning. arXiv preprint arXiv:2301.00234.
- Efficient conformal prediction via cascaded inference with expanded admission. arXiv preprint arXiv:2007.03114.
- Conformal prediction: a unified review of theory and new challenges. Bernoulli, 29(1):1–23.
- A framework for few-shot language model evaluation.
- A survey of uncertainty in deep neural networks. Artificial Intelligence Review, 56(Suppl 1):1513–1589.
- Patrizio Giovannotti and Alex Gammerman. 2021. Transformer-based conformal predictors for paraphrase detection. In Conformal and Probabilistic Prediction and Applications, pages 243–265. PMLR.
- Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Uncertainty in natural language processing: Sources, quantification, and applications. arXiv preprint arXiv:2306.04459.
- Cosmos QA: Machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2391–2401, Hong Kong, China. Association for Computational Linguistics.
- Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Challenges and applications of large language models. arXiv preprint arXiv:2307.10169.
- Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404.
- Uncertainty quantification using bayesian neural networks in classification: Application to biomedical image segmentation. Computational Statistics & Data Analysis, 142:106816.
- HaluEval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464, Singapore. Association for Computational Linguistics.
- A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering. arXiv preprint arXiv:2311.07536.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187.
- Retrieval-augmented multi-modal chain-of-thoughts reasoning for large language models. arXiv preprint arXiv:2312.01714.
- Federated conformal predictors for distributed uncertainty quantification. In International Conference on Machine Learning, pages 22942–22964. PMLR.
- Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093.
- Revisiting the calibration of modern neural networks. Advances in Neural Information Processing Systems, 34:15682–15694.
- OpenDialKG: Explainable conversational reasoning with attention-based walks over knowledge graphs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 845–854, Florence, Italy. Association for Computational Linguistics.
- A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435.
- Peter Norvig. 1987. A unified theory of inference for text understanding. Ph.D. thesis, University of California, Berkeley.
- Conformal language modeling. arXiv preprint arXiv:2306.10193.
- Rahul Rahaman et al. 2021. Uncertainty quantification and deep ensembles. Advances in Neural Information Processing Systems, 34:20063–20075.
- Conformal nucleus sampling. In Findings of the Association for Computational Linguistics: ACL 2023, pages 27–34, Toronto, Canada. Association for Computational Linguistics.
- Classification with valid and adaptive coverage. Advances in Neural Information Processing Systems, 33:3581–3591.
- Least ambiguous set-valued classifiers with bounded error levels. Journal of the American Statistical Association, 114(525):223–234.
- Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
- InternLM Team. 2023a. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM.
- MosaicML NLP Team. 2023b. Introducing mpt-7b: A new standard for open-source, commercially usable llms. www.mosaicml.com/blog/mpt-7b. Accessed: 2023-05-05.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Algorithmic learning in a random world, volume 29. Springer.
- Empirical evaluation of uncertainty quantification in retrieval-augmented language models for science. arXiv preprint arXiv:2311.09358.
- An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations.
- Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint arXiv:2306.13063.
- Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712.
- Improving the reliability of large language models by leveraging uncertainty-aware in-context learning. arXiv preprint arXiv:2310.04782.
- Flask: Fine-grained language model evaluation based on alignment skill sets. arXiv preprint arXiv:2307.10928.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
- Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.