Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference (2403.04132v1)
Abstract: LLMs have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges. To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing. The platform has been operational for several months, amassing over 240K votes. This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using for efficient and accurate evaluation and ranking of models. We confirm that the crowdsourced questions are sufficiently diverse and discriminating and that the crowdsourced human votes are in good agreement with those of expert raters. These analyses collectively establish a robust foundation for the credibility of Chatbot Arena. Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced LLM leaderboards, widely cited by leading LLM developers and companies. Our demo is publicly available at \url{https://chat.lmsys.org}.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Elo uncovered: Robustness and best practices in language model evaluation, 2023.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Preference-based rank elicitation using statistical models: The case of mallows. In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 1071–1079, Bejing, China, 22–24 Jun 2014a. PMLR. URL https://proceedings.mlr.press/v32/busa-fekete14.html.
- Preference-based rank elicitation using statistical models: The case of mallows. In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 1071–1079, Bejing, China, 22–24 Jun 2014b. PMLR. URL https://proceedings.mlr.press/v32/busa-fekete14.html.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Chernoff, H. Sequential Design of Experiments, pp. 345–360. Springer New York, New York, NY, 1992. ISBN 978-1-4612-4380-9. doi: 10.1007/978-1-4612-4380-9_27. URL https://doi.org/10.1007/978-1-4612-4380-9_27.
- Can large language models be an alternative to human evaluations? In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15607–15631, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.870. URL https://aclanthology.org/2023.acl-long.870.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Ultrafeedback: Boosting language models with high-quality feedback, 2023.
- Bootstrap confidence intervals. Statistical science, 11(3):189–228, 1996.
- Durrett, R. Probability: theory and examples, volume 49. Cambridge university press, 2019.
- Elo, A. E. The proposed uscf rating system, its development, theory, and applications. Chess Life, 22(8):242–247, 1967.
- Fisher, R. A. Statistical methods for research workers. Number 5. Oliver and Boyd, 1928.
- Freedman, D. A. On the so-called “huber sandwich estimator”’ and “robust standard errors”’. The American Statistician, 60(4):299–302, 2006.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Koala: A dialogue model for academic research. Blog post, April 2023. URL https://bair.berkeley.edu/blog/2023/04/03/koala/.
- Grootendorst, M. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794, 2022.
- Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020.
- Time-uniform chernoff bounds via nonnegative supermartingales. 2020.
- Competition-level problems are effective llm evaluators. arXiv preprint arXiv:2312.02143, 2023.
- Huber, P. J. et al. The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pp. 221–233. Berkeley, CA: University of California Press, 1967.
- Hunter, D. R. MM algorithms for generalized Bradley-Terry models. The Annals of Statistics, 32(1):384 – 406, 2004. doi: 10.1214/aos/1079120141. URL https://doi.org/10.1214/aos/1079120141.
- Online active model selection for pre-trained classifiers. In International Conference on Artificial Intelligence and Statistics, pp. 307–315. PMLR, 2021.
- The perils of using Mechanical Turk to evaluate open-ended text generation. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1265–1285, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.97. URL https://aclanthology.org/2021.emnlp-main.97.
- Dynabench: Rethinking benchmarking in nlp. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4110–4124, 2021.
- Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
- Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
- Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- ToxicChat: Unveiling hidden challenges of toxicity detection in real-world user-AI conversation. In Bouamor, H., Pino, J., and Bali, K. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 4694–4702, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.311. URL https://aclanthology.org/2023.findings-emnlp.311.
- Liu, T.-Y. et al. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval, 3(3):225–331, 2009.
- Umap: Uniform manifold approximation and projection for dimension reduction, 2020.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Proving test set contamination in black box language models. arXiv preprint arXiv:2310.17623, 2023.
- Training language models to follow instructions with human feedback, 2022.
- Game-theoretic statistics and safe anytime-valid inference. Statistical Science, 38(4):576–601, 2023.
- Ties in paired-comparison experiments: A generalization of the bradley-terry model. Journal of the American Statistical Association, 62(317):194–204, 1967. doi: 10.1080/01621459.1967.10482901.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023.
- Online rank elicitation for plackett-luce: A dueling bandits approach. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/7eacb532570ff6858afd2723755ff790-Paper.pdf.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- E-values: Calibration, combination and applications. The Annals of Statistics, 49(3):1736–1754, 2021.
- Self-instruct: Aligning language models with self-generated instructions. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13484–13508, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754.
- Estimating means of bounded random variables by betting. arXiv preprint arXiv:2010.09686, 2020.
- White, H. Maximum likelihood estimation of misspecified models. Econometrica: Journal of the econometric society, pp. 1–25, 1982.
- Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850, 2023.
- Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4791–4800, 2019.
- Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2023a.
- Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023b. URL https://openreview.net/forum?id=uccHPGDlao.
- Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
- Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023.