Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate (2401.16788v1)
Abstract: Despite the utility of LLMs across a wide range of tasks and scenarios, developing a method for reliably evaluating LLMs across varied contexts continues to be challenging. Modern evaluation approaches often use LLMs to assess responses generated by LLMs. However, the meta-evaluation conducted to assess the effectiveness of these LLMs as evaluators is typically constrained by the coverage of existing benchmarks or requires extensive human annotation. This underscores the urgency of methods for scalable meta-evaluation that can effectively, reliably, and efficiently evaluate the performance of LLMs as evaluators across diverse tasks and scenarios, particularly in potentially new, user-defined scenarios. To fill this gap, we propose ScaleEval, an agent-debate-assisted meta-evaluation framework that leverages the capabilities of multiple communicative LLM agents. This framework supports multi-round discussions to assist human annotators in discerning the most capable LLMs as evaluators, which significantly eases their workload in cases that used to require large-scale annotations during meta-evaluation. We release the code for our framework, which is publicly available at: \url{https://github.com/GAIR-NLP/scaleeval}.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv:2302.04023v3.
- Re-evaluating evaluation in text summarization. arXiv preprint arXiv:2010.07100.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201.
- Evaluating large language models trained on code.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937.
- Overview of the tac 2008 update summarization task. In TAC.
- Results of wmt22 metrics shared task: Stop using bleu–neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68.
- Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
- Pal: Program-aided language models. arXiv preprint arXiv:2211.10435.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
- Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. arXiv:2003.11080v5.
- Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798.
- Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470.
- Prd: Peer rank and discussion improve large language model based evaluations. arXiv preprint arXiv:2307.02762.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
- Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
- Large language models are not fair evaluators. ArXiv, abs/2305.17926.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Agieval: A human-centric benchmark for evaluating foundation models. arXiv:2304.06364v2.