A Survey on Evaluation of Large Language Models

Published 6 Jul 2023 in cs.CL and cs.AI | (2307.03109v9)

Abstract: LLMs are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, educations, natural and social sciences, agent applications, and other areas. Secondly, we answer the where' andhow' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey.

Abstract PDF Upgrade to Chat

Citations (1,009)

View on Semantic Scholar

Summary

The paper systematically categorizes LLM evaluation methods into key dimensions: tasks, datasets, and evaluation criteria.
The paper demonstrates that LLMs perform well in natural language tasks but face challenges with abstract reasoning and non-Latin scripts.
The paper outlines grand challenges and future research directions for developing dynamic, robust, and unified evaluation frameworks.

A Comprehensive Overview of LLM Evaluation Methodologies

The paper "A Survey on Evaluation of LLMs" (2307.03109) presents a systematic review of evaluation methods for LLMs, categorizing them along three dimensions: what, where, and how to evaluate. It argues for the importance of treating LLM evaluation as an essential discipline to better assist their development.

Figure 1: The evaluation process of AI models.

What to Evaluate: Tasks for LLMs

The survey categorizes evaluation tasks into several key areas, including natural language processing, robustness, ethics, biases, trustworthiness, social sciences, natural sciences, engineering, medical applications, and agent applications.

Natural Language Processing

This category encompasses tasks related to both natural language understanding (NLU) and natural language generation (NLG). NLU includes sentiment analysis, text classification, and natural language inference, while NLG covers summarization, dialogue generation, translation, and question answering. Factuality, multilingual capabilities are also explored. The paper highlights that while LLMs demonstrate commendable performance in sentiment analysis tasks, future work should focus on enhancing their capability to understand emotions in under-resourced languages.

Robustness, Ethics, Biases, and Trustworthiness

This section addresses the critical aspects of LLM performance beyond mere task completion. It covers robustness against adversarial inputs, ethical considerations, biases, and overall trustworthiness. The survey points out that existing LLMs have been found to internalize, spread, and potentially magnify harmful information existing in the crawled training corpora, usually, toxic languages, like offensiveness, hate speech, and insults as well as social biases.

The paper further explores the application and evaluation of LLMs in various specialized domains. In social science, the focus is on tasks related to economics, sociology, political science, and law. Natural science and engineering evaluations focus on mathematics, general science, and various engineering disciplines. In medical applications, evaluations are categorized into medical queries, medical examinations, and medical assistants. Agent applications explore the use of LLMs as agents equipped with external tools, greatly expanding the capabilities of the model.

Where to Evaluate: Datasets and Benchmarks

The survey provides a comprehensive overview of existing LLM evaluation benchmarks. These benchmarks are categorized into general benchmarks, specific benchmarks, and multi-modal benchmarks. General benchmarks, such as MMLU and C-Eval, are designed to evaluate overall performance across multiple tasks. Specific benchmarks, like MATH and APPS, focus on evaluating performance in specific domains. Multi-modal benchmarks, such as MME and MMBench, are designed for evaluating multi-modal LLMs.

How to Evaluate: Evaluation Criteria

The "How to Evaluate" section discusses the evaluation criteria used for assessing LLMs, which are divided into automatic evaluation and human evaluation. Automatic evaluation uses standard metrics and evaluation tools, whereas human evaluation involves human participation to evaluate the quality and accuracy of model-generated results.

Summary of Success and Failure Cases

Based on existing evaluation efforts, the paper summarizes the success and failure cases of LLMs in different tasks. LLMs generally perform well in generating text, language understanding, arithmetic reasoning, and contextual comprehension. However, they often struggle with NLI, semantic understanding, abstract reasoning, and tasks involving non-Latin scripts and limited resources.

Grand Challenges and Future Research

The survey concludes by outlining several grand challenges and opportunities for future research in LLM evaluation. These challenges include designing AGI benchmarks, developing complete behavioral evaluation methods, enhancing robustness evaluation, creating dynamic and evolving evaluation systems, developing principled and trustworthy evaluation methods, creating unified evaluation systems that support all LLM tasks, and going beyond evaluation to enhance LLMs. The central argument is that evaluation should be treated as an essential discipline to drive the success of LLMs and other AI models.

Conclusion

The paper offers a comprehensive survey of the current state of LLM evaluation, providing a valuable resource for researchers and practitioners in the field. The paper emphasizes the need for continuous development of evaluation methodologies to ensure the responsible and effective advancement of LLMs.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (16)

First 10 authors:

Collections

GitHub

GitHub - MLGroupJLU/LLM-eval-survey: The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models". (1,286 stars)

Tweets

YouTube

Show All Videos

A Survey on Evaluation of Large Language Models

Summary

A Comprehensive Overview of LLM Evaluation Methodologies

What to Evaluate: Tasks for LLMs

Natural Language Processing

Robustness, Ethics, Biases, and Trustworthiness

Where to Evaluate: Datasets and Benchmarks

How to Evaluate: Evaluation Criteria

Summary of Success and Failure Cases

Grand Challenges and Future Research

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (16)

Collections

GitHub

Tweets

YouTube

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

A Survey on Evaluation of Large Language Models

Summary

A Comprehensive Overview of LLM Evaluation Methodologies

What to Evaluate: Tasks for LLMs

Natural Language Processing

Robustness, Ethics, Biases, and Trustworthiness

Social Science, Natural Science and Engineering, Medical Applications, and Agent Applications

Where to Evaluate: Datasets and Benchmarks

How to Evaluate: Evaluation Criteria

Summary of Success and Failure Cases

Grand Challenges and Future Research

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (16)

Collections

GitHub

Tweets

YouTube

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research