Dynamic Evaluation of Large Language Models by Meta Probing Agents (2402.14865v2)

Published 21 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Evaluation of LLMs has raised great concerns in the community due to the issue of data contamination. Existing work designed evaluation protocols using well-defined algorithms for specific tasks, which cannot be easily extended to diverse scenarios. Moreover, current evaluation benchmarks can only provide the overall benchmark results and cannot support a fine-grained and multifaceted analysis of LLMs' abilities. In this paper, we propose meta probing agents (MPA), a general dynamic evaluation protocol inspired by psychometrics to evaluate LLMs. MPA is the key component of DyVal 2, which naturally extends the previous DyVal~\citep{zhu2023dyval}. MPA designs the probing and judging agents to automatically transform an original evaluation problem into a new one following psychometric theory on three basic cognitive abilities: language understanding, problem solving, and domain knowledge. These basic abilities are also dynamically configurable, allowing multifaceted analysis. We conducted extensive evaluations using MPA and found that most LLMs achieve poorer performance, indicating room for improvement. Our multifaceted analysis demonstrated the strong correlation between the basic abilities and an implicit Matthew effect on model size, i.e., larger models possess stronger correlations of the abilities. MPA can also be used as a data augmentation approach to enhance LLMs. Code is available at: https://github.com/microsoft/promptbench.

References (65)

Citations (15)

View on Semantic Scholar

Summary

The paper introduces the MPA framework to dynamically generate evaluation samples, uncovering a 15.7% performance drop in GPT-4-Turbo on dynamic benchmarks.
It employs probing and judge agents rooted in psychometrics to assess language understanding and problem-solving skills comprehensively.
The framework enables robust data augmentation that enhances LLM performance through fine-tuning on dynamically generated samples.

Dynamic Evaluation with Meta Probing Agents: A Critical Overview

The paper "DyVal 2: Dynamic Evaluation of LLMs by Meta Probing Agents" presents an innovative approach to evaluating LLMs through a novel dynamic evaluation protocol known as Meta Probing Agents (MPA). This framework is chiefly inspired by psychometrics and aims to address two significant challenges in LLM evaluation: the problem of data contamination and the need for a multifaceted analysis of model capabilities.

Core Contributions and Methodology

The primary contribution of the paper is the development of the MPA framework, which distinguishes itself from traditional evaluation methods by dynamically generating evaluation samples. Unlike static benchmarks that may inadvertently contribute to data contamination through overfitting, MPA supports a more versatile and comprehensive analysis by employing a dynamic evaluation paradigm. This flexibility is crucial for an accurate assessment of LLMs, which have shown impressive yet opaque skillsets due to their scale and the breadth of their training data.

MPA operates through two central components: probing agents and judge agents. The probing agents, based on various psychometrically inspired principles, transform existing evaluation problems into new ones, focusing on core cognitive abilities such as language understanding, problem-solving, and domain knowledge. The judge agents, in turn, validate this transformation to ensure the new evaluations maintain consistency with the original tasks. This agent-based design allows for nuanced benchmarking where the LLMs' multifaceted cognitive capabilities can be assessed and analyzed.

Empirical Findings

The paper reports empirical results obtained from evaluating several prominent LLMs, including GPT-4-Turbo, GPT-3.5-Turbo, and Gemini-Pro, against both traditional benchmarks and those meticulously generated through MPA. Notably, the evaluation highlights a significant drop in performance on the dynamic benchmarks, suggesting that the performance of existing models on static benchmarks may be inflated due to potential data contamination. For instance, the paper finds that GPT-4-Turbo's performance on the MMLU dataset decreases by approximately 15.7% when subjected to MPA evaluation, underscoring areas for improvement.

The authors further conducted a detailed analysis of different probing principles which indicated that language understanding, as well as problem-solving, play pivotal roles in the models' performance decline. When dissected into various combinations of principles, the findings emphasized that more complex configurations yielded broader performance degradation across models.

Theoretical and Practical Implications

The theoretical implications of the paper are profound; it contributes to our understanding of the inherent structure of LLMs' cognitive abilities, which, according to the findings, exhibit strong internal correlations. The transparency promoted by such evaluations encourages more refined architectures for future LLM developments and highlights the significant ‘Matthew effect’, wherein larger model sizes correlate with stronger ability correlations.

Practically, MPA not only serves as an effective evaluation tool but also opens up avenues for data augmentation, which was demonstrated by the paper through the fine-tuning of GPT-3.5-Turbo. Enhanced training datasets, derived through the MPA approach, resulted in improved model performances, indicating that the use of MPA can assist in the creation of robust training datasets for future model iterations.

Future Directions and Limitations

Future research should explore incorporating a wider array of evaluation tasks to provide an even more comprehensive understanding of LLMs' capabilities. Additionally, while the utilization of sophisticated agents contributes significantly to MPA's robustness, there remains room to improve the alignment between generated and original questions to minimize inconsistencies.

In conclusion, MPA provides a substantial advancement in the evaluation of LLMs, positioning it as a critical tool for both model assessment and development. By aligning the evaluation process more closely with human cognitive theories, this paper paves the way for a deeper, more structured exploration of LLM capabilities, thereby fostering the more nuanced development of future AI systems.

PDF Markdown

Tweets

https://twitter.com/woojinrad/status/1870220897589637337

https://twitter.com/woojinrad/status/1771903557945491776