Emergent Mind

Abstract

The current paradigm of evaluating LLMs through static benchmarks comes with significant limitations, such as vulnerability to data contamination and a lack of adaptability to the evolving capabilities of LLMs. Therefore, evaluation methods that can adapt and generate evaluation data with controlled complexity are urgently needed. In this work, we introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks. We further use a code-augmented LLM to ensure the label correctness of newly generated data. We apply our DARG framework to diverse reasoning tasks in four domains with 15 state-of-the-art LLMs. Experimental results show that almost all LLMs experience a performance decrease with increased complexity and certain LLMs exhibit significant drops. Additionally, we find that LLMs exhibit more biases when being evaluated via the data generated by DARG with higher complexity levels. These observations provide useful insights into how to dynamically and adaptively evaluate LLMs. The code is available at https://github.com/SALT-NLP/DARG.

Proposed DARG framework for constructing reasoning graphs, augmenting benchmarks, and verifying label correctness.

Overview

  • The DARG framework offers a dynamic evaluation method for LLMs by generating controlled and diverse test data through reasoning graph extraction and perturbation.

  • Key innovations include Reasoning Graph Extraction, Graph Perturbation, and Graph-to-Text Decoding, which collectively provide a sophisticated mechanism for evaluating LLMs across various complexity levels.

  • Findings indicate that increased data complexity leads to performance degradation and bias amplification in LLMs, while larger models and those with Mixture of Experts architectures show better resilience.

Dynamic Evaluation of LLMs via Adaptive Reasoning Graph

The static evaluation paradigm for LLMs is rapidly becoming inadequate due to inherent limitations such as data contamination and lack of alignment with the evolving capabilities of LLMs. In this context, the paper presents DARG (Dynamic Evaluation of LLMs via Adaptive Reasoning Graph), a novel framework designed to overcome these limitations by generating test data with controlled complexity and diversity.

Methodology

The DARG framework introduces several key innovations:

  1. Reasoning Graph Extraction: The framework initiates by extracting reasoning graphs from existing benchmarks. Nodes in these graphs represent basic reasoning units, while edges signify the relationships or operations between these units. This step is facilitated through in-context learning (ICL) capabilities of LLMs to ensure accurate graph representations.
  2. Graph Perturbation: The reasoning graphs are then perturbed to introduce varying degrees of complexity. This perturbation can modify numerical values, and structural aspects such as the width and depth of the graph, creating new data points that maintain linguistic diversity similar to the original benchmarks.
  3. Graph-to-Text Decoding: The perturbed graphs are translated back into natural language using an LLM. This process utilizes exemplars to maintain the linguistic style and coherence of the original data. However, to mitigate potential hallucinations by the LLMs, this generated text undergoes strict label verification using a code-augmented LLM agent.
  4. Evaluation and Analysis: The framework was applied on four distinct reasoning tasks - math (GSM8K), social (BBQ), spatial (BBH Navigate), and symbolic reasoning (BBH Dyck Language) - using 15 state-of-the-art LLMs. The performance was evaluated across various complexity dimensions introduced by DARG, highlighting the LLMs' susceptibility to increased complexity.

Key Findings

The application of DARG yielded significant insights:

  • Performance Degradation: As expected, almost all evaluated LLMs showed a decrease in performance with the increasing complexity of the generated test data. For instance, GPT-4 Turbo, although performing impressively on static benchmarks such as GSM8K, exhibited a substantial performance drop in more complex scenarios, underscoring the limitations of static benchmarking in capturing the true reasoning capabilities of LLMs.
  • Bias Amplification: In the context of social reasoning tasks, like those found in BBQ, higher complexity data generated by DARG revealed an increase in biases, particularly against protected groups. LLMs like GPT-4 Turbo and Gemini-1.5-Pro exhibited heightened sensitivity and bias, choosing the "Cannot be determined" option even when evidence was clear, suggesting an over-alignment to ethical guidelines at the cost of accuracy.
  • Model Size and Resilience: Larger models and those employing Mixture of Experts (MoE) architectures demonstrated better resilience to complexity increases. This was evident from models like Mixtral-8×22B outperforming vanilla models of similar size, suggesting that enhanced model architectures and scaling could be beneficial in tackling complex reasoning tasks.
  • Training Data Utility: The paper also explored the utility of training models on DARG-generated data, showing that models fine-tuned on this data set outperformed those fine-tuned on traditional data in handling increased complexity. This highlights the potential of DARG not just for evaluation but also for LLM enhancement.

Implications and Future Directions

The DARG framework provides a more nuanced and reliable assessment tool for LLM capabilities, offering several practical and theoretical implications:

  • Dynamic Benchmarking: Transitioning to dynamic evaluation methods like DARG can offer a more accurate measure of an LLM's reasoning abilities, ensuring that benchmarks evolve in tandem with model capabilities.
  • Bias and Fairness: The controlled perturbations in DARG can uncover latent biases in LLMs, providing researchers with valuable insights to inform the development of fairer and more ethical AI systems.
  • Model Improvement: The utility of DARG-generated data for training indicates a promising direction for developing more robust model architectures capable of handling diverse and complex reasoning tasks.

Future research could extend the DARG framework to other domains beyond reasoning tasks, exploring its potential in natural language understanding and generation tasks. Additionally, further refinement of graph extraction and perturbation methods, possibly incorporating open-source models, could enhance the versatility and accessibility of DARG.

Conclusion

The DARG framework represents a significant advance in the dynamic evaluation of LLMs, addressing critical limitations of static benchmarks. By providing a means to generate controlled, diverse, and complex evaluation data, DARG offers a more accurate and comprehensive measure of LLM capabilities, thereby paving the way for further advancements in AI research and development.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.