Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 37 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 10 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4 31 tok/s Pro
2000 character limit reached

AutoBencher: Towards Declarative Benchmark Construction (2407.08351v2)

Published 11 Jul 2024 in cs.CL and cs.LG

Abstract: We present AutoBencher, a declarative framework for automatic benchmark construction, and use it to scalably discover novel insights and vulnerabilities of existing LLMs. Concretely, given a few desiderata of benchmarks (e.g., question difficulty, topic salience), we operationalize each desideratum and cast benchmark creation as an optimization problem. Specifically, we experiment with two settings with different optimization objectives: (i) for capability evaluation, we declare the goal of finding a salient, difficult dataset that induces novel performance patterns; (ii) for safety evaluation, we declare the goal of finding a dataset of unsafe prompts that existing LMs fail to decline. To tackle this optimization problem, we use a LLM to iteratively propose and refine dataset descriptions, which are then used to generate topic-specific questions and answers. These descriptions are optimized to improve the declared desiderata. We use AutoBencher (powered by GPT-4) to create datasets for math, multilinguality, knowledge, and safety. The scalability of AutoBencher allows it to test fine-grained categories and tail knowledge, creating datasets that elicit 22% more model errors (i.e., difficulty) than existing benchmarks. On the novelty ends, AutoBencher also helps identify specific gaps not captured by existing benchmarks: e.g., Gemini-Pro has knowledge gaps on Permian Extinction and Fordism while GPT-4o fails to decline harmful requests about cryptocurrency scams.

Citations (5)

Summary

  • The paper introduces AutoBencher, an automated system that creates evaluation benchmarks optimized for salience, novelty, and difficulty.
  • It employs a metric-driven adaptive search and re-ranking process to achieve a 27% increase in novelty and 22% higher difficulty over traditional benchmarks.
  • This scalable framework pinpoints specific model weaknesses, streamlining improvements and guiding future developments in language model evaluation.

AutoBencher: Creating Salient, Novel, Difficult Datasets for LLMs

The paper "AutoBencher: Creating Salient, Novel, Difficult Datasets for LLMs" by Xiang Lisa Li et al. introduces AutoBencher, an automated system designed to generate evaluation benchmarks for LLMs that satisfy three critical desiderata: salience, novelty, and difficulty. Benchmarking in LLMs is crucial for evaluating model performance, discerning trends, and guiding the development of future models. Traditional benchmarks often fail to account for emerging model weaknesses, revealing a pressing need for adaptive and rigorous benchmarking strategies. This paper addresses such gaps through a metric-driven search algorithm employing LLMs to propose and create datasets meeting the specified desiderata.

Salience, Novelty, and Difficulty

The paper elaborates on the following key metrics:

  1. Salience: A benchmark should test practically important capabilities, such as performance on widely recognized historical events like World War II.
  2. Novelty: A benchmark should reveal new trends in model performance, distinguishing models in ways existing benchmarks do not.
  3. Difficulty: The benchmark should pose significant challenges to current models, leaving room for future improvements.

By formalizing these properties, the authors transform benchmark creation into an optimization problem, resolved through a search for datasets that balance all three requirements.

AutoBencher Framework

AutoBencher leverages LLMs to automate the creation of datasets, utilizing privileged information to ensure accuracy and difficulty. The process involves the following steps:

  1. Dataset Construction: The system generates (question, answer) pairs using privileged information like Wikipedia articles for knowledge-intensive questions, translation systems for multilingual questions, and mathematical libraries for math questions. This ensures answers are accurate and provides grounding in reliable sources.
  2. Adaptive Search: AutoBencher performs iterative searches, using a history of proposed topics and their difficulties to guide subsequent topic proposals. This adaptive mechanism aims to iteratively enhance the difficulty and novelty of proposed evaluation topics.
  3. Re-Ranking for Final Selection: After generating datasets, topics are re-ranked based on salience, difficulty, and novelty. This ensures that the final chosen benchmark maximizes the overall objective function.

Experimental Results

AutoBencher was evaluated against existing, human-constructed benchmarks across several domains—history, economics, science, mathematics, and multilingual question answering. The system demonstrated significant enhancements in both novelty and difficulty:

  • Novelty Increase: AutoBencher-produced datasets showed a 27% improvement in revealing new model performance trends compared to human-constructed datasets.
  • Difficulty Increase: Datasets generated by AutoBencher exhibited 22% higher difficulty, challenging even state-of-the-art LLMs.

Specific examples highlight the capability of AutoBencher to uncover unique model weaknesses. For instance, while Gemini Pro performed robustly on existing economic datasets, it struggled with questions on Fordism, simultaneously revealing OpenChat-3.5's unexpected strengths in that area.

Discussion and Implications

The automated, scalable nature of AutoBencher could have profound implications for the future of LLM benchmarking. Key takeaways include:

  • Enhanced Model Evaluation: AutoBencher can continually generate challenging, novel datasets, thereby providing a sustainable methodology for tracking LLM advancements.
  • Identification of Specific Model Weaknesses: By pinpointing domains where specific models underperform, AutoBencher aids in pinpointing areas needing targeted improvements.
  • Scalability and Efficiency: The automation reduces the manual effort involved in benchmark creation, accelerating the feedback loop in model development.

Future Directions

Potential future developments of AutoBencher could explore broader domains, including aspects like model safety or efficiency, extending beyond the capabilities discussed (e.g., knowledge-intensive QA, mathematics). Additionally, while overcoming the constraints of domain-specific proposals could enable more creative and comprehensive benchmarking strategies.

Conclusion

AutoBencher represents a significant step forward in the field of LLM evaluation, providing an automated, metric-driven approach to creating salient, novel, and difficult benchmarks. This work not only enhances the current landscape of model evaluation but also introduces a versatile tool that can adapt to the evolving challenges in AI research. Future explorations expanding its utility across diverse domains will likely further consolidate its role in the AI community.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.