OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems (2402.14008v2)

Published 21 Feb 2024 in cs.CL

Abstract: Recent advancements have seen LLMs and Large Multimodal Models (LMMs) surpassing general human capabilities in various tasks, approaching the proficiency level of human experts across multiple domains. With traditional benchmarks becoming less challenging for these models, new rigorous challenges are essential to gauge their advanced abilities. In this work, we present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions, including the Chinese college entrance exam. Each problem is detailed with expert-level annotations for step-by-step reasoning. Evaluating top-tier models on OlympiadBench, we implement a comprehensive assessment methodology to accurately evaluate model responses. Notably, the best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics, highlighting the benchmark rigor and the intricacy of physical reasoning. Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies. We hope that our challenging benchmark can serve as a valuable resource for helping future AGI research endeavors. The data and evaluation code are available at \url{https://github.com/OpenBMB/OlympiadBench}

References (50)

Citations (30)

View on Semantic Scholar

Summary

The paper introduces a rigorous benchmark featuring 8,952 Olympiad-level bilingual and multimodal problems with expert annotations to guide advanced reasoning.
The methodology evaluates AI models on scientific problem-solving, with GPT-4V scoring just 17.23%, highlighting significant challenges in current AI capabilities.
The findings underscore the need for continuous innovation in AGI research to improve logical reasoning and multimodal understanding in complex scientific tasks.

OlympiadBench: Elevating Benchmark Challenges in AI with Olympiad-Level Bilingual Multimodal Scientific Problems

Introduction of OlympiadBench

The rapid advancements in LLMs and Large Multimodal Models (LMMs) have necessitated the development of more rigorous assessment tools. OlympiadBench addresses this need by introducing a benchmark featuring 8,952 problems from high-level mathematics and physics competitions, with a focus on Olympiad-level challenges. This benchmark is distinct in its bilingual (English and Chinese) and multimodal attributes, each problem accompanied by expert annotations for step-by-step reasoning. Notably, the best-performing model, GPT-4V, achieves an average score of 17.23% on the benchmark, underscoring the challenges posed by OlympiadBench in modeling physical reasoning and problem-solving.

Key Features of OlympiadBench

Comprehensive Problem Set: Comprising a vast collection of problems sourced from prestigious Olympiads and Chinese college entrance exams, OlympiadBench presents a diverse range of challenges designed to test the limits of current AI capabilities.
Expert Annotations: Every problem includes detailed annotations from experts, providing valuable insights into the reasoning processes required to solve complex scientific issues.
Bilingual and Multimodal Approach: By offering problems in both English and Chinese and incorporating multimodal data, OlympiadBench emphasizes the importance of versatility in language and medium for AI research.
Robust Evaluation Methodology: Utilizing a thorough assessment methodology, this benchmark accurately evaluates AI responses, highlighting prevalent issues such as hallucinations, knowledge omissions, and logical fallacies in AI-generated solutions.

Challenges Highlighted by OlympiadBench

The findings from OlympiadBench highlight several pivotal challenges for AI models, particularly in solving physics problems and generating error-free reasoning. The benchmark's complexity is illustrated by the relatively low problem-solving success rates, which reveal significant gaps in AI capabilities compared to human experts. These challenges serve as a critical reminder of the considerable room for improvement and growth in the field of AI and AGI research.

Implications for Future Research

OlympiadBench sets a new precedent for the complexity and rigorousness of benchmarks in AI research. The benchmark not only challenges the AI research community to develop models that can tackle higher levels of scientific reasoning but also provides a novel dataset for training and testing next-generation AI systems. The bilingual and multimodal nature of the problems in OlympiadBench opens the door for exploration into new realms of AI capabilities, encouraging advancements in understanding and processing complex scientific texts and visuals in multiple languages.

Conclusion

OlympiadBench stands as a significant contribution to the field of AI, pushing the boundaries of what is considered a challenging benchmark. By placing a strong emphasis on Olympiad-level problems, this benchmark underscores the necessity for continuous innovation and development within AI research to reach and surpass human expert levels of problem-solving and reasoning. As the AI community strives towards the goal of achieving AGI, resources like OlympiadBench will be instrumental in benchmarking progress and guiding research efforts towards addressing the most daunting challenges in AI.

Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1760492091430650063

https://twitter.com/Hothan01/status/1762058362320289928

https://twitter.com/OpenBMB/status/1762457989053903276

https://twitter.com/OpenBMB/status/1762642220287746117

https://twitter.com/Montreal_AI/status/1760680634283155935

https://twitter.com/fly51fly/status/1760789213799424253

Reddit

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems, He et al. 2024 [Math+Physics, ZH+EN at 3:1 ratio, SotA accuracy = 18% by GPT-4V] (9 points, 1 comment)