OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI (2406.12753v2)

Published 18 Jun 2024 in cs.CL and cs.AI

Abstract: The evolution of AI has been significantly accelerated by advancements in LLMs and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i.e., AI4Science) once exclusive to human intellect. To comprehensively evaluate current models' performance in cognitive reasoning abilities, we introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities. These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage. We argue that the challenges in Olympic competition problems are ideal for evaluating AI's cognitive reasoning due to their complexity and interdisciplinary nature, which are essential for tackling complex scientific challenges and facilitating discoveries. Beyond evaluating performance across various disciplines using answer-only criteria, we conduct detailed experiments and analyses from multiple perspectives. We delve into the models' cognitive reasoning abilities, their performance across different modalities, and their outcomes in process-level evaluations, which are vital for tasks requiring complex reasoning with lengthy solutions. Our extensive evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration. Through the OlympicArena, we aim to advance AI towards superintelligence, equipping it to address more complex challenges in science and beyond. We also provide a comprehensive set of resources to support AI research, including a benchmark dataset, an open-source annotation platform, a detailed evaluation tool, and a leaderboard with automatic submission features.

Citations (7)

View on Semantic Scholar

Summary

The paper introduces the OlympicArena benchmark, which rigorously evaluates AI's cognitive reasoning across 11,163 interdisciplinary problems.
It employs multimodal and process-level assessments, testing both final answers and intermediate reasoning steps in diverse academic domains.
Results reveal significant performance gaps in areas like mathematics and physics, underscoring challenges and avenues for future AI improvements.

An Overview of OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

The emergence of LLMs and Large Multimodal Models (LMMs) has prompted a significant shift in the domain of AI, particularly in cognitive reasoning and problem-solving. The paper "OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI" by Zhen Huang et al., presents a comprehensive benchmark tailored to evaluate AI's cognitive reasoning by leveraging complex, interdisciplinary problems modeled after international Olympic competitions.

Key Contributions

The authors introduce the "OlympicArena" benchmark, designed to rigorously test cognitive reasoning capabilities of advanced AI models. This benchmark features:

Extensive Problem Collection: The dataset encompasses 11,163 bilingual (English and Chinese) problems across text-only and interleaved text-image modalities. These problems span seven disciplines: mathematics, physics, chemistry, biology, geography, astronomy, and computer science, derived from 62 different international Olympic-level competitions.
Multimodal and Process-Level Evaluation: Unlike traditional benchmarks that primarily focus on text-based problems, OlympicArena integrates multimodal assessments and detailed process-level evaluations. This approach scrutinizes AI models on both the correctness of the final answers and the intermediate reasoning steps, thus providing a more comprehensive evaluation.
Fine-Grained Cognitive Reasoning Analysis: The benchmark categorizes cognitive reasoning into eight types of logical reasoning abilities and five types of visual reasoning abilities. This categorization facilitates in-depth analysis of model performance across different cognitive dimensions.
Resource Provision: The paper details the provision of a benchmark dataset, an open-source annotation platform, a detailed evaluation tool, and a leaderboard with automatic submission features to support ongoing research in AI.

Experimental Evaluation

The authors conducted meticulous experiments using top-performing proprietary (e.g., GPT-4o, GPT-4V, Claude3 Sonnet) and open-source models (e.g., LLaVa-NeXT-34B, InternVL-Chat-V1.5). Three experimental settings were explored:

Multimodal Setting: Assessed LMMs using interleaved text and images inputs.
Image-Caption Setting: Used textual descriptions of images to facilitate better problem understanding.
Text-Only Setting: Served as a baseline without visual inputs.

Main Findings

Overall Performance: Advanced models like GPT-4o only achieved 39.97\% accuracy, whereas many open-source models could not surpass 20\% overall accuracy. This highlights the benchmark's difficulty and the current limitations of AI in interdisciplinary cognitive reasoning.
Subject-Specific Performance: Mathematics and physics presented the most significant challenges, reflecting their reliance on complex reasoning. Computer science problems also proved difficult, indicating gaps in models' algorithmic reasoning abilities.
Fine-Grained Analysis: While models displayed varied performances across different logical and visual reasoning abilities, most notably:
- LLMs generally performed better on abductive and cause-and-effect reasoning tasks.
- LMMs struggled with complex visual tasks requiring spatial and geometric reasoning and understanding abstract symbols.
Process-Level Insights: The process-level evaluations showed that models often performed some reasoning steps correctly even when the final answers were incorrect. This underscores the latent potential of AI models in handling complex reasoning tasks if intermediate steps can be better managed.
Multimodal Performance: Results indicated that very few LMMs demonstrated significant performance improvements with visual inputs, suggesting an area for future enhancement.

Implications and Future Directions

The introduction of OlympicArena is a significant step in pushing the boundaries of AI capabilities. By presenting a robust, challenging benchmark, the authors highlight several key insights and areas requiring further research and development:

Refinement of Multimodal Models: Enhancing the ability of LMMs to effectively integrate and leverage visual information remains an open challenge.
Improving Reasoning Pathways: Given that many models demonstrate potential by correctly executing some intermediate steps, future research should focus on optimizing the reasoning process.
Reducing Knowledge Deficits: The error analysis indicates that models still lack domain-specific knowledge, which is critical for solving complex interdisciplinary problems.

In conclusion, OlympicArena serves as a rigorous and comprehensive benchmark that significantly contributes to the field of AI cognitive reasoning. It sets a high bar for future AI systems, guiding researchers towards developing more sophisticated models capable of tackling complex, real-world challenges.

PDF Markdown

Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1803255189417214166

https://twitter.com/AdeenaY8/status/1803532981149266100

https://twitter.com/Z_Huang_02/status/1803773623108518072

https://twitter.com/stefan_fee/status/1803818613595369639

https://twitter.com/fly51fly/status/1803547476059889859

https://twitter.com/SinclairWang1/status/1834637884462141821