OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far? (2406.16772v2)

Published 24 Jun 2024 in cs.CL and cs.AI

Abstract: In this report, we pose the following question: Who is the most intelligent AI model to date, as measured by the OlympicArena (an Olympic-level, multi-discipline, multi-modal benchmark for superintelligent AI)? We specifically focus on the most recently released models: Claude-3.5-Sonnet, Gemini-1.5-Pro, and GPT-4o. For the first time, we propose using an Olympic medal Table approach to rank AI models based on their comprehensive performance across various disciplines. Empirical results reveal: (1) Claude-3.5-Sonnet shows highly competitive overall performance over GPT-4o, even surpassing GPT-4o on a few subjects (i.e., Physics, Chemistry, and Biology). (2) Gemini-1.5-Pro and GPT-4V are ranked consecutively just behind GPT-4o and Claude-3.5-Sonnet, but with a clear performance gap between them. (3) The performance of AI models from the open-source community significantly lags behind these proprietary models. (4) The performance of these models on this benchmark has been less than satisfactory, indicating that we still have a long way to go before achieving superintelligence. We remain committed to continuously tracking and evaluating the performance of the latest powerful models on this benchmark (available at https://github.com/GAIR-NLP/OlympicArena).

Citations (3)

View on Semantic Scholar

Summary

The paper introduces the OlympicArena Medal Table, a novel evaluation framework that ranks state-of-the-art AI models inspired by the Olympic Games.
It employs fine-grained analysis on 11,163 bilingual problems spanning text and image modalities, using metrics like accuracy and pass@k.
The study finds that GPT-4o leads in performance while none of the models reach superintelligence, highlighting a performance gap with open-source systems.

OlympicArena Medal Ranks: Evaluating AI Performance in a Competitive Benchmark

In this paper, researchers Zhen Huang, Zengzhi Wang, Shijie Xia, and Pengfei Liu from Shanghai Jiao Tong University and the Generative AI Research Lab (GAIR) introduce an innovative evaluation method for AI models called the "OlympicArena". This framework draws its inspiration from the Olympic Games, providing a multi-disciplinary and multi-modal benchmark aimed at rigorously testing AI capabilities in various subjects under competitive conditions. Specifically, the paper focuses on the latest models: Claude-3.5-Sonnet, Gemini-1.5-Pro, GPT-4o, and GPT-4V, and ranks them using the newly devised OlympicArena Medal Table.

Key Contributions

The paper presents several key contributions to the field of AI model evaluation:

Comparison of Advanced Models: Through a detailed analysis, the paper compares state-of-the-art models, Claude-3.5-Sonnet and Gemini-1.5-Pro, against the established OpenAI's GPT series, including GPT-4o.
Invention of the OlympicArena Medal Table: A novel ranking mechanism, the OlympicArena Medal Table, is introduced. This table aggregates performance across various disciplines and awards medals akin to the Olympic Games, offering a clear and competitive framework for comparison.
Fine-Grained Analysis: The paper delivers a fine-grained analysis enriching the OlympicArena benchmark, aimed at providing deeper insights into the distinct capabilities and limitations of these models.

Setup and Evaluation Methods

Data and Testing: The OlympicArena benchmark comprises 11,163 bilingual problems spanning text-only and interleaved text-image modalities, covering subjects including Math, Physics, Chemistry, Biology, among others.
Evaluation Metrics: The paper employs accuracy for non-programming tasks and unbiased pass@k for programming tasks. Pass@k is calculated using the formula:

$\operatorname{pass} @ k := \underset{\text {Problems}{\mathbb{E}\left[1-\frac{\binom{n-c}{k}{\binom{n}{k}\right]$

where $k=1$ and $n=5$ , with $c$ indicating the number of correct samples.

Ranking Mechanism: The OlympicArena Medal Table ranks models by the number of Gold medals first, then by overall scores if tied. This method highlights top-performing models in specific disciplines.

Results and Analysis

Overall Performance

Top Models: The results show that GPT-4o outperforms other models, securing a total of 4 Gold medals, followed by Claude-3.5-Sonnet with 3 Gold medals. Notably, Gemini-1.5-Pro and GPT-4V rank just behind but demonstrate a clear performance gap.
Performance Gap: There is a significant disparity between proprietary models like GPT-4o and Claude-3.5-Sonnet and their open-source counterparts, highlighting the ongoing challenge for open-source models to compete with state-of-the-art proprietary systems.
Superintelligence Hurdle: The paper concludes that none of the models have reached a level that can be considered superintelligent, emphasizing that the journey towards superintelligence remains ongoing.

Subject-Specific Performance

Strengths and Weaknesses: GPT-4o shows robust performance in deductive and inductive reasoning tasks, particularly in Math and Computer Science, while Claude-3.5-Sonnet excels in knowledge-intensive subjects like Biology and Chemistry.
Reasoning Abilities: Fine-grained analysis shows that GPT-4o leads in traditional reasoning capabilities but Claude-3.5-Sonnet surpasses in cause-and-effect reasoning, decompositional reasoning, and quantitative reasoning.

Language and Modality Analysis

Language Performance: Models generally perform better on English tasks compared to Chinese ones, with the exception of certain models optimized for Chinese data showing improved performance in those scenarios.
Multi-Modal Capabilities: The analysis highlights that current models still perform better in text-only tasks compared to multi-modal ones, indicating room for improvement in utilizing multi-modal information for complex reasoning tasks.

Implications and Future Directions

The comprehensive evaluation using the OlympicArena benchmark underscores the need for continuous advancements in both proprietary and open-source AI models. The fine-grained analysis provides valuable insights into specific strengths and areas requiring further development, such as multi-modal reasoning capabilities and support for non-English languages. This work not only offers a robust competitive framework for evaluating AI models but also sets the stage for future research and development aimed at closing these performance gaps and pushing the boundaries of AI capabilities further towards superintelligence.

PDF Markdown

Related Papers

GitHub

GitHub - GAIR-NLP/OlympicArena: This is the official repository of the paper "OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI" (84 stars)

Tweets

https://twitter.com/SinclairWang1/status/1805502085024100499