OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems (2402.14008v2)
Abstract: Recent advancements have seen LLMs and Large Multimodal Models (LMMs) surpassing general human capabilities in various tasks, approaching the proficiency level of human experts across multiple domains. With traditional benchmarks becoming less challenging for these models, new rigorous challenges are essential to gauge their advanced abilities. In this work, we present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions, including the Chinese college entrance exam. Each problem is detailed with expert-level annotations for step-by-step reasoning. Evaluating top-tier models on OlympiadBench, we implement a comprehensive assessment methodology to accurately evaluate model responses. Notably, the best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics, highlighting the benchmark rigor and the intricacy of physical reasoning. Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies. We hope that our challenging benchmark can serve as a valuable resource for helping future AGI research endeavors. The data and evaluation code are available at \url{https://github.com/OpenBMB/OlympiadBench}
- 01-ai. 2023. Yi-34b-chat model card.
- 01-ai. 2024. Yi-vl-34b model card.
- MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357–2367, Minneapolis, Minnesota. Association for Computational Linguistics.
- Have llms advanced enough? a challenging problem solving benchmark for large language models.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
- Daniel Bobrow et al. 1964. Natural language input for a computer problem solving system.
- Sparks of artificial general intelligence: Early experiments with gpt-4.
- Jie Cao and Jing Xiao. 2022. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1511–1520, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- UniGeo: Unifying geometry logical reasoning via reformulating mathematical expression. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3313–3323, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 513–523, Online. Association for Computational Linguistics.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Evaluating language models for mathematics through interactions. arXiv preprint arXiv:2306.01694.
- Mathematical capabilities of chatgpt.
- Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.
- Draft, sketch, and prove: Guiding formal theorem provers with informal proofs.
- MAWPS: A math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1152–1157, San Diego, California. Association for Computational Linguistics.
- Solving quantitative reasoning problems with language models.
- Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167, Vancouver, Canada. Association for Computational Linguistics.
- Llava-next: Improved reasoning, ocr, and world knowledge.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In International Conference on Learning Representations (ICLR).
- Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6774–6786, Online. Association for Computational Linguistics.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS).
- A survey of deep learning for mathematical reasoning.
- Ha-Thanh Nguyen. 2023. A brief report on lawgpt 1.0: A virtual legal assistant based on gpt-3. arXiv preprint arXiv:2302.05729.
- NousResearch. 2023. Nous-hermes-2-yi-34b model card.
- OpenAI. 2023a. Gpt-4 technical report.
- OpenAI. 2023b. Gpt-4v(ision) system card.
- Transfer knowledge from natural language to electrocardiography: Can we detect cardiovascular disease through language models? arXiv preprint arXiv:2301.09017.
- Deepseekmath: Pushing the limits of mathematical reasoning in open language models.
- Scieval: A multi-level large language model evaluation benchmark for scientific research.
- Gemini Team. 2023. Gemini: A family of highly capable multimodal models.
- Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482.
- Scibench: Evaluating college-level scientific problem-solving abilities of large language models.
- Deep neural solver for math word problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 845–854, Copenhagen, Denmark. Association for Computational Linguistics.
- Emergent abilities of large language models.
- Cmath: Can your language model pass chinese elementary school math test?
- Crowdsourcing multiple choice science questions. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94–106, Copenhagen, Denmark. Association for Computational Linguistics.
- Naturalproofs: Mathematical theorem proving in natural language.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.
- Large language models meet NL2Code: A survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7443–7464, Toronto, Canada. Association for Computational Linguistics.
- Ai for mathematics: A cognitive science perspective. arXiv preprint arXiv:2310.13021.
- Mm-llms: Recent advances in multimodal large language models.
- Cmmmu: A chinese massive multi-discipline multimodal understanding benchmark.
- A survey of large language models.
- Minif2f: a cross-system benchmark for formal olympiad-level mathematics.
- Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification.