How Far Are We from Intelligent Visual Deductive Reasoning? (2403.04732v3)
Abstract: Vision-LLMs (VLMs) have recently demonstrated incredible strides on diverse vision language tasks. We dig into vision-based deductive reasoning, a more sophisticated but less explored realm, and find previously unexposed blindspots in the current SOTA VLMs. Specifically, we leverage Raven's Progressive Matrices (RPMs), to assess VLMs' abilities to perform multi-hop relational and deductive reasoning relying solely on visual clues. We perform comprehensive evaluations of several popular VLMs employing standard strategies such as in-context learning, self-consistency, and Chain-of-thoughts (CoT) on three diverse datasets, including the Mensa IQ test, IntelligenceTest, and RAVEN. The results reveal that despite the impressive capabilities of LLMs in text-based reasoning, we are still far from achieving comparable proficiency in visual deductive reasoning. We found that certain standard strategies that are effective when applied to LLMs do not seamlessly translate to the challenges presented by visual reasoning tasks. A detailed analysis reveals that VLMs struggle to solve these tasks mainly because they are unable to perceive and comprehend multiple, confounding abstract patterns in RPM examples.
- Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models. arXiv preprint arXiv:2311.18232, 2023.
- Communicating natural programs to humans and machines. Advances in Neural Information Processing Systems, 35:3731–3743, 2022.
- The curious case of nonverbal abstract reasoning with multi-modal large language models. arXiv preprint arXiv:2401.12117, 2024.
- Self-imagine: Effective unimodal reasoning with multimodal models using self-imagination. arXiv preprint arXiv:2401.08025, 2024.
- Neural module networks. In CVPR, June 2016.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433, 2015.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. arXiv preprint arXiv:2401.12168, 2024. URL https://arxiv.org/abs/2401.12168.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Guesswhat?! visual object discovery through multi-modal dialogue. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5503–5512, 2017.
- Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022.
- Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, pp. 2901–2910, 2017.
- ReferItGame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1086. URL https://aclanthology.org/D14-1086.
- A computational model for solving problems from the raven’s progressive matrices intelligence test using iconic visual representations. Cognitive Systems Research, 22:47–66, 2013.
- Llms as factual reasoners: Insights from existing benchmarks and beyond. arXiv preprint arXiv:2305.14540, 2023.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
- Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In ICLR, 2024.
- An in-depth look at gemini’s language abilities. arXiv e-prints, pp. arXiv–2312, 2023.
- OpenAI. Gpt-4 technical report, 2023.
- Language models as knowledge bases? In EMNLP, 2019.
- Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4938–4947, 2020.
- Textcaps: a dataset for image captioning with reading comprehension. In ECCV, pp. 742–758. Springer, 2020.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- the Dawn of Lmms: Preliminary Explorations With Gpt-4v(ision), 2023.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
- Raven: A dataset for relational and analogical visual reasoning. In CVPR, pp. 5317–5327, 2019.
- the Entity-deduction Arena: A Playground for Probing the Conversational Reasoning and Planning Capabilities of Llms. arXiv preprint arXiv:2310.01468, 2023.