Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations (2404.01266v3)

Published 1 Apr 2024 in cs.AI and cs.CL

Abstract: Current foundation models exhibit impressive capabilities when prompted either with text only or with both image and text inputs. But do their capabilities change depending on the input modality? In this work, we propose $\textbf{IsoBench}$, a benchmark dataset containing problems from four major areas: math, science, algorithms, and games. Each example is presented with multiple $\textbf{isomorphic representations}$ of inputs, such as visual, textual, and mathematical presentations. IsoBench provides fine-grained feedback to diagnose performance gaps caused by the form of the representation. Across various foundation models, we observe that on the same problem, models have a consistent preference towards textual representations. Most prominently, when evaluated on all IsoBench problems, Claude-3 Opus performs 28.7 points worse when provided with images instead of text; similarly, GPT-4 Turbo is 18.7 points worse and Gemini Pro is 14.9 points worse. Finally, we present two prompting techniques, $\textit{IsoCombination}$ and $\textit{IsoScratchPad}$, which improve model performance by considering combinations of, and translations between, different input representations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520, 2022.
  2. An asymmetrical relationship between verbal and visual thinking: Converging evidence from behavior and fmri. NeuroImage, 152:619–627, 2017. ISSN 1053-8119. doi: https://doi.org/10.1016/j.neuroimage.2017.03.029. URL https://www.sciencedirect.com/science/article/pii/S1053811917302379.
  3. Palm 2 technical report, 2023.
  4. Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf.
  5. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.
  6. Simplicity bias in transformers and their ability to learn sparse Boolean functions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023.
  7. Towards understanding the word sensitivity of attention layers: A study via random features, 2024.
  8. Data curation alone can stabilize in-context learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8123–8144, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.452. URL https://aclanthology.org/2023.acl-long.452.
  9. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
  10. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023.
  11. Is gpt-4 a good data analyst? arXiv preprint arXiv:2305.15038, 2023.
  12. The picture superiority effect in recognition memory: A developmental study using the response signal procedure. Cognitive Development, 24(3):265–273, 2009. ISSN 0885-2014. doi: https://doi.org/10.1016/j.cogdev.2009.05.002. URL https://www.sciencedirect.com/science/article/pii/S0885201409000471.
  13. Nphardeval4v: A dynamic reasoning benchmark of multimodal large language models. arXiv preprint arXiv:2403.01777, 2024.
  14. Transformers learn higher-order optimization methods for in-context learning: A study with linear models, 2023.
  15. What can transformers learn in-context? a case study of simple function classes. ArXiv, abs/2208.01066, 2022.
  16. Demystifying prompts in language models via perplexity estimation. arXiv preprint arXiv:2212.04037, 2022.
  17. Gemini Team Google. Gemini: A family of highly capable multimodal models, 2023.
  18. Instruction tuning with lexicons for zero-shot style classification. arXiv preprint arXiv:2305.14592, 2023.
  19. Orca: Interpreting prompted language models via locating supporting data evidence in the ocean of pretraining data. arXiv preprint arXiv:2205.12600, 2022.
  20. How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=p4PckNQR8k.
  21. Neftune: Noisy embeddings improve instruction finetuning. arXiv preprint arXiv:2310.05914, 2023.
  22. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  23. Approximating cky with transformers. arXiv preprint arXiv:2305.02386, 2023.
  24. Improved instruction ordering in recipe-grounded conversation. arXiv preprint arXiv:2305.17280, 2023.
  25. Teaching arithmetic to small transformers. ArXiv, abs/2307.03381, 2023.
  26. Quantifying & modeling multimodal interactions: An information decomposition framework. Advances in Neural Information Processing Systems, 36, 2024.
  27. Countr: Transformer-based generalised visual counting. arXiv preprint arXiv:2208.13721, 2022.
  28. Improved baselines with visual instruction tuning, 2023a.
  29. Visual instruction tuning, 2023b.
  30. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
  31. DeLLMa: A framework for decision making under uncertainty with large language models, 2024b.
  32. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023c.
  33. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022.
  34. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024.
  35. Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611, 2024.
  36. Reframing instructional prompts to gptk’s language. arXiv preprint arXiv:2109.07830, 2021.
  37. Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in.
  38. OpenAI. Gpt 3.5 turbo. openai.com, 2023a. URL https://help.openai.com/en/articles/8555514-gpt-3-5-turbo-updates.
  39. OpenAI. Gpt-4 technical report, 2023b.
  40. Impact of pretraining term frequencies on few-shot reasoning. arXiv preprint arXiv:2202.07206, 2022.
  41. Reka. Reka flash: An efficient and capable multimodal language model. 2024. URL https://reka.ai/reka-flash-an-efficient-and-capable-multimodal-language-model/.
  42. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324, 2023.
  43. Shifting the baseline: Single modality performance on visual navigation & qa. North American Chapter of the Association for Computational Linguistics, abs/1811.00613, 2018.
  44. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  45. Simplicity bias of transformers to learn low sensitivity functions, 2024.
  46. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, 2022.
  47. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  48. Tree of thoughts: Deliberate problem solving with large language models, 2023.
  49. Gpt3mix: Leveraging large-scale language models for text augmentation. arXiv preprint arXiv:2104.08826, 2021.
  50. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2023.
  51. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
  52. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624, 2024a.
  53. How far are we from intelligent visual deductive reasoning? arXiv preprint arXiv:2403.04732, 2024b.
Citations (8)

Summary

  • The paper introduces IsoBench, a benchmark that systematically evaluates multimodal models using isomorphic visual and textual representations.
  • It assesses performance across four domains—mathematics, games, algorithms, and science—to uncover biases and modality-specific gaps.
  • The study proposes IsoCombination and IsoScratchPad strategies to improve multimodal fusion and enhance AI reasoning capabilities.

Evaluating Multimodal Foundation Models with IsoBench: Insights and Challenges

Introduction to IsoBench

IsoBench is a benchmark designed to systematically evaluate the capabilities of multimodal foundation models across a diverse range of tasks that require understanding texts, images, or combinations thereof. This benchmark spans four domains: mathematics, science, algorithms, and games. Unique to IsoBench is its emphasis on isomorphic representations, where the same problem is presented in different modalities, including both visual and textual formats. By doing so, IsoBench provides a granular assessment of how well these models handle semantically equivalent inputs in distinct representations, revealing preferences or biases toward specific modalities.

Domains and Tasks

IsoBench comprises four major domains, each testing different aspects of model capabilities:

  1. Mathematics: Focusing on continuous mathematics and plot understanding, tasks include classifying function properties and identifying breakpoints in piecewise functions.
  2. Games: Chess puzzles and winner identification tasks test strategic reasoning and understanding of complex game states.
  3. Algorithms: Graph algorithms such as connectivity, maximum flow, and isomorphism challenge the models' algorithmic reasoning skills.
  4. Science: Chemistry and physics questions assess the models' understanding of scientific concepts and their ability to interpret diagrams and visual information.

Key Observations and Findings

Across the evaluated multimodal foundation models, a consistent preference for textual representations over visual ones was observed, contradicting human tendencies to benefit from visual information processing. This discrepancy raises questions about the current multimodal fusion mechanisms in these models and their ability to leverage visual inputs effectively. The findings from IsoBench highlight several limitations and challenges:

  • Vision Model Shortcomings: Visual recognition errors and a lack of capability in utilizing low-level visual features for reasoning suggest that the vision components may not be optimally integrated or trained.
  • Input Format Sensitivity: Models display varying performance across different textual representations, indicating potential biases or overfitting to specific formats encountered during training.
  • Multimodal Fusion Gaps: The observed performance gaps between visual and textual representations suggest that current fusion techniques may not effectively leverage the complementary strengths of different modalities.

Addressing the Gaps: IsoCombination and IsoScratchPad

To mitigate the performance discrepancies observed between input modalities, two strategies were introduced: IsoCombination (IsoCB) and IsoScratchPad (IsoSP). IsoCB explores the effect of combining multiple isomorphic representations into a single input, aiming to provide models with a richer set of information. IsoSP, on the other hand, employs a two-step process where a model first translates a visual input into text, leveraging the higher performing text representations for downstream tasks. These strategies showed promising improvements, especially IsoCB, which significantly reduced the performance gap in certain tasks.

Implications and Future Directions

The findings from IsoBench underscore the need for advances in the representations and fusion techniques used by multimodal foundation models to more effectively process and integrate information across modalities. The observed preference for textual inputs points to potential biases in current models, possibly stemming from imbalances in pre-training data or limitations in the models' architectural design.

Future research should focus on developing more sophisticated multimodal fusion mechanisms that can capitalize on the unique advantages of each modality. Additionally, expanding the diversity of tasks and representations in benchmarks like IsoBench will be crucial for comprehensively assessing and improving the capabilities of multimodal foundation models.

In summary, IsoBench brings to light critical challenges in current multimodal foundation models and proposes avenues for research to enhance their understanding and reasoning capabilities across diverse input modalities. With continued development and evaluation, we can move closer to models that truly comprehend and reason with the richness of human communication.