Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring Failure Cases in Multimodal Reasoning About Physical Dynamics (2402.15654v1)

Published 24 Feb 2024 in cs.CL

Abstract: In this paper, we present an exploration of LLMs' abilities to problem solve with physical reasoning in situated environments. We construct a simple simulated environment and demonstrate examples of where, in a zero-shot setting, both text and multimodal LLMs display atomic world knowledge about various objects but fail to compose this knowledge in correct solutions for an object manipulation and placement task. We also use BLIP, a vision-LLM trained with more sophisticated cross-modal attention, to identify cases relevant to object physical properties that that model fails to ground. Finally, we present a procedure for discovering the relevant properties of objects in the environment and propose a method to distill this knowledge back into the LLM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
  2. Limits for Learning with Language Models. arXiv preprint arXiv:2306.12213.
  3. Baillargeon, R. 1987. Object permanence in 31212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG-and 41212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG-month-old infants. Developmental psychology, 23(5): 655.
  4. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32.
  5. Analyzing Semantic Faithfulness of Language Models via Input Intervention on Conversational Question Answering. arXiv preprint arXiv:2212.10696.
  6. How is ChatGPT’s behavior changing over time? arXiv preprint arXiv:2307.09009.
  7. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  8. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Burstein, J.; Doran, C.; and Solorio, T., eds., Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Minnesota: Association for Computational Linguistics.
  9. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
  10. Detecting and accommodating novel types and concepts in an embodied simulation environment. In Proceedings of the 10th Annual Conference on Advanced in Cognitive Systems.
  11. Grounding and distinguishing conceptual vocabulary through similarity learning in embodied simulations. In Proceedings of the 15th International Conference on Computational Semantics.
  12. Goertzel, B. 2023. Generative AI vs. AGI: The Cognitive Strengths and Weaknesses of Modern LLMs. arXiv preprint arXiv:2309.10371.
  13. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  14. ChatGPT: Jack of all trades, master of none. Information Fusion, 101861.
  15. Krishnaswamy, N. 2017. Monte Carlo Simulation Generation Through Operationalization of Spatial Primitives. Brandeis University.
  16. The VoxWorld platform for multimodal embodied agents. In LREC proceedings, volume 13.
  17. Affordance embeddings for situated language understanding. Frontiers in Artificial Intelligence, 5: 774752.
  18. Learning physical intuition of block towers by example. In International conference on machine learning, 430–438. PMLR.
  19. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 12888–12900. PMLR.
  20. Towards practical multi-object manipulation using relational reinforcement learning. In 2020 IEEE international conference on robotics and automation (ICRA), 4051–4058. IEEE.
  21. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  22. Reinforcement learning for pick and place operations in robotics: A survey. Robotics, 10(3): 105.
  23. Hybrid Machine Learning/Knowledge Base Systems Learning through Natural Language Dialogue with Deep Learning Models. In AAAI Spring Symposium: Challenges Requiring the Combination of Machine Learning and Knowledge Engineering.
  24. How understanding large language models can inform the use of ChatGPT in physics education. European Journal of Physics.
  25. Pustejovsky, J. 2013. Dynamic event structure and habitat theory. In Proceedings of the 6th International Conference on Generative Approaches to the Lexicon (GL2013), 1–10.
  26. VoxML: A Visualization Modeling Language. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 4606–4613.
  27. Is ChatGPT a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
  28. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
  29. An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP). arXiv preprint arXiv:2302.13814.
  30. Shanahan, M. 2022. Talking about large language models. arXiv preprint arXiv:2212.03551.
  31. Spelke, E. S. 1985. Perception of unity, persistence, and identity: Thoughts on infants’ conceptions of objects.
  32. Spelke, E. S. 1990. Principles of object perception. Cognitive science, 14(1): 29–56.
  33. Object perception in infancy: Interaction of spatial and kinetic information for object boundaries. Developmental Psychology, 25(2): 185.
  34. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  35. NEWTON: Are Large Language Models Capable of Physical Reasoning? In Findings of the Association for Computational Linguistics: EMNLP 2023, 9743–9758.
  36. Rlcd: Reinforcement learning from contrast distillation for language model alignment. arXiv preprint arXiv:2307.12950.
Citations (2)

Summary

We haven't generated a summary for this paper yet.