Octopi: Object Property Reasoning with Large Tactile-Language Models (2405.02794v2)
Abstract: Physical reasoning is important for effective robot manipulation. Recent work has investigated both vision and language modalities for physical reasoning; vision can reveal information about objects in the environment and language serves as an abstraction and communication medium for additional context. Although these works have demonstrated success on a variety of physical reasoning tasks, they are limited to physical properties that can be inferred from visual or language inputs. In this work, we investigate combining tactile perception with language, which enables embodied systems to obtain physical properties through interaction and apply commonsense reasoning. We contribute a new dataset PhysiCLeAR, which comprises both physical/property reasoning tasks and annotated tactile videos obtained using a GelSight tactile sensor. We then introduce Octopi, a system that leverages both tactile representation learning and large vision-LLMs to predict and reason about tactile inputs with minimal language fine-tuning. Our evaluations on PhysiCLeAR show that Octopi is able to effectively use intermediate physical property predictions to improve its performance on various tactile-related tasks. PhysiCLeAR and Octopi are available at https://github.com/clear-nus/octopi.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Prost: Physical reasoning of objects through space and time. arXiv preprint arXiv:2106.03634, 2021. URL https://arxiv.org/pdf/2106.03634.pdf.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. URL https://arxiv.org/pdf/2308.12966.pdf.
- PHYRE: A New Benchmark for Physical Reasoning. 2019. URL https://arxiv.org/pdf/1908.05656.pdf.
- Wouter M. Bergmann Tiest. Tactual perception of material properties. Vision Research, 50(24):2775–2782, 2010. ISSN 0042-6989. doi: https://doi.org/10.1016/j.visres.2010.10.005. URL https://www.sciencedirect.com/science/article/pii/S0042698910004967. Perception and Action: Part I.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020. URL https://arxiv.org/pdf/1911.11641.pdf.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023. URL https://arxiv.org/pdf/2307.15818.pdf.
- Learn from Incomplete Tactile Data: Tactile Representation Learning with Masked Autoencoders. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10800–10805. IEEE, 2023. URL https://arxiv.org/pdf/2307.07358.pdf.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023. URL https://arxiv.org/pdf/2310.09478.pdf.
- Exploring Relationships between Touch Perception and Surface Physical Properties. International Journal of Design, 3:67–76, 08 2009. URL https://arxiv.org/pdf/1704.03822.pdf.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Space: A simulator for physical interactions and causal learning in 3d environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2058–2063, 2021.
- PIP: Physical Interaction Prediction via Mental Simulation with Span Selection. In European Conference on Computer Vision, pages 405–421. Springer, 2022a. URL http://phys101.csail.mit.edu/papers/phys101_bmvc.pdf.
- A survey of embodied ai: From simulators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022b.
- Ar2-d2: Training a robot without a robot. arXiv preprint arXiv:2306.13818, 2023.
- Physically grounded vision-language models for robotic manipulation. arXiv preprint arXiv:2309.02561, 2023a. URL https://arxiv.org/pdf/2309.02561.pdf.
- Objectfolder 2.0: A multisensory object dataset for sim2real transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10598–10608, 2022. URL https://arxiv.org/pdf/2204.02389.pdf.
- The ObjectFolder Benchmark: Multisensory Learning With Neural and Real Objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17276–17286, June 2023b. URL https://arxiv.org/pdf/2306.00956.pdf.
- Deep learning for tactile understanding from visual and haptic data. In 2016 IEEE international conference on robotics and automation (ICRA), pages 536–543. IEEE, 2016. URL https://arxiv.org/pdf/1511.06065.pdf.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Multiply: A multisensory object-centric embodied large language model in 3d world. arXiv preprint arXiv:2401.08577, 2024.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. URL https://arxiv.org/pdf/2106.09685.pdf.
- Understanding dynamic tactile sensing for liquid property estimation. arXiv preprint arXiv:2205.08771, 2022. URL https://arxiv.org/pdf/2205.08771.pdf.
- Visual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022.
- Robotic perception of object properties using tactile sensing. In Tactile Sensing, Skill Learning, and Robotic Dexterous Manipulation, pages 23–44. Elsevier, 2022.
- Haptic exploration. Scholarpedia of Touch, pages 177–183, 2016.
- Mark H. Lee. Tactile sensing: New directions, new challenges. The International Journal of Robotics Research, 19(7):636–643, 2000. doi: 10.1177/027836490001900702. URL https://doi.org/10.1177/027836490001900702.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a. URL https://dl.acm.org/doi/10.5555/3618408.3619222.
- Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023b. URL https://arxiv.org/pdf/2305.06355.pdf.
- Can language models understand physical concepts? In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 11843–11861, Singapore, December 2023c. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.726. URL https://aclanthology.org/2023.emnlp-main.726.
- Sensing and recognizing surface textures using a gelsight sensor. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 1241–1247, 2013. doi: 10.1109/CVPR.2013.164.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b. URL https://arxiv.org/pdf/2304.08485.pdf.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. arXiv preprint arXiv:2306.05424, 2023. URL https://arxiv.org/pdf/2306.05424.pdf.
- Benchmarks for Physical Reasoning AI. arXiv preprint arXiv:2312.10728, 2023. URL https://arxiv.org/pdf/2312.10728.pdf.
- Teaching cameras to feel: Estimating tactile physical properties of surfaces from images. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, pages 1–20. Springer, 2020. URL https://arxiv.org/pdf/2004.14487.pdf.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. URL https://arxiv.org/pdf/2103.00020.pdf.
- Fine-tuned clip models are efficient video learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6545–6554, 2023.
- Exploring Tactile Perceptual Dimensions Using Materials Associated with Sensory Vocabulary. Frontiers in Psychology, 8, 2017. URL https://api.semanticscholar.org/CorpusID:14038261.
- Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos. arXiv e-prints, pages arXiv–2312, 2023. URL https://arxiv.org/pdf/2312.04746.pdf.
- Deep visuo-tactile learning: Estimation of tactile properties from images. In 2019 International Conference on Robotics and Automation (ICRA), pages 8951–8957. IEEE, 2019. URL https://arxiv.org/pdf/1803.03435.pdf.
- Event-driven visual-tactile sensing and learning for robots. arXiv preprint arXiv:2009.07083, 2020. URL https://arxiv.org/pdf/2009.07083.pdf.
- Extended tactile perception: Vibration sensing through tools and grasped objects. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1755–1762. IEEE, 2021.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. URL https://arxiv.org/pdf/2312.11805.pdf.
- Tactile sensing in intelligent robotic manipulation—a review. Industrial Robot: An International Journal, 32, 02 2005. doi: 10.1108/01439910510573318.
- Manipulation by feel: Touch-based control with deep predictive models. In 2019 International Conference on Robotics and Automation (ICRA), pages 818–824. IEEE, 2019. URL https://arxiv.org/pdf/1903.04128.pdf.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b. URL https://arxiv.org/pdf/2307.09288.pdf.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- NEWTON: Are Large Language Models Capable of Physical Reasoning?
- Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models. arXiv preprint arXiv:2312.06109, 2023. URL https://arxiv.org/pdf/2312.06109.pdf.
- Physics 101: Learning Physical Object Properties from Unlabeled Videos. In BMVC, volume 2, page 7, 2016. URL http://phys101.csail.mit.edu/papers/phys101_bmvc.pdf.
- Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023. URL https://arxiv.org/pdf/2309.05519.pdf.
- Touch and go: Learning from human-collected vision and touch. arXiv preprint arXiv:2211.12498, 2022. URL https://arxiv.org/pdf/2211.12498.pdf.
- Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442, 2019. URL https://arxiv.org/pdf/1910.01442.pdf.
- Estimating object hardness with a gelsight touch sensor. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 208–215, 2016. doi: 10.1109/IROS.2016.7759057.
- Gelsight: High-resolution robot tactile sensors for estimating geometry and force. Sensors, 17(12), 2017. ISSN 1424-8220. doi: 10.3390/s17122762. URL https://www.mdpi.com/1424-8220/17/12/2762.
- Active clothing material perception using tactile sensing and deep learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 4842–4849. IEEE, 2018. URL https://arxiv.org/pdf/1711.00574.pdf.
- Investigating Vision Foundational Models for Tactile Representation Learning. arXiv preprint arXiv:2305.00596, 2023. URL https://arxiv.org/pdf/2305.00596.pdf.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
- Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802, 2023b. URL https://arxiv.org/pdf/2307.10802.pdf.
- Egoobjects: A large-scale egocentric dataset for fine-grained object understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Dark, beyond deep: A paradigm shift to cognitive ai with humanlike common sense. Engineering, 6(3):310–345, 2020.