Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

One-Shot Open Affordance Learning with Foundation Models (2311.17776v1)

Published 29 Nov 2023 in cs.CV

Abstract: We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category, but is expected to identify novel objects and affordances. While vision-LLMs excel at recognizing novel objects and scenes, they often struggle to understand finer levels of granularity such as affordances. To handle this issue, we conduct a comprehensive analysis of existing foundation models, to explore their inherent understanding of affordances and assess the potential for data-limited affordance learning. We then propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings. Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data, and exhibits reasonable generalization capability on unseen objects and affordances.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Deep vit features as dense visual descriptors. ECCVW What is Motion For, 2022.
  2. Affordances in robotic tasks–a survey. arXiv preprint arXiv:2004.07400, 2020.
  3. Learning grasp affordance reasoning through semantic relations. IEEE Robotics and Automation Letters, 4(4):4571–4578, 2019.
  4. Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023.
  5. Visual affordance prediction for guiding robot exploration. arXiv preprint arXiv:2305.17783, 2023.
  6. Affordance learning from play for sample-efficient policy learning. In 2022 International Conference on Robotics and Automation (ICRA), pages 6372–6378. IEEE, 2022.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  9. Affordance grounding from demonstration video to target image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6799–6808, 2023.
  10. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
  11. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
  12. Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. arXiv preprint arXiv:2303.11797, 2023.
  13. Learning to act properly: Predicting and explaining affordances from images. In CVPR, 2018.
  14. MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  15. Training agents with interactive reinforcement learning and contextual affordances. IEEE Transactions on Cognitive and Developmental Systems, 8(4):271–284, 2016.
  16. Strap: Structured object affordance segmentation with point supervision. arXiv preprint arXiv:2304.08492, 2023.
  17. 3d affordancenet: A benchmark for visual object affordance understanding. In CVPR, 2021.
  18. Affordancenet: An end-to-end deep learning approach for object affordance detection. ICRA, 2018.
  19. Demo2Vec: Reasoning Object Affordances from Online Videos. CVPR, 2018.
  20. End-to-end affordance learning for robotic manipulation. arXiv preprint arXiv:2209.12941, 2022.
  21. James J. Gibson. The Ecological Approach to Visual Perception: Classic Edition. Houghton Mifflin, 1979.
  22. One-Shot Transfer of Affordance Regions? AffCorrs! CoRL, 2022.
  23. Visual affordance and function understanding: A survey. ACM Computing Surveys (CSUR), 54(3):1–35, 2021.
  24. Affordance transfer learning for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 495–504, 2021.
  25. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
  26. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  27. Affordance detection for task-specific grasping using deep learning. In 2017 IEEE-RAS 17th International Conference on Humanoid Robotics (Humanoids), pages 91–98. IEEE, 2017.
  28. Learning human activities and object affordances from rgb-d videos. The International journal of robotics research, 32(8):951–970, 2013.
  29. Locate: Localize and transfer object parts for weakly supervised affordance grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10922–10931, 2023.
  30. Clip surgery for better explainability with enhancement in open-vocabulary tasks. arXiv preprint arXiv:2304.05653, 2023.
  31. Joint hand motion and interaction hotspots prediction from egocentric videos. In CVPR, 2022.
  32. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  33. Learning to segment affordances. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 769–776, 2017.
  34. One-shot affordance detection. In IJCAI, 2021.
  35. Grounded affordance from exocentric view. arXiv preprint arXiv:2208.13196, 2022.
  36. Learning affordance grounding from exocentric images. CVPR, 2022.
  37. Grounding language with visual affordances over unstructured data. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11576–11582. IEEE, 2023.
  38. Lan-grasp: Using large language models for semantic object grasping. arXiv preprint arXiv:2310.05239, 2023.
  39. Affordances, development and imitation. In 2007 IEEE 6th International Conference on Development and Learning, pages 270–275. IEEE, 2007.
  40. Bayesian deep learning for affordance segmentation in images. arXiv preprint arXiv:2303.00871, 2023.
  41. Affordance of Object Parts from Geometric Features. Int. Conf. Robot. Autom., pages 5–6, 2015.
  42. Affordance detection of tool parts from geometric features. ICRA, 2015.
  43. Grounded human-object interaction hotspots from video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8688–8697, 2019.
  44. Learning affordance landscapes for interaction exploration in 3d environments. Advances in Neural Information Processing Systems, 33:2005–2015, 2020.
  45. Detecting object affordances with convolutional neural networks. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2765–2770. IEEE, 2016.
  46. Object-based affordances detection with convolutional neural networks and dense conditional random fields. In IROS, 2017.
  47. Open-vocabulary affordance detection in 3d point clouds. arXiv preprint arXiv:2303.02401, 2023.
  48. R OpenAI. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2023.
  49. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  50. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  51. Language embedded radiance fields for zero-shot task-oriented grasping. arXiv preprint arXiv:2309.07970, 2023.
  52. A multi-scale cnn for affordance segmentation in rgb images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 186–201. Springer, 2016.
  53. Adaptive binarization for weakly supervised affordance segmentation. In ICCVW, 2017.
  54. Weakly supervised affordance detection. CVPR, 2017.
  55. One-shot learning for semantic segmentation. arXiv preprint arXiv:1709.03410, 2017.
  56. Learning 6-dof fine-grained grasp detection based on part affordance grounding. arXiv preprint arXiv:2301.11564, 2023.
  57. Going denser with open-vocabulary part segmentation. arXiv preprint arXiv:2305.11173, 2023.
  58. Relationship oriented affordance learning through manipulation graph construction. arXiv preprint arXiv:2110.14137, 2021.
  59. Deit iii: Revenge of the vit. In European Conference on Computer Vision, pages 516–533. Springer, 2022.
  60. Ov-parts: Towards open-vocabulary part segmentation. arXiv preprint arXiv:2310.05107, 2023.
  61. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
  62. Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2945–2954, 2023.
  63. Recent advances of deep robotic affordance learning: a reinforcement learning perspective. IEEE Transactions on Cognitive and Developmental Systems, 2023.
  64. Grounding 3d object affordance from 2d interactions in images. arXiv preprint arXiv:2303.10437, 2023.
  65. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
  66. Extract free dense labels from clip. In European Conference on Computer Vision, pages 696–712. Springer, 2022.
  67. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
  68. Zegclip: Towards adapting clip for zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11175–11185, 2023.
Citations (12)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com