Learning Multi-Step Manipulation Tasks from A Single Human Demonstration (2312.15346v2)
Abstract: Learning from human demonstrations has exhibited remarkable achievements in robot manipulation. However, the challenge remains to develop a robot system that matches human capabilities and data efficiency in learning and generalizability, particularly in complex, unstructured real-world scenarios. We propose a system that processes RGBD videos to translate human actions to robot primitives and identifies task-relevant key poses of objects using Grounded Segment Anything. We then address challenges for robots in replicating human actions, considering the human-robot differences in kinematics and collision geometry. To test the effectiveness of our system, we conducted experiments focusing on manual dishwashing. With a single human demonstration recorded in a mockup kitchen, the system achieved 50-100% success for each step and up to a 40% success rate for the whole task with different objects in a home kitchen. Videos are available at https://robot-dishwashing.github.io
- A. Amini, A. Selvam Periyasamy, and S. Behnke, “Yolopose: Transformer-based multi-object 6d pose estimation using keypoint regression,” in International Conference on Intelligent Autonomous Systems. Springer, 2022, pp. 392–406.
- S. P. Arunachalam, I. Güzey, S. Chintala, and L. Pinto, “Holo-dex: Teaching dexterity with immersive mixed reality,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 5962–5969.
- S. P. Arunachalam, S. Silwal, B. Evans, and L. Pinto, “Dexterous imitation made easy: A learning-based framework for efficient dexterous manipulation,” in 2023 ieee international conference on robotics and automation (icra). IEEE, 2023, pp. 5954–5961.
- O. Batchelor, “Multi-camera calibration using one or more calibration patterns,” May 2023. [Online]. Available: https://github.com/oliver-batchelor/multical
- P. Beeson and B. Ames, “Trac-ik: An open-source library for improved solving of generic inverse kinematics,” in 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids). IEEE, 2015, pp. 928–935.
- H. Bharadhwaj, A. Gupta, and S. Tulsiani, “Visual affordance prediction for guiding robot exploration,” IEEE International Conference on Robotics and Automation (ICRA), 2023.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool, “One-shot video object segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 221–230.
- S. Calinon, F. Guenter, and A. Billard, “On learning, representing, and generalizing a task in a humanoid robot,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 37, no. 2, pp. 286–298, 2007.
- Y.-W. Chao, W. Yang, Y. Xiang, P. Molchanov, A. Handa, J. Tremblay, Y. S. Narang, K. Van Wyk, U. Iqbal, S. Birchfield et al., “Dexycb: A benchmark for capturing hand grasping of objects,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9044–9053.
- D. Chetverikov, D. Svirko, D. Stepanov, and P. Krsek, “The trimmed iterative closest point algorithm,” in 2002 International Conference on Pattern Recognition, vol. 3. IEEE, 2002, pp. 545–548.
- E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” http://pybullet.org, 2016–2021.
- A. Darkhalil, D. Shan, B. Zhu, J. Ma, A. Kar, R. Higgins, S. Fidler, D. Fouhey, and D. Damen, “Epic-kitchens visor benchmark: Video segmentations and object relations,” Advances in Neural Information Processing Systems, vol. 35, pp. 13 745–13 758, 2022.
- ——, “Epic-kitchens visor benchmark: Video segmentations and object relations,” in Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2022.
- S. Devgon, J. Ichnowski, A. Balakrishna, H. Zhang, and K. Goldberg, “Orienting novel 3d objects using self-supervised learning of rotation transforms,” in 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE). IEEE, 2020, pp. 1453–1460.
- B. Drost, M. Ulrich, N. Navab, and S. Ilic, “Model globally, match locally: Efficient and robust 3d object recognition,” in 2010 IEEE computer society conference on computer vision and pattern recognition. Ieee, 2010, pp. 998–1005.
- C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6202–6211.
- Grounded-SAM Contributors, “Grounded-Segment-Anything,” Apr. 2023. [Online]. Available: https://github.com/IDEA-Research/Grounded-Segment-Anything
- F. Hagelskjær and A. G. Buch, “Pointvotenet: Accurate object detection and 6 dof pose estimation in point clouds,” in 2020 IEEE International Conference on Image Processing (ICIP). IEEE, 2020, pp. 2641–2645.
- T. Hodan, P. Haluza, Š. Obdržálek, J. Matas, M. Lourakis, and X. Zabulis, “T-less: An rgb-d dataset for 6d pose estimation of texture-less objects,” in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2017, pp. 880–888.
- Y. Hu, P. Fua, and M. Salzmann, “Perspective flow aggregation for data-limited 6d object pose estimation,” in European Conference on Computer Vision. Springer, 2022, pp. 89–106.
- L. Huang, J. Tan, J. Meng, J. Liu, and J. Yuan, “Hot-net: Non-autoregressive transformer for 3d hand-object pose estimation,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 3136–3145.
- J. Ji, R. Krishna, L. Fei-Fei, and J. C. Niebles, “Action genome: Actions as compositions of spatio-temporal scene graphs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 236–10 247.
- M. Kazhdan, M. Bolitho, and H. Hoppe, “Poisson surface reconstruction,” in Proceedings of the fourth Eurographics symposium on Geometry processing, vol. 7, 2006, p. 0.
- A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment anything,” arXiv:2304.02643, 2023.
- J. J. Kuffner and S. M. LaValle, “Rrt-connect: An efficient approach to single-query path planning,” in Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), vol. 2. IEEE, 2000, pp. 995–1001.
- T. Kunz and M. Stilman, “Time-optimal trajectory generation for path following with bounded acceleration and velocity,” Robotics: Science and Systems VIII, pp. 1–8, 2012.
- Y. Labbé, J. Carpentier, M. Aubry, and J. Sivic, “Cosypose: Consistent multi-view multi-object 6d pose estimation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16. Springer, 2020, pp. 574–591.
- S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.
- F. Li, S. R. Vutukur, H. Yu, I. Shugurov, B. Busam, S. Yang, and S. Ilic, “Nerf-pose: A first-reconstruct-then-regress approach for weakly-supervised 6d object pose estimation supplementary.”
- T. Lin, X. Liu, X. Li, E. Ding, and S. Wen, “Bmn: Boundary-matching network for temporal action proposal generation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 3889–3898.
- Y. Lin, A. S. Wang, G. Sutanto, A. Rai, and F. Meier, “Polymetis,” https://facebookresearch.github.io/fairo/polymetis/, 2021.
- S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023.
- M. T. Mason and J. K. Salisbury Jr, “Robot hands and the mechanics of manipulation,” 1985.
- E. Olson, “Apriltag: A robust and flexible visual fiducial system,” in 2011 IEEE International Conference on Robotics and Automation, 2011, pp. 3400–3407.
- A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan et al., “Open x-embodiment: Robotic learning datasets and rt-x models,” arXiv preprint arXiv:2310.08864, 2023.
- C. Pan, B. Okorn, H. Zhang, B. Eisner, and D. Held, “Tax-pose: Task-specific cross-pose estimation for robot manipulation,” in Conference on Robot Learning. PMLR, 2023, pp. 1783–1792.
- F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung, “Learning video object segmentation from static images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2663–2672.
- F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 724–732.
- Y. Qin, Y.-H. Wu, S. Liu, H. Jiang, R. Yang, Y. Fu, and X. Wang, “Dexmv: Imitation learning for dexterous manipulation from human videos,” in European Conference on Computer Vision. Springer, 2022, pp. 570–587.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- D. Shan, J. Geng, M. Shu, and D. F. Fouhey, “Understanding human hands in contact at internet scale,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9869–9878.
- S. H. Shivakumar, M. Oberweger, M. Rad, and V. Lepetit, “Ho-3d: A multi-user, multi-object dataset for joint 3d hand-object pose estimation,” 2019.
- A. Sivakumar, K. Shaw, and D. Pathak, “Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube,” RSS, 2022.
- S. Song, A. Zeng, J. Lee, and T. Funkhouser, “Grasping in the wild: Learning 6dof closed-loop grasping from low-cost demonstrations,” Robotics and Automation Letters, 2020.
- Y. Su, M. Saleh, T. Fetzer, J. Rambach, N. Navab, B. Busam, D. Stricker, and F. Tombari, “Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 6738–6748.
- M. Sundermeyer, T. Hodaň, Y. Labbe, G. Wang, E. Brachmann, B. Drost, C. Rother, and J. Matas, “Bop challenge 2022 on detection, segmentation and pose estimation of specific rigid objects,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2784–2793.
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- V-HACD Contributors, “The V-HACD library decomposes a 3D surface into a set of ”near” convex parts,” Oct. 2022. [Online]. Available: https://github.com/kmammou/v-hacd
- J. Vidal, C.-Y. Lin, X. Lladó, and R. Martí, “A method for 6d pose estimation of free-form rigid objects using point pair features on range data,” Sensors, vol. 18, no. 8, p. 2678, 2018.
- P. Voigtlaender and B. Leibe, “Online adaptation of convolutional neural networks for video object segmentation,” arXiv preprint arXiv:1706.09364, 2017.
- H. Walke, K. Black, A. Lee, M. J. Kim, M. Du, C. Zheng, T. Zhao, P. Hansen-Estruch, Q. Vuong, A. He, V. Myers, K. Fang, C. Finn, and S. Levine, “Bridgedata v2: A dataset for robot learning at scale,” in Conference on Robot Learning (CoRL), 2023.
- G. Wang, F. Manhardt, F. Tombari, and X. Ji, “Gdr-net: Geometry-guided direct regression network for monocular 6d object pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 611–16 621.
- J. Wang, S. Dasari, M. K. Srirama, S. Tulsiani, and A. Gupta, “Manipulate by seeing: Creating manipulation controllers from pre-trained representations,” ICCV, 2023.
- W. Wang, M. Feiszli, H. Wang, and D. Tran, “Unidentified video objects: A benchmark for dense, open-world segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 776–10 785.
- B. Wen, W. Lian, K. Bekris, and S. Schaal, “You only demonstrate once: Category-level manipulation from single visual demonstration,” arXiv preprint arXiv:2201.12716, 2022.
- Y. Wu, A. Javaheri, M. Zand, and M. Greenspan, “Keypoint cascade voting for point cloud based 6dof pose estimation,” in 2022 International Conference on 3D Vision (3DV). IEEE, 2022, pp. 176–186.
- N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. Huang, “Youtube-vos: A large-scale video object segmentation benchmark,” 2018.
- L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos, “Efficient video object segmentation via network modulation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6499–6507.
- L. Yen-Chen, P. Florence, J. T. Barron, A. Rodriguez, P. Isola, and T.-Y. Lin, “inerf: Inverting neural radiance fields for pose estimation,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 1323–1330.
- T. Zhang, Z. McCarthy, O. Jow, D. Lee, X. Chen, K. Goldberg, and P. Abbeel, “Deep imitation learning for complex manipulation tasks from virtual reality teleoperation,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 5628–5635.
- X. Zhang, Y. Lu, W. Wang, A. Yan, J. Yan, L. Qin, H. Wang, X. Yan, W. Y. Wang, and L. R. Petzold, “Gpt-4v (ision) as a generalist evaluator for vision-language tasks,” arXiv preprint arXiv:2311.01361, 2023.
- Z. Zhang, W. Chen, L. Zheng, A. Leonardis, and H. J. Chang, “Trans6d: Transformer-based 6d object pose estimation and refinement,” in European Conference on Computer Vision. Springer, 2022, pp. 112–128.
- T. Z. Zhao, V. Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” arXiv preprint arXiv:2304.13705, 2023.
- Y. Zhu, Z. Jiang, P. Stone, and Y. Zhu, “Learning generalizable manipulation policies with object-centric 3d representations,” in 7th Annual Conference on Robot Learning, 2023. [Online]. Available: https://openreview.net/forum?id=9SM6l0HyY_
- Y. Zhu, A. Joshi, P. Stone, and Y. Zhu, “VIOLA: Object-centric imitation learning for vision-based robot manipulation,” in 6th Annual Conference on Robot Learning, 2022. [Online]. Available: https://openreview.net/forum?id=L8hCfhPbFho
- L. Zou, Z. Huang, N. Gu, and G. Wang, “6d-vit: Category-level 6d object pose estimation via transformer-based instance representation learning,” IEEE Transactions on Image Processing, vol. 31, pp. 6907–6921, 2022.