DEFT: Dexterous Fine-Tuning for Real-World Hand Policies (2310.19797v2)
Abstract: Dexterity is often seen as a cornerstone of complex manipulation. Humans are able to perform a host of skills with their hands, from making food to operating tools. In this paper, we investigate these challenges, especially in the case of soft, deformable objects as well as complex, relatively long-horizon tasks. However, learning such behaviors from scratch can be data inefficient. To circumvent this, we propose a novel approach, DEFT (DExterous Fine-Tuning for Hand Policies), that leverages human-driven priors, which are executed directly in the real world. In order to improve upon these priors, DEFT involves an efficient online optimization procedure. With the integration of human-based learning and online fine-tuning, coupled with a soft robotic hand, DEFT demonstrates success across various tasks, establishing a robust, data-efficient pathway toward general dexterous manipulation. Please see our website at https://dexterous-finetuning.github.io for video results.
- A framework for designing anthropomorphic soft hands through interaction, 2023.
- In defense of the direct perception of affordances. arXiv preprint arXiv:1505.01085, 2015.
- Affordances from human videos as a versatile representation for robotics. 2023.
- Human hands as probes for interactive object understanding. In CVPR, 2022.
- Grounded human-object interaction hotspots from video. In ICCV, 2019.
- Understanding human hands in contact at internet scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9869–9878, 2020.
- Joint hand motion and interaction hotspots prediction from egocentric videos. In CVPR, 2022a.
- Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21013–21022, June 2022b.
- Binge watching: Scaling affordance learning from sitcoms. In CVPR, 2017.
- Residual learning from demonstration. arXiv preprint arXiv:2008.07682, 2020.
- Residual reinforcement learning for robot control. In ICRA, 2019.
- Anytime motion planning using the rrt. In ICRA, 2011.
- Rrt-connect: An efficient approach to single-query path planning. In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), volume 2, pages 995–1001. IEEE, 2000.
- Gaussian process motion planning. In 2016 IEEE international conference on robotics and automation (ICRA), pages 9–15. IEEE, 2016.
- J. Kober and J. Peters. Policy search for motor primitives in robotics. Advances in neural information processing systems, 21, 2008.
- Relative entropy policy search. In Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.
- Continuous control with deep reinforcement learning. ICLR, 2016.
- Data-efficient deep reinforcement learning for dexterous manipulation. arXiv preprint arXiv:1704.03073, 2017.
- Curiosity-driven exploration by self-supervised prediction. In ICML, 2017.
- Learning hand-eye coordination for robotic grasping with large-scale data collection. In ISER, 2016.
- Visual reinforcement learning with imagined goals. In NeurIPS, pages 9191–9200, 2018.
- Learning to poke by poking: Experiential learning of intuitive physics. NIPS, 2016.
- Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1352–1361. JMLR. org, 2017.
- Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
- Mt-opt: Continuous multi-task robotic reinforcement learning at scale. arXiv preprint arXiv:2104.08212, 2021.
- Variable impedance control in end-effector space: An action space for reinforcement learning in contact-rich tasks. IROS, 2019.
- Hindsight experience replay. NeurIPS, 2017.
- S. Levine and V. Koltun. Guided policy search. In ICML, 2013.
- Alan: Autonomously exploring robotic agents in the real world. arXiv preprint arXiv:2302.06604, 2023.
- Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6), Nov. 2017.
- CPF: Learning a contact potential field to model the hand-object interaction. In ICCV, 2021.
- Rgb2hands: real-time tracking of 3d hand interactions from monocular rgb video. ACM Transactions on Graphics (TOG), 39(6):1–16, 2020.
- End-to-end recovery of human shape and pose. CoRR, abs/1712.06584, 2017. URL http://arxiv.org/abs/1712.06584.
- Frankmocap: A monocular 3d whole-body pose estimation system via regression and integration. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 1749–1759, October 2021.
- What’s in your hands? 3d reconstruction of generic objects in hands. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3895–3905, 2022.
- Cmu graphics lab motion capture database. http://mocap.cs.cmu.edu/.
- Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013.
- Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 813–822, 2019.
- The ”something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- B. G. Fabian Caba Heilbron, Victor Escorcia and J. C. Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, pages 961–970, 2015.
- A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2634–2641, 2013.
- Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
- Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21013–21022, June 2022.
- Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vision (ECCV), 2018.
- Vip: Towards universal visual reward and representation via value-implicit pre-training. arXiv preprint arXiv:2210.00030, 2022.
- R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022.
- Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system. In ICRA, 2020.
- Robotic telekinesis: learning a robotic hand imitator by watching humans on youtube. arXiv preprint arXiv:2202.10448, 2022.
- F. O. H. to Multiple Hands: Imitation Learning for Dexterous Manipulation from Single-Camera Teleoperation. Qin, yuzhe and su, hao and wang, xiaolong, 2022.
- Videodex: Learning dexterity from internet videos. In CoRL, 2022.
- P. Mandikal and K. Grauman. Dexvip: Learning dexterous grasping with human hand pose priors from video. In Conference on Robot Learning, pages 651–661. PMLR, 2022.
- Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017.
- Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions On Graphics (TOG), 37(4):1–14, 2018.
- P. Mandikal and K. Grauman. Dexvip: Learning dexterous grasping with human hand pose priors from video. In Conference on Robot Learning (CoRL), 2021.
- Learning dexterous in-hand manipulation. IJRR, 2020.
- Visual dexterity: In-hand dexterous manipulation from depth. arXiv preprint arXiv:2211.11744, 2022.
- Legged locomotion in challenging terrains using egocentric vision. In Conference on Robot Learning, pages 403–415. PMLR, 2023.
- Rapid locomotion via reinforcement learning. arXiv preprint arXiv:2205.02824, 2022.
- T. Yoshikawa. Dynamic manipulability of robot manipulators. Transactions of the Society of Instrument and Control Engineers, 21(9):970–975, 1985.
- H. Asada and J.-J. Slotine. Robot analysis and control. John Wiley & Sons, 1991.
- Modern robotics. Cambridge University Press, 2017.
- C. Liu and M. Tomizuka. Designing the robot behavior for safe human robot interactions. In Trends in Control and Decision-Making for Human-Robot Collaboration Systems, pages 241–270. Springer, 2017.
- O. Khatib. A unified approach for motion and force control of robot manipulators: The operational space formulation. IEEE Journal on Robotics and Automation, 3(1):43–53, 1987.
- Robotsweater: Scalable, generalizable, and customizable machine-knitted tactile skins for robots. arXiv preprint arXiv:2303.02858, 2023.
- Gelsight: High-resolution robot tactile sensors for estimating geometry and force. Sensors, 17(12):2762, 2017.
- Reskin: versatile, replaceable, lasting tactile skins. arXiv preprint arXiv:2111.00071, 2021.
- Learning the signatures of the human grasp using a scalable tactile glove. Nature, 569(7758), 2019. doi:10.1038/s41586-019-1234-z.
- Detecting twenty-thousand classes using image-level supervision. arXiv preprint arXiv:2201.02605, 2022.
- Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021. URL https://arxiv.org/abs/2103.00020.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9164–9170. IEEE, 2020.
- Human-to-robot imitation in the wild. In RSS, 2022.
- Learning Structured Output Representation using Deep Conditional Generative Models. In NeurIPS, 2015.
- Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278–1286. PMLR, 2014a.
- Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014b.
- D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385.
- Decision transformer: Reinforcement learning via sequence modeling. arXiv preprint arXiv:2106.01345, 2021.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.