Towards Generalizable Zero-Shot Manipulation via Translating Human Interaction Plans (2312.00775v1)

Published 1 Dec 2023 in cs.RO, cs.CV, and cs.LG

Abstract: We pursue the goal of developing robots that can interact zero-shot with generic unseen objects via a diverse repertoire of manipulation skills and show how passive human videos can serve as a rich source of data for learning such generalist robots. Unlike typical robot learning approaches which directly learn how a robot should act from interaction data, we adopt a factorized approach that can leverage large-scale human videos to learn how a human would accomplish a desired task (a human plan), followed by translating this plan to the robots embodiment. Specifically, we learn a human plan predictor that, given a current image of a scene and a goal image, predicts the future hand and object configurations. We combine this with a translation module that learns a plan-conditioned robot manipulation policy, and allows following humans plans for generic manipulation tasks in a zero-shot manner with no deployment-time training. Importantly, while the plan predictor can leverage large-scale human videos for learning, the translation module only requires a small amount of in-domain data, and can generalize to tasks not seen during training. We show that our learned system can perform over 16 manipulation skills that generalize to 40 objects, encompassing 100 real-world tasks for table-top manipulation and diverse in-the-wild manipulation. https://homangab.github.io/hopman/

References (63)

Citations (24)

View on Semantic Scholar

Summary

The paper presents a dual-module system that uses human interaction videos to predict and translate plans into robotic actions.
It leverages a diffusion model and limited human-robot paired data to achieve effective zero-shot manipulation.
Experiments on table-top and in-the-wild setups demonstrate the system's capacity for generalizing across 16 skills and 40 objects.

Introduction

In robotics, teaching machines to perform tasks without prior specific training is a tantalizing yet challenging goal. A new system stands out for its ability to handle this challenge by translating human interaction plans into robot actions. This approach builds on the observation that humans can perform a vast array of manipulations which robots could learn to emulate.

The Underlying Approach

At the core of this approach is a two-part system. First, there’s a plan predictor based on human interactions. Given a current and a goal image, this module predicts future hand and object configurations, leading to an interaction plan. The system then features a translation module translating these plans into actions that a robot can perform.

Rather than sourcing data from robots, this model predominantly learns from large-scale human videos available on the web, facilitated by the plan predictor. The translation module, conversely, only requires a small set of training data. The combination allows the system to handle a broad range of tasks and objects without needing additional training time for deployment.

Learning from Humans

One eye-catching aspect of this system is its reliance on visual data to delineate hand and object interaction plans, focusing on motions instead of attempting full image prediction. The methodology employs a diffusion model trained with videos of human interactions to produce likely future mask scenarios representing hand and object movements.

For physical implementation, the translation module is educated through a limited set of paired human-robot data. This training data bucket includes detailed guides to human manipulation, which it maps to robotic movements, then tested in real-world environments.

Experimentation and Generalization

The experiments conducted to assess this framework involved a table-top setup with an arm-like robot as well as in-the-wild manipulations, where the robot operated in unstructured environments like offices and kitchens. With a bank of 16 skills and navigating interactions with 40 different objects, the robot displayed a significant ability to generalize across tasks, showcasing manipulation skills in diverse, unforeseen situations.

The research, equipped with a structured generalization criterion across object categories, instances, skills, and configurations, illuminated how a robot could adeptly learn manipulation skills without on-site training. The system was found to be specially efficient when translating human interactions from video content into robot actions, pushing the envelope for zero-shot manipulation capabilities in robotics.

PDF Markdown

GitHub

Towards Generalizable Zero-Shot Manipulation via Translating Human Interaction Plans