MimicPlay: Long-Horizon Imitation Learning by Watching Human Play (2302.12422v2)

Published 24 Feb 2023 in cs.RO

Abstract: Imitation learning from human demonstrations is a promising paradigm for teaching robots manipulation skills in the real world. However, learning complex long-horizon tasks often requires an unattainable amount of demonstrations. To reduce the high data requirement, we resort to human play data - video sequences of people freely interacting with the environment using their hands. Even with different morphologies, we hypothesize that human play data contain rich and salient information about physical interactions that can readily facilitate robot policy learning. Motivated by this, we introduce a hierarchical learning framework named MimicPlay that learns latent plans from human play data to guide low-level visuomotor control trained on a small number of teleoperated demonstrations. With systematic evaluations of 14 long-horizon manipulation tasks in the real world, we show that MimicPlay outperforms state-of-the-art imitation learning methods in task success rate, generalization ability, and robustness to disturbances. Code and videos are available at https://mimic-play.github.io

References (64)

Citations (130)

View on Semantic Scholar

Summary

The paper demonstrates that leveraging just 10 minutes of human play data can significantly boost sample efficiency in teaching long-horizon robotic tasks.
It introduces a hierarchical framework combining a high-level planner with a low-level visuomotor controller to distill complex manipulation into manageable actions.
MimicPlay outperforms conventional imitation methods by achieving superior task success and generalization across 14 diverse manipulation tasks.

MimicPlay: Long-Horizon Imitation Learning by Watching Human Play

The paper "MimicPlay: Long-Horizon Imitation Learning by Watching Human Play" addresses the challenge of teaching robots long-horizon manipulation tasks through imitation learning (IL). Traditional IL methodologies often depend on a large number of robot demonstrations to effectively learn tasks, particularly when dealing with complex operations. The considerable human and time resources required for such data collection are prohibitive, limiting the scalability of such approaches. MimicPlay innovatively leverages human play data to alleviate these constraints.

Human play data, which consists of video sequences capturing human interactions with the environment, is positioned as a resource-efficient alternative for learning high-level task plans. Even though humans and robots have different physical embodiments, the paper posits that human interactions can encapsulate valuable information about task dynamics that robots can utilize.

MimicPlay introduces a two-tier learning framework: a high-level planner and a low-level visuomotor controller. Human play data informs the high-level planner, enabling it to generate latent task plans. These plans encapsulate sequences of human actions, which can guide a low-level controller trained on a small dataset of robot demonstrations. Through a sequence of systematic evaluations involving 14 distinct manipulation tasks across multiple environments, the paper illustrates that MimicPlay significantly surpasses existing IL methods in terms of sample efficiency, task success rate, generalization capabilities, and resistance to disturbances.

The results are quantitatively compelling; MimicPlay dramatically improves sample efficiency. For example, contrary to prior methods like C-BeT or LMP, which requires hours of teleoperated robot play data, MimicPlay effectively uses just 10 minutes of human play data to achieve superior outcomes. In kitchen and desk environments specifically, MimicPlay outperformed other models by notable margins in both task completion and generalization to novel subgoal compositions.

The hierarchical framework of MimicPlay facilitates distinguishing complex tasks into a series of smaller, manageable actions, akin to plan-and-control architectures previously deemed promising in research. The latent plan generated by the high-level planner is crucial for guiding the low-level controller in executing fine-grained tasks, such as grasping or object placement. The integration of human play videos as prompts is particularly innovative, permitting parseable human demonstrations to specify complex robotic manipulation tasks.

In future AI developments, MimicPlay's framework could be extended to accommodate more diverse and scalable human interaction datasets, including those from internet-scale data sources. Additionally, expanding the variety of tasks beyond static environments, like desk spaces, to dynamic and mobile settings could further enhance the practical applicability and robustness of such systems.

MimicPlay's contribution to the domain of imitation learning is strategically significant, providing a pathway to revolutionizing how robots learn complex, multifaceted tasks. By harnessing the efficiency of human play, the presented framework offers a roadmap that can significantly reduce the costs associated with training state-of-the-art robotic systems in diverse task settings.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/archerlin__/status/1939694507815293382

https://twitter.com/danfei_xu/status/1858903242308030917

YouTube

Show All Videos