Dreamitate: Real-World Visuomotor Policy Learning via Video Generation (2406.16862v1)

Published 24 Jun 2024 in cs.RO and cs.CV

Abstract: A key challenge in manipulation is learning a policy that can robustly generalize to diverse visual environments. A promising mechanism for learning robust policies is to leverage video generative models, which are pretrained on large-scale datasets of internet videos. In this paper, we propose a visuomotor policy learning framework that fine-tunes a video diffusion model on human demonstrations of a given task. At test time, we generate an example of an execution of the task conditioned on images of a novel scene, and use this synthesized execution directly to control the robot. Our key insight is that using common tools allows us to effortlessly bridge the embodiment gap between the human hand and the robot manipulator. We evaluate our approach on four tasks of increasing complexity and demonstrate that harnessing internet-scale generative models allows the learned policy to achieve a significantly higher degree of generalization than existing behavior cloning approaches.

Citations (6)

View on Semantic Scholar

Summary

The paper introduces fine-tuning of video diffusion models on human demonstrations to significantly improve policy generalization.
It employs tool-based transfer learning to bridge the embodiment gap, translating human actions into precise robot manipulations.
Evaluations on four tasks demonstrate that Dreamitate outperforms state-of-the-art methods, achieving up to 92.5% success rates and reduced rotation errors.

Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

Dreamitate presents an innovative strategy for visuomotor policy learning in robotics by leveraging video generation models to enhance the generalization of policies across diverse visual environments. Existing behavior cloning (BC) methods, while effective, are limited by their reliance on ground truth actions for the robot, which hinders scalability and generalization. Dreamitate addresses these limitations by fine-tuning video diffusion models on human demonstrations and using the resultant synthesized videos to guide real-world robot actions.

Key Contributions

Video Model Fine-Tuning: The researchers fine-tune a video diffusion model pre-trained on large-scale internet video datasets. This fine-tuning is performed using stereo video recordings of human demonstrations. The premise is that pre-trained video models capture extensive priors from human activities, which can be beneficial when transferred to robot behaviors. By doing so, the approach preserves the generalization ability of the pre-trained models while adapting to specific tasks.
Tool-Based Transfer Learning: A critical insight of the work is the use of common, trackable tools to bridge the embodiment gap between human demonstrators and robot manipulators. This approach facilitates the translation of human actions to robot actions by tracking the tool's trajectory in the synthesized video and executing these trajectories using the robot's end-effector.
Real-World Task Evaluation: Dreamitate was validated on four tasks of increasing complexity: object rotation, granular material scooping, tabletop sweeping, and shape pushing. These tasks were designed to test various aspects of manipulation, including precision, multi-step planning, and generalization to unseen scenarios. The experiments demonstrated that Dreamitate yields superior performance compared to the state-of-the-art Diffusion Policy.

Experimental Results

The empirical evaluation showcased remarkable results:

Rotation Task: Dreamitate achieved a success rate of 92.5%, significantly outperforming the 55% success rate of Diffusion Policy. Failures in the baseline approach were mainly due to improper grasping points and slippage during manipulation.
Scooping Task: Dreamitate exhibited an 85% success rate in transferring granular material compared to 55% for Diffusion Policy. The improved performance was attributed to its robust handling of small tool sizes and precise positioning.
Sweeping Task: Dreamitate achieved a 92.5% success rate, while Diffusion Policy managed only 12.5%. The baseline approach struggled with obstacle avoidance and multi-modal task complexities which Dreamitate negotiated effectively.
Push-Shape Task: For this long-horizon task, Dreamitate demonstrated superior performance with a mean intersection-over-union (mIoU) of 0.731 and an average rotation error of 8.0 degrees, compared to Diffusion Policy’s 0.550 mIoU and 48.2 degrees rotation error.

Implications and Future Work

The success of Dreamitate has far-reaching implications. Practically, the method enables the development of more autonomous and adaptable robotic systems capable of performing complex manipulation tasks in varied environments without extensive reprogramming. Theoretically, the approach underscores the potential of integrating generative models with control policies to enhance the adaptability of robotic systems.

However, limitations remain, such as reliance on visually trackable tools and increased computational demands for real-time closed-loop control. Future research may focus on incorporating advancements in object tracking and accelerating video model inference to address these issues. Additionally, extending the approach to more complex and dynamic environments could further validate the method's robustness and applicability.

In conclusion, Dreamitate represents a significant step in the evolution of visuomotor policy learning, demonstrating that leveraging large-scale generative models can substantially enhance the generalization and scalability of robotic manipulation policies. The research opens new avenues for integrating generative models with robotics, fostering advancements in both fields.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ruoshi_liu/status/1805624036379988027

https://twitter.com/fly51fly/status/1807370949395906737