Emergent Mind

Vision-based Manipulation from Single Human Video with Open-World Object Graphs

(2405.20321)
Published May 30, 2024 in cs.RO , cs.CV , and cs.LG

Abstract

We present an object-centric approach to empower robots to learn vision-based manipulation skills from human videos. We investigate the problem of imitating robot manipulation from a single human video in the open-world setting, where a robot must learn to manipulate novel objects from one video demonstration. We introduce ORION, an algorithm that tackles the problem by extracting an object-centric manipulation plan from a single RGB-D video and deriving a policy that conditions on the extracted plan. Our method enables the robot to learn from videos captured by daily mobile devices such as an iPad and generalize the policies to deployment environments with varying visual backgrounds, camera angles, spatial layouts, and novel object instances. We systematically evaluate our method on both short-horizon and long-horizon tasks, demonstrating the efficacy of ORION in learning from a single human video in the open world. Videos can be found in the project website https://ut-austin-rpl.github.io/ORION-release.

Plan generation process in ORION, including tracking, keyframe identification, and Open-world Object Graph creation.

Overview

  • The paper presents a novel methodology called ORION for teaching robots to perform manipulation tasks solely by imitating actions observed in a single human video, particularly in dynamic environments with new objects.

  • Central to the approach is the introduction of Open-world Object Graphs (OOGs), which employ a graph-based, object-centric representation to capture the states and interactions of objects during the task, facilitating seamless transition from human to robot execution.

  • Extensive experimental validation demonstrates that ORION achieves a significant success rate across various manipulation tasks, surpassing traditional hand-motion imitation and dense correlation methods, thus indicating strong generalizability and robustness in dynamic conditions.

Vision-based Manipulation from Single Human Video with Open-World Object Graphs

The paper "Vision-based Manipulation from Single Human Video with Open-World Object Graphs" introduces a novel approach for teaching robots to perform manipulation tasks by imitating actions observed in a single human video. This methodology is particularly designed for scenarios involving novel objects and dynamic environments, leveraging an object-centric representation to infer and execute the manipulation tasks. The proposed algorithm, ORION (Open-world video ImitatiON), leverages recent advancements in vision foundation models to achieve robust generalization across diverse spatial layouts, visual backgrounds, and novel object instances.

Summary of Contributions

The key contributions of the paper are threefold:

  1. Problem Framing: The paper formulates the challenge of learning vision-based robot manipulation from a single human video in an open-world context, involving varied visual backgrounds, camera angles, and spatial configurations.
  2. Open-world Object Graphs (OOGs): The introduction of OOGs as a graph-based, object-centric representation captures the states and interactions of task-relevant objects, facilitating the transition from human to robot execution.
  3. ORION Algorithm: The ORION algorithm constructs manipulation policies directly from single RGB-D video demonstrations, ensuring generalizability to different environmental conditions and object instances.

Technical Approach

Object Tracking and Keyframe Detection

The process begins with localizing task-relevant objects in the human video using open-world vision models like Grounded-SAM for initial frame annotation, followed by propagation using video object segmentation models such as Cutie. Keyframes are identified based on the velocity statistics of tracked keypoints, capturing critical transitions in object contact relations.

Open-world Object Graph Construction

OOGs are generated for each keyframe, encapsulating object node features (3D point clouds) and hand interaction cues obtained from hand-reconstruction models like HaMeR. The edges within OOGs represent contact relationships, enabling robust association and mapping of objects and their interactions across frames.

Policy Construction and Execution

The ORION policy dynamically retrieves keyframes from the manipulation plan by matching observed object states with precomputed OOGs. Trajectories are predicted by warping video-observed keypoint motions, and SE(3) transformations are optimized to align these trajectories with the robot's end-effector actions. This structured optimization ensures the robot's actions are accurately guided by the observed human demonstration, effectively generalizing across varied environmental conditions.

Experimental Validation

The efficacy of ORION is systematically evaluated through a series of manipulation tasks, including both short-horizon single-action tasks and long-horizon multi-stage tasks. Experiments demonstrate ORION achieves an average success rate of 69.3% in diverse real-world scenarios, a significant performance considering the complexities of spatial variability and the introduction of novel objects.

Comparative Analysis

ORION is compared against baselines such as Hand-motion-imitation and Dense-Correspondence. Results validate that the object-centric approach significantly outperforms hand-centric imitation, particularly due to its robustness in achieving target object configurations and generalizing to new spatial setups. The TAP model further enhances performance by accurately capturing critical keyframes and motion features, outperforming dense correspondence methods like optical flow.

Implications and Future Directions

The implications of this research are significant for advancing robot autonomy in complex, unstructured environments. The object-centric abstraction and use of open-world vision models enable robots to learn and execute tasks from readily available human videos, such as those found on the internet.

Future research directions include addressing the limitations related to video capturing constraints, such as moving cameras and reliance on RGB-D data. Enhancing the system to infer human intentions from inherently ambiguous video data, leveraging both semantic and geometric information for object correspondence, and reconstructing scenes from dynamic video streams present promising avenues for further investigation.

Conclusion

The paper presents a robust framework for vision-based robotic manipulation, leveraging object-centric representations and foundation models to achieve high generalizability and performance from a single human video. The proposed ORION algorithm demonstrates substantial advancements in enabling robots to effectively learn and adapt manipulation strategies in dynamic, open-world environments.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube