XSkill: Cross Embodiment Skill Discovery (2307.09955v2)

Published 19 Jul 2023 in cs.RO, cs.AI, and cs.LG

Abstract: Human demonstration videos are a widely available data source for robot learning and an intuitive user interface for expressing desired behavior. However, directly extracting reusable robot manipulation skills from unstructured human videos is challenging due to the big embodiment difference and unobserved action parameters. To bridge this embodiment gap, this paper introduces XSkill, an imitation learning framework that 1) discovers a cross-embodiment representation called skill prototypes purely from unlabeled human and robot manipulation videos, 2) transfers the skill representation to robot actions using conditional diffusion policy, and finally, 3) composes the learned skill to accomplish unseen tasks specified by a human prompt video. Our experiments in simulation and real-world environments show that the discovered skill prototypes facilitate both skill transfer and composition for unseen tasks, resulting in a more general and scalable imitation learning framework. The benchmark, code, and qualitative results are on https://xskill.cs.columbia.edu/

Citations (40)

View on Semantic Scholar

Summary

The paper introduces an imitation learning framework that discovers, aligns, and transfers cross-embodiment skills from human demonstration videos to robotic actions.
It employs conditional diffusion models and self-supervised clustering to create a unified skill representation space across human and robot data.
Evaluations in simulated and real-world settings show significant improvements over baselines, demonstrating robustness and scalability in complex task compositions.

XSkill: Cross Embodiment Skill Discovery

Introduction

XSkill introduces an imitation learning framework that tackles the challenges of skill discovery, transfer, and composition across different embodiments. By leveraging human demonstration videos, which are abundant and possess intuitive task representations, XSkill bridges the gap between human demonstrations and robot actions through a cross-embodiment skill representation space, conditional diffusion policies, and a skill alignment transformer.

Methodology

Cross-Embodiment Skill Discovery

XSkill begins with the discovery of robust cross-embodiment skill prototypes using self-supervised learning on human and robot video data.

Figure 1: Cross Embodiment Skill Discovery. XSkill identifies and executes skills through a learned representation space.

The framework employs feature clustering via learnable skill prototypes, aligning skill representations from different embodiments by using shared prototypes across the spaces. This approach prevents embodiment-specific segmentation and ensures unified skill representation.

Transfer via Skill-Conditioned Imitation Learning

The transfer phase involves learning a skill-conditioned visuomotor policy using imitation techniques. XSkill utilizes diffusion models, specifically Denoising Diffusion Probabilistic Models (DDPMs), to manage multimodal action distributions derived from human teleoperation data. This technique ensures the stable representation of action distributions, even with limited datasets.

Figure 2: XSkill Discover: Temporal encoding and clustering process for aligning cross-embodiment skill representations.

Skill Composition for New Tasks

In the final phase, a single human prompt video is used to specify a new task, for which the robot extrapolates a sequence of skills. However, speed mismatches between human and robot executions pose a challenge. To address this, XSkill employs a Skill Alignment Transformer (SAT) to align task execution with the robot's pace, introducing robustness against execution errors.

Figure 3: Transfer and Composition: Skill execution plan alignment using SAT for mismatch handling between human demonstrations and robot tasks.

Evaluation

The framework is evaluated in both simulated and real-world environments, demonstrating significant improvements over baselines in cross-embodiment skill performance. XSkill outperforms goal-conditioned diffusion policies and shows robustness against variations in task execution speed.

Figure 4: XSkill embedding. Alignment of skill representations via t-SNE visualization, showcasing embodiment consistency.

This performance is particularly evident in tasks with unseen compositions, where the composition framework significantly boosts imitation learning's applicability to real-world scenarios.

Key Findings and Implications

XSkill excels at leveraging human demonstration videos to infer cross-embodiment skills, thereby enhancing the scalability and generalizability of skill transfer to robotic actions. However, the method performs optimally when provided with sufficient and diverse robotic teleoperation data to encapsulate transition dynamics across varying environments. Future research directions may explore further diversifying data sources, including varied camera setups and leveraging extensive publicly available human video data.

Figure 5: Robot execution on novel tasks, demonstrating robustness and skill re-composition after perturbation.

Conclusion

XSkill's framework provides a compelling approach to cross-embodiment skill learning, utilizing shared skill prototypes to bridge the gap between distinct embodiments. It sets the foundation for future work in expanding imitation capabilities across various robotic systems, emphasizing cost-effective and scalable learning methodologies from human demonstrations. The combination of discovery, transfer, and composition phases heralds a more generalized solution to imitation learning challenges, poised for further enhancement and adoption across varied domains.