Any-point Trajectory Modeling for Policy Learning (2401.00025v3)
Abstract: Learning from demonstration is a powerful method for teaching robots new skills, and having more demonstration data often improves policy learning. However, the high cost of collecting demonstration data is a significant bottleneck. Videos, as a rich data source, contain knowledge of behaviors, physics, and semantics, but extracting control-specific information from them is challenging due to the lack of action labels. In this work, we introduce a novel framework, Any-point Trajectory Modeling (ATM), that utilizes video demonstrations by pre-training a trajectory model to predict future trajectories of arbitrary points within a video frame. Once trained, these trajectories provide detailed control guidance, enabling the learning of robust visuomotor policies with minimal action-labeled data. Across over 130 language-conditioned tasks we evaluated in both simulation and the real world, ATM outperforms strong video pre-training baselines by 80% on average. Furthermore, we show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology. Visualizations and code are available at: \url{https://xingyu-lin.github.io/atm}.
- Trajectory-tracking and path-following of underactuated autonomous vehicles with parametric modeling uncertainty. IEEE transactions on automatic control, 52(8):1362–1379, 2007.
- Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35:24639–24654, 2022.
- Zero-shot robot manipulation from passive human videos. arXiv preprint arXiv:2302.02011, 2023.
- Zero-shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023.
- Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
- Language models are few-shot learners. In Neural Information Processing Systems (NeurIPS), 2020.
- Causal confusion in imitation learning. Advances in Neural Information Processing Systems, 32, 2019.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
- Tap-vid: A benchmark for tracking any point in a video, 2023a.
- Tapir: Tracking any point with per-frame initialization and temporal refinement, 2023b.
- Learning universal policies via text-guided video generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Video prediction models as rewards for reinforcement learning. Neural Information Processing Systems, 2023.
- Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. In Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 2023.
- Ifor: Iterative flow minimization for robotic object rearrangement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14787–14797, 2022.
- Rt-trajectory: Robotic task generalization via hindsight trajectory sketches. arXiv preprint arXiv:2311.01977, 2023.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
- Mesh-based dynamics model with occlusion reasoning for cloth manipulation. In Robotics: Science and Systems (RSS), 2022.
- Cotracker: It is better to track together. arXiv:2307.07635, 2023.
- Vilt: Vision-and-language transformer without convolution or region supervision, 2021.
- Segment anything. In IEEE International Conference on Computer Vision (ICCV), 2023.
- Learning to Act from Actionless Video through Dense Correspondences. arXiv:2310.08576, 2023.
- 3d neural scene representations for visuomotor control. arXiv preprint arXiv:2107.04004, 2021.
- Learning visible connectivity dynamics for cloth smoothing. In Conference on Robot Learning, 2021.
- Spawnnet: Learning generalizable visuomotor skills from pre-trained networks, 2023.
- Libero: Benchmarking knowledge transfer for lifelong robot learning. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- VIP: Towards universal visual reward and representation via value-implicit pre-training. In The Eleventh International Conference on Learning Representations, 2023.
- What matters in learning from offline human demonstrations for robot manipulation. arXiv preprint arXiv:2108.03298, 2021.
- Keypoints into the future: Self-supervised correspondence in model-based reinforcement learning, 2020.
- R3m: A universal visual representation for robot manipulation. In Conference on Robot Learning, pages 892–909. PMLR, 2023.
- Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
- Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions On Graphics (TOG), 37(4):1–14, 2018.
- Keto: Learning keypoint representations for tool manipulation. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 7278–7285. IEEE, 2020.
- Language embedded radiance fields for zero-shot task-oriented grasping. In Conference on Robot Learning, pages 178–200. PMLR, 2023.
- A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
- Learning predictive models from observation and interaction, 2019.
- Toolflownet: Robotic manipulation with tools via predicting tool flow from point clouds. In Conference on Robot Learning, pages 1038–1049. PMLR, 2023.
- Reinforcement learning with action-free pre-training from videos, 2022.
- Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE international conference on robotics and automation (ICRA), pages 1134–1141. IEEE, 2018a.
- Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE international conference on robotics and automation (ICRA), pages 1134–1141. IEEE, 2018b.
- Behavioral cloning from observation, 2018.
- Robotap: Tracking arbitrary points for few-shot visual imitation. arXiv, 2023.
- Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736. PMLR, 2023.
- Tracking everything everywhere all at once, 2023.
- Fighting copycat agents in behavioral cloning from observation histories. Advances in Neural Information Processing Systems, 33:2564–2575, 2020.
- Fighting fire with fire: Avoiding dnn shortcuts through priming. In International Conference on Machine Learning, pages 23723–23750. PMLR, 2022.
- Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators. arXiv preprint arXiv:2309.13037, 2023.
- Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114, 2023.
- Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 5628–5635. IEEE, 2018.