Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Any-point Trajectory Modeling for Policy Learning (2401.00025v3)

Published 28 Dec 2023 in cs.RO and cs.CV

Abstract: Learning from demonstration is a powerful method for teaching robots new skills, and having more demonstration data often improves policy learning. However, the high cost of collecting demonstration data is a significant bottleneck. Videos, as a rich data source, contain knowledge of behaviors, physics, and semantics, but extracting control-specific information from them is challenging due to the lack of action labels. In this work, we introduce a novel framework, Any-point Trajectory Modeling (ATM), that utilizes video demonstrations by pre-training a trajectory model to predict future trajectories of arbitrary points within a video frame. Once trained, these trajectories provide detailed control guidance, enabling the learning of robust visuomotor policies with minimal action-labeled data. Across over 130 language-conditioned tasks we evaluated in both simulation and the real world, ATM outperforms strong video pre-training baselines by 80% on average. Furthermore, we show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology. Visualizations and code are available at: \url{https://xingyu-lin.github.io/atm}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Trajectory-tracking and path-following of underactuated autonomous vehicles with parametric modeling uncertainty. IEEE transactions on automatic control, 52(8):1362–1379, 2007.
  2. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35:24639–24654, 2022.
  3. Zero-shot robot manipulation from passive human videos. arXiv preprint arXiv:2302.02011, 2023.
  4. Zero-shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023.
  5. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  6. Language models are few-shot learners. In Neural Information Processing Systems (NeurIPS), 2020.
  7. Causal confusion in imitation learning. Advances in Neural Information Processing Systems, 32, 2019.
  8. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
  9. Tap-vid: A benchmark for tracking any point in a video, 2023a.
  10. Tapir: Tracking any point with per-frame initialization and temporal refinement, 2023b.
  11. Learning universal policies via text-guided video generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  12. Video prediction models as rewards for reinforcement learning. Neural Information Processing Systems, 2023.
  13. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. In Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 2023.
  14. Ifor: Iterative flow minimization for robotic object rearrangement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14787–14797, 2022.
  15. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches. arXiv preprint arXiv:2311.01977, 2023.
  16. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  17. Mesh-based dynamics model with occlusion reasoning for cloth manipulation. In Robotics: Science and Systems (RSS), 2022.
  18. Cotracker: It is better to track together. arXiv:2307.07635, 2023.
  19. Vilt: Vision-and-language transformer without convolution or region supervision, 2021.
  20. Segment anything. In IEEE International Conference on Computer Vision (ICCV), 2023.
  21. Learning to Act from Actionless Video through Dense Correspondences. arXiv:2310.08576, 2023.
  22. 3d neural scene representations for visuomotor control. arXiv preprint arXiv:2107.04004, 2021.
  23. Learning visible connectivity dynamics for cloth smoothing. In Conference on Robot Learning, 2021.
  24. Spawnnet: Learning generalizable visuomotor skills from pre-trained networks, 2023.
  25. Libero: Benchmarking knowledge transfer for lifelong robot learning. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  26. VIP: Towards universal visual reward and representation via value-implicit pre-training. In The Eleventh International Conference on Learning Representations, 2023.
  27. What matters in learning from offline human demonstrations for robot manipulation. arXiv preprint arXiv:2108.03298, 2021.
  28. Keypoints into the future: Self-supervised correspondence in model-based reinforcement learning, 2020.
  29. R3m: A universal visual representation for robot manipulation. In Conference on Robot Learning, pages 892–909. PMLR, 2023.
  30. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
  31. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions On Graphics (TOG), 37(4):1–14, 2018.
  32. Keto: Learning keypoint representations for tool manipulation. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 7278–7285. IEEE, 2020.
  33. Language embedded radiance fields for zero-shot task-oriented grasping. In Conference on Robot Learning, pages 178–200. PMLR, 2023.
  34. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
  35. Learning predictive models from observation and interaction, 2019.
  36. Toolflownet: Robotic manipulation with tools via predicting tool flow from point clouds. In Conference on Robot Learning, pages 1038–1049. PMLR, 2023.
  37. Reinforcement learning with action-free pre-training from videos, 2022.
  38. Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE international conference on robotics and automation (ICRA), pages 1134–1141. IEEE, 2018a.
  39. Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE international conference on robotics and automation (ICRA), pages 1134–1141. IEEE, 2018b.
  40. Behavioral cloning from observation, 2018.
  41. Robotap: Tracking arbitrary points for few-shot visual imitation. arXiv, 2023.
  42. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736. PMLR, 2023.
  43. Tracking everything everywhere all at once, 2023.
  44. Fighting copycat agents in behavioral cloning from observation histories. Advances in Neural Information Processing Systems, 33:2564–2575, 2020.
  45. Fighting fire with fire: Avoiding dnn shortcuts through priming. In International Conference on Machine Learning, pages 23723–23750. PMLR, 2022.
  46. Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators. arXiv preprint arXiv:2309.13037, 2023.
  47. Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114, 2023.
  48. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 5628–5635. IEEE, 2018.
Citations (44)

Summary

We haven't generated a summary for this paper yet.