Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 30 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 12 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Robot Learning with Sensorimotor Pre-training (2306.10007v2)

Published 16 Jun 2023 in cs.RO, cs.CV, and cs.LG

Abstract: We present a self-supervised sensorimotor pre-training approach for robotics. Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens. Given a sequence of camera images, proprioceptive robot states, and actions, we encode the sequence into tokens, mask out a subset, and train a model to predict the missing content from the rest. We hypothesize that if a robot can predict the masked-out content it will have acquired a good model of the physical world that can enable it to act. RPT is designed to operate on latent visual representations which makes prediction tractable, enables scaling to larger models, and allows fast inference on a real robot. To evaluate our approach, we collected a dataset of 20,000 real-world trajectories over 9 months using a combination of motion planning and grasping algorithms. We find that sensorimotor pre-training consistently outperforms training from scratch, has favorable scaling properties, and enables transfer across different tasks, environments, and robots.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Imagenet classification with deep convolutional neural networks. NeurIPS, 2012.
  2. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  3. Generative pretraining from pixels. In ICML, 2020.
  4. Masked autoencoders are scalable vision learners. arXiv:2111.06377, 2021.
  5. Improving language understanding by generative pre-training. 2018.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HCT, 2019.
  7. Language models are few-shot learners. NeurIPS, 2020.
  8. Real-world robot learning with masked visual pre-training. arXiv:2210.03109, 2022.
  9. Attention is all you need. In NeurIPS, 2017.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
  11. Ego4d: Around the world in 3,000 hours of egocentric video. arXiv:2110.07058, 2021.
  12. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. IJCV, 2021.
  13. The” something something” video database for learning and evaluating visual common sense. In ICCV, 2017.
  14. Understanding human hands in contact at internet scale. In CVPR, 2020.
  15. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  16. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv:1806.10293, 2018.
  17. L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In ICRA, 2016.
  18. Self-supervised correspondence in visuomotor policy learning. RA-L, 2019.
  19. Exploratory grasping: Asymptotically optimal algorithms for grasping challenging polyhedral objects. arXiv:2011.05632, 2020.
  20. Mt-opt: Continuous multi-task robotic reinforcement learning at scale. arXiv, 2021.
  21. Learning visual robotic control efficiently with contrastive pre-training and data augmentation. In IROS, 2022.
  22. From play to policy: Conditional behavior generation from uncurated robot data. arXiv:2210.10047, 2022.
  23. Multi-view masked world models for visual robotic manipulation. arXiv:2302.02408, 2023.
  24. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, 2022.
  25. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv:2109.13396, 2021.
  26. Perceiver-actor: A multi-task transformer for robotic manipulation. In Proceedings of the 6th Conference on Robot Learning (CoRL), 2022.
  27. Pre-training for robots: Offline rl enables learning new tasks from a handful of trials. arXiv:2210.05178, 2022.
  28. A generalist agent. arXiv:2205.06175, 2022.
  29. Rt-1: Robotics transformer for real-world control at scale. arXiv:2212.06817, 2022.
  30. Palm-e: An embodied multimodal language model. arXiv:2303.03378, 2023.
  31. On-policy dataset synthesis for learning robot grasping policies using fully convolutional deep networks. RAL, 2019.
Citations (39)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube