Natural Language Can Help Bridge the Sim2Real Gap (2405.10020v2)
Abstract: The main challenge in learning image-conditioned robotic policies is acquiring a visual representation conducive to low-level control. Due to the high dimensionality of the image space, learning a good visual representation requires a considerable amount of visual data. However, when learning in the real world, data is expensive. Sim2Real is a promising paradigm for overcoming data scarcity in the real-world target domain by using a simulator to collect large amounts of cheap data closely related to the target task. However, it is difficult to transfer an image-conditioned policy from sim to real when the domains are very visually dissimilar. To bridge the sim2real visual gap, we propose using natural language descriptions of images as a unifying signal across domains that captures the underlying task-relevant semantics. Our key insight is that if two image observations from different domains are labeled with similar language, the policy should predict similar action distributions for both images. We demonstrate that training the image encoder to predict the language description or the distance between descriptions of a sim or real image serves as a useful, data-efficient pretraining step that helps learn a domain-invariant image representation. We can then use this image encoder as the backbone of an IL policy trained simultaneously on a large amount of simulated and a handful of real demonstrations. Our approach outperforms widely used prior sim2real methods and strong vision-language pretraining baselines like CLIP and R3M by 25 to 40%. See additional videos and materials at https://robin-lab.cs.utexas.edu/lang4sim2real/.
- A unified approach for motion and force control of robot manipulators: The operational space formulation. IEEE Journal on Robotics and Automation, 3(1):43–53, 1987.
- Do as I can and not as I say: Grounding language in robotic affordances. In arXiv preprint arXiv:2204.01691, 2022.
- Invariance is key to generalization: Examining the role of representation in sim-to-real transfer for visual navigation. arXiv preprint arXiv:2310.15020, 2023.
- Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3–20, 2020.
- Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3722–3731, 2017.
- Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
- Open-vocabulary queryable scene representations for real world planning. In arXiv preprint arXiv:2209.09874, 2022.
- Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vision (ECCV), 2018.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35:18343–18362, 2022.
- Using natural language for reward shaping in reinforcement learning. 2019. URL https://arxiv.org/abs/1903.02020.
- Pixl2r: Guiding reinforcement learning using natural language by mapping pixels to rewards. 2020. URL https://arxiv.org/abs/2007.15543.
- The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pages 5842–5850, 2017.
- Ego4d: Around the world in 3,000 hours of egocentric video. arXiv preprint arXiv:2110.07058, 2021.
- Maskvit: Masked visual pre-training for video prediction. arXiv preprint arXiv:2206.11894, 2022.
- Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
- Retinagan: An object-aware approach to sim-to-real transfer. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 10920–10926. IEEE, 2021.
- Inner monologue: Embodied reasoning through planning with language models. In arXiv preprint arXiv:2207.05608, 2022.
- Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12627–12637, 2019.
- BC-z: Zero-shot task generalization with robotic imitation learning. In 5th Annual Conference on Robot Learning, 2021. URL https://openreview.net/forum?id=8kbp23tSGYv.
- Vima: General robot manipulation with multimodal prompts. arXiv, 2022.
- Exploring visual pre-training for robot manipulation: Datasets, models and methods. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11390–11395. IEEE, 2023.
- Lila: Language-informed latent actions. In 5th Annual Conference on Robot Learning, 2021. URL https://arxiv.org/pdf/2111.03205.
- Sim2real transfer for reinforcement learning without dynamics randomization. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4383–4388. IEEE, 2020.
- Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pages 5639–5650. PMLR, 2020.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2023.
- Language conditioned imitation learning over unstructured data. Robotics: Science and Systems, 2021. URL https://arxiv.org/abs/2005.07648.
- Liv: Language-image representations and rewards for robotic control. arXiv preprint arXiv:2306.00958, 2023.
- Sim-to-real reinforcement learning for deformable object manipulation. In Conference on Robot Learning, pages 734–743. PMLR, 2018.
- Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks, 2021. URL https://arxiv.org/abs/2112.03227.
- What matters in language conditioned imitation learning. arXiv preprint arXiv:2204.06252, 2022.
- Driving policy transfer via modularity and abstraction. arXiv preprint arXiv:1804.09364, 2018.
- Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In 5th Annual Conference on Robot Learning, 2021. URL https://arxiv.org/pdf/2109.01115.
- R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022.
- Solving rubik’s cube with a robot hand, 2019.
- Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
- Film: Visual reasoning with a general conditioning layer. In AAAI, 2018.
- Dean Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Conference on Neural Information Processing Systems (NeurIPS), 1988.
- Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
- Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, pages 416–426. PMLR, 2023.
- Cape: Corrective actions from precondition errors using large language models. In 2nd Workshop on Language and Robot Learning: Language as Grounding, 2023.
- Rl-cyclegan: Reinforcement learning aware simulation-to-real. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11157–11166, 2020.
- Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3(6):233–242, 1999.
- Bleurt: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696, 2020.
- Masked world models for visual control. In Conference on Robot Learning, pages 1332–1344. PMLR, 2023.
- Mutex: Learning unified policies from multimodal task specifications. arXiv preprint arXiv:2309.14320, 2023.
- Concept2robot: Learning manipulation concepts from instructions and human demonstrations. In Proceedings of Robotics: Science and Systems (RSS), 2020.
- Cliport: What and where pathways for robotic manipulation. In Proceedings of the 5th Conference on Robot Learning (CoRL), 2021.
- Perceiver-actor: A multi-task transformer for robotic manipulation. Conference on Robot Learning, 2022.
- Lancon-learn: Learning with language to enable generalization in multi-task manipulation. In IEEE Robotics and Automation Letters, 2021.
- Multi-task reinforcement learning with context-based representations. arXiv preprint arXiv:2102.06177, 2021.
- Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012.
- Deep domain confusion: Maximizing for domain invariance, 2014.
- Vrl3: A data-driven framework for visual deep reinforcement learning. Advances in Neural Information Processing Systems, 35:32974–32988, 2022.
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, 2020.
- Using both demonstrations and language instructions to efficiently learn robotic tasks. arXiv preprint arXiv:2210.04476, 2022.
- Preparing for the unknown: Learning a universal policy with online system identification. arXiv preprint arXiv:1702.02453, 2017.
- Pre-trained image encoder for generalizable visual reinforcement learning. Advances in Neural Information Processing Systems, 35:13022–13037, 2022.
- What makes representation learning from videos hard for control? 2022. URL https://api.semanticscholar.org/CorpusID:252635608.
- Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852, 2023.
- robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020.