URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images (2405.11656v3)
Abstract: Constructing simulation scenes that are both visually and physically realistic is a problem of practical interest in domains ranging from robotics to computer vision. This problem has become even more relevant as researchers wielding large data-hungry learning methods seek new sources of training data for physical decision-making systems. However, building simulation models is often still done by hand. A graphic designer and a simulation engineer work with predefined assets to construct rich scenes with realistic dynamic and kinematic properties. While this may scale to small numbers of scenes, to achieve the generalization properties that are required for data-driven robotic control, we require a pipeline that is able to synthesize large numbers of realistic scenes, complete with 'natural' kinematic and dynamic structures. To attack this problem, we develop models for inferring structure and generating simulation scenes from natural images, allowing for scalable scene generation from web-scale datasets. To train these image-to-simulation models, we show how controllable text-to-image generative models can be used in generating paired training data that allows for modeling of the inverse problem, mapping from realistic images back to complete scene models. We show how this paradigm allows us to build large datasets of scenes in simulation with semantic and physical realism. We present an integrated end-to-end pipeline that generates simulation scenes complete with articulated kinematic and dynamic structures from real-world images and use these for training robotic control policies. We then robustly deploy in the real world for tasks like articulated object manipulation. In doing so, our work provides both a pipeline for large-scale generation of simulation environments and an integrated system for training robust robotic control policies in the resulting environments.
- Physics-guided machine learning from simulation data: An application in modeling lake and river systems. In James Bailey, Pauli Miettinen, Yun Sing Koh, Dacheng Tao, and Xindong Wu, editors, IEEE International Conference on Data Mining, ICDM 2021, Auckland, New Zealand, December 7-10, 2021, pages 270–279. IEEE, 2021. doi: 10.1109/ICDM51629.2021.00037. URL https://doi.org/10.1109/ICDM51629.2021.00037.
- Integrating machine learning and multiscale modeling - perspectives, challenges, and opportunities in the biological, biomedical, and behavioral sciences. npj Digit. Medicine, 2, 2019. doi: 10.1038/s41746-019-0193-y. URL https://doi.org/10.1038/s41746-019-0193-y.
- A review of physics simulators for robotic applications. IEEE Access, 9:51416–51431, 2021. doi: 10.1109/ACCESS.2021.3068769. URL https://doi.org/10.1109/ACCESS.2021.3068769.
- Factory: Fast contact for robotic assembly. In Kris Hauser, Dylan A. Shell, and Shoudong Huang, editors, Robotics: Science and Systems XVIII, New York City, NY, USA, June 27 - July 1, 2022, 2022. doi: 10.15607/RSS.2022.XVIII.035. URL https://doi.org/10.15607/RSS.2022.XVIII.035.
- Sim4cv: A photo-realistic simulator for computer vision applications. Int. J. Comput. Vis., 126(9):902–919, 2018. doi: 10.1007/s11263-018-1073-7. URL https://doi.org/10.1007/s11263-018-1073-7.
- Modality-invariant visual odometry for embodied vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21549–21559, 2023. URL https://openaccess.thecvf.com/content/CVPR2023/papers/Memmel_Modality-Invariant_Visual_Odometry_for_Embodied_Vision_CVPR_2023_paper.pdf.
- Habitat 2.0: Training home assistants to rearrange their habitat. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 251–266, 2021a. URL https://proceedings.neurips.cc/paper/2021/hash/021bbc7ee20b71134d53e20206bd6feb-Abstract.html.
- Habitat-matterport 3d semantics dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 4927–4936. IEEE, 2023. doi: 10.1109/CVPR52729.2023.00477. URL https://doi.org/10.1109/CVPR52729.2023.00477.
- Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017. URL https://arxiv.org/pdf/1712.05474.pdf.
- Predicting motion plans for articulating everyday objects. In International Conference on Robotics and Automation (ICRA), 2023. URL https://ieeexplore.ieee.org/document/10160752.
- Procthor: Large-scale embodied ai using procedural generation. Advances in Neural Information Processing Systems, 35:5982–5994, 2022. URL https://arxiv.org/pdf/2206.06994.pdf.
- Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=FjNys5c7VyY.
- Robogen: Towards unleashing infinite data for automated robot learning via generative simulation. arXiv preprint arXiv:2311.01455, 2023a. URL https://arxiv.org/pdf/2311.01455.pdf.
- Gensim: Generating robotic simulation tasks via large language models. In Arxiv, 2023b. URL https://arxiv.org/pdf/2310.01361.pdf.
- One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928, 2023a. URL https://arxiv.org/pdf/2306.16928.pdf.
- Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints. arXiv preprint arXiv:2310.03602, 2023. URL https://arxiv.org/pdf/2310.03602.pdf.
- Infinite photorealistic worlds using procedural generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 12630–12641. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01215. URL https://doi.org/10.1109/CVPR52729.2023.01215.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. URL https://openaccess.thecvf.com/content/CVPR2022/papers/Rombach_High-Resolution_Image_Synthesis_With_Latent_Diffusion_Models_CVPR_2022_paper.pdf.
- Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 909–918. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.00100. URL http://openaccess.thecvf.com/content_CVPR_2019/html/Mo_PartNet_A_Large-Scale_Benchmark_for_Fine-Grained_and_Hierarchical_Part-Level_3D_CVPR_2019_paper.html.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. URL https://arxiv.org/pdf/2010.11929.pdf.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. URL https://arxiv.org/pdf/1703.06870.pdf.
- Attention is all you need. Advances in neural information processing systems, 30, 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- Panoptic video scene graph generation. In CVPR, 2023a. URL https://openaccess.thecvf.com/content/CVPR2023/papers/Yang_Panoptic_Video_Scene_Graph_Generation_CVPR_2023_paper.pdf.
- curobo: Parallelized collision-free minimum-jerk robot motion generation, 2023. URL https://arxiv.org/pdf/2310.17274.pdf.
- M2t2: Multi-task masked transformer for object-centric pick and place. In 7th Annual Conference on Robot Learning, 2023. URL https://arxiv.org/pdf/2311.00926.pdf.
- Dimensionality reduction and prioritized exploration for policy search. In International Conference on Artificial Intelligence and Statistics, pages 2134–2157. PMLR, 2022. URL https://proceedings.mlr.press/v151/memmel22a.html.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. URL https://arxiv.org/pdf/1707.06347.pdf.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 1856–1865. PMLR, 2018. URL http://proceedings.mlr.press/v80/haarnoja18b.html.
- Simple open-vocabulary object detection. In European Conference on Computer Vision, 2022. URL https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136700714.pdf.
- Open X-Embodiment: Robotic learning datasets and RT-X models. https://arxiv.org/abs/2310.08864, 2023.
- Reset-free reinforcement learning via multi-task learning: Learning dexterous manipulation behaviors without human intervention. In IEEE International Conference on Robotics and Automation, ICRA 2021, Xi’an, China, May 30 - June 5, 2021, pages 6664–6671. IEEE, 2021. doi: 10.1109/ICRA48506.2021.9561384. URL https://doi.org/10.1109/ICRA48506.2021.9561384.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023b. URL https://arxiv.org/pdf/2303.05499.pdf.
- Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2021.
- Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
- Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2017, Vancouver, BC, Canada, September 24-28, 2017, pages 23–30. IEEE, 2017. doi: 10.1109/IROS.2017.8202133. URL https://doi.org/10.1109/IROS.2017.8202133.
- Method for registration of 3-d shapes. In Sensor fusion IV: control paradigms and data structures, volume 1611, pages 586–606. Spie, 1992. URL https://www.researchgate.net/publication/3191994_A_method_for_registration_of_3-D_shapes_IEEE_Trans_Pattern_Anal_Mach_Intell.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 23965–23998. PMLR, 17–23 Jul 2022a. URL https://proceedings.mlr.press/v162/wortsman22a.html.
- Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014. URL https://arxiv.org/pdf/1311.3618.pdf.
- Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pages 559–568, 2011.
- Building rome in a day. Communications of the ACM, 54(10):105–112, 2011.
- Rgb-d mapping: Using kinect-style depth cameras for dense 3d modeling of indoor environments. The international journal of Robotics Research, 31(5):647–663, 2012.
- Orb-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics, 31(5):1147–1163, 2015.
- Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 165–174, 2019.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG), 42(4):1–14, 2023.
- Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11097–11107, 2020. URL https://openaccess.thecvf.com/content_CVPR_2020/papers/Xiang_SAPIEN_A_SimulAted_Part-Based_Interactive_ENvironment_CVPR_2020_paper.pdf.
- Habitat 2.0: Training home assistants to rearrange their habitat. Advances in Neural Information Processing Systems, 34:251–266, 2021b.
- Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023a.
- Holodeck: Language guided generation of 3d embodied ai environments. arXiv preprint arXiv:2312.09067, 2023b.
- The rbo dataset of articulated objects and interactions. The International Journal of Robotics Research, 38(9):1013–1019, 2019.
- Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments. IEEE Robotics and Automation Letters, 5(2):713–720, 2020.
- Akb-48: A real-world articulated object knowledge base. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14809–14818, 2022.
- Lasr: Learning articulated shape reconstruction from a monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15980–15989, 2021.
- Understanding 3d object articulation in internet videos. In CVPR, 2022.
- Self-supervised neural articulated shape and appearance models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15816–15826, 2022.
- Carto: Category and joint agnostic reconstruction of articulated objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21201–21210, 2023.
- Paris: Part-level reconstruction and motion analysis for articulated objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 352–363, 2023c.
- Shape2motion: Joint analysis of motion parts and attributes from 3d shapes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8876–8884, 2019.
- Rpm-net: recurrent prediction of motion and parts from point cloud. arXiv preprint arXiv:2006.14865, 2020.
- Captra: Category-level pose tracking for rigid and articulated objects from point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13209–13218, 2021.
- Learning to infer kinematic hierarchies for novel object instances. In 2022 International Conference on Robotics and Automation (ICRA), pages 8461–8467. IEEE, 2022.
- Inferring articulated rigid body dynamics from rgbd video. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8383–8390. IEEE, 2022.
- Openrooms: An open framework for photorealistic indoor scene datasets. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 7190–7199. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.00711. URL https://openaccess.thecvf.com/content/CVPR2021/html/Li_OpenRooms_An_Open_Framework_for_Photorealistic_Indoor_Scene_Datasets_CVPR_2021_paper.html.
- Multiscan: Scalable rgbd scanning for 3d environments with articulated objects. Advances in Neural Information Processing Systems, 35:9058–9071, 2022.
- Phone2proc: Bringing robust robots into our chaotic world. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 9665–9675. IEEE, 2023b. doi: 10.1109/CVPR52729.2023.00932. URL https://doi.org/10.1109/CVPR52729.2023.00932.
- Interactive segmentation, tracking, and kinematic modeling of unknown 3d articulated objects. In 2013 IEEE International Conference on Robotics and Automation, pages 5003–5010. IEEE, 2013.
- Learning kinematic models for articulated objects.
- Structure from action: Learning interactions for articulated object 3d structure discovery. arXiv preprint arXiv:2207.08997, 2022.
- Ditto: Building digital twins of articulated objects from interaction. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Sim2real2: Actively building explicit physics model for precise articulated object manipulation. In International Conference on Robotics and Automation (ICRA), 2023.
- Ditto in the house: Building articulation models of indoor scenes through interactive perception. arXiv preprint arXiv:2302.01295, 2023. URL https://arxiv.org/abs/2302.01295.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. URL https://arxiv.org/abs/2006.11239.
- Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023.
- Effective data augmentation with diffusion models. arXiv preprint arXiv:2302.07944, 2023a.
- Open-vocabulary object segmentation with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7667–7676, 2023.
- Diffusion-based data augmentation for nuclei image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 592–602. Springer, 2023a.
- Datasetdm: Synthesizing data with perception annotations using diffusion models. arXiv preprint arXiv:2308.06160, 2023.
- Dreamsim: Learning new dimensions of human visual similarity using synthetic data. CoRR, abs/2306.09344, 2023. doi: 10.48550/arXiv.2306.09344. URL https://doi.org/10.48550/arXiv.2306.09344.
- Stablerep: Synthetic images from text-to-image models make strong visual representation learners. CoRR, abs/2306.00984, 2023. doi: 10.48550/arXiv.2306.00984. URL https://doi.org/10.48550/arXiv.2306.00984.
- Generative models as a data source for multiview representation learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=qhAeZjs7dCL.
- Retinagan: An object-aware approach to sim-to-real transfer. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 10920–10926. IEEE, 2021.
- Genaug: Retargeting behaviors to unseen situations via generative augmentation. In Kostas E. Bekris, Kris Hauser, Sylvia L. Herbert, and Jingjin Yu, editors, Robotics: Science and Systems XIX, Daegu, Republic of Korea, July 10-14, 2023, 2023a. doi: 10.15607/RSS.2023.XIX.010. URL https://doi.org/10.15607/RSS.2023.XIX.010.
- Cacti: A framework for scalable multi-task multi-scene visual imitation learning. arXiv preprint arXiv:2212.05711, 2022.
- Scaling robot learning with semantically imagined experience. In Kostas E. Bekris, Kris Hauser, Sylvia L. Herbert, and Jingjin Yu, editors, Robotics: Science and Systems XIX, Daegu, Republic of Korea, July 10-14, 2023, 2023b. doi: 10.15607/RSS.2023.XIX.027. URL https://doi.org/10.15607/RSS.2023.XIX.027.
- Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. arXiv preprint arXiv:2309.01918, 2023.
- Effective data augmentation with diffusion models. CoRR, abs/2302.07944, 2023b. doi: 10.48550/arXiv.2302.07944. URL https://doi.org/10.48550/arXiv.2302.07944.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. URL https://arxiv.org/abs/2302.05543.
- Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721, 2023. URL https://arxiv.org/abs/2302.01721.
- Scenetex: High-quality texture synthesis for indoor scenes via diffusion priors. arXiv preprint arXiv:2311.17261, 2023b. URL https://arxiv.org/abs/2311.17261.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pages 23965–23998. PMLR, 2022b.
- Zoey Chen (4 papers)
- Aaron Walsman (14 papers)
- Marius Memmel (12 papers)
- Kaichun Mo (41 papers)
- Alex Fang (13 papers)
- Karthikeya Vemuri (1 paper)
- Alan Wu (9 papers)
- Dieter Fox (201 papers)
- Abhishek Gupta (226 papers)