Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 186 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images (2405.11656v3)

Published 19 May 2024 in cs.RO and cs.AI

Abstract: Constructing simulation scenes that are both visually and physically realistic is a problem of practical interest in domains ranging from robotics to computer vision. This problem has become even more relevant as researchers wielding large data-hungry learning methods seek new sources of training data for physical decision-making systems. However, building simulation models is often still done by hand. A graphic designer and a simulation engineer work with predefined assets to construct rich scenes with realistic dynamic and kinematic properties. While this may scale to small numbers of scenes, to achieve the generalization properties that are required for data-driven robotic control, we require a pipeline that is able to synthesize large numbers of realistic scenes, complete with 'natural' kinematic and dynamic structures. To attack this problem, we develop models for inferring structure and generating simulation scenes from natural images, allowing for scalable scene generation from web-scale datasets. To train these image-to-simulation models, we show how controllable text-to-image generative models can be used in generating paired training data that allows for modeling of the inverse problem, mapping from realistic images back to complete scene models. We show how this paradigm allows us to build large datasets of scenes in simulation with semantic and physical realism. We present an integrated end-to-end pipeline that generates simulation scenes complete with articulated kinematic and dynamic structures from real-world images and use these for training robotic control policies. We then robustly deploy in the real world for tasks like articulated object manipulation. In doing so, our work provides both a pipeline for large-scale generation of simulation environments and an integrated system for training robust robotic control policies in the resulting environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (90)
  1. Physics-guided machine learning from simulation data: An application in modeling lake and river systems. In James Bailey, Pauli Miettinen, Yun Sing Koh, Dacheng Tao, and Xindong Wu, editors, IEEE International Conference on Data Mining, ICDM 2021, Auckland, New Zealand, December 7-10, 2021, pages 270–279. IEEE, 2021. doi: 10.1109/ICDM51629.2021.00037. URL https://doi.org/10.1109/ICDM51629.2021.00037.
  2. Integrating machine learning and multiscale modeling - perspectives, challenges, and opportunities in the biological, biomedical, and behavioral sciences. npj Digit. Medicine, 2, 2019. doi: 10.1038/s41746-019-0193-y. URL https://doi.org/10.1038/s41746-019-0193-y.
  3. A review of physics simulators for robotic applications. IEEE Access, 9:51416–51431, 2021. doi: 10.1109/ACCESS.2021.3068769. URL https://doi.org/10.1109/ACCESS.2021.3068769.
  4. Factory: Fast contact for robotic assembly. In Kris Hauser, Dylan A. Shell, and Shoudong Huang, editors, Robotics: Science and Systems XVIII, New York City, NY, USA, June 27 - July 1, 2022, 2022. doi: 10.15607/RSS.2022.XVIII.035. URL https://doi.org/10.15607/RSS.2022.XVIII.035.
  5. Sim4cv: A photo-realistic simulator for computer vision applications. Int. J. Comput. Vis., 126(9):902–919, 2018. doi: 10.1007/s11263-018-1073-7. URL https://doi.org/10.1007/s11263-018-1073-7.
  6. Modality-invariant visual odometry for embodied vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21549–21559, 2023. URL https://openaccess.thecvf.com/content/CVPR2023/papers/Memmel_Modality-Invariant_Visual_Odometry_for_Embodied_Vision_CVPR_2023_paper.pdf.
  7. Habitat 2.0: Training home assistants to rearrange their habitat. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 251–266, 2021a. URL https://proceedings.neurips.cc/paper/2021/hash/021bbc7ee20b71134d53e20206bd6feb-Abstract.html.
  8. Habitat-matterport 3d semantics dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 4927–4936. IEEE, 2023. doi: 10.1109/CVPR52729.2023.00477. URL https://doi.org/10.1109/CVPR52729.2023.00477.
  9. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017. URL https://arxiv.org/pdf/1712.05474.pdf.
  10. Predicting motion plans for articulating everyday objects. In International Conference on Robotics and Automation (ICRA), 2023. URL https://ieeexplore.ieee.org/document/10160752.
  11. Procthor: Large-scale embodied ai using procedural generation. Advances in Neural Information Processing Systems, 35:5982–5994, 2022. URL https://arxiv.org/pdf/2206.06994.pdf.
  12. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=FjNys5c7VyY.
  13. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation. arXiv preprint arXiv:2311.01455, 2023a. URL https://arxiv.org/pdf/2311.01455.pdf.
  14. Gensim: Generating robotic simulation tasks via large language models. In Arxiv, 2023b. URL https://arxiv.org/pdf/2310.01361.pdf.
  15. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928, 2023a. URL https://arxiv.org/pdf/2306.16928.pdf.
  16. Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints. arXiv preprint arXiv:2310.03602, 2023. URL https://arxiv.org/pdf/2310.03602.pdf.
  17. Infinite photorealistic worlds using procedural generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 12630–12641. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01215. URL https://doi.org/10.1109/CVPR52729.2023.01215.
  18. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. URL https://openaccess.thecvf.com/content/CVPR2022/papers/Rombach_High-Resolution_Image_Synthesis_With_Latent_Diffusion_Models_CVPR_2022_paper.pdf.
  19. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 909–918. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.00100. URL http://openaccess.thecvf.com/content_CVPR_2019/html/Mo_PartNet_A_Large-Scale_Benchmark_for_Fine-Grained_and_Hierarchical_Part-Level_3D_CVPR_2019_paper.html.
  20. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. URL https://arxiv.org/pdf/2010.11929.pdf.
  21. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. URL https://arxiv.org/pdf/1703.06870.pdf.
  22. Attention is all you need. Advances in neural information processing systems, 30, 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  23. Panoptic video scene graph generation. In CVPR, 2023a. URL https://openaccess.thecvf.com/content/CVPR2023/papers/Yang_Panoptic_Video_Scene_Graph_Generation_CVPR_2023_paper.pdf.
  24. curobo: Parallelized collision-free minimum-jerk robot motion generation, 2023. URL https://arxiv.org/pdf/2310.17274.pdf.
  25. M2t2: Multi-task masked transformer for object-centric pick and place. In 7th Annual Conference on Robot Learning, 2023. URL https://arxiv.org/pdf/2311.00926.pdf.
  26. Dimensionality reduction and prioritized exploration for policy search. In International Conference on Artificial Intelligence and Statistics, pages 2134–2157. PMLR, 2022. URL https://proceedings.mlr.press/v151/memmel22a.html.
  27. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. URL https://arxiv.org/pdf/1707.06347.pdf.
  28. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 1856–1865. PMLR, 2018. URL http://proceedings.mlr.press/v80/haarnoja18b.html.
  29. Simple open-vocabulary object detection. In European Conference on Computer Vision, 2022. URL https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136700714.pdf.
  30. Open X-Embodiment: Robotic learning datasets and RT-X models. https://arxiv.org/abs/2310.08864, 2023.
  31. Reset-free reinforcement learning via multi-task learning: Learning dexterous manipulation behaviors without human intervention. In IEEE International Conference on Robotics and Automation, ICRA 2021, Xi’an, China, May 30 - June 5, 2021, pages 6664–6671. IEEE, 2021. doi: 10.1109/ICRA48506.2021.9561384. URL https://doi.org/10.1109/ICRA48506.2021.9561384.
  32. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023b. URL https://arxiv.org/pdf/2303.05499.pdf.
  33. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2021.
  34. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
  35. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2017, Vancouver, BC, Canada, September 24-28, 2017, pages 23–30. IEEE, 2017. doi: 10.1109/IROS.2017.8202133. URL https://doi.org/10.1109/IROS.2017.8202133.
  36. Method for registration of 3-d shapes. In Sensor fusion IV: control paradigms and data structures, volume 1611, pages 586–606. Spie, 1992. URL https://www.researchgate.net/publication/3191994_A_method_for_registration_of_3-D_shapes_IEEE_Trans_Pattern_Anal_Mach_Intell.
  37. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 23965–23998. PMLR, 17–23 Jul 2022a. URL https://proceedings.mlr.press/v162/wortsman22a.html.
  38. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014. URL https://arxiv.org/pdf/1311.3618.pdf.
  39. Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pages 559–568, 2011.
  40. Building rome in a day. Communications of the ACM, 54(10):105–112, 2011.
  41. Rgb-d mapping: Using kinect-style depth cameras for dense 3d modeling of indoor environments. The international journal of Robotics Research, 31(5):647–663, 2012.
  42. Orb-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics, 31(5):1147–1163, 2015.
  43. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 165–174, 2019.
  44. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  45. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG), 42(4):1–14, 2023.
  46. Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11097–11107, 2020. URL https://openaccess.thecvf.com/content_CVPR_2020/papers/Xiang_SAPIEN_A_SimulAted_Part-Based_Interactive_ENvironment_CVPR_2020_paper.pdf.
  47. Habitat 2.0: Training home assistants to rearrange their habitat. Advances in Neural Information Processing Systems, 34:251–266, 2021b.
  48. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023a.
  49. Holodeck: Language guided generation of 3d embodied ai environments. arXiv preprint arXiv:2312.09067, 2023b.
  50. The rbo dataset of articulated objects and interactions. The International Journal of Robotics Research, 38(9):1013–1019, 2019.
  51. Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments. IEEE Robotics and Automation Letters, 5(2):713–720, 2020.
  52. Akb-48: A real-world articulated object knowledge base. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14809–14818, 2022.
  53. Lasr: Learning articulated shape reconstruction from a monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15980–15989, 2021.
  54. Understanding 3d object articulation in internet videos. In CVPR, 2022.
  55. Self-supervised neural articulated shape and appearance models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15816–15826, 2022.
  56. Carto: Category and joint agnostic reconstruction of articulated objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21201–21210, 2023.
  57. Paris: Part-level reconstruction and motion analysis for articulated objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 352–363, 2023c.
  58. Shape2motion: Joint analysis of motion parts and attributes from 3d shapes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8876–8884, 2019.
  59. Rpm-net: recurrent prediction of motion and parts from point cloud. arXiv preprint arXiv:2006.14865, 2020.
  60. Captra: Category-level pose tracking for rigid and articulated objects from point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13209–13218, 2021.
  61. Learning to infer kinematic hierarchies for novel object instances. In 2022 International Conference on Robotics and Automation (ICRA), pages 8461–8467. IEEE, 2022.
  62. Inferring articulated rigid body dynamics from rgbd video. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8383–8390. IEEE, 2022.
  63. Openrooms: An open framework for photorealistic indoor scene datasets. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 7190–7199. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.00711. URL https://openaccess.thecvf.com/content/CVPR2021/html/Li_OpenRooms_An_Open_Framework_for_Photorealistic_Indoor_Scene_Datasets_CVPR_2021_paper.html.
  64. Multiscan: Scalable rgbd scanning for 3d environments with articulated objects. Advances in Neural Information Processing Systems, 35:9058–9071, 2022.
  65. Phone2proc: Bringing robust robots into our chaotic world. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 9665–9675. IEEE, 2023b. doi: 10.1109/CVPR52729.2023.00932. URL https://doi.org/10.1109/CVPR52729.2023.00932.
  66. Interactive segmentation, tracking, and kinematic modeling of unknown 3d articulated objects. In 2013 IEEE International Conference on Robotics and Automation, pages 5003–5010. IEEE, 2013.
  67. Learning kinematic models for articulated objects.
  68. Structure from action: Learning interactions for articulated object 3d structure discovery. arXiv preprint arXiv:2207.08997, 2022.
  69. Ditto: Building digital twins of articulated objects from interaction. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  70. Sim2real2: Actively building explicit physics model for precise articulated object manipulation. In International Conference on Robotics and Automation (ICRA), 2023.
  71. Ditto in the house: Building articulation models of indoor scenes through interactive perception. arXiv preprint arXiv:2302.01295, 2023. URL https://arxiv.org/abs/2302.01295.
  72. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. URL https://arxiv.org/abs/2006.11239.
  73. Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023.
  74. Effective data augmentation with diffusion models. arXiv preprint arXiv:2302.07944, 2023a.
  75. Open-vocabulary object segmentation with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7667–7676, 2023.
  76. Diffusion-based data augmentation for nuclei image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 592–602. Springer, 2023a.
  77. Datasetdm: Synthesizing data with perception annotations using diffusion models. arXiv preprint arXiv:2308.06160, 2023.
  78. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. CoRR, abs/2306.09344, 2023. doi: 10.48550/arXiv.2306.09344. URL https://doi.org/10.48550/arXiv.2306.09344.
  79. Stablerep: Synthetic images from text-to-image models make strong visual representation learners. CoRR, abs/2306.00984, 2023. doi: 10.48550/arXiv.2306.00984. URL https://doi.org/10.48550/arXiv.2306.00984.
  80. Generative models as a data source for multiview representation learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=qhAeZjs7dCL.
  81. Retinagan: An object-aware approach to sim-to-real transfer. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 10920–10926. IEEE, 2021.
  82. Genaug: Retargeting behaviors to unseen situations via generative augmentation. In Kostas E. Bekris, Kris Hauser, Sylvia L. Herbert, and Jingjin Yu, editors, Robotics: Science and Systems XIX, Daegu, Republic of Korea, July 10-14, 2023, 2023a. doi: 10.15607/RSS.2023.XIX.010. URL https://doi.org/10.15607/RSS.2023.XIX.010.
  83. Cacti: A framework for scalable multi-task multi-scene visual imitation learning. arXiv preprint arXiv:2212.05711, 2022.
  84. Scaling robot learning with semantically imagined experience. In Kostas E. Bekris, Kris Hauser, Sylvia L. Herbert, and Jingjin Yu, editors, Robotics: Science and Systems XIX, Daegu, Republic of Korea, July 10-14, 2023, 2023b. doi: 10.15607/RSS.2023.XIX.027. URL https://doi.org/10.15607/RSS.2023.XIX.027.
  85. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. arXiv preprint arXiv:2309.01918, 2023.
  86. Effective data augmentation with diffusion models. CoRR, abs/2302.07944, 2023b. doi: 10.48550/arXiv.2302.07944. URL https://doi.org/10.48550/arXiv.2302.07944.
  87. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. URL https://arxiv.org/abs/2302.05543.
  88. Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721, 2023. URL https://arxiv.org/abs/2302.01721.
  89. Scenetex: High-quality texture synthesis for indoor scenes via diffusion priors. arXiv preprint arXiv:2311.17261, 2023b. URL https://arxiv.org/abs/2311.17261.
  90. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pages 23965–23998. PMLR, 2022b.
Citations (11)

Summary

  • The paper introduces URDFormer, a novel pipeline that infers URDF scene descriptions from single real-world images using synthetic data generation.
  • It employs a two-phase process with a forward stage creating paired datasets via controllable generative models and an inverse stage leveraging a Vision Transformer for scene structure extraction.
  • The methodology demonstrates robust real-to-sim-to-real transfer, enabling robot manipulation tasks with a 78% success rate and offering a scalable solution for simulation asset creation.

The paper "URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images" (2405.11656) introduces a novel system for generating realistic, diverse, and controllable articulated simulation environments directly from real-world RGB images. This addresses a critical bottleneck in data-driven approaches for robotics, computer vision, and other domains: the manual and labor-intensive process of creating high-quality simulation assets with accurate physical and kinematic properties.

The core idea is to train an inverse model that can infer a structured scene description, specifically in the Unified Robot Description Format (URDF), from a single real-world image. Since large datasets of real-world images paired with corresponding URDFs do not exist, the authors propose an "inversion through synthesis" approach. This involves a two-phase pipeline:

  1. Forward Phase (Data Generation): Create a large, paired dataset of structured simulation scenes (represented as URDFs, zz) and corresponding realistic RGB images (xx). This is achieved by rendering procedurally generated or existing simulated scenes and then using controllable text-to-image generative models to augment these renders into visually realistic images while preserving the underlying structure.
  2. Inverse Phase (Model Training): Train a neural network, named URDFormer, on this synthetic paired dataset to learn the mapping from realistic images (xx) back to the structured simulation scene descriptions (zz).

Problem Formulation

The scene structure zz is defined as a collection of objects, each specified by its class label, 3D bounding box, 3D transform, kinematic parent, and joint type. This representation is akin to URDF, commonly used for describing robots and articulated objects. The challenge is inferring this complex zz from a simple observation xx (like an image), which is the result of an unknown forward function f(z)=xf(z) = x. The lack of real-world (z,x)(z, x) pairs necessitates the synthetic data generation approach.

Controlled Generation of Paired Datasets

To overcome the lack of real-world paired data, URDFormer leverages the capabilities of pre-trained controllable generative models (like Stable Diffusion). Simulated renders, while structurally accurate, often lack visual realism. Generative models can enhance these renders, but naive application can alter structural details. The authors propose a controlled generation process that differentiates between scene-level and object-level data generation:

  • Scene-Level: Render an entire simulated scene (e.g., a kitchen) and use a text-to-image diffusion model guided by the rendered image and a text prompt. The diffusion model is conditioned to maintain the global layout from the render but adds realistic textures and details. This process might change low-level object details or categories, so the resulting paired data contains complete images (xx) but only partial labels (z~\tilde{z}), including high-level object bounding boxes, transforms, and parents, but not accurate low-level part details.
  • Object-Level: For individual articulated objects (e.g., cabinets with drawers and doors), the generative process needs to preserve fine-grained part structures. Instead of full image generation, a texture-guided approach is used. Diverse texture images are generated or sourced (Appendix A). These textures are then overlaid onto the rendered object parts using perspective warping based on the known geometry from the simulation. Generative models are then used for background generation and smoothing boundaries, ensuring consistency at the part level. This results in partial images (x~\tilde{x}, focusing on a single object) but complete labels (zz) for that object and its parts.

This controlled generation yields two datasets: $\mathcal{D}_{\text{scene} = \{(x, \tilde{z})\}$ and $\mathcal{D}_{\text{object} = \{(\tilde{x}, z)\}$.

Learning Inverse Generative Models (URDFormer Architecture)

The URDFormer architecture is designed to process images and predict URDF primitives, using the two partially complete datasets. Both the scene-level (fθ1f^{-1}_\theta) and object-level (gϕ1g^{-1}_\phi) models share the same fundamental architecture but are trained on their respective datasets.

The architecture processes an input image:

  1. A Vision Transformer (ViT) extracts global image features.
  2. Bounding boxes corresponding to objects or parts are provided (either ground truth during training or from a detection model during inference).
  3. ROI alignment extracts features for each bounding box.
  4. Box features are combined with learned embeddings of the bounding box coordinates.
  5. A Transformer processes these features to produce a representation for each object/part.
  6. An MLP decodes each object/part feature into:
    • An optional base class label (used in object-level prediction).
    • A discretized 3D position and bounding box relative to its parent.
    • Learned child and parent embeddings.
  7. Hierarchical relationships (parent-child) are predicted using a scene graph generation technique: computing dot products between parent and child embeddings to form a relationship score matrix.

For scene-level prediction, special learned embeddings for root objects (walls, floor, ceiling) are included to attach scene objects. At test time, a real image is fed to a detection model to get initial bounding boxes. The Global URDFormer (fθ1f^{-1}_\theta) uses these boxes and the image to predict high-level scene structure (object positions and parents). Then, regions corresponding to predicted objects are cropped, a second detection model finds part-level boxes, and the Part URDFormer (gϕ1g^{-1}_\phi) is applied to each object crop and its part boxes to predict the detailed kinematic structure of parts.

Using URDFormer for Robotic Control

A key application demonstrated is using URDFormer in a real-to-simulation-to-real pipeline for training robot manipulation policies. Instead of creating a perfect "digital twin" for model-based control (which is fragile due to potential inaccuracies), URDFormer enables training learning-based policies using targeted randomization in simulation.

The pipeline involves:

  1. Scene Generation: Given a real-world observation (RGB-D point cloud), use URDFormer on the RGB image to predict a URDF structure. Scale the predicted structure using depth measurements.
  2. Targeted Randomization: Import the predicted URDF into a physics simulator. Collect training data by solving tasks (e.g., opening/closing drawers) using an efficient motion planner (like cuRobo) which has access to privileged simulation information. To bridge the sim2real gap and account for URDFormer's prediction errors (e.g., incorrect mesh details), randomize the simulated environment around the predicted structure. This includes replacing meshes of parts with variations from datasets like PartNet, randomizing textures (by cropping real textures and generating variations with Stable Diffusion), and applying standard image augmentations. This randomization is "targeted" because it is based on the predicted real-world configuration, unlike blind procedural generation.
  3. Policy Synthesis: Train a robot policy (e.g., a language-conditioned behavior cloning policy operating on RGB point clouds, using an M2T2-like architecture) on the large dataset of successful trajectories collected in the randomized simulation.

This pipeline allows training policies that generalize well to the real world from raw perceptual input with minimal human effort compared to manual scene creation or extensive real-world data collection.

Experiments

The paper evaluates URDFormer in several ways:

  • Real-world Robot Control: A UR5 robot with an RGB-D camera is used for articulated object manipulation tasks on five different cabinets. The URDFormer-TR pipeline (URDFormer prediction + Targeted Randomization training) is compared against zero-shot OWL-ViT detection for motion planning, standard Domain Randomization (DR), and a URDFormer-ICP approach (URDFormer prediction + ICP pose tracking for model-based execution). Results show URDFormer-TR achieves an average 78% success rate across tasks, significantly outperforming baselines (DR: 9%, OWL-ViT: 0%, URDFormer-ICP: 53.3% on available tasks), demonstrating the benefit of targeted randomization informed by URDFormer's prediction.
  • Simulation Content Generation Accuracy: URDFormer's ability to generate plausible and accurate URDFs from internet images is evaluated on manually labeled test sets of individual objects (300 images) and kitchen scenes (54 images). Metrics include category accuracy, parent accuracy, spatial error, precision, and recall for both high-level objects and low-level parts. Qualitative results show URDFormer captures scene structure reasonably well, though errors occur. An ablation paper confirms that training with generated realistic textures improves global scene prediction accuracy, while part prediction is less affected, possibly because bounding box spatial relationships are sufficient for simple part structures. The paper also highlights the performance gap when using bounding boxes from a fine-tuned detector (Model Soup of pretrained and fine-tuned Grounding DINO) compared to ground truth boxes, though detection performance is improved by the Model Soup technique (F1 79.7% vs 53.4% for pretrained).
  • Generalization: The authors demonstrate URDFormer's ability to generalize to new object categories (toilet, microwave, desk, laptop, chair) and scene categories (bedroom, bathroom, laundry room, paper room) by training on expanded datasets (Figures 9-12). They also show the pipeline can be applied to a different robot (Stretch) for a multi-step task (Figure 1, 14), showcasing its flexibility.
  • Reality Gym: The generated assets form the basis of Reality Gym, a new robot learning suite providing diverse, interactive simulation environments derived from real-world images.

Limitations

The paper acknowledges several limitations:

  • Reliance on the performance of the bounding box detection model.
  • Inability to reconstruct accurate meshes or complex textures; relies on predefined meshes and simple texture projection.
  • Limited to basic URDF primitives (prismatic/revolute joints), not complex objects like cars.
  • Predicted URDFs may have link collisions requiring post-processing.
  • The pipeline consists of multiple non-end-to-end trained components.
  • Physical properties (mass, friction) are not inferred from images.

Conclusion

URDFormer presents a significant step towards scalable generation of articulated simulation environments from real-world images. By synthesizing paired data with controllable generative models and training an inverse model, it enables the creation of diverse and realistic simulation assets. Integrating this pipeline with targeted domain randomization proves highly effective for training robot manipulation policies that transfer zero-shot to the real world, reducing the dependency on manual simulation design and extensive real-world data collection. The Reality Gym dataset provides a valuable resource for future research leveraging this approach.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 tweets and received 125 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com