Disentangled 3D Scene Generation with Layout Learning (2402.16936v1)
Abstract: We introduce a method to generate 3D scenes that are disentangled into their component objects. This disentanglement is unsupervised, relying only on the knowledge of a large pretrained text-to-image model. Our key insight is that objects can be discovered by finding parts of a 3D scene that, when rearranged spatially, still produce valid configurations of the same scene. Concretely, our method jointly optimizes multiple NeRFs from scratch - each representing its own object - along with a set of layouts that composite these objects into scenes. We then encourage these composited scenes to be in-distribution according to the image generator. We show that despite its simplicity, our approach successfully generates 3D scenes decomposed into individual objects, enabling new capabilities in text-to-3D content creation. For results and an interactive demo, see our project page at https://dave.ml/layoutlearning/
- Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5855–5864, October 2021.
- Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5470–5479, June 2022.
- Biederman, I. On the semantics of a glance at a scene. In Perceptual organization, pp. 213–253. Routledge, 1981.
- Scene perception: Detecting and judging objects undergoing relational violations. Cognitive psychology, 14(2):143–177, 1982.
- Muse: Text-to-image generation via masked generative transformers. In ICML, 2023.
- Set-the-scene: Global-local training for generating controllable nerf scenes. In ICCV, 2023.
- Blobgan: Spatially disentangled scene representations. In European Conference on Computer Vision, pp. 616–635. Springer, 2022.
- Diffusion self-guidance for controllable image generation. In Advances in Neural Information Processing Systems, 2023.
- Shampoo: Preconditioned stochastic tensor optimization. In ICML, 2018.
- Unsupervised semantic correspondence using stable diffusion. arXiv preprint arXiv:2305.15581, 2023.
- Object discovery and representation networks. In European Conference on Computer Vision, pp. 123–143. Springer, 2022.
- Ontogeny of object permanence and object tracking in the carrion crow, corvus corone. Animal behaviour, 82(2):359–367, 2011.
- Scalable adaptive computation for iterative generation. In ICML, 2023.
- Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021a.
- Perceiver: General perception with iterative attention. In International conference on machine learning, pp. 4651–4664. PMLR, 2021b.
- Zero-shot text-guided object generation with dream fields. In CVPR, 2022.
- Repurposing diffusion-based image generators for monocular depth estimation. arXiv preprint arXiv:2312.02145, 2023.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Learning visual n-grams from web data. In ICCV, 2017.
- Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, pp. 280–296. Springer, 2022.
- Barf: Bundle-adjusting neural radiance fields. In ICCV, 2021.
- Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 300–309, 2023.
- Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9298–9309, 2023.
- Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pp. 4114–4124. PMLR, 2019.
- Object-centric learning with slot attention. Advances in Neural Information Processing Systems, 33:11525–11538, 2020.
- Diffusion hyperfeatures: Searching through time and space for semantic correspondence. arXiv preprint arXiv:2305.14334, 2023.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives. In Neural Information Processing Systems, 2023.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, July 2022. doi: 10.1145/3528223.3530127. URL https://doi.org/10.1145/3528223.3530127.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11453–11464, 2021.
- An analysis system for scenes containing objects with substructures. In Proceedings of the Fourth International Joint Conference on Pattern Recognitions, pp. 752–754, 1978.
- Counterfactual image networks, 2018. URL https://openreview.net/forum?id=SyYYPdg0-.
- Benchmark for compositional text-to-image synthesis. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
- The hessian penalty: A weak prior for unsupervised disentanglement. In ECCV, 2020.
- The origins of intelligence in children, volume 8. International Universities Press New York, 1952.
- Compositional 3d scene generation using locally conditioned diffusion. arXiv preprint arXiv:2303.12218, 2023.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- Roberts, L. G. Machine perception of three-dimensional solids. PhD thesis, Massachusetts Institute of Technology, 1963.
- Unsupervised joint object discovery and segmentation in internet images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1939–1946, 2013.
- Using multiple segmentations to discover objects and their extent in image collections. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pp. 1605–1614. IEEE, 2006.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- Object scene representation transformer. Advances in Neural Information Processing Systems, 35:9512–9524, 2022.
- Unsupervised discovery and composition of object light fields. arXiv preprint arXiv:2205.03923, 2022.
- Spelke, E. S. Principles of object perception. Cognitive science, 14(1):29–56, 1990.
- Fourier features let networks learn high frequency functions in low dimensional domains. In Neural Information Processing Systems, 2020.
- Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7464–7475, 2023a.
- Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12619–12629, 2023b.
- Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023c.
- Wertheimer, M. Laws of organization in perceptual forms. 1938.
- Wilcox, T. Object individuation: Infants’ use of shape, size, pattern, and color. Cognition, 72(2):125–166, 1999.
- Reconfusion: 3d reconstruction with diffusion priors. arXiv preprint arXiv:2312.02981, 2023.
- Holodeck: Language guided generation of 3d embodied ai environments. arXiv preprint arXiv:2312.09067, 2023.
- Deformable sprites for unsupervised video decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2657–2666, 2022.
- Unsupervised discovery of object radiance fields. arXiv preprint arXiv:2107.07905, 2021.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
- Scenewiz3d: Towards text-guided 3d scene composition. arXiv preprint arXiv:2312.08885, 2023.