Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Variational Inference for Scalable 3D Object-centric Learning (2309.14010v1)

Published 25 Sep 2023 in cs.CV

Abstract: We tackle the task of scalable unsupervised object-centric representation learning on 3D scenes. Existing approaches to object-centric representation learning show limitations in generalizing to larger scenes as their learning processes rely on a fixed global coordinate system. In contrast, we propose to learn view-invariant 3D object representations in localized object coordinate systems. To this end, we estimate the object pose and appearance representation separately and explicitly map object representations across views while maintaining object identities. We adopt an amortized variational inference pipeline that can process sequential input and scalably update object latent distributions online. To handle large-scale scenes with a varying number of objects, we further introduce a Cognitive Map that allows the registration and query of objects on a per-scene global map to achieve scalable representation learning. We explore the object-centric neural radiance field (NeRF) as our 3D scene representation, which is jointly modeled within our unsupervised object-centric learning framework. Experimental results on synthetic and real datasets show that our proposed method can infer and maintain object-centric representations of 3D scenes and outperforms previous models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Attend, infer, repeat: Fast scene understanding with generative models. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 3233–3241, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
  2. Space: Unsupervised object-oriented scene representation via spatial attention and decomposition. In ICLR, 2020. URL https://openreview.net/forum?id=rkl03ySYDH.
  3. Monet: Unsupervised scene decomposition and representation. ArXiv, abs/1901.11390, 01 2019. URL https://arxiv.org/abs/1901.11390.
  4. Spatially invariant unsupervised object detection with convolutional neural networks. AAAI, 33:3412–3420, 07 2019. doi: 10.1609/aaai.v33i01.33013412.
  5. Object-centric learning with slot attention. In NeurIPS, 2020.
  6. Learning object-centric representations of multi-object scenes from multiple views. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 5656–5666. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/3d9dabe52805a1ea21864b09f3397593-Paper.pdf.
  7. Decomposing 3d scenes into objects via unsupervised volume segmentation, 2021.
  8. Roots: Object-centric representation and rendering of 3d scenes, 2021.
  9. Unsupervised object-centric video generation and decomposition in 3D. In NeurIPS, 2020.
  10. Rob Kitchin. Cognitive maps: What are they and why study them? Journal of Environmental Psychology, 14:1–19, 03 1994. doi: 10.1016/S0272-4944(05)80194-X.
  11. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM, 65(1):99–106, dec 2021. ISSN 0001-0782. doi: 10.1145/3503250. URL https://doi.org/10.1145/3503250.
  12. Nerfusion: Fusing radiance fields for large-scale scene reconstruction. arXiv preprint arXiv:2203.11283, 2022.
  13. Block-nerf: Scalable large scene neural view synthesis. arXiv preprint arXiv:2202.05263, 2022.
  14. Reconstruction Bottlenecks in Object-Centric Generative Models. ICML Workshop on Object-Oriented Learning, 2020.
  15. Multi-object representation learning with iterative variational inference. In ICML, 2019.
  16. Neural expectation maximization. In NeurIPS, NIPS’17, page 6694–6704, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
  17. Iterative amortized inference. In ICML, 07 2018.
  18. Simple unsupervised object-centric learning for complex and naturalistic videos. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=eYfIM88MTUE.
  19. Unsupervised multi-object segmentation by predicting probable motion patterns. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=_w2-1nXNjvv.
  20. Unsupervised discovery of object radiance fields. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=rwE8SshAlxw.
  21. Unsupervised discovery and composition of object light fields, 2022. URL https://arxiv.org/abs/2205.03923.
  22. SIMONe: View-invariant, temporally-abstracted object representations via unsupervised video decomposition. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=YSzTMntO1KY.
  23. Learning object-compositional neural radiance field for editable scene rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13779–13788, 2021.
  24. Efficient iterative amortized inference for learning symmetric and disentangled multi-object representations. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 2970–2981. PMLR, 18–24 Jul 2021. URL http://proceedings.mlr.press/v139/emami21a.html.
  25. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=S1jE5L5gl.
  26. Neural scene representation and rendering. Science, 360(6394):1204–1210, 2018. ISSN 0036-8075. doi: 10.1126/science.aar6170. URL https://science.sciencemag.org/content/360/6394/1204.
  27. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1988–1997, 07 2017. doi: 10.1109/CVPR.2017.215.
  28. GENESIS-V2: Inferring Unordered Object Representations without Iterative Refinement. arXiv preprint arXiv:2104.09958, 2021.
  29. Unity: A general platform for intelligent agents. ArXiv, abs/1809.02627, 2020.
  30. Large-scale data for multiple-view stereopsis. Int. J. Comput. Vision, 120(2):153–168, nov 2016. ISSN 0920-5691. doi: 10.1007/s11263-016-0902-9. URL https://doi.org/10.1007/s11263-016-0902-9.
  31. Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021. URL https://arxiv.org/abs/2109.08238.

Summary

We haven't generated a summary for this paper yet.