Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering (2407.20908v2)

Published 30 Jul 2024 in cs.CV

Abstract: Learning object-centric representations from unsupervised videos is challenging. Unlike most previous approaches that focus on decomposing 2D images, we present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning within a differentiable volume rendering framework. The key idea is to perform object-centric voxelization to capture the 3D nature of the scene, which infers per-object occupancy probabilities at individual spatial locations. These voxel features evolve through a canonical-space deformation function and are optimized in an inverse rendering pipeline with a compositional NeRF. Additionally, our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids. DynaVol-S significantly outperforms existing models in both novel view synthesis and unsupervised decomposition tasks for dynamic scenes. By jointly considering geometric structures and semantic features, it effectively addresses challenging real-world scenarios involving complex object interactions. Furthermore, once trained, the explicitly meaningful voxel features enable additional capabilities that 2D scene decomposition methods cannot achieve, such as novel scene generation through editing geometric shapes or manipulating the motion trajectories of objects.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. J. Wu, I. Yildirim, J. J. Lim, B. Freeman, and J. Tenenbaum, “Galileo: Perceiving physical object properties by integrating a physics engine with deep learning,” in NeurIPS, vol. 28, 2015.
  2. A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap, “A simple neural network module for relational reasoning,” in NeurIPS, vol. 30, 2017.
  3. K. Greff, S. Van Steenkiste, and J. Schmidhuber, “On the binding problem in artificial neural networks,” arXiv preprint arXiv:2012.05208, 2020.
  4. R. Kabra, D. Zoran, G. Erdogan, L. Matthey, A. Creswell, M. Botvinick, A. Lerchner, and C. Burgess, “SIMONe: View-invariant, temporally-abstracted object representations via unsupervised video decomposition,” in NeurIPS, 2021.
  5. G. F. Elsayed, A. Mahendran, S. van Steenkiste, K. Greff, M. C. Mozer, and T. Kipf, “SAVi++: Towards end-to-end object-centric learning from real-world videos,” in NeurIPS, 2022.
  6. G. Singh, Y.-F. Wu, and S. Ahn, “Simple unsupervised object-centric learning for complex and naturalistic videos,” in NeurIPS, 2022.
  7. J. Fang, T. Yi, X. Wang, L. Xie, X. Zhang, W. Liu, M. Nießner, and Q. Tian, “Fast dynamic radiance fields with time-aware neural voxels,” in SIGGRAPH Asia, 2022, pp. 1–9.
  8. K. Park, U. Sinha, P. Hedman, J. T. Barron, S. Bouaziz, D. B. Goldman, R. Martin-Brualla, and S. M. Seitz, “Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields,” ACM Transactions on Graphics, vol. 40, no. 6, dec 2021.
  9. T. Wu, F. Zhong, A. Tagliasacchi, F. Cole, and C. Öztireli, “D2NeRF: Self-supervised decoupling of dynamic and static objects from a monocular video,” in NeurIPS, 2022.
  10. Y. Zhao, S. Gao, Y. Wang, and X. Yang, “Dynavol: Unsupervised learning for dynamic scenes through object-centric voxelization,” in ICLR, 2024. [Online]. Available: https://openreview.net/forum?id=koYsgfEwCQ
  11. T. Kipf, G. F. Elsayed, A. Mahendran, A. Stone, S. Sabour, G. Heigold, R. Jonschkowski, A. Dosovitskiy, and K. Greff, “Conditional object-centric learning from video,” in ICLR, 2022.
  12. H.-X. Yu, L. J. Guibas, and J. Wu, “Unsupervised discovery of object radiance fields,” in ICLR, 2022.
  13. K. Yang, X. Zhang, Z. Huang, X. Chen, Z. Xu, and H. Su, “MovingParts: Motion-based 3D part discovery in dynamic radiance field,” in ICLR, 2024. [Online]. Available: https://openreview.net/forum?id=QQ6RgKYiQq
  14. V. Tschernezki, D. Larlus, and A. Vedaldi, “Neuraldiff: Segmenting 3D objects that move in egocentric videos,” in 3DV.   IEEE, 2021, pp. 910–919.
  15. J. Xie, W. Xie, and A. Zisserman, “Segmenting moving objects via an object-centric layered representation,” in NeurIPS, 2022.
  16. T. Lüddecke and A. Ecker, “Image segmentation using text and image prompts,” in CVPR, June 2022, pp. 7086–7096.
  17. A. Pumarola, E. Corona, G. Pons-Moll, and F. Moreno-Noguer, “D-NeRF: Neural radiance fields for dynamic scenes,” in CVPR, 2020.
  18. N. Max, “Optical models for direct volume rendering,” IEEE Transactions on Visualization and Computer Graphics, 1995.
  19. M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
  20. S. Fu, M. Hamilton, L. E. Brandt, A. Feldmann, Z. Zhang, and W. T. Freeman, “Featup: A model-agnostic framework for features at any resolution,” in ICLR, 2024. [Online]. Available: https://openreview.net/forum?id=GkJiNn2QDF
  21. F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, and T. Kipf, “Object-centric learning with slot attention,” in NeurIPS, 2020.
  22. J.-W. Liu, Y.-P. Cao, W. Mao, W. Zhang, D. J. Zhang, J. Keppo, Y. Shan, X. Qie, and M. Z. Shou, “DeVRF: Fast deformable voxel radiance fields for dynamic scenes,” in NeurIPS, 2022.
  23. C. Sun, M. Sun, and H.-T. Chen, “Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction,” in CVPR, 2022.
  24. L. Li, Z. Shen, Z. Wang, L. Shen, and L. Bo, “Compressing volumetric radiance fields to 1 mb,” in CVPR, 2023, pp. 4222–4231.
  25. K. Greff, F. Belletti, L. Beyer, C. Doersch, Y. Du, D. Duckworth, D. J. Fleet, D. Gnanapragasam, F. Golemo, C. Herrmann, T. Kipf, A. Kundu, D. Lagun, I. H. Laradji, H.-T. Liu, H. Meyer, Y. Miao, D. Nowrouzezahrai, C. Oztireli, E. Pot, N. Radwan, D. Rebain, S. Sabour, M. S. M. Sajjadi, M. Sela, V. Sitzmann, A. Stone, D. Sun, S. Vora, Z. Wang, T. Wu, K. M. Yi, F. Zhong, and A. Tagliasacchi, “Kubric: A scalable dataset generator,” in CVPR, 2022.
  26. Z. Yan, C. Li, and G. H. Lee, “Nerf-ds: Neural radiance fields for dynamic specular objects,” CVPR, 2023.
  27. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, 2004.
  28. W. M. Rand, “Objective criteria for the evaluation of clustering methods,” Journal of the American Statistical association, vol. 66, no. 336, pp. 846–850, 1971.
  29. L. Hubert and P. Arabie, “Comparing partitions,” Journal of classification, vol. 2, pp. 193–218, 1985.
  30. Y. Cheng, L. Li, Y. Xu, X. Li, Z. Yang, W. Wang, and Y. Yang, “Segment and track anything,” arXiv preprint arXiv:2305.06558, 2023.
  31. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment anything,” arXiv:2304.02643, 2023.
  32. M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in ICCV, 2021, pp. 9650–9660.
  33. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML.   PMLR, 2021, pp. 8748–8763.
  34. X. Dong, J. Bao, Y. Zheng, T. Zhang, D. Chen, H. Yang, M. Zeng, W. Zhang, L. Yuan, D. Chen et al., “Maskclip: Masked self-distillation advances contrastive language-image pretraining,” in CVPR, 2023, pp. 10 995–11 005.
  35. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  36. M. El Banani, A. Raj, K.-K. Maninis, A. Kar, Y. Li, M. Rubinstein, D. Sun, L. Guibas, J. Johnson, and V. Jampani, “Probing the 3d awareness of visual foundation models,” in CVPR, 2024, pp. 21 795–21 806.
  37. K. Greff, A. Rasmus, M. Berglund, T. Hao, H. Valpola, and J. Schmidhuber, “Tagger: Deep unsupervised perceptual grouping,” in NeurIPS, 2016.
  38. K. Greff, R. L. Kaufman, R. Kabra, N. Watters, C. Burgess, D. Zoran, L. Matthey, M. Botvinick, and A. Lerchner, “Multi-object representation learning with iterative variational inference,” in ICML, 2019.
  39. C. P. Burgess, L. Matthey, N. Watters, R. Kabra, I. Higgins, M. Botvinick, and A. Lerchner, “Monet: Unsupervised scene decomposition and representation,” in CVPR, 2019.
  40. M. Engelcke, A. R. Kosiorek, O. P. Jones, and I. Posner, “Genesis: Generative scene inference and sampling with object-centric latent representations,” in ICLR, 2020.
  41. K. Cho, B. van Merrienboer, Çaglar Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” in EMNLP, 2014.
  42. C. Chen, F. Deng, and S. Ahn, “ROOTS: Object-centric representation and rendering of 3D scenes.” Journal of Machine Learning Research, 2021.
  43. K. Stelzner, K. Kersting, and A. R. Kosiorek, “Decomposing 3D scenes into objects via unsupervised volume segmentation,” arXiv preprint arXiv:2104.01148, 2021.
  44. M. S. Sajjadi, D. Duckworth, A. Mahendran, S. van Steenkiste, F. Pavetić, M. Lučić, L. J. Guibas, K. Greff, and T. Kipf, “Object scene representation transformer,” in NeurIPS, 2022.
  45. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in ECCV, 2020.
  46. S. Guan, H. Deng, Y. Wang, and X. Yang, “NeuroFluid: Fluid dynamics grounding with particle-driven neural radiance fields,” in ICML, 2022.
  47. D. Driess, Z. Huang, Y. Li, R. Tedrake, and M. Toussaint, “Learning multi-object dynamics with compositional neural radiance fields,” arXiv preprint arXiv:2202.11855, 2022.
  48. Z. Li, S. Niklaus, N. Snavely, and O. Wang, “Neural scene flow fields for space-time view synthesis of dynamic scenes,” in CVPR, 2021, pp. 6498–6508.
  49. X. Guo, G. Chen, Y. Dai, X. Ye, J. Sun, X. Tan, and E. Ding, “Neural deformable voxel grid for fast optimization of dynamic view synthesis,” in ACCV, 2022.
  50. Z. Li, Q. Wang, F. Cole, R. Tucker, and N. Snavely, “Dynibar: Neural dynamic image-based rendering,” in CVPR, 2023, pp. 4273–4284.

Summary

We haven't generated a summary for this paper yet.