Multi-View Unsupervised Image Generation with Cross Attention Guidance (2312.04337v1)
Abstract: The growing interest in novel view synthesis, driven by Neural Radiance Field (NeRF) models, is hindered by scalability issues due to their reliance on precisely annotated multi-view images. Recent models address this by fine-tuning large text2image diffusion models on synthetic multi-view data. Despite robust zero-shot generalization, they may need post-processing and can face quality issues due to the synthetic-real domain gap. This paper introduces a novel pipeline for unsupervised training of a pose-conditioned diffusion model on single-category datasets. With the help of pretrained self-supervised Vision Transformers (DINOv2), we identify object poses by clustering the dataset through comparing visibility and locations of specific object parts. The pose-conditioned diffusion model, trained on pose labels, and equipped with cross-frame attention at inference time ensures cross-view consistency, that is further aided by our novel hard-attention guidance. Our model, MIRAGE, surpasses prior work in novel view synthesis on real images. Furthermore, MIRAGE is robust to diverse textures and geometries, as demonstrated with our experiments on synthetic images generated with pretrained Stable Diffusion.
- Cross-image attention for zero-shot appearance transfer. ArXiv, abs/2311.03335, 2023.
- Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. ICCV, 2021.
- Mip-nerf 360: Unbounded anti-aliased neural radiance fields. CVPR, 2022.
- Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22560–22570, 2023.
- Efficient geometry-aware 3D generative adversarial networks. In arXiv, 2021.
- Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14124–14133, 2021.
- Tensorf: Tensorial radiance fields. In European Conference on Computer Vision (ECCV), 2022.
- Stereo radiance fields (srf): Learning view synthesis from sparse views of novel scenes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021.
- 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
- Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13142–13153, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2020.
- Karl Pearson F.R.S. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.
- Learning a predictable and generative vector representation for objects. In ECCV, 2016.
- Shape and viewpoints without keypoints. In ECCV, 2020.
- Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. In Proceedings of the 40th International Conference on Machine Learning, 2023.
- Escaping plato’s cave: 3d shape from adversarial rendering. In The IEEE International Conference on Computer Vision (ICCV), 2019.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
- Jonathan Ho. Classifier-free diffusion guidance. ArXiv, abs/2207.12598, 2022.
- Denoising diffusion probabilistic models. ArXiv, abs/2006.11239, 2020.
- Large language models are frame-level directors for zero-shot text-to-video generation. arXiv preprint arXiv:2305.14330, 2023.
- Arbitrary style transfer in real-time with adaptive instance normalization. 2017 IEEE International Conference on Computer Vision (ICCV), pages 1510–1519, 2017.
- Learning category-specific mesh reconstruction from image collections. In ECCV, 2018.
- Holofusion: Towards photo-realistic 3d generative modeling, 2023a.
- Holodiffusion: Training a 3D diffusion model using 2D images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023b.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
- Adam: A method for stochastic optimization, 2017.
- Segment anything. arXiv:2304.02643, 2023.
- Lift3d: Synthesize 3d training data by lifting 2d gan to 3d generative radiance field. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2023.
- Sdf-srn: Learning signed distance 3d object reconstruction from static images. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Vision transformer for nerf-based view synthesis from a single input image. In WACV, 2023.
- Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9298–9309, 2023.
- Stuart P. Lloyd. Least squares quantization in pcm. IEEE Trans. Inf. Theory, 28:129–136, 1982.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency. In ECCV, 2022.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, 2022.
- Giraffe: Representing scenes as compositional generative neural feature fields. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.
- Dinov2: Learning robust visual features without supervision. ArXiv, abs/2304.07193, 2023.
- Pose estimation for category specific multiview object localization. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009.
- Shape, pose, and appearance from a single image via bootstrapped radiance field inversion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4391–4401, 2023.
- Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In International Conference on Computer Vision, 2021.
- Sharf: Shape-conditioned radiance fields from a single view. In ICML, 2021.
- High-resolution image synthesis with latent diffusion models, 2021.
- U-net: Convolutional networks for biomedical image segmentation. ArXiv, abs/1505.04597, 2015.
- Incremental learning for robust visual tracking. International Journal of Computer Vision, 77:125–141, 2008.
- Plenoxels: Radiance fields without neural networks. In CVPR, 2022.
- Graf: Generative radiance fields for 3d-aware image synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Zero123++: a single image to consistent multi-view diffusion base model. ArXiv, abs/2310.15110, 2023a.
- Mvdream: Multi-view diffusion for 3d generation, 2023b.
- Multi-category mesh reconstruction from image collections. In 2021 International Conference on 3D Vision (3DV), pages 1321–1330. IEEE, 2021.
- EpiGRAF: Rethinking training of 3d GANs. In Advances in Neural Information Processing Systems, 2022.
- Denoising diffusion implicit models. arXiv:2010.02502, 2020.
- Grf: Learning a general radiance field for 3d scene representation and rendering. In arXiv:2010.04595, 2020.
- Attention is all you need. In Neural Information Processing Systems, 2017.
- Novel view synthesis with diffusion models, 2022.
- Consistent123: Improve consistency for one image to 3d object synthesis, 2023.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
- Group normalization. International Journal of Computer Vision, 128:742 – 755, 2018.
- Giraffe hd: A high-resolution 3d-aware generative model. In CVPR, 2022.
- Dense, accurate optical flow estimation with piecewise parametric model. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- Consistnet: Enforcing 3d consistency for multi-view images diffusion, 2023.
- Consistent-1-to-3: Consistent image to 3d view synthesis via geometry-aware diffusion models, 2023.
- Shelf-supervised mesh prediction in the wild. In Computer Vision and Pattern Recognition (CVPR), 2021.
- pixelNeRF: Neural radiance fields from one or few images. In CVPR, 2021.
- NeRS: Neural reflectance surfaces for sparse-view 3d reconstruction in the wild. In Conference on Neural Information Processing Systems, 2021.
- Nerf++: Analyzing and improving neural radiance fields. ArXiv, abs/2010.07492, 2020.
- Multi-view consistent generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.