Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View (2405.03894v2)

Published 6 May 2024 in cs.CV and cs.LG

Abstract: Generating consistent multiple views for 3D reconstruction tasks is still a challenge to existing image-to-3D diffusion models. Generally, incorporating 3D representations into diffusion model decrease the model's speed as well as generalizability and quality. This paper proposes a general framework to generate consistent multi-view images from single image or leveraging scene representation transformer and view-conditioned diffusion model. In the model, we introduce epipolar geometry constraints and multi-view attention to enforce 3D consistency. From as few as one image input, our model is able to generate 3D meshes surpassing baselines methods in evaluation metrics, including PSNR, SSIM and LPIPS.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Renderdiffusion: Image diffusion for 3d reconstruction, inpainting and generation, 2024.
  2. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  3. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  4. Large-vocabulary 3d diffusion model with transformer, 2023.
  5. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation, 2023.
  6. Objaverse: A universe of annotated 3d objects. arXiv preprint arXiv:2212.08051, 2022.
  7. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023.
  8. Google scanned objects: A high-quality dataset of 3d scanned household items. 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560, 2022.
  9. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  10. Get3d: A generative model of high quality 3d textured shapes learned from images, 2022.
  11. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  12. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  13. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (NeurIPS), 2020.
  14. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23(47):1–33, 2022.
  15. Leap: Liberate sparse-view 3d modeling from camera poses. arXiv preprint arXiv:2310.01410, 2023.
  16. Alias-free generative adversarial networks. Advances in Neural Information Processing Systems, 34:852–863, 2021.
  17. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928, 2023a.
  18. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023b.
  19. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023c.
  20. Meshdiffusion: Score-based generative 3d mesh modeling, 2023d.
  21. Diffusion probabilistic models for 3d point cloud generation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2836–2844, 2021.
  22. Text2mesh: Text-driven neural stylization for meshes, 2021.
  23. Diffrf: Rendering-guided 3d radiance field diffusion, 2023.
  24. Point-e: A system for generating 3d point clouds from complex prompts, 2022.
  25. DeepSDF: Learning continuous signed distance functions for shape representation. In CVPR, 2019.
  26. Dreamfusion: Text-to-3d using 2d diffusion, 2022.
  27. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  28. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
  29. High-resolution image synthesis with latent diffusion models, 2021a.
  30. High-resolution image synthesis with latent diffusion models, 2021b.
  31. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  32. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems (NeurIPS), 2022.
  33. Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations, 2022.
  34. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023a.
  35. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023b.
  36. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
  37. Score-based generative modeling through stochastic differential equations. ICLR, 2021.
  38. Robust Multiview Stereopsis. Accurate, dense, and robust multiview stereopsis. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 32(8), 2010.
  39. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  40. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In Advances in Neural Information Processing Systems, pages 27171–27183. Curran Associates, Inc., 2021a.
  41. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In NeurIPS, 2021b.
  42. Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction. arXiv preprint arXiv:2311.12024, 2023.
  43. Rodin: A generative model for sculpting 3d digital avatars using diffusion, 2022.
  44. Image quality assessment: from error visibility to structural similarity. IEEE TIP, 2004.
  45. Fvor: Robust joint shape and pose optimization for few-view object reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2497–2507, 2022.
  46. Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European conference on computer vision (ECCV), pages 767–783, 2018.
  47. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com