Zero-1-to-3: Zero-shot One Image to 3D Object

Published 20 Mar 2023 in cs.CV, cs.GR, and cs.RO | (2303.11328v1)

Abstract: We introduce Zero-1-to-3, a framework for changing the camera viewpoint of an object given just a single RGB image. To perform novel view synthesis in this under-constrained setting, we capitalize on the geometric priors that large-scale diffusion models learn about natural images. Our conditional diffusion model uses a synthetic dataset to learn controls of the relative camera viewpoint, which allow new images to be generated of the same object under a specified camera transformation. Even though it is trained on a synthetic dataset, our model retains a strong zero-shot generalization ability to out-of-distribution datasets as well as in-the-wild images, including impressionist paintings. Our viewpoint-conditioned diffusion approach can further be used for the task of 3D reconstruction from a single image. Qualitative and quantitative experiments show that our method significantly outperforms state-of-the-art single-view 3D reconstruction and novel view synthesis models by leveraging Internet-scale pre-training.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (824)

View on Semantic Scholar

Summary

The paper introduces a diffusion-based framework that achieves zero-shot novel view synthesis and 3D reconstruction from one RGB image.
It fine-tunes a conditional model on a synthetic dataset with camera transformations to learn robust geometric priors applicable to real-world images.
The approach outperforms state-of-the-art methods in metrics like PSNR, SSIM, LPIPS, and volumetric IoU, advancing applications in AR/VR and robotics.

Zero-1-to-3: Zero-Shot Control for 3D Reconstruction and View Synthesis

The paper presents Zero-1-to-3, a framework designed to achieve zero-shot novel view synthesis and 3D reconstruction from a single RGB image. This research leverages the capabilities of large-scale, pre-trained diffusion models, particularly those known for their excellent performance in generating diverse images, and adapts them towards understanding the geometric transformations required for changing camera viewpoints.

Methodology Overview

Zero-1-to-3 stands out by constructing a conditional diffusion model finetuned on a synthetic dataset, allowing the manipulation of relative camera viewpoints. Despite the model's training on synthetic data, it exhibits strong zero-shot generalization to both out-of-distribution datasets and in-the-wild images, such as impressionist paintings. This is a crucial advancement, as it allows the model to synthesize new views of objects without relying on expensive 3D annotations or category-specific priors.

The proposed approach highlights the ability of large-scale diffusion models, like Stable Diffusion, to learn and apply geometric priors found in natural images. The researchers fine-tune the model with paired images and their corresponding relative camera transformations to teach the model controls over camera extrinsics. Through this novel formulation, the model successfully extrapolates to unseen object classes, achieving state-of-the-art results in novel view synthesis and zero-shot 3D reconstruction.

Experimental Evaluation

The paper rigorously evaluates the Zero-1-to-3 model against existing state-of-the-art techniques using synthetic datasets like Google Scanned Objects (GSO) and Real-Time Multi-View (RTMV). It outperforms traditional methods that rely on jittery consistency losses across NeRFs or semantic variation-based sampling for synthesis tasks. Quantitative metrics such as PSNR, SSIM, LPIPS, and FID clearly reflect the superiority of the proposed model over existing counterparts in generating high-fidelity images.

For 3D reconstruction, the model is compared against established techniques like MCC and Point-E, demonstrating robust generalization capabilities in reconstructing high-fidelity 3D meshes and low-error surfaces. Most notably, the volumetric IoU achieved significantly exceeds that of other methods, indicating a better understanding of object silhouettes and depth.

Implications and Future Directions

The implications of this work are profound for areas like AR/VR, robotics, and autonomous navigation, where understanding and manipulating 3D spaces from minimal data is crucial. The research showcases not only the learned geometric priors within diffusion models but also pushes the boundary of image-based 3D reconstruction methodologies.

Future directions may focus on extending these methods to handle dynamic scenes, object relations in complex environments, and videos, presenting challenges that are ripe for further exploration. Moreover, exploring the synergies between traditional graphics rendering techniques and diffusion models could unlock new frontiers in realistic image generation and scene manipulation.

Conclusion

Zero-1-to-3 efficiently leverages the latent 3D information learned by diffusion models to achieve zero-shot view synthesis and 3D reconstruction. Its performance across diverse datasets underscores the potential of large-scale generative models in simplifying otherwise complex computations required for such tasks. This work is a step forward in exploiting the vast amounts of implicit data encoded within modern generative architectures for practical applications in computer vision and graphics.

Markdown Report Issue