Generic 3D Diffusion Adapter Using Controlled Multi-View Editing (2403.12032v2)

Published 18 Mar 2024 in cs.CV and cs.GR

Abstract: Open-domain 3D object synthesis has been lagging behind image synthesis due to limited data and higher computational complexity. To bridge this gap, recent works have investigated multi-view diffusion but often fall short in either 3D consistency, visual quality, or efficiency. This paper proposes MVEdit, which functions as a 3D counterpart of SDEdit, employing ancestral sampling to jointly denoise multi-view images and output high-quality textured meshes. Built on off-the-shelf 2D diffusion models, MVEdit achieves 3D consistency through a training-free 3D Adapter, which lifts the 2D views of the last timestep into a coherent 3D representation, then conditions the 2D views of the next timestep using rendered views, without uncompromising visual quality. With an inference time of only 2-5 minutes, this framework achieves better trade-off between quality and speed than score distillation. MVEdit is highly versatile and extendable, with a wide range of applications including text/image-to-3D generation, 3D-to-3D editing, and high-quality texture synthesis. In particular, evaluations demonstrate state-of-the-art performance in both image-to-3D and text-guided texture generation tasks. Additionally, we introduce a method for fine-tuning 2D latent diffusion models on small 3D datasets with limited resources, enabling fast low-resolution text-to-3D initialization.

References (71)

Citations (11)

View on Semantic Scholar

Summary

The paper introduces MVEdit, a framework that integrates a training-free 3D Adapter with pre-trained 2D diffusion models to achieve coherent 3D consistency.
It employs ancestral sampling and ControlNets to condition multi-view images, enabling rapid inference in 2-5 minutes.
The work demonstrates versatile applications in text-to-3D synthesis, 3D-to-3D editing, and texture generation, addressing limitations of prior methods.

Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

Introduction

Open-domain 3D object synthesis from sparse data and complex computational frameworks has been an ongoing challenge in computer graphics and artificial intelligence. Recent advancements have been made through the use of multi-view diffusion models, leveraging pre-trained 2D models for 3D generation tasks. However, these techniques often struggle with ensuring 3D consistency, retaining high visual quality, or operating efficiently. Addressing these issues, this paper introduces MVEdit, a new framework that implements a 3D Adapter mechanism to produce high-quality textured meshes by employing ancestral sampling and conditioning techniques on multi-view images.

MVEdit Overview

MVEdit capitalizes on off-the-shelf 2D diffusion models and integrates a novel, training-free 3D Adapter to ensure 3D consistency across multi-view inputs. The key innovation lies in its ability to lift 2D views into a coherent 3D representation, subsequently conditioning future 2D views on this 3D model, thereby facilitating cross-view information exchange without compromising visual fidelity. This process takes 2-5 minutes for inference, presenting a better balance between quality, speed, and 3D consistency as compared to previous techniques like score distillation.

Core Contributions

3D Adapter on Existing Diffusion Models: Unlike prior approaches requiring substantial model adjustments or end-to-end training for 3D consistency, MVEdit uses ControlNets to effectively condition the denoising steps of pre-trained 2D diffusion models based on 3D-aware perspectives.
Versatile and Extendable Framework: Demonstrated across various tasks such as text/image-to-3D generation, 3D-to-3D editing, and texture synthesis, MVEdit showcases state-of-the-art performance, particularly in image-to-3D and text-guided texture generation.
Fast Text-to-3D Initialization: Introducing StableSSDNeRF, a method to fine-tune 2D latent diffusion models for 3D initialization, MVEdit circumvents the scarcity of large 3D datasets and achieves rapid low-resolution 3D generation.

Practical and Theoretical Implications

The MVEdit framework signifies an eminent step towards efficient 3D content generation from 2D data, highlighting the potential of leveraging pre-trained models across dimensions without extensive retraining. Theoretically, it questions and addresses the feasibility of achieving cross-dimensional consistency through conditional diffusion processes, providing a blueprint for future research in 3D generative models.

From a practical standpoint, the versatility and extendability of MVEdit unlock new possibilities in digital content creation, enabling intricate 3D model generation and editing with minimal input requirements. This could particularly benefit industries reliant on rapid prototyping and visualization, like gaming, virtual reality, and film production.

Future Directions in AI and 3D Generation

Looking ahead, the development of purpose-built 3D Adapters, specifically trained to augment 2D diffusion models for 3D tasks, could further improve the efficiency, quality, and consistency of generated objects. Moreover, enhancing the understanding and optimization of the underlying conditioning mechanisms between 2D imagery and 3D models stands as an exciting area for ongoing research, with the potential to bridge the current gap between these dimensions more seamlessly.

In conclusion, MVEdit represents a notable advancement in the domain of 3D object synthesis, promoting a more effective utilization of existing 2D models for 3D generation tasks. Its methodological advancements and practical applications suggest a promising avenue for further exploration and development within the AI and computer graphics research communities.

Related Papers

Tweets

https://twitter.com/Gradio/status/1770028925856387464

https://twitter.com/_akhaliq/status/1769939012221886972

https://twitter.com/WilliamLamkin/status/1770075732678529409

YouTube

Show All Videos