Emergent Mind

Sketch2NeRF: Multi-view Sketch-guided Text-to-3D Generation

(2401.14257)
Published Jan 25, 2024 in cs.CV and cs.AI

Abstract

Recently, text-to-3D approaches have achieved high-fidelity 3D content generation using text description. However, the generated objects are stochastic and lack fine-grained control. Sketches provide a cheap approach to introduce such fine-grained control. Nevertheless, it is challenging to achieve flexible control from these sketches due to their abstraction and ambiguity. In this paper, we present a multi-view sketch-guided text-to-3D generation framework (namely, Sketch2NeRF) to add sketch control to 3D generation. Specifically, our method leverages pretrained 2D diffusion models (e.g., Stable Diffusion and ControlNet) to supervise the optimization of a 3D scene represented by a neural radiance field (NeRF). We propose a novel synchronized generation and reconstruction method to effectively optimize the NeRF. In the experiments, we collected two kinds of multi-view sketch datasets to evaluate the proposed method. We demonstrate that our method can synthesize 3D consistent contents with fine-grained sketch control while being high-fidelity to text prompts. Extensive results show that our method achieves state-of-the-art performance in terms of sketch similarity and text alignment.

Sketch2NeRF transforms sketches into high-fidelity, controllable 3D objects resembling the input, exemplified by a teapot.

Overview

  • The paper introduces Sketch2NeRF, a framework that integrates sketch-based control with text-to-3D generative models to create high-fidelity 3D content.

  • Sketch2NeRF allows for more granular control in 3D content generation, overcoming limitations of previous methods that lacked fine-grained control from multi-view sketches.

  • The framework utilizes Neural Radiance Fields (NeRF) and sketch-conditional 2D diffusion models to optimize 3D object generation without the need for a large sketch-3D dataset.

  • Quantitative and qualitative evaluations on the OmniObject3D-Sketch and THuman-Sketch datasets show that Sketch2NeRF outperforms other methods in accuracy and fidelity.

  • Findings reveal that Sketch2NeRF enables the creation of 3D models with a high level of detail and preciseness, aligning closely with both text prompts and input sketches.

Introduction

With the proliferation of text-to-3D methods, the natural progression in 3D content generation technology is the inclusion of finer controls within the generative process. High-fidelity 3D content creation from text has seen remarkable advancements but often lacks the granular control creators seek. Sketches, offering control at a much more nuanced level, were hitherto a challenging medium to integrate into the 3D generation, particularly due to their inherent 2D nature and the complexity of processing multi-view information. The introduction of the Sketch2NeRF framework bridges this gap by effectively incorporating the fine-grained control of sketches into text-to-3D generative models, thus tackling the multi-view sketch-guided 3D object generation problem.

Related Work

Historically, controllable 3D generation from text has relied on prompt-based approaches and initial geometries, whereas generation from images faces issues around fine control in 3D space. Such methods can synthesize diverse results but remain limited when it comes to fine-grained control. Notably, Script2NeRF marks a significant departure from these techniques. In their prior work, diffusion models—typically trained on large-scale image datasets and conditioned on texts—have been adapted for 3D synthesis yet have not provided sketch-based controllability. This serves as the backbone for the innovation presented in Script2NeRF.

Methodology

Sketch2NeRF's approach is two-pronged. First, a neural radiance field (NeRF) represents the underlying 3D object. The framework then applies sketch-conditional 2D diffusion models—specifically, sketch-conditioned variants of pre-existing 2D diffusion models such as Stable Diffusion and ControlNet—to supervise the NeRF's optimization through a novel synchronous generation and reconstruction method. Furthermore, an annealed time schedule is introduced to enhance object generation quality. This methodology bypasses the need for a large sketch-3D dataset, a significant advantage over traditional strategies.

Experiments and Results

Experiments using both the OmniObject3D-Sketch and THuman-Sketch datasets demonstrate Sketch2NeRF's capacity for fine-grained control in 3D generation. The method has been quantified using metrics relevant to sketch similarity and text alignment. Qualitative and quantitative results show that SKetch2NeRF outperforms other approaches in generating 3D objects consistent with both input multi-view sketches and text descriptions, evidenced by lower Chamfer Distance (CD) and higher R-Precision CLIP scores.

Conclusion

The research advances the conversation in text-to-3D generative modeling by introducing a multi-view sketch-guided methodology—Sketch2NeRF. The authors present a framework that leverages sketch-conditional guidance to create 3D assets aligned with both textual prompts and provided sketches. Their findings underscore the framework's proficiency in generating 3D content with high fidelity and controllability, effectively containing the shape and concept conveyed in sketches, a marked advancement for creators aiming for precision in their 3D models.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

Reddit