Sketch2NeRF: Multi-view Sketch-guided Text-to-3D Generation (2401.14257v2)

Published 25 Jan 2024 in cs.CV and cs.AI

Abstract: Recently, text-to-3D approaches have achieved high-fidelity 3D content generation using text description. However, the generated objects are stochastic and lack fine-grained control. Sketches provide a cheap approach to introduce such fine-grained control. Nevertheless, it is challenging to achieve flexible control from these sketches due to their abstraction and ambiguity. In this paper, we present a multi-view sketch-guided text-to-3D generation framework (namely, Sketch2NeRF) to add sketch control to 3D generation. Specifically, our method leverages pretrained 2D diffusion models (e.g., Stable Diffusion and ControlNet) to supervise the optimization of a 3D scene represented by a neural radiance field (NeRF). We propose a novel synchronized generation and reconstruction method to effectively optimize the NeRF. In the experiments, we collected two kinds of multi-view sketch datasets to evaluate the proposed method. We demonstrate that our method can synthesize 3D consistent contents with fine-grained sketch control while being high-fidelity to text prompts. Extensive results show that our method achieves state-of-the-art performance in terms of sketch similarity and text alignment.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a novel Sketch2NeRF framework that integrates multi-view sketch conditioning with text-to-3D generation for enhanced precision and control.
It employs sketch-conditioned 2D diffusion models along with an annealed time schedule to optimize NeRF representations without needing extensive sketch-3D datasets.
Experiments show significantly lower Chamfer Distance and higher CLIP R-Precision, validating the framework’s superior accuracy in generating 3D objects.

Introduction

With the proliferation of text-to-3D methods, the natural progression in 3D content generation technology is the inclusion of finer controls within the generative process. High-fidelity 3D content creation from text has seen remarkable advancements but often lacks the granular control creators seek. Sketches, offering control at a much more nuanced level, were hitherto a challenging medium to integrate into the 3D generation, particularly due to their inherent 2D nature and the complexity of processing multi-view information. The introduction of the Sketch2NeRF framework bridges this gap by effectively incorporating the fine-grained control of sketches into text-to-3D generative models, thus tackling the multi-view sketch-guided 3D object generation problem.

Historically, controllable 3D generation from text has relied on prompt-based approaches and initial geometries, whereas generation from images faces issues around fine control in 3D space. Such methods can synthesize diverse results but remain limited when it comes to fine-grained control. Notably, Script2NeRF marks a significant departure from these techniques. In their prior work, diffusion models—typically trained on large-scale image datasets and conditioned on texts—have been adapted for 3D synthesis yet have not provided sketch-based controllability. This serves as the backbone for the innovation presented in Script2NeRF.

Methodology

Sketch2NeRF's approach is two-pronged. First, a neural radiance field (NeRF) represents the underlying 3D object. The framework then applies sketch-conditional 2D diffusion models—specifically, sketch-conditioned variants of pre-existing 2D diffusion models such as Stable Diffusion and ControlNet—to supervise the NeRF's optimization through a novel synchronous generation and reconstruction method. Furthermore, an annealed time schedule is introduced to enhance object generation quality. This methodology bypasses the need for a large sketch-3D dataset, a significant advantage over traditional strategies.

Experiments and Results

Experiments using both the OmniObject3D-Sketch and THuman-Sketch datasets demonstrate Sketch2NeRF's capacity for fine-grained control in 3D generation. The method has been quantified using metrics relevant to sketch similarity and text alignment. Qualitative and quantitative results show that SKetch2NeRF outperforms other approaches in generating 3D objects consistent with both input multi-view sketches and text descriptions, evidenced by lower Chamfer Distance (CD) and higher R-Precision CLIP scores.

Conclusion

The research advances the conversation in text-to-3D generative modeling by introducing a multi-view sketch-guided methodology—Sketch2NeRF. The authors present a framework that leverages sketch-conditional guidance to create 3D assets aligned with both textual prompts and provided sketches. Their findings underscore the framework's proficiency in generating 3D content with high fidelity and controllability, effectively containing the shape and concept conveyed in sketches, a marked advancement for creators aiming for precision in their 3D models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1750705569978024344

https://twitter.com/fly51fly/status/1751386547797274736

https://twitter.com/gm8xx8/status/1750703381721583677

https://twitter.com/javaeeeee1/status/1753807110142562503

Reddit

[2401.14257] Sketch2NeRF: Multi-view Sketch-guided Text-to-3D Generation (2 points, 1 comment)
[2401.14257] Sketch2NeRF: Multi-view Sketch-guided Text-to-3D Generation (2 points, 1 comment)