Sketch-A-Shape: Zero-Shot Sketch-to-3D Shape Generation (2307.03869v1)

Published 8 Jul 2023 in cs.CV

Abstract: Significant progress has recently been made in creative applications of large pre-trained models for downstream tasks in 3D vision, such as text-to-shape generation. This motivates our investigation of how these pre-trained models can be used effectively to generate 3D shapes from sketches, which has largely remained an open challenge due to the limited sketch-shape paired datasets and the varying level of abstraction in the sketches. We discover that conditioning a 3D generative model on the features (obtained from a frozen large pre-trained vision model) of synthetic renderings during training enables us to effectively generate 3D shapes from sketches at inference time. This suggests that the large pre-trained vision model features carry semantic signals that are resilient to domain shifts, i.e., allowing us to use only RGB renderings, but generalizing to sketches at inference time. We conduct a comprehensive set of experiments investigating different design factors and demonstrate the effectiveness of our straightforward approach for generation of multiple 3D shapes per each input sketch regardless of their level of abstraction without requiring any paired datasets during training.

References (81)

Citations (13)

View on Semantic Scholar

Summary

The paper introduces a zero-shot framework using pre-trained vision models to condition a 3D generative model on sketch inputs without paired training data.
The methodology applies a two-stage training process with VQ-VAE and masked transformers, achieving robust results across voxel, implicit, and CAD representations.
Experimental evaluations on diverse sketch datasets demonstrate significant improvements in classification accuracy and practical benefits for rapid 3D design prototyping.

Insightful Overview of "Sketch-A-Shape: Zero-Shot Sketch-to-3D Shape Generation"

The paper "Sketch-A-Shape: Zero-Shot Sketch-to-3D Shape Generation" by Sanghi et al., addresses the complex challenge of generating 3D shapes from sketches in a zero-shot setting. This paper effectively employs the capabilities of large-scale pre-trained vision models to accomplish the task of sketch-to-3D shape generation without relying on paired sketch-3D datasets. Recognizing the historic importance of sketches as a medium of conveying abstract concepts and designs, the authors propose a method that extends the limits of previous data-driven approaches.

The methodology hinges on conditioning a 3D generative model with features extracted from a pre-trained vision model during training. These models, such as CLIP and DINOv2, are pre-trained on extensive and diverse datasets, which allows them to carry robust semantic signals, resilient to domain shifts. This resilience is pivotal as it enables the effective parsing of sketches, given only synthetic renderings during training.

Several strong quantitative results affirm the efficacy of this approach. The method demonstrates the ability to generalize across different 3D representations, including voxel, implicit, and CAD formats. The authors also highlight the model's capability to cater to varying levels of abstraction, ranging from casual doodles to professional sketches without explicitly paired training data, which has been a significant limitation in extant research.

Furthermore, the paper presents a comprehensive experimental evaluation across diverse sketch datasets, such as ShapeNet-Sketch, TU-Berlin Sketch, and QuickDraw. The results show significant improvement over existing supervised techniques, with a particular emphasis on classification accuracy across these datasets.

The architecture comprises a two-stage training process. The first stage uses a VQ-VAE network to encode 3D shapes into discrete latent embeddings. In the second stage, a masked transformer models the distribution of these embedding indices conditioned on vision model features derived through cross-attention mechanisms. During inference, sketches are processed through the same frozen pre-trained network to facilitate 3D shape generation.

Several methodological choices are explored with distinct architectural designs, including the importance of utilizing local features of pre-trained models and various conditioning strategies. Among the noteworthy conclusions, the paper finds that cross-attention with learnable positional embeddings markedly enhances performance.

The implications of this research are manifold. Practically, it facilitates accessible 3D modeling based on simplistic sketches, democratizing the design process for rapid prototyping and ideation. The approach can significantly streamline workflows in design industries where rapid transformation from conceptual sketches to tangible 3D models is advantageous. Theoretically, the work demonstrates the strength of leveraging semantically robust features from large-scale pre-trained models, paving the way for further exploration in domain adaptation and zero-shot learning in 3D vision tasks.

Future directions may include scaling this approach to even larger datasets and refining methods to capture finer geometric details. Integrating multimodal inputs, such as combining sketch and textual descriptions for enhanced generative models, could further bolster the richness and variability of generated 3D shapes. This research stands as a testament to the potential buried in leveraging massive pre-trained models to tackle longstanding challenges in the generative modeling domain.

PDF Markdown

YouTube

Show All Videos

Sketch-A-Shape: Zero-Shot Sketch-to-3D Shape Generation (2307.03869v1)

Summary

Insightful Overview of "Sketch-A-Shape: Zero-Shot Sketch-to-3D Shape Generation"

Related Papers

YouTube