Sketch-A-Shape: Zero-Shot Sketch-to-3D Shape Generation (2307.03869v1)
Abstract: Significant progress has recently been made in creative applications of large pre-trained models for downstream tasks in 3D vision, such as text-to-shape generation. This motivates our investigation of how these pre-trained models can be used effectively to generate 3D shapes from sketches, which has largely remained an open challenge due to the limited sketch-shape paired datasets and the varying level of abstraction in the sketches. We discover that conditioning a 3D generative model on the features (obtained from a frozen large pre-trained vision model) of synthetic renderings during training enables us to effectively generate 3D shapes from sketches at inference time. This suggests that the large pre-trained vision model features carry semantic signals that are resilient to domain shifts, i.e., allowing us to use only RGB renderings, but generalizing to sketches at inference time. We conduct a comprehensive set of experiments investigating different design factors and demonstrate the effectiveness of our straightforward approach for generation of multiple 3D shapes per each input sketch regardless of their level of abstraction without requiring any paired datasets during training.
- The quick, draw! dataset. https://github.com/googlecreativelab/quickdraw-dataset.
- Sketch2model: View-aware 3d modeling from single free-hand sketches. https://github.com/bennyguo/sketch2model.
- Learning representations and generative models for 3d point clouds. In International conference on machine learning, pages 40–49. PMLR, 2018.
- Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16123–16133, 2022.
- Shapenet: An information-rich 3d model repository, 2015. cite arxiv:1512.03012.
- Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022.
- Learning implicit fields for generative shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5939–5948, 2019.
- Autoregressive 3d shape generation via canonical mapping. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pages 89–104. Springer, 2022.
- 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In European conference on computer vision, pages 628–644. Springer, 2016.
- 3d sketching using multi-view deep volumetric prediction. Proceedings of the ACM on Computer Graphics and Interactive Techniques, 1(1):1–22, 2018.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- How do humans sketch objects? ACM Trans. Graph. (Proc. SIGGRAPH), 31(4):44:1–44:10, 2012.
- G.T. Fechner. Elemente der Psychophysik. Number pt. 1 in Elemente der Psychophysik. Breitkopf und Härtel, 1860.
- Shapecrafter: A recursive text-conditioned 3d shape generation model, 2022.
- Sketchsampler: Sketch-based 3d reconstruction via view-dependent depth sampling. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part I, pages 464–479. Springer, 2022.
- Get3d: A generative model of high quality 3d textured shapes learned from images. In Advances In Neural Information Processing Systems, 2022.
- What do vision transformers learn? a visual exploration. arXiv preprint arXiv:2212.06727, 2022.
- A papier-mâché approach to learning 3d surface generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 216–224, 2018.
- Sketch2mesh: Reconstructing and editing 3d shapes from sketches. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13023–13032, 2021.
- A neural representation of sketch drawings. In International Conference on Learning Representations, 2018.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. arXiv preprint arXiv:2205.08535, 2022.
- Neural wavelet-domain diffusion for 3d shape generation. In SIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022.
- Openclip, July 2021. If you use this software, please cite it as below.
- Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 867–876, 2022.
- Solidgen: An autoregressive model for direct b-rep synthesis. arXiv preprint arXiv:2203.13944, 2022.
- Unsupervised learning of 3d structure from images. Advances in neural information processing systems, 29, 2016.
- Contour-based 3d modeling through joint embedding of shapes and contours. In Symposium on interactive 3D graphics and games, pages 1–10, 2020.
- The quick, draw! - a.i. experiment. https://quickdraw.withgoogle.com/.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Rpg: Learning recursive point cloud generation. 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 544–551, 2021.
- A diffusion-refinement model for sketch-to-point modeling. In Proceedings of the Asian Conference on Computer Vision, pages 1522–1538, 2022.
- Reconstructing editable prismatic cad from rounded voxel models. In SIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022.
- Point cloud gan. arXiv preprint arXiv:1810.05795, 2018.
- Magic3d: High-resolution text-to-3d content creation. arXiv preprint arXiv:2211.10440, 2022.
- 3d shape reconstruction from sketches via multi-view convolutional networks. In 2017 International Conference on 3D Vision (3DV), pages 67–77. IEEE, 2017.
- Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2837–2845, 2021.
- Sarah Boyes Maxseiner. Sketch Quality Prediction Using Transformers. PhD thesis, Virginia Tech, 2023.
- Occupancy networks: Learning 3d reconstruction in function space. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019.
- Latent-nerf for shape-guided generation of 3d shapes and textures. arXiv preprint arXiv:2211.07600, 2022.
- Text2mesh: Text-driven neural stylization for meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13492–13502, 2022.
- Abhishek Mishra. Machine learning in the aws cloud: Add intelligence to applications with amazon sagemaker and amazon rekognition, 2019.
- Clip-based neural neighbor style transfer for 3d assets. arXiv preprint arXiv:2208.04370, 2022.
- Autosdf: Shape priors for 3d completion, reconstruction and generation. arXiv preprint arXiv:2203.09516, 2022.
- PolyGen: An autoregressive generative model of 3D meshes. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 7220–7229. PMLR, 13–18 Jul 2020.
- The shape variational autoencoder: A deep generative model of part-segmented 3d objects. In Computer Graphics Forum, volume 36, pages 1–12. Wiley Online Library, 2017.
- Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
- Point-e: A system for generating 3d point clouds from complex prompts, 2022.
- Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Deepsdf: Learning continuous signed distance functions for shape representation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- Convolutional occupancy networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 523–540. Springer, 2020.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
- Learning transferable visual models from natural language supervision, 2021.
- Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
- Clip-forge: Towards zero-shot text-to-shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18603–18613, 2022.
- Textcraft: Zero-shot generation of high-fidelity and diverse shapes from text. arXiv preprint arXiv:2211.01427, 2022.
- Zero-shot multi-modal artist-controlled retrieval and exploration of 3d object sets. In SIGGRAPH Asia 2022 Technical Communications, SA ’22, New York, NY, USA, 2022. Association for Computing Machinery.
- Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
- Graf: Generative radiance fields for 3d-aware image synthesis. Advances in Neural Information Processing Systems, 33:20154–20166, 2020.
- Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. In Proceedings of the IEEE international conference on computer vision, pages 2088–2096, 2017.
- Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pages 10506–10518, 2019.
- 3d shape reconstruction from free-hand sketches. arXiv preprint arXiv:2006.09694, 2020.
- Unsupervised learning of 3d model reconstruction from hand-drawn sketches. In Proceedings of the 26th ACM international conference on Multimedia, pages 1820–1828, 2018.
- Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
- Deepcad: A deep generative network for computer-aided design models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6772–6782, 2021.
- Skexgen: Autoregressive generation of cad construction sequences with disentangled codebooks. arXiv preprint arXiv:2207.04632, 2022.
- Shapeformer: Transformer-based shape completion via sparse representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6239–6249, 2022.
- Pointflow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4541–4550, 2019.
- Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 206–215, 2018.
- Shelf-supervised mesh prediction in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8843–8852, 2021.
- 3dilg: Irregular latent grids for 3d generative modeling. arXiv preprint arXiv:2205.13914, 2022.
- Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8552–8562, 2022.
- Sketch2model: View-aware 3d modeling from single free-hand sketches. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6012–6021, 2021.
- 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5826–5835, 2021.
- Robust training of vector quantized bottleneck models, 2020.