Renderers are Good Zero-Shot Representation Learners: Exploring Diffusion Latents for Metric Learning (2306.10721v1)
Abstract: Can the latent spaces of modern generative neural rendering models serve as representations for 3D-aware discriminative visual understanding tasks? We use retrieval as a proxy for measuring the metric learning properties of the latent spaces of Shap-E, including capturing view-independence and enabling the aggregation of scene representations from the representations of individual image views, and find that Shap-E representations outperform those of the classical EfficientNet baseline representations zero-shot, and is still competitive when both methods are trained using a contrative loss. These findings give preliminary indication that 3D-based rendering and generative models can yield useful representations for discriminative tasks in our innately 3D-native world. Our code is available at \url{https://github.com/michaelwilliamtang/golden-retriever}.
- Vqa: Visual question answering, 2016.
- Deep convolutional neural network based autonomous drone navigation. In Thirteenth International Conference on Machine Vision, volume 11605, pages 16–24. SPIE, 2021.
- Shapenet: An information-rich 3d model repository, 2015.
- A simple framework for contrastive learning of visual representations, 2020.
- A generalist framework for panoptic segmentation of images and videos. arXiv preprint arXiv:2210.06366, 2022.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019.
- Diffusiondepth: Diffusion denoising approach for monocular depth estimation. ArXiv, abs/2303.05021, 2023.
- Diffusioninst: Diffusion model for instance segmentation. ArXiv, abs/2212.02773, 2022.
- Shap-e: Generating conditional 3d implicit functions, 2023.
- Learning multiple layers of features from tiny images. 2009.
- Melon: Nerf with unposed images using equivalence class estimation, 2023.
- Zero-1-to-3: Zero-shot one image to 3d object, 2023.
- Sphereface: Deep hypersphere embedding for face recognition, 2018.
- Nerf: Representing scenes as neural radiance fields for view synthesis, 2020.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, July 2022.
- Zero-shot text-to-image generation, 2021.
- FaceNet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, jun 2015.
- Thalles Santos Silva. Exploring simclr: A simple framework for contrastive learning of visual representations. https://sthalles.github.io, 2020.
- Efficientnet: Rethinking model scaling for convolutional neural networks, 2020.
- A discriminative feature learning approach for deep face recognition. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 499–515, Cham, 2016. Springer International Publishing.
- Lilian Weng. Contrastive representation learning. lilianweng.github.io, May 2021.
- Diffusion models for implicit image segmentation ensembles. In International Conference on Medical Imaging with Deep Learning, pages 1336–1348. PMLR, 2022.
- Medsegdiff: Medical image segmentation with diffusion probabilistic model. arXiv preprint arXiv:2211.00611, 2022.
- pixelnerf: Neural radiance fields from one or few images, 2021.