Deformable One-shot Face Stylization via DINO Semantic Guidance (2403.00459v2)
Abstract: This paper addresses the complex issue of one-shot face stylization, focusing on the simultaneous consideration of appearance and structure, where previous methods have fallen short. We explore deformation-aware face stylization that diverges from traditional single-image style reference, opting for a real-style image pair instead. The cornerstone of our method is the utilization of a self-supervised vision transformer, specifically DINO-ViT, to establish a robust and consistent facial structure representation across both real and style domains. Our stylization process begins by adapting the StyleGAN generator to be deformation-aware through the integration of spatial transformers (STN). We then introduce two innovative constraints for generator fine-tuning under the guidance of DINO semantics: i) a directional deformation loss that regulates directional vectors in DINO space, and ii) a relative structural consistency constraint based on DINO token self-similarities, ensuring diverse generation. Additionally, style-mixing is employed to align the color generation with the reference, minimizing inconsistent correspondences. This framework delivers enhanced deformability for general one-shot face stylization, achieving notable efficiency with a fine-tuning duration of approximately 10 minutes. Extensive qualitative and quantitative comparisons demonstrate our superiority over state-of-the-art one-shot face stylization methods. Code is available at https://github.com/zichongc/DoesFS
- 3davatargan: Bridging domains for personalized editable avatars. In Proc. of IEEE Conf. on Computer Vision & Pattern Recognition, pages 4552–4562, 2023.
- Image2stylegan: How to embed images into the stylegan latent space?, 2019.
- Deep vit features as dense visual descriptors. ECCVW What is Motion For?, 2022.
- Carigans: Unpaired photo-to-caricature translation, 2018.
- Emerging properties in self-supervised vision transformers. In Proc. of Int. Conf. on Computer Vision, 2021.
- Jojogan: One shot face stylization. In Proc. of Euro. Conf. on Computer Vision, 2022.
- Arcface: Additive angular margin loss for deep face recognition. In Proc. of IEEE Conf. on Computer Vision & Pattern Recognition, pages 4690–4699, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. of Int. Conf. on Learning Representations, 2021.
- Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Trans. on Graphics (Proc. of SIGGRAPH), 41(4), jul 2022.
- Image style transfer using convolutional neural networks. In Proc. of IEEE Conf. on Computer Vision & Pattern Recognition, pages 2414–2423, Las Vegas, NV, USA, June 2016. IEEE.
- Autotoon: Automatic geometric warping for face cartoon generation. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 349–358, 2020.
- Generative adversarial nets. In Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
- Arbitrary style transfer in real-time with adaptive instance normalization. In Proc. of Int. Conf. on Computer Vision, 2017.
- Spatial transformer networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
- Stylecarigan: Caricature generation via stylegan feature map modulation. ACM Trans. on Graphics (Proc. of SIGGRAPH), 40(4), 2021.
- Training generative adversarial networks with limited data. In Advances in Neural Information Processing Systems, 2020.
- A style-based generator architecture for generative adversarial networks, 2019.
- Analyzing and improving the image quality of StyleGAN. In Proc. of IEEE Conf. on Computer Vision & Pattern Recognition, 2020.
- Deformable style transfer. In Proc. of Euro. Conf. on Computer Vision, 2020.
- Style transfer by relaxed optimal transport and self-similarity. In Proc. of IEEE Conf. on Computer Vision & Pattern Recognition, pages 10043–10052, Los Alamitos, CA, USA, jun 2019. IEEE Computer Society.
- One-shot adaptation of gan in just one clip. IEEE Trans. Pattern Analysis & Machine Intelligence, 45(10):12179–12191, 2023.
- Dct-net: Domain-calibrated translation for portrait stylization. ACM Trans. on Graphics, 41(4):1–9, 2022.
- Few-shot cross-domain image generation via inference-time latent-code learning. In Proc. of Int. Conf. on Learning Representations, 2023.
- Few-shot image generation via cross-domain correspondence. In Proc. of IEEE Conf. on Computer Vision & Pattern Recognition, 2021.
- Styleclip: Text-driven manipulation of stylegan imagery. In Proc. of Int. Conf. on Computer Vision, pages 2085–2094, October 2021.
- Justin N. M. Pinkney and Doron Adler. Resolution dependent gan interpolation for controllable image synthesis between domains, 2020.
- Learning transferable visual models from natural language supervision, 2021.
- Warpgan: Automatic caricature generation. In Proc. of IEEE Conf. on Computer Vision & Pattern Recognition, 2019.
- Very deep convolutional networks for large-scale image recognition. In Proc. of Int. Conf. on Learning Representations, 2015.
- Etnet: Error transition network for arbitrary style transfer. In Advances in Neural Information Processing Systems, pages 668–677, 2019.
- Agilegan: Stylizing portraits by inversion-consistent transfer learning. ACM Transactions on Graphics (Proc. SIGGRAPH), jul 2021.
- Designing an encoder for stylegan image manipulation. arXiv preprint arXiv:2102.02766, 2021.
- Splicing vit features for semantic appearance transfer. In Proc. of IEEE Conf. on Computer Vision & Pattern Recognition, pages 10748–10757, 2022.
- Efanet: Exchangeable feature alignment network for arbitrary style transfer. Proc. AAAI Conf. on Artificial Intelligence, pages 12305–12312, 4 2020.
- Few shot generative model adaption via relaxed spatial structural alignment. In Proc. of IEEE Conf. on Computer Vision & Pattern Recognition, pages 11204–11213, June 2022.
- Styleganex: Stylegan-based manipulation beyond cropped aligned faces. In Proc. of Int. Conf. on Computer Vision, 2023.
- Pastiche master: Exemplar-based high-resolution portrait style transfer. In Proc. of IEEE Conf. on Computer Vision & Pattern Recognition, 2022.
- Vtoonify: Controllable high-resolution portrait video style transfer. ACM Trans. on Graphics (Proc. of SIGGRAPH Asia), 41(6):1–15, 2022.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proc. of IEEE Conf. on Computer Vision & Pattern Recognition, 2018.
- Towards diverse and faithful one-shot adaption of generative adversarial networks. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- Few-shot image generation via adaptation-aware kernel modulation. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- A closer look at few-shot image generation. Proc. of IEEE Conf. on Computer Vision & Pattern Recognition, pages 9130–9140, 2022.
- Exploring incompatible knowledge transfer in few-shot image generation. In Proc. of IEEE Conf. on Computer Vision & Pattern Recognition, pages 7380–7391, 2023.
- General facial representation learning in a visual-linguistic manner. In Proc. of IEEE Conf. on Computer Vision & Pattern Recognition, pages 18676–18688, New Orleans, LA, USA, June 2022. IEEE.
- Mind the gap: Domain gap control for single shot domain adaptation for generative adversarial networks. In Proc. of Int. Conf. on Learning Representations, 2022.