DiffArtist: Towards Structure and Appearance Controllable Image Stylization (2407.15842v3)
Abstract: Artistic style includes both structural and appearance elements. Existing neural stylization techniques primarily focus on transferring appearance features such as color and texture, often neglecting the equally crucial aspect of structural stylization. In this paper, we present a comprehensive study on the simultaneous stylization of structure and appearance of 2D images. Specifically, we introduce DiffArtist, which, to the best of our knowledge, is the first stylization method to allow for dual controllability over structure and appearance. Our key insight is to represent structure and appearance as separate diffusion processes to achieve complete disentanglement without requiring any training, thereby endowing users with unprecedented controllability for both components. The evaluation of stylization of both appearance and structure, however, remains challenging as it necessitates semantic understanding. To this end, we further propose a Multimodal LLM-based style evaluator, which better aligns with human preferences than metrics lacking semantic understanding. With this powerful evaluator, we conduct extensive analysis, demonstrating that DiffArtist achieves superior style fidelity, editability, and structure-appearance disentanglement. These merits make DiffArtist a highly versatile solution for creative applications. Project homepage: https://github.com/songrise/Artist.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
- Style injection in diffusion: A training-free approach for adapting large-scale diffusion models for style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8795–8805, 2024.
- Diffusion in style. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2251–2261, 2023.
- Stanislav Fort. Pixels still beat text: Attacking the openai clip model with text patches and adversarial pixel perturbations, 2021.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022a.
- Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4):1–13, 2022b.
- R-lpips: An adversarially robust perceptual similarity metric. arXiv preprint arXiv:2307.15157, 2023.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Delta denoising score. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2328–2337, 2023.
- Style aligned image generation via shared attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4775–4785, 2024.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Diffstyler: Controllable dual diffusion for text-driven image stylization. IEEE Transactions on Neural Networks and Learning Systems, 2024.
- Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pages 1501–1510, 2017.
- Avatarcraft: Transforming text into neural human avatars with parameterized shape and pose control. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14371–14382, 2023.
- Direct inversion: Boosting diffusion-based editing with 3 lines of code. arXiv preprint arXiv:2310.01506, 2023.
- Style transfer by relaxed optimal transport and self-similarity. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10051–10060, 2019.
- Content and style disentanglement for artistic style transfer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4422–4431, 2019.
- Clipstyler: Image style transfer with a single text condition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18062–18071, 2022.
- Demystifying neural style transfer. arXiv preprint arXiv:1701.01036, 2017.
- Rich human feedback for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19401–19411, 2024.
- Text2mesh: Text-driven neural stylization for meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13492–13502, 2022.
- Freecontrol: Training-free spatial control of any text-to-image diffusion model with any condition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7465–7475, 2024.
- Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- A style-aware content loss for real-time hd style transfer. In proceedings of the European conference on computer vision (ECCV), pages 698–714, 2018.
- Chaehan So. Measuring aesthetic preferences of neural style transfer: More precision with the two-alternative-forced-choice task. International Journal of Human–Computer Interaction, 39(4):755–775, 2023.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
- Nerf-art: Text-driven neural radiance fields stylization. IEEE Transactions on Visualization and Computer Graphics, 2023a.
- Aesust: towards aesthetic-enhanced universal style transfer. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1095–1106, 2022.
- Stylediffusion: Controllable disentangled style transfer via diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7677–7689, 2023b.
- Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7452–7461, 2023.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023a.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
- Inversion-based style transfer with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10146–10156, 2023b.
- A unified arbitrary style transfer framework via adaptive contrastive learning. ACM Transactions on Graphics, 42(5):1–16, 2023c.
- Layoutdiffusion: Controllable diffusion model for layout-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22490–22499, 2023.