Transparent Image Layer Diffusion using Latent Transparency (2402.17113v4)
Abstract: We present LayerDiffuse, an approach enabling large-scale pretrained latent diffusion models to generate transparent images. The method allows generation of single transparent images or of multiple transparent layers. The method learns a "latent transparency" that encodes alpha channel transparency into the latent manifold of a pretrained latent diffusion model. It preserves the production-ready quality of the large diffusion model by regulating the added transparency as a latent offset with minimal changes to the original latent distribution of the pretrained model. In this way, any latent diffusion model can be converted into a transparent image generator by finetuning it with the adjusted latent space. We train the model with 1M transparent image layer pairs collected using a human-in-the-loop collection scheme. We show that latent transparency can be applied to different open source image generators, or be adapted to various conditional control systems to achieve applications like foreground/background-conditioned layer generation, joint layer generation, structural control of layer contents, etc. A user study finds that in most cases (97%) users prefer our natively generated transparent content over previous ad-hoc solutions such as generating and then matting. Users also report the quality of our generated transparent images is comparable to real commercial transparent assets like Adobe Stock.
- Interactive high-quality green-screen keying via color unmixing. ACM Trans. Graph., 35(5):152:1–152:12, 2016.
- Designing effective inter-pixel information flow for natural image matting. In Proc. CVPR, 2017a.
- Unmixing-based soft color segmentation for image manipulation. ACM Trans. Graph., 36(2):19:1–19:19, 2017b.
- Semantic soft segmentation. ACM Trans. Graph. (Proc. SIGGRAPH), 37(4):72:1–72:13, 2018.
- Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- Instructpix2pix: Learning to follow image editing instructions, 2022.
- cagliostrolab. animagine-xl-3.0. huggingface, 2024.
- Pp-matting: High-accuracy natural image matting, 04 2022.
- diffusers. stable-diffusion-xl-1.0-inpainting-0.1. diffusers, 2024.
- Image vectorization and editing via linear gradient layer decomposition. ACM Transactions on Graphics (TOG), 42(4), Aug. 2023.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
- Explaining and harnessing adversarial examples. In Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Denoising diffusion probabilistic models. NeurIPS, 2020.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023.
- Openclip, 2021.
- Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022.
- Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022.
- Segment anything. arXiv:2304.02643, 2023.
- Pick-a-pic: An open dataset of user preferences for text-to-image generation. 2023.
- Z. Kong and W. Ping. On fast sampling of diffusion probabilistic models. CoRR, 2106, 2021.
- Y. Koyama and M. Goto. Decomposing images into layers with advanced color blending. Computer Graphics Forum, 37(7):397–407, Oct. 2018. ISSN 1467-8659. doi: 10.1111/cgf.13577.
- Matting anything. arXiv: 2306.05399, 2023a.
- Photomaker: Customizing realistic human photos via stacked id embedding. 2023b.
- Visual instruction tuning. In NeurIPS, 2023.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. July 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Oct. 2019. doi: 10.48550/ARXIV.1910.10683.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242, 2022.
- Noise estimation for generative diffusion models. CoRR, 2104, 2021.
- C. Schuhmann and P. Bevan. Laion pop: 600,000 high-resolution images with detailed descriptions. https://huggingface.co/datasets/laion/laion-pop, 2023.
- LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
- Deep unsupervised learning using nonequilibrium thermodynamics. CoRR, 1503, 2015.
- Denoising diffusion implicit models. In ICLR. OpenReview.net, 2021.
- Score-based generative modeling through stochastic differential equations. CoRR, 2011:13456, 2020.
- Stability. Stable diffusion v1.5 model card, https://huggingface.co/runwayml/stable-diffusion-v1-5, 2022a.
- Stability. Stable diffusion v2 model card, stable-diffusion-2-depth, https://huggingface.co/stabilityai/stable-diffusion-2-depth, 2022b.
- Decomposing time-lapse paintings into layers. ACM Transactions on Graphics (TOG), 34(4):61:1–61:10, July 2015. doi: 10.1145/2766960. URL http://doi.acm.org/10.1145/2766960.
- Decomposing images into layers via RGB-space geometry. ACM Transactions on Graphics (TOG), 36(1):7:1–7:14, Nov. 2016. ISSN 0730-0301. doi: 10.1145/2988229. URL http://doi.acm.org/10.1145/2988229.
- Efficient palette-based decomposition and recoloring of images via rgbxy-space geometry. ACM Transactions on Graphics (TOG), 37(6):262:1–262:10, Dec. 2018. ISSN 0730-0301. doi: 10.1145/3272127.3275054.
- Pigmento: Pigment-based image analysis and editing. Transactions on Visualization and Computer Graphics (TVCG), 25(9), 2019. doi: 10.1109/TVCG.2018.2858238.
- Learning-based sampling for natural image matting. In Proc. CVPR, 2019.
- Instantid: Zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519, 2024.
- Invertible grayscale. ACM Transactions on Graphics (SIGGRAPH Asia 2018 issue), 37(6):246:1–246:10, Nov. 2018.
- Invertible Image Rescaling, pages 126–144. Springer International Publishing, 2020. ISBN 9783030584528.
- Deep image matting. Mar. 2017. doi: 10.48550/ARXIV.1703.03872.
- Versatile diffusion: Text, images and variations all in one diffusion model. arXiv preprint arXiv:2211.08332, 2022.
- Prompt-free diffusion: Taking” text” out of text-to-image diffusion models. arXiv preprint arXiv:2305.16223, 2023.
- Vitmatte: Boosting image matting with pre-trained plain vision transformers. Information Fusion, 103:102091, 2024.
- Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. 2023.
- L. Zhang and M. Agrawala. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
- Uni-controlnet: All-in-one control to text-to-image diffusion models. arXiv preprint arXiv:2305.16322, 2023.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.