DiffUTE: Universal Text Editing Diffusion Model (2305.10825v3)
Abstract: Diffusion model based language-guided image editing has achieved great success recently. However, existing state-of-the-art diffusion models struggle with rendering correct text and text style during generation. To tackle this problem, we propose a universal self-supervised text editing diffusion model (DiffUTE), which aims to replace or modify words in the source image with another one while maintaining its realistic appearance. Specifically, we build our model on a diffusion model and carefully modify the network structure to enable the model for drawing multilingual characters with the help of glyph and position information. Moreover, we design a self-supervised learning framework to leverage large amounts of web data to improve the representation ability of the model. Experimental results show that our method achieves an impressive performance and enables controllable editing on in-the-wild images with high fidelity. Our code will be avaliable in \url{https://github.com/chenhaoxing/DiffUTE}.
- A2: Efficient automated attacker for boosting adversarial training. Advances in Neural Information Processing Systems, 35:22844–22855, 2022a.
- Mobile user interface element detection via adaptively prompt tuning. In CVPR, pages 11155–11164, 2023.
- Weakly-supervised enhanced semantic-aware hashing for cross-modal retrieval. IEEE Trans. Knowl. Data Eng., 35(6):6475–6488, 2022a.
- Deep image harmonization with learnable augmentation. In ICCV, pages 7482–7491, 2023.
- Hierarchical dynamic image harmonization. In ACM MM, 2023.
- Cross-image context for single image inpainting. In NeurIPS, 2022.
- Scsnet: An efficient paradigm for learning simultaneously image colorization and super-resolution. In AAAI, pages 3271–3279, 2022b.
- Clipstyler: Image style transfer with a single text condition. In CVPR, pages 18041–18050, 2022.
- Sega: Instructing diffusion using semantic dimensions. arXiv preprint arXiv:2301.12247, 2023.
- Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, volume 35, pages 36479–36494, 2022a.
- Editing text in the wild. In ACM MM, pages 1500–1508, 2019.
- Exploring stroke-level modifications for scene text editing. In AAAI, 2023.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241, 2015.
- Adding conditional control to text-to-image diffusion models. 2023.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Adaptively-realistic image generation from stroke and sketch with diffusion model. In WACV, pages 4054–4062, 2023.
- Character-aware models improve visual text rendering. arXiv preprint arXiv:2212.10562, 2022a.
- Trocr: Transformer-based optical character recognition with pre-trained models. In AAAI, 2023.
- GLM-130b: An open bilingual pre-trained model. In ICLR, 2023.
- Denoising diffusion implicit models. In ICLR, 2020.
- Hang Li. Cdla: A chinese document layout analysis (cdla) dataset. [Online]. Available: https://github.com/buptlihang/CDLA. Accessed 2021.
- Xfund: A benchmark dataset for multilingual visually rich form understanding. In Findings of ACL, pages 3214–3224, 2022b.
- Publaynet: largest dataset ever for document layout analysis. In ICDAR, pages 1015–1022, 2019.
- Icdar 2019 robust reading challenge on reading chinese text on signboard. In ICDAR, pages 1577–1581, 2019.
- Icdar2019 robust reading challenge on multi-lingual scene text detection and recognition—rrc-mlt-2019. In ICDAR, pages 1582–1587, 2019.
- ICDAR 2015 competition on robust reading. In ICDAR, pages 1156–1160, 2015.
- Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art. In ICDAR, pages 1571–1576, 2019.
- Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text. In CVPR, pages 8802–8812, 2021.
- Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In CVPR, pages 7098–7107, 2021.
- Image-to-image translation with conditional adversarial networks. In CVPR, pages 1125–1134, 2017.
- Improving diffusion models for scene text editing with dual encoders. arXiv preprint arXiv:2304.05568, 2023.
- Synthtiger: Synthetic text image generator towards better text recognition models. In ICDAR, pages 109–124, 2021.
- Stefann: scene text editor using font adaptive neural network. In CVPR, pages 13228–13237, 2020.
- Gentext: Unsupervised artistic text generation via decoupled font and texture manipulation. arXiv preprint arXiv:2207.09649, 2022.
- Look closer to supervise better: One-shot font generation via component-based discriminator. In CVPR, pages 13482–13491, 2022.
- Rewritenet: Reliable scene text editing with implicit decomposition of text contents and styles. arXiv preprint arXiv:2107.11041, 2021.
- De-rendering stylized texts. In ICCV, pages 1076–1085, 2021.
- Swaptext: Image based texts transfer in scenes. In CVPR, pages 14700–14709, 2020.
- Spatial fusion gan for image synthesis. In CVPR, pages 3653–3662, 2019.
- Paint by word. arXiv preprint arXiv:2103.10951, 2021.
- Stylegan-nada: Clip-guided domain adaptation of image generators. arXiv preprint arXiv:2108.00946, 2021.
- Poisson image editing. In ACM SIGGRAPH, pages 313–318. 2003.
- Palette: Image-to-image diffusion models. In ACM SIGGRAPH, pages 1–10, 2022b.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2023.
- Character-aware models improve visual text rendering. arXiv preprint arXiv:2212.10562, 2022b.
- Language models are few-shot learners. In NeurIPS, volume 33, pages 1877–1901, 2020.
- Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2018.
- mt5: A massively multilingual pre-trained text-to-text transformer. In ACL, 2021.
- Haoxing Chen (22 papers)
- Zhuoer Xu (15 papers)
- Zhangxuan Gu (17 papers)
- Jun Lan (30 papers)
- Xing Zheng (2 papers)
- Yaohui Li (17 papers)
- Changhua Meng (27 papers)
- Huijia Zhu (22 papers)
- Weiqiang Wang (171 papers)