Improving Diffusion-Based Image Synthesis with Context Prediction (2401.02015v1)
Abstract: Diffusion models are a new class of generative models, and have dramatically promoted image generation with unprecedented quality and diversity. Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes. However, such point-based reconstruction may fail to make each predicted pixel/feature fully preserve its neighborhood context, impairing diffusion-based image synthesis. As a powerful source of automatic supervisory signal, context has been well studied for learning representations. Inspired by this, we for the first time propose ConPreDiff to improve diffusion-based image synthesis with context prediction. We explicitly reinforce each point to predict its neighborhood context (i.e., multi-stride features/tokens/pixels) with a context decoder at the end of diffusion denoising blocks in training stage, and remove the decoder for inference. In this way, each point can better reconstruct itself by preserving its semantic connections with neighborhood context. This new paradigm of ConPreDiff can generalize to arbitrary discrete and continuous diffusion backbones without introducing extra parameters in sampling procedure. Extensive experiments are conducted on unconditional image generation, text-to-image generation and image inpainting tasks. Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.
- Structured denoising diffusion models in discrete state-spaces. In Advances in Neural Information Processing Systems, 2021.
- Blended diffusion for text-driven editing of natural images. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 18208–18218, 2022.
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. In International Conference on Learning Representations, 2021.
- Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2018.
- A survey on generative diffusion model. arXiv preprint arXiv:2209.02646, 2022.
- Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
- Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026, 2022.
- Diffusion models in vision: A survey. arXiv preprint arXiv:2209.04747, 2022.
- Vqgan-clip: Open domain image generation and editing with natural language guidance. arXiv preprint arXiv:2204.08583, 2022.
- Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. NeurIPS, 2013.
- Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, 2019.
- Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems, volume 34, pp. 8780–8794, 2021.
- Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pp. 1422–1430, 2015.
- Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis. Advances in Neural Information Processing Systems, 34:3518–3532, 2021a.
- Taming transformers for high-resolution image synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 12873–12883, 2021b.
- A point set generation network for 3d object reconstruction from a single image. In CVPR, 2017.
- Frido: Feature pyramid diffusion for complex scene image synthesis. arXiv preprint arXiv:2208.13753, 2022.
- Sex, lies, and videotape: Deep fakes and free speech delusions. Md. L. Rev., 78:892, 2018.
- Learning with a wasserstein loss. In NeurlPS, 2015.
- Make-a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131, 2022.
- A class of wasserstein metrics for probability distributions. Michigan Mathematical Journal, 31(2):231–240, 1984.
- Vector quantized diffusion model for text-to-image synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 10696–10706, 2022.
- Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7514–7528, 2021.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pp. 6840–6851, 2020.
- Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 23:47–1, 2022.
- Argmax flows and multinomial diffusion: Learning categorical distributions. In Advances in Neural Information Processing Systems, volume 34, pp. 12454–12465, 2021.
- simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
- Local relation networks for image recognition. In ICCV, pp. 3464–3473, 2019.
- A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38(4):325–340, 1987.
- Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
- A style-based generator architecture for generative adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410, 2019a.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410, 2019b.
- Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8110–8119, 2020.
- Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022.
- Score matching model for unbounded data score. arXiv preprint arXiv:2106.05527, 2021.
- Diffusionclip: Text-guided diffusion models for robust image manipulation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2426–2435, 2022.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Auto-encoding variational bayess. In ICLR, 2014.
- Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems, 32, 2019.
- A multi-level mesh mutual attention model for visual question answering. Data Science and Engineering, 7(4):339–353, 2022.
- Coherent semantic attention for image inpainting. In ICCV, pp. 4170–4179, 2019.
- Maximum likelihood training for score-based diffusion odes by high order denoising score matching. In International Conference on Machine Learning, pp. 14429–14460, 2022a.
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022b.
- A universal approximation theorem of deep neural networks for expressing probability distributions. NeurIPS, 2020.
- Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11461–11471, June 2022.
- Locality preserving matching. International Journal of Computer Vision, 127(5):512–531, 2019.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021.
- Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
- Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162–8171, 2021.
- GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pp. 16784–16804, 2022.
- Dual contradistinctive generative autoencoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 823–832, 2021.
- Generating diverse structure for image inpainting with hierarchical vq-vae. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10775–10784, 2021.
- Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
- Diffusion autoencoders: Toward a meaningful and decodable representation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 10619–10629, 2022.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
- Zero-shot text-to-image generation. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 8821–8831. PMLR, 18–24 Jul 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Generating diverse high-fidelity images with vq-vae-2. Advances in Neural Information Processing Systems, 32, 2019.
- Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pp. 1278–1286, 2014.
- High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242, 2022.
- Palette: Image-to-image diffusion models. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings, pp. 1–10, 2022a.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022b.
- Photorealistic text-to-image diffusion models with deep language understanding. ArXiv, abs/2205.11487, 2022c.
- Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022d.
- Projected gans converge faster. Advances in Neural Information Processing Systems, 34:17480–17492, 2021.
- A u-net based discriminator for generative adversarial networks. In CVPR, pp. 8204–8213, 2020.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Knn-diffusion: Image generation via large-scale retrieval. arXiv preprint arXiv:2204.02849, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265, 2015.
- Denoising diffusion implicit models. In International Conference on Learning Representations, 2020a.
- Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, volume 32, 2019.
- Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems, volume 33, pp. 12438–12448, 2020.
- Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2020b.
- Maximum likelihood training of score-based diffusion models. In Advances in Neural Information Processing Systems, volume 34, pp. 1415–1428, 2021.
- Biases in generative art: A causal look from the lens of art history. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 41–51, 2021.
- Image representations learned with unsupervised pre-training contain human-like biases. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp. 701–713, 2021.
- Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 2149–2159, 2022.
- Improved vector quantized diffusion models. arXiv preprint arXiv:2205.16007, 2022.
- Df-gan: A simple and effective baseline for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16515–16525, 2022.
- Score-based generative modeling in latent space. In Advances in Neural Information Processing Systems, volume 34, pp. 11287–11302, 2021.
- Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- High-fidelity pluralistic image completion with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4692–4701, 2021.
- Lion: Implicit vision prompt tuning. arXiv preprint arXiv:2303.09992, 2023a.
- Mode approximation makes good vision-language prompts. arXiv preprint arXiv:2305.08381, 2023b.
- Contextual and selective attention networks for image captioning. Science China Information Sciences, 65(12):222103, 2022.
- Probing the impacts of visual context in multimodal entity alignment. Data Science and Engineering, 8(2):124–134, 2023c.
- Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1316–1324, 2018.
- Dpgn: Distribution propagation graph network for few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 13390–13399, 2020.
- Diffusion-based scene graph to image generation with masked contrastive pre-training. arXiv preprint arXiv:2211.11138, 2022a.
- Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 2022b.
- Improving diffusion-based image synthesis with context prediction. In Advances in Neural Information Processing Systems, 2023.
- Improving text-to-image synthesis using contrastive learning. arXiv preprint arXiv:2107.02423, 2021.
- Visual tuning. arXiv preprint arXiv:2305.06061, 2023.
- Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
- Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5505–5514, 2018.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
- Aggregated contextual transformations for high-resolution image inpainting. IEEE Transactions on Visualization and Computer Graphics, 2022.
- Augmented fcn: rethinking context modeling for semantic segmentation. Science China Information Sciences, 66(4):142105, 2023.
- Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 833–842, 2021.
- Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902, 2022.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595, 2018.
- Cae v2: Context autoencoder with clip target. arXiv preprint arXiv:2211.09799, 2022.
- Pointweb: Enhancing local neighborhood features for point cloud processing. In CVPR, pp. 5565–5573, 2019.
- Towards language-free training for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17907–17917, 2022.
- Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5802–5810, 2019.
- Discrete contrastive diffusion for cross-modal and conditional generation. arXiv preprint arXiv:2206.07771, 2022.