Improved Vector Quantized Diffusion Models (2205.16007v2)

Published 31 May 2022 in cs.CV

Abstract: Vector quantized diffusion (VQ-Diffusion) is a powerful generative model for text-to-image synthesis, but sometimes can still generate low-quality samples or weakly correlated images with text input. We find these issues are mainly due to the flawed sampling strategy. In this paper, we propose two important techniques to further improve the sample quality of VQ-Diffusion. 1) We explore classifier-free guidance sampling for discrete denoising diffusion model and propose a more general and effective implementation of classifier-free guidance. 2) We present a high-quality inference strategy to alleviate the joint distribution issue in VQ-Diffusion. Finally, we conduct experiments on various datasets to validate their effectiveness and show that the improved VQ-Diffusion suppresses the vanilla version by large margins. We achieve an 8.44 FID score on MSCOCO, surpassing VQ-Diffusion by 5.42 FID score. When trained on ImageNet, we dramatically improve the FID score from 11.89 to 4.83, demonstrating the superiority of our proposed techniques.

Citations (59)

View on Semantic Scholar

Summary

The paper introduces discrete classifier-free guidance that effectively balances prior and posterior probabilities for improved image quality.
The authors propose an improved inference strategy using fewer token changes and a purity prior to preserve inter-token dependencies.
Experiments show FID scores on MSCOCO and ImageNet drop significantly, highlighting superior text-to-image synthesis performance.

Improved Vector Quantized Diffusion Models

The paper "Improved Vector Quantized Diffusion Models" addresses crucial enhancements to the Vector Quantized Diffusion (VQ-Diffusion) framework, widely utilized for text-to-image synthesis. Despite VQ-Diffusion's capabilities, it sometimes struggles with generating low-quality samples or images that poorly align with input text. The authors attribute these challenges primarily to sampling strategy inadequacies and propose significant methodological improvements to enhance VQ-Diffusion’s performance.

Key Contributions

The contributions of the paper focus on two primary techniques:

Discrete Classifier-free Guidance:
- The researchers introduce a refined method for classifier-free guidance sampling within the discrete domain of VQ-Diffusion. By addressing the probability distribution directly rather than approximating the noise, this approach integrates both prior and posterior probabilities more effectively. A learnable parameter is employed as a condition, providing an advanced implementation that substantially improves generated image quality.
High-quality Inference Strategy:
- The authors identify and address the joint distribution issue arising from independent token sampling at each denoising step. Their strategy involves fewer token changings per step and employs a "purity prior" to selectively sample high-confidence tokens, thus preserving inter-token dependencies and enhancing sample coherence.

These methods demonstrate significant performance improvements, validating their efficacy across diverse datasets including CUB-200, MSCOCO, and ImageNet. Specifically, the improved VQ-Diffusion achieves an 8.44 FID score on MSCOCO, marking a substantial 5.42 point enhancement over the original version, and on ImageNet, the FID score improves dramatically from 11.89 to 4.83.

Practical and Theoretical Implications

The enhancements proposed offer substantial practical implications for generative modelling in image synthesis:

Sample Quality: By addressing the posterior constraint issue, the model consistently produces images more aligned with textual inputs, beneficial for applications requiring precise text-to-visual coherence.
Efficiency in Inference: Although the high-quality inference strategy increases computational demands, the resultant gains in image quality and fidelity can significantly impact fields like content creation, where quality is paramount.

The authors suggest that these strategies could inform future developments in discrete generative models beyond the scope of image synthesis.

Future Developments

The findings open several avenues for further research:

Cross-domain Applications: Given the improvements, similar techniques might be adapted to other discrete generative tasks, such as text or video generation.
Parameter Optimization: Exploring different learnable parameters and fine-tuning strategies could yield further enhancements in classifier-free guidance.

Overall, the paper delivers a well-articulated advancement to the field of generative models, providing both a detailed methodology and a robust evaluative framework to substantiate the improvements in sample quality and text-image alignment within the Vector Quantized Diffusion framework.

Related Papers

GitHub

GitHub - microsoft/VQ-Diffusion: Official implementation of VQ-Diffusion (852 stars)