Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 45 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 11 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 88 tok/s Pro
Kimi K2 214 tok/s Pro
GPT OSS 120B 460 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes (2401.00365v2)

Published 31 Dec 2023 in cs.LG, cs.AI, and cs.CV

Abstract: Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity reconstructions. However, such hierarchical extensions of VQ-VAE often suffer from the codebook/layer collapse issue, where the codebook is not efficiently used to express the data, and hence degrades reconstruction accuracy. To mitigate this problem, we propose a novel unified framework to stochastically learn hierarchical discrete representation on the basis of the variational Bayes framework, called hierarchically quantized variational autoencoder (HQ-VAE). HQ-VAE naturally generalizes the hierarchical variants of VQ-VAE, such as VQ-VAE-2 and residual-quantized VAE (RQ-VAE), and provides them with a Bayesian training scheme. Our comprehensive experiments on image datasets show that HQ-VAE enhances codebook usage and improves reconstruction performance. We also validated HQ-VAE in terms of its applicability to a different modality with an audio dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Soft-to-hard vector quantization for end-to-end learning compressible representations. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2017.
  2. Fixing a broken ELBO. arXiv preprint arXiv:1711.00464, 2017.
  3. Structured denoising diffusion models in discrete state-spaces. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 34, pp.  17981–17993, 2021.
  4. Efficient-vqgan: Towards high-resolution image generation with efficient vision transformers. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  7368–7377, 2023.
  5. Maskgit: Masked generative image transformer. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  11315–11325, 2022.
  6. Muse: Text-to-image generation via masked generative transformers. In Proc. International Conference on Machine Learning (ICML), pp.  4055–4075, 2023.
  7. Pixelsnail: An improved autoregressive generative model. In Proc. International Conference on Machine Learning (ICML), pp.  864–872. PMLR, 2018.
  8. Rewon Child. Very deep VAEs generalize autoregressive models and can outperform them on images. In Proc. International Conference on Learning Representation (ICLR), 2021.
  9. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  10. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
  11. Imagenet: A large-scale hierarchical image database. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  248–255. Ieee, 2009.
  12. Diffusion models beat GANs on image synthesis. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 34, pp.  8780–8794, 2021.
  13. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020.
  14. Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 34, pp.  3518–3532, 2021a.
  15. Taming transformers for high-resolution image synthesis. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  12873–12883, 2021b.
  16. Vector quantized diffusion model for text-to-image synthesis. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  10696–10706, 2022.
  17. Denoising diffusion probabilistic models. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp.  6840–6851, 2020.
  18. Argmax flows and multinomial diffusion: Learning categorical distributions. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 34, pp.  12454–12465, 2021.
  19. Image-to-image translation with conditional adversarial networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  1125–1134, 2017.
  20. Categorical reparameterization with gumbel-softmax. In Proc. International Conference on Learning Representation (ICLR), 2017.
  21. Perceptual losses for real-time style transfer and super-resolution. In Proc. European Conference on Computer Vision (ECCV), pp. 694–711, 2016.
  22. Fast decoding in sequence models using discrete latent variables. In Proc. International Conference on Machine Learning (ICML), pp.  2390–2399, 2018.
  23. Progressive growing of gans for improved quality, stability, and variation. In Proc. International Conference on Learning Representation (ICLR), 2018.
  24. A style-based generator architecture for generative adversarial networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  4401–4410, 2019.
  25. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020.
  26. Audiogen: Textually guided audio generation. arXiv preprint arXiv:2209.15352, 2022.
  27. Learning multiple layers of features from tiny images. 2009.
  28. Autoregressive image generation using residual quantization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  11523–11532, 2022a.
  29. Draft-and-revise: Effective image generation with contextual rq-transformer. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2022b.
  30. Conditional sound generation using neural discrete time-frequency representation learning. In IEEE Int. Workshop on Machine Learning for Signal Processing (MLSP), pp.  1–6, 2021.
  31. The concrete distribution: A continuous relaxation of discrete random variables. In Proc. International Conference on Learning Representation (ICLR), 2017.
  32. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
  33. Zero-shot text-to-image generation. In Proc. International Conference on Machine Learning (ICML), pp.  8821–8831, 2021.
  34. Generating diverse high-fidelity images with VQ-VAE-2. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp.  14866–14876, 2019.
  35. High-resolution image synthesis with latent diffusion models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  10684–10695, 2022.
  36. Theory and experiments on vector quantized autoencoders. arXiv preprint arXiv:1805.11063, 2018.
  37. A dataset and taxonomy for urban sound research. In ACM Int. Conf. on Multimedia (ACM MM), pp.  1041–1044, 2014.
  38. webmushra—a comprehensive framework for web-based listening tests. Journal of Open Research Software, 6(1), 2018.
  39. B Series. Method for the subjective assessment of intermediate quality level of audio systems. International Telecommunication Union Radiocommunication Assembly, 2014.
  40. Bit prioritization in variational autoencoders via progressive coding. In International Conference on Machine Learning (ICML), pp. 20141–20155, 2022.
  41. Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. International Conference on Machine Learning (ICML), pp.  2256–2265. PMLR, 2015.
  42. Ladder variational autoencoders. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp.  3738–3746, 2016.
  43. Continuous relaxation training of discrete latent variable image models. In Beysian DeepLearning workshop, NIPS, 2017.
  44. Denoising diffusion implicit models. In Proc. International Conference on Learning Representation (ICLR), 2020.
  45. Preventing oversmoothing in VAE via generalized variance parameterization. Neurocomputing, 509:137–156, 2022a.
  46. SQ-VAE: Variational bayes on discrete representation with self-annealed stochastic quantization. In Proc. International Conference on Machine Learning (ICML), 2022b.
  47. Lossy image compression with compressive autoencoders. In Proc. International Conference on Learning Representation (ICLR), 2017.
  48. Variable rate image compression with recurrent neural networks. In Proc. International Conference on Learning Representation (ICLR), 2016.
  49. Nvae: A deep hierarchical variational autoencoder. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 33, pp.  19667–19679, 2020.
  50. Pixel recurrent neural networks. In Proc. International Conference on Machine Learning (ICML), pp.  1747–1756, 2016.
  51. Neural discrete representation learning. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp.  6306–6315, 2017.
  52. Neural data-dependent transform for learned image compression. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  17379–17388, 2022.
  53. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  54. Hierarchical quantized autoencoders. arXiv preprint arXiv:2002.08111, 2020.
  55. Multi-scale residual convolutional encoder decoder with bidirectional long short-term memory for single channel speech enhancement. In Proc. European Signal Process. Conf. (EUSIPCO), pp. 431–435, 2021.
  56. Diffsound: Discrete diffusion model for text-to-sound generation. arXiv preprint arXiv:2207.09983, 2022.
  57. Locally hierarchical auto-regressive modeling for image generation. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 35, pp.  16360–16372, 2022.
  58. Vector-quantized image modeling with improved VQGAN. In Proc. International Conference on Learning Representation (ICLR), 2022.
  59. SoundStream: An end-to-end neural audio codec. IEEE Trans. Audio, Speech, Lang. Process., 30:495–507, 2021.
  60. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  586–595, 2018.
Citations (8)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a novel hierarchical discrete representation framework using a Bayesian variational approach that mitigates codebook collapse.
  • It employs bottom-up and top-down stochastic quantization to effectively capture both local and global features across diverse data modalities.
  • Experimental results on image and audio datasets demonstrate improved reconstruction accuracy and competitive generative performance compared to prior VQ-based models.

HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

Introduction

The HQ-VAE paper presents a novel framework for hierarchical discrete representation learning, addressing the limitations of traditional Vector Quantization Variational Autoencoders (VQ-VAEs). Traditional VQ-VAEs suffer from codebook collapse, where the codebook is inefficiently utilized, affecting reconstruction accuracy. HQ-VAE integrates a Bayesian approach to enhance codebook usage and improve performance across modalities, including both image and audio datasets.

Methodology

HQ-VAE extends the VQ-VAE framework by employing a hierarchical structure. This structure consists of bottom-up and top-down paths, facilitating the capture of both local and global information. The key innovation is the stochastic quantization within the variational Bayes framework, which mitigates the codebook collapse problem. Figure 1

Figure 1

Figure 1

Figure 1

Figure 1: HQ-VAE consists of bottom-up and top-down paths. Red arrows represent the approximated posterior.

In the HQ-VAE framework, hierarchical discrete representations are learned by introducing multiple latent variable groups, each associated with a trainable codebook. The following key components are integral to the HQ-VAE structure:

  • Stochastic Quantization: A dequantization process followed by a stochastic quantization operation, which involves sampling from a Gumbel-softmax distribution. This process provides a continuous approximation of the categorical distribution.
  • Variational Framework: Utilization of the evidence lower bound (ELBO) for model training, incorporating terms for reconstruction and regularization derived from the variational Bayes' rule.

Experiments and Results

Image Reconstruction:

Experiments on datasets like CIFAR10, CelebA-HQ, and ImageNet demonstrate HQ-VAE's ability to outperform VQ-VAE-2 in terms of reconstruction accuracy and codebook utilization. Figure 2

Figure 2

Figure 2: Impact of codebook capacity on reconstruction of images in (a) CIFAR10 and (b) CelebA-HQ.

Audio Reconstruction:

The paper validates HQ-VAE on UrbanSound8K, revealing improved RMSE scores compared with traditional RQ-VAEs, demonstrating its applicability to audio modalities.

Hierarchical Model Instances

Two main instances of HQ-VAE are explored:

  • SQ-VAE-2: Extends VQ-VAE-2 with injected top-down layers to better capture multi-resolution features via stochastic quantization.
  • RSQ-VAE: Refines hierarchical representations through residual top-down layers, offering superior reconstruction across varying compression rates without heuristic strategies like EMA updates. Figure 3

Figure 3

Figure 3

Figure 3: Impact of codebook capacity on reconstructions of images from (a) CIFAR10 and (b) CelebA-HQ.

Application in Generative Modeling

HQ-VAE models are effectively applied to generative tasks:

  • FFHQ Dataset: RSQ-VAE demonstrates competitive FID scores using RQ-Transformers, yielding high-quality image generations.
  • ImageNet: SQ-VAE-2 achieves competitive FID and inception scores, validating the generative capability of HQ-VAE in complex data scenarios.

Conclusion

HQ-VAE introduces a robust framework for variational hierarchical discrete representation learning, overcoming the limitations of prior VQ models, particularly in enhancing codebook efficiency and reconstruction quality. The Bayesian approach simplifies training by reducing dependence on ad-hoc techniques and hyperparameters, offering a scalable solution across varied data modalities.

Future research paths may explore further semantic disentanglement within the hierarchical discrete representations and broader applications in high-fidelity generation tasks.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube