HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes (2401.00365v2)

Published 31 Dec 2023 in cs.LG, cs.AI, and cs.CV

Abstract: Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity reconstructions. However, such hierarchical extensions of VQ-VAE often suffer from the codebook/layer collapse issue, where the codebook is not efficiently used to express the data, and hence degrades reconstruction accuracy. To mitigate this problem, we propose a novel unified framework to stochastically learn hierarchical discrete representation on the basis of the variational Bayes framework, called hierarchically quantized variational autoencoder (HQ-VAE). HQ-VAE naturally generalizes the hierarchical variants of VQ-VAE, such as VQ-VAE-2 and residual-quantized VAE (RQ-VAE), and provides them with a Bayesian training scheme. Our comprehensive experiments on image datasets show that HQ-VAE enhances codebook usage and improves reconstruction performance. We also validated HQ-VAE in terms of its applicability to a different modality with an audio dataset.

References (60)

Citations (8)

View on Semantic Scholar

Summary

The paper introduces a novel hierarchical discrete representation framework using a Bayesian variational approach that mitigates codebook collapse.
It employs bottom-up and top-down stochastic quantization to effectively capture both local and global features across diverse data modalities.
Experimental results on image and audio datasets demonstrate improved reconstruction accuracy and competitive generative performance compared to prior VQ-based models.

HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

Introduction

The HQ-VAE paper presents a novel framework for hierarchical discrete representation learning, addressing the limitations of traditional Vector Quantization Variational Autoencoders (VQ-VAEs). Traditional VQ-VAEs suffer from codebook collapse, where the codebook is inefficiently utilized, affecting reconstruction accuracy. HQ-VAE integrates a Bayesian approach to enhance codebook usage and improve performance across modalities, including both image and audio datasets.

Methodology

HQ-VAE extends the VQ-VAE framework by employing a hierarchical structure. This structure consists of bottom-up and top-down paths, facilitating the capture of both local and global information. The key innovation is the stochastic quantization within the variational Bayes framework, which mitigates the codebook collapse problem.

Figure 1: HQ-VAE consists of bottom-up and top-down paths. Red arrows represent the approximated posterior.

In the HQ-VAE framework, hierarchical discrete representations are learned by introducing multiple latent variable groups, each associated with a trainable codebook. The following key components are integral to the HQ-VAE structure:

Stochastic Quantization: A dequantization process followed by a stochastic quantization operation, which involves sampling from a Gumbel-softmax distribution. This process provides a continuous approximation of the categorical distribution.
Variational Framework: Utilization of the evidence lower bound (ELBO) for model training, incorporating terms for reconstruction and regularization derived from the variational Bayes' rule.

Experiments and Results

Image Reconstruction:

Experiments on datasets like CIFAR10, CelebA-HQ, and ImageNet demonstrate HQ-VAE's ability to outperform VQ-VAE-2 in terms of reconstruction accuracy and codebook utilization.

Figure 2: Impact of codebook capacity on reconstruction of images in (a) CIFAR10 and (b) CelebA-HQ.

Audio Reconstruction:

The paper validates HQ-VAE on UrbanSound8K, revealing improved RMSE scores compared with traditional RQ-VAEs, demonstrating its applicability to audio modalities.

Hierarchical Model Instances

Two main instances of HQ-VAE are explored:

SQ-VAE-2: Extends VQ-VAE-2 with injected top-down layers to better capture multi-resolution features via stochastic quantization.
RSQ-VAE: Refines hierarchical representations through residual top-down layers, offering superior reconstruction across varying compression rates without heuristic strategies like EMA updates.

Figure 3: Impact of codebook capacity on reconstructions of images from (a) CIFAR10 and (b) CelebA-HQ.

Application in Generative Modeling

HQ-VAE models are effectively applied to generative tasks:

FFHQ Dataset: RSQ-VAE demonstrates competitive FID scores using RQ-Transformers, yielding high-quality image generations.
ImageNet: SQ-VAE-2 achieves competitive FID and inception scores, validating the generative capability of HQ-VAE in complex data scenarios.

Conclusion

HQ-VAE introduces a robust framework for variational hierarchical discrete representation learning, overcoming the limitations of prior VQ models, particularly in enhancing codebook efficiency and reconstruction quality. The Bayesian approach simplifies training by reducing dependence on ad-hoc techniques and hyperparameters, offering a scalable solution across varied data modalities.

Future research paths may explore further semantic disentanglement within the hierarchical discrete representations and broader applications in high-fidelity generation tasks.