- The paper introduces a transformer variant that replaces fixed vocabulary lookups with continuous vector outputs using linear projection and Gaussian mixture modeling.
- It demonstrates competitive performance against models like VQ-GAN and MaskGIT on tasks including image generation, panoptic segmentation, and depth estimation.
- By integrating continuous latent space learning with inference techniques such as temperature sampling, the approach paves the way for more flexible and robust generative applications.
Generative Infinite-Vocabulary Transformers (GIVT) present a significant breakthrough in the field of generative AI models. Traditional generative transformer models have been limited to producing sequences of tokens from a fixed, finite vocabulary, which aligns well with natural language processing but is less than ideal when dealing with non-discrete data like images. GIVT addresses this limitation by generating sequences with real-valued vectors, effectively lifting the restriction of a finite vocabulary.
This innovation is achieved through two key modifications to the decoder-only transformer architecture. First, GIVT replaces the finite-vocabulary lookup table, typically used at the input stage, with a linear projection of input vectors. Secondly, at the output stage, it substitutes the logits prediction (which maps to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. This means that instead of dealing with a fixed set of tokens, GIVT operates with an infinite set of possible outputs.
GIVT demonstrates its versatility and robustness across several applications, such as class-conditional image generation and dense prediction tasks including panoptic segmentation and depth estimation. Using an underlying VAE (Variational AutoEncoder) framework to learn a continuous latent space, GIVT then models this space with the generative transformer approach. Notably, it performs competitively or even outperforms previous models like VQ-GAN and MaskGIT in certain scenarios. This is remarkable considering that it achieves this without complex training techniques often associated with VQ-VAE literature.
The paper also adapts several inference techniques, including temperature sampling and classifier-free guidance, to work with continuous distributions. It shows that careful tuning of these techniques can influence the quality of generated samples significantly.
In summary, by enabling transformer models to work with real-valued vector sequences, GIVT overcomes previous constraints related to vocabulary size and related training complications, providing a new level of flexibility for generative models. This is a foundational change that paves the way for further innovations in diverse fields of generative AI.