GIVT: Generative Infinite-Vocabulary Transformers (2312.02116v4)

Published 4 Dec 2023 in cs.CV

Abstract: We introduce Generative Infinite-Vocabulary Transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized real-valued latent sequences of a $\beta$-VAE. In class-conditional image generation GIVT outperforms VQ-GAN (and improved variants thereof) as well as MaskGIT, and achieves performance competitive with recent latent diffusion models. Finally, we obtain strong results outside of image generation when applying GIVT to panoptic segmentation and depth estimation with a VAE variant of the UViM framework.

Citations (18)

View on Semantic Scholar

Summary

The paper introduces a transformer variant that replaces fixed vocabulary lookups with continuous vector outputs using linear projection and Gaussian mixture modeling.
It demonstrates competitive performance against models like VQ-GAN and MaskGIT on tasks including image generation, panoptic segmentation, and depth estimation.
By integrating continuous latent space learning with inference techniques such as temperature sampling, the approach paves the way for more flexible and robust generative applications.

Generative Infinite-Vocabulary Transformers (GIVT) present a significant breakthrough in the field of generative AI models. Traditional generative transformer models have been limited to producing sequences of tokens from a fixed, finite vocabulary, which aligns well with natural language processing but is less than ideal when dealing with non-discrete data like images. GIVT addresses this limitation by generating sequences with real-valued vectors, effectively lifting the restriction of a finite vocabulary.

This innovation is achieved through two key modifications to the decoder-only transformer architecture. First, GIVT replaces the finite-vocabulary lookup table, typically used at the input stage, with a linear projection of input vectors. Secondly, at the output stage, it substitutes the logits prediction (which maps to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. This means that instead of dealing with a fixed set of tokens, GIVT operates with an infinite set of possible outputs.

GIVT demonstrates its versatility and robustness across several applications, such as class-conditional image generation and dense prediction tasks including panoptic segmentation and depth estimation. Using an underlying VAE (Variational AutoEncoder) framework to learn a continuous latent space, GIVT then models this space with the generative transformer approach. Notably, it performs competitively or even outperforms previous models like VQ-GAN and MaskGIT in certain scenarios. This is remarkable considering that it achieves this without complex training techniques often associated with VQ-VAE literature.

The paper also adapts several inference techniques, including temperature sampling and classifier-free guidance, to work with continuous distributions. It shows that careful tuning of these techniques can influence the quality of generated samples significantly.

In summary, by enabling transformer models to work with real-valued vector sequences, GIVT overcomes previous constraints related to vocabulary size and related training complications, providing a new level of flexibility for generative models. This is a foundational change that paves the way for further innovations in diverse fields of generative AI.

PDF Markdown

Related Papers

Tweets

https://twitter.com/mtschannen/status/1732061956478017604

https://twitter.com/597511633/status/1732162421370880507

https://twitter.com/mtschannen/status/1797213472687632456

https://twitter.com/neilhoulsby/status/1803110187026309555

https://twitter.com/mtschannen/status/1748388240317415686

https://twitter.com/mtschannen/status/1844984933745353189