Emergent Mind

GIVT: Generative Infinite-Vocabulary Transformers

(2312.02116)
Published Dec 4, 2023 in cs.CV

Abstract

We introduce generative infinite-vocabulary transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized real-valued latent sequences of a $\beta$-VAE. In class-conditional image generation GIVT outperforms VQ-GAN (and improved variants thereof) as well as MaskGIT, and achieves performance competitive with recent latent diffusion models. Finally, we obtain strong results outside of image generation when applying GIVT to panoptic segmentation and depth estimation with a VAE variant of the UViM framework

Variational Autoencoder samples demonstrating model's ability to generate diverse and accurate representations.

Overview

  • GIVT introduces a method to generate sequences with real-valued vectors instead of a finite set of tokens, expanding transformer capabilities beyond discrete data.

  • The model alters the traditional transformer architecture, replacing finite vocabulary with a linear projection of input vectors, and logits with Gaussian mixture model parameters at output.

  • The GIVT framework showcases its strength in various generative tasks, such as image generation and complex segmentation, outperforming previous models in some cases.

  • It incorporates a VAE framework to model continuous latent spaces, showing competitive performance without complex training methods.

  • Adapting inference techniques for continuous distributions, GIVT allows fine-tuning of generated sample quality, representing a significant advancement for generative AI models.

Generative Infinite-Vocabulary Transformers (GIVT) present a significant breakthrough in the field of generative AI models. Traditional generative transformer models have been limited to producing sequences of tokens from a fixed, finite vocabulary, which aligns well with natural language processing but is less than ideal when dealing with non-discrete data like images. GIVT addresses this limitation by generating sequences with real-valued vectors, effectively lifting the restriction of a finite vocabulary.

This innovation is achieved through two key modifications to the decoder-only transformer architecture. First, GIVT replaces the finite-vocabulary lookup table, typically used at the input stage, with a linear projection of input vectors. Secondly, at the output stage, it substitutes the logits prediction (which maps to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. This means that instead of dealing with a fixed set of tokens, GIVT operates with an infinite set of possible outputs.

GIVT demonstrates its versatility and robustness across several applications, such as class-conditional image generation and dense prediction tasks including panoptic segmentation and depth estimation. Using an underlying VAE (Variational AutoEncoder) framework to learn a continuous latent space, GIVT then models this space with the generative transformer approach. Notably, it performs competitively or even outperforms previous models like VQ-GAN and MaskGIT in certain scenarios. This is remarkable considering that it achieves this without complex training techniques often associated with VQ-VAE literature.

The paper also adapts several inference techniques, including temperature sampling and classifier-free guidance, to work with continuous distributions. It shows that careful tuning of these techniques can influence the quality of generated samples significantly.

In summary, by enabling transformer models to work with real-valued vector sequences, GIVT overcomes previous constraints related to vocabulary size and related training complications, providing a new level of flexibility for generative models. This is a foundational change that paves the way for further innovations in diverse fields of generative AI.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.