Illiterate DALL-E Learns to Compose (2110.11405v3)

Published 17 Oct 2021 in cs.CV and cs.LG

Abstract: Although DALL-E has shown an impressive ability of composition-based systematic generalization in image generation, it requires the dataset of text-image pairs and the compositionality is provided by the text. In contrast, object-centric representation models like the Slot Attention model learn composable representations without the text prompt. However, unlike DALL-E its ability to systematically generalize for zero-shot generation is significantly limited. In this paper, we propose a simple but novel slot-based autoencoding architecture, called SLATE, for combining the best of both worlds: learning object-centric representations that allows systematic generalization in zero-shot image generation without text. As such, this model can also be seen as an illiterate DALL-E model. Unlike the pixel-mixture decoders of existing object-centric representation models, we propose to use the Image GPT decoder conditioned on the slots for capturing complex interactions among the slots and pixels. In experiments, we show that this simple and easy-to-implement architecture not requiring a text prompt achieves significant improvement in in-distribution and out-of-distribution (zero-shot) image generation and qualitatively comparable or better slot-attention structure than the models based on mixture decoders.

Citations (119)

View on Semantic Scholar

Summary

The paper presents SLATE, a slot-based autoencoder that achieves zero-shot image composition without relying on textual prompts.
It uses a Slot Attention Encoder and a transformer decoder to extract and model detailed object-centric representations from images.
SLATE demonstrates superior image reconstruction and robust generalization by clustering visual concepts for compositional generation.

Illiterate DALL-E Learns to Compose: An Overview

The paper "Illiterate DALL-E Learns to Compose," authored by Gautam Singh, Fei Deng, and Sungjin Ahn, introduces a novel slot-based autoencoding architecture named SLATE, aimed at enhancing object-centric representation models in zero-shot image generation. The primary focus of the paper is to bridge the gap between conventional image-generating models that rely heavily on text-based prompts, such as DALL-E, and those capable of independently inferring compositional structures from images.

Key Contributions and Methodology

SLATE stands for SLot Attention TransformEr and is developed to combine the advantages of DALL-E and object-centric representation learning models. Unlike DALL-E, which achieves compositionality through text-image pairs, SLATE attempts to achieve similar systematic generalization without the aid of text. It does so by learning object-centric representations directly from images, enabling zero-shot generation without text inputs, effectively making it an "illiterate" DALL-E.

The innovative structure of SLATE is primarily centered around:

Slot Attention Encoder: This module generates a set of object representation vectors, or slots, from each input image. These slots are used to encode information about different objects within the scene.
Transformer-based Decoder: Unlike pixel-mixture decoders in traditional models, SLATE uses an Image GPT-based transformer decoder. This allows for modeling complex interactions among slots and image pixels, significantly improving the quality of generated images in both zero-shot settings and the reconstruction of known input images.
Visual Concept Library: SLATE generates a library of reusable visual concepts by clustering the learned slots, allowing the model to compose images by sampling from these learned concepts.

Experimental Evaluation

The authors conduct extensive experiments to compare SLATE with conventional models like Slot Attention that use mixture decoders. Evaluations are performed on datasets with composable objects, including 3D Shapes, CLEVR-Mirror, Shapestacks, Bitmoji, and others. Some key findings from these evaluations include:

SLATE significantly improves the quality of zero-shot generation and image reconstruction over traditional mixture decoders.
The model demonstrates robust zero-shot generalization, such as rendering novel compositions of objects.
SLATE shows better object attention masks in textured images, effectively resolving issues like object merging, which is prevalent in other models when dealing with complex scenes.

Implications and Future Directions

The implications of this research are profound, especially in the context of increasing the autonomy of machine learning models in comprehending and generating visual content. By removing dependencies on textual descriptions, models like SLATE enhance the flexibility and generalization capability of AI systems in dealing with novel situations and compositions.

Potential future developments could focus on more robust online clustering for the visual concept library, integration with density modeling at the slot level, and optimizations for computational efficiency. This line of work also opens the door for deeper explorations into how unsupervised learning can be leveraged to improve AI's understanding of complex visual scenes intrinsically.

In conclusion, by eschewing traditional text-based dependency for visual compositionality, SLATE offers a promising direction for enhancing the innate image generation capabilities of AI systems while maintaining simplicity in its architecture.

PDF Markdown

Related Papers

GitHub

GitHub - singhgautam/slate: This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch. (87 stars)