CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers (2204.14217v2)

Published 28 Apr 2022 in cs.CV and cs.LG

Abstract: The development of the transformer-based text-to-image models are impeded by its slow generation and complexity for high-resolution images. In this work, we put forward a solution based on hierarchical transformers and local parallel auto-regressive generation. We pretrain a 6B-parameter transformer with a simple and flexible self-supervised task, Cross-modal general LLM (CogLM), and finetune it for fast super-resolution. The new text-to-image system, CogView2, shows very competitive generation compared to concurrent state-of-the-art DALL-E-2, and naturally supports interactive text-guided editing on images.

Authors (4)

Ming Ding (219 papers)
Wendi Zheng (12 papers)
Wenyi Hong (14 papers)
Jie Tang (302 papers)

Citations (279)

View on Semantic Scholar

Summary

The paper introduces CogView2, which leverages hierarchical transformers and local parallel autoregressive strategies for faster, high-resolution text-to-image generation.
The paper details a three-stage process—low-resolution generation, direct super-resolution, and iterative refinement—to effectively balance generation speed and image quality.
The paper demonstrates that CogView2 is approximately ten times faster than its predecessor while achieving competitive performance on metrics like FID and Inception Scores.

CogView2: Advancements in Text-to-Image Generation with Hierarchical Transformers

The paper "CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers" presents a significant enhancement to transformer-based text-to-image models through the implementation of hierarchical transformers and local parallel autoregressive generation strategies. The authors address several longstanding issues associated with high-resolution image generation, namely slow autoregressive generation, expensive training times due to high-resolution outputs, and the unidirectional nature of existing models.

Core Contribution and Methodology

The core contribution of this paper is the introduction of CogView2, a novel text-to-image system that leverages a 6 billion parameter pretrained transformer, termed Cross-Modal General LLM (CogLM). The CogLM is fine-tuned using a text and image token masking strategy that trains the model to predict missing tokens autoregressively. This approach enables the model to perform multiple tasks such as text-to-image generation, image infilling, and image captioning without additional architectural changes.

The hierarchical nature of CogView2 is pivotal to its performance. The generation process is divided into three main stages:

Low-resolution image generation, which utilizes the previously described cross-modal generation strategy.
A direct super-resolution module, which transforms these preliminary low-resolution images into higher-resolution outputs by means of a cross-resolution local attention mechanism.
An iterative super-resolution module, which further refines these high-resolution images, addressing local coherence and optimizing the output using a Local Parallel Autoregressive (LoPAR) approach.

Comparative Performance and Evaluation

CogView2 demonstrates comparable performance to state-of-the-art models like DALL-E-2, particularly in generating high-resolution images with improved generation speed. It is reported to be approximately ten times faster than its predecessor, the original CogView, when generating images at similar resolutions.

The authors conducted evaluations using Fréchet Inception Distance (FID) and Inception Scores (IS), highlighting CogView2's competitive metrics in comparison to other leading models. Additionally, through the introduction of a cluster sampling optimization and local attention kernel enhancements, CogView2 achieves significant gains in computational efficiency, evidenced by a reduction in runtime from 3,600 to six units in certain benchmarks.

Implications and Future Directions

CogView2's enhancements demonstrate substantial practical implications, offering a viable solution for real-time, high-quality image generation. This capability is particularly relevant in applications demanding rapid visual synthesis from textual descriptions, such as in creative industries and interactive media.

Theoretically, the integration of hierarchical transformers and local parallel autoregressive mechanisms may guide future developments in other modalities and cross-modal tasks, extending beyond direct image synthesis. This work also provides a framework for mitigating computational overheads in large-scale model training and generation processes.

Conclusion

The paper positions CogView2 at the forefront of text-to-image generation research, offering a strategic balance between speed, resolution, and output quality. Future iterations may explore deeper hierarchical architectures and the potential for integrating additional levels of super-resolution, as suggested by the authors. The broader impact on multimedia applications and ethical considerations regarding synthetic content prosecution are also briefly noted, affirming the importance of responsible AI deployment.

In summary, the advancements detailed in this paper reflect a well-concerted effort to enhance transformer-based image generation, representing a meaningful stride within the field of artificial intelligence and machine learning.

PDF Markdown

Related Papers

YouTube

Show All Videos