Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation (2311.16201v2)
Abstract: Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to LLMing. However, these methods have yet to leverage pre-trained LLMs, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained LLM for auto-regressive text-to-image generation, and find that pre-trained LLMs offer limited help. We provide a two-fold explanation by analyzing tokens from each modality. First, we demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained LLMs no more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared to normal LLM pre-training data, which causes the catastrophic degradation of LLMs' capability.
- Scaling laws for generative mixed-modal language models. In ICML, 2023.
- Pythia: A suite for analyzing large language models across training and scaling. In ICML, 2023.
- GPT-NeoX-20B: An open-source autoregressive language model. In ACL Workshop, 2022.
- Re-imagen: Retrieval-augmented text-to-image generator. In ICLR, 2022.
- T. Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
- Taming transformers for high-resolution image synthesis. In CVPR, 2021.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041, 2023.
- open_lm: a minimal but performative language modeling (lm) repository, 2023. GitHub repository.
- Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset. In NeurIPS, 2022.
- Measuring massive multitask language understanding. In ICLR, 2021.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
- Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017, 2022.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- S2ORC: The semantic scholar open research corpus. In ACL, 2020.
- I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In ICLR, 2019.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Zero-shot text-to-image generation. In ICML, 2021.
- Perceptual grouping in contrastive vision-language models. In ICCV, 2023.
- Generating diverse high-fidelity images with VQ-VAE-2. In NeurIPS, 2019.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
- Analysing mathematical reasoning abilities of neural models. In ICLR, 2019.
- Neural discrete representation learning. In NeurIPS, 2017.
- Attention is all you need. In NeurIPS, 2017.
- Vector-quantized image modeling with improved VQGAN. In ICLR, 2022.
- Scaling autoregressive models for content-rich text-to-image generation. TMLR, 2022.
- Spae: Semantic pyramid autoencoder for multimodal generation with frozen llms. arXiv preprint arXiv:2306.17842, 2023.
- Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2023.
- HellaSwag: Can a machine really finish your sentence? In ACL, 2019.
- Defending against neural fake news. In NeurIPS, 2019.
- Opt: Open pre-trained transformer language models, 2022.
- Movq: Modulating quantized vectors for high-fidelity image generation. In NeurIPS, 2022.