Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation (2311.16201v2)
Abstract: Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to LLMing. However, these methods have yet to leverage pre-trained LLMs, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained LLM for auto-regressive text-to-image generation, and find that pre-trained LLMs offer limited help. We provide a two-fold explanation by analyzing tokens from each modality. First, we demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained LLMs no more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared to normal LLM pre-training data, which causes the catastrophic degradation of LLMs' capability.
- Scaling laws for generative mixed-modal language models. In ICML, 2023.
- Pythia: A suite for analyzing large language models across training and scaling. In ICML, 2023.
- GPT-NeoX-20B: An open-source autoregressive language model. In ACL Workshop, 2022.
- Re-imagen: Retrieval-augmented text-to-image generator. In ICLR, 2022.
- T. Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
- Taming transformers for high-resolution image synthesis. In CVPR, 2021.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041, 2023.
- open_lm: a minimal but performative language modeling (lm) repository, 2023. GitHub repository.
- Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset. In NeurIPS, 2022.
- Measuring massive multitask language understanding. In ICLR, 2021.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
- Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017, 2022.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- S2ORC: The semantic scholar open research corpus. In ACL, 2020.
- I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In ICLR, 2019.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Zero-shot text-to-image generation. In ICML, 2021.
- Perceptual grouping in contrastive vision-language models. In ICCV, 2023.
- Generating diverse high-fidelity images with VQ-VAE-2. In NeurIPS, 2019.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
- Analysing mathematical reasoning abilities of neural models. In ICLR, 2019.
- Neural discrete representation learning. In NeurIPS, 2017.
- Attention is all you need. In NeurIPS, 2017.
- Vector-quantized image modeling with improved VQGAN. In ICLR, 2022.
- Scaling autoregressive models for content-rich text-to-image generation. TMLR, 2022.
- Spae: Semantic pyramid autoencoder for multimodal generation with frozen llms. arXiv preprint arXiv:2306.17842, 2023.
- Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2023.
- HellaSwag: Can a machine really finish your sentence? In ACL, 2019.
- Defending against neural fake news. In NeurIPS, 2019.
- Opt: Open pre-trained transformer language models, 2022.
- Movq: Modulating quantized vectors for high-fidelity image generation. In NeurIPS, 2022.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.