Grounding Language Models to Images for Multimodal Inputs and Outputs (2301.13823v4)
Abstract: We propose an efficient method to ground pretrained text-only LLMs to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method leverages the abilities of LLMs learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the LLM frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf LLM and paves the way towards an effective, general solution for leveraging pretrained LLMs in visually grounded settings.
- Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
- Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520, 2022.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
- METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610–623, 2021.
- Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963, 2021.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Language models are few-shot learners. NeurIPS, 2020.
- Data distributional properties drive emergent few-shot learning in transformers. NeurIPS, 2022.
- Learning a similarity metric discriminatively, with application to face verification. In CVPR, 2005.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Transformer-xl: Attentive language models beyond a fixed-length context. ACL, 2019.
- Visual dialog. In CVPR, 2017.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. NeurIPS, 2022.
- Cogview2: Faster and better text-to-image generation via hierarchical transformers. arXiv preprint arXiv:2204.14217, 2022.
- Magma–multimodal augmentation of generative models through adapter-based finetuning. EMNLP, 2022.
- Taming transformers for high-resolution image synthesis. In CVPR, 2021.
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models. EMNLP, 2020.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
- Training compute-optimal large language models. NeurIPS, 2022.
- The curious case of neural text degeneration. ICLR, 2020.
- Parameter-efficient transfer learning for nlp. In ICML, 2019.
- Visual storytelling. In NAACL-HLT, 2016.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICLR, 2021.
- Adam: A method for stochastic optimization. ICLR, 2015.
- The power of scale for parameter-efficient prompt tuning. EMNLP, 2021.
- The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022a.
- Prefix-tuning: Optimizing continuous prompts for generation. ACL, 2021.
- Contrastive decoding: Open-ended text generation as optimization. arXiv preprint arXiv:2210.15097, 2022b.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NeurIPS, 2019.
- Pretrained transformers as universal computation engines. AAAI, 2022.
- Linearly mapping from image to text space. arXiv preprint arXiv:2209.15162, 2022.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
- Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
- Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 2019.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In ICLR, 2021.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
- Zero-shot text-to-image generation. In ICML, 2021.
- Generative adversarial text to image synthesis. In ICML, 2016.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Neural machine translation of rare words with subword units. ACL, 2015.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. ACL, 2018.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
- Progressive generation of long text with pretrained language models. NAACL, 2021.
- Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399, 2022.
- Multimodal few-shot learning with frozen language models. NeurIPS, 2021.
- Attention is all you need. NeurIPS, 2017.
- Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. ICML, 2022.
- Finetuned language models are zero-shot learners. ICLR, 2021.
- Emergent abilities of large language models. TMLR, 2022.
- Re3: Generating longer stories with recursive reprompting and revision. EMNLP, 2022.
- Vector-quantized image modeling with improved vqgan. ICLR, 2021.
- Scaling autoregressive models for content-rich text-to-image generation. TMLR, 2022a.
- Multimodal knowledge alignment with reinforcement learning. arXiv preprint arXiv:2205.12630, 2022b.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.