Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Grounding Language Models to Images for Multimodal Inputs and Outputs (2301.13823v4)

Published 31 Jan 2023 in cs.CL, cs.AI, cs.CV, and cs.LG

Abstract: We propose an efficient method to ground pretrained text-only LLMs to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method leverages the abilities of LLMs learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the LLM frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf LLM and paves the way towards an effective, general solution for leveraging pretrained LLMs in visually grounded settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
  2. Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520, 2022.
  3. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
  4. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005.
  5. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp.  610–623, 2021.
  6. Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963, 2021.
  7. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  8. Language models are few-shot learners. NeurIPS, 2020.
  9. Data distributional properties drive emergent few-shot learning in transformers. NeurIPS, 2022.
  10. Learning a similarity metric discriminatively, with application to face verification. In CVPR, 2005.
  11. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  12. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  13. Transformer-xl: Attentive language models beyond a fixed-length context. ACL, 2019.
  14. Visual dialog. In CVPR, 2017.
  15. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. NeurIPS, 2022.
  16. Cogview2: Faster and better text-to-image generation via hierarchical transformers. arXiv preprint arXiv:2204.14217, 2022.
  17. Magma–multimodal augmentation of generative models through adapter-based finetuning. EMNLP, 2022.
  18. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  19. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. EMNLP, 2020.
  20. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
  21. Training compute-optimal large language models. NeurIPS, 2022.
  22. The curious case of neural text degeneration. ICLR, 2020.
  23. Parameter-efficient transfer learning for nlp. In ICML, 2019.
  24. Visual storytelling. In NAACL-HLT, 2016.
  25. Scaling up visual and vision-language representation learning with noisy text supervision. In ICLR, 2021.
  26. Adam: A method for stochastic optimization. ICLR, 2015.
  27. The power of scale for parameter-efficient prompt tuning. EMNLP, 2021.
  28. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
  29. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022a.
  30. Prefix-tuning: Optimizing continuous prompts for generation. ACL, 2021.
  31. Contrastive decoding: Open-ended text generation as optimization. arXiv preprint arXiv:2210.15097, 2022b.
  32. Microsoft coco: Common objects in context. In ECCV, 2014.
  33. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NeurIPS, 2019.
  34. Pretrained transformers as universal computation engines. AAAI, 2022.
  35. Linearly mapping from image to text space. arXiv preprint arXiv:2209.15162, 2022.
  36. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  37. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  38. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
  39. Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 2019.
  40. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  41. Learning transferable visual models from natural language supervision. In ICLR, 2021.
  42. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  43. Zero-shot text-to-image generation. In ICML, 2021.
  44. Generative adversarial text to image synthesis. In ICML, 2016.
  45. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  46. Neural machine translation of rare words with subword units. ACL, 2015.
  47. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. ACL, 2018.
  48. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  49. Progressive generation of long text with pretrained language models. NAACL, 2021.
  50. Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399, 2022.
  51. Multimodal few-shot learning with frozen language models. NeurIPS, 2021.
  52. Attention is all you need. NeurIPS, 2017.
  53. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. ICML, 2022.
  54. Finetuned language models are zero-shot learners. ICLR, 2021.
  55. Emergent abilities of large language models. TMLR, 2022.
  56. Re3: Generating longer stories with recursive reprompting and revision. EMNLP, 2022.
  57. Vector-quantized image modeling with improved vqgan. ICLR, 2021.
  58. Scaling autoregressive models for content-rich text-to-image generation. TMLR, 2022a.
  59. Multimodal knowledge alignment with reinforcement learning. arXiv preprint arXiv:2205.12630, 2022b.
  60. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
Citations (98)

Summary

  • The paper introduces FROMAGe, a novel approach that grounds text-only LLMs to process interleaved image-and-text data with strong zero-shot retrieval performance.
  • It employs fine-tuning of input/output linear layers while keeping the LLM frozen to facilitate effective cross-modality interactions.
  • Experimental results on VIST and VisDial datasets demonstrate superior contextual image retrieval and competitive zero-shot dialogue performance.

The paper "Grounding LLMs to Images for Multimodal Inputs and Outputs" (2301.13823) introduces Frozen Retrieval Over Multimodal Data for Autoregressive Generation (FROMAGe), a method for grounding pre-trained text-only LLMs to the visual domain. This enables the model to process arbitrarily interleaved image-and-text data and generate text interleaved with retrieved images. The key idea is to leverage the existing capabilities of LLMs, such as in-context learning and free-form text generation, while adapting them to handle visual information.

The approach involves keeping the LLM frozen and fine-tuning input and output linear layers to facilitate cross-modality interactions. The model is trained with a multi-task objective:

  • Image captioning: learning to process interleaved multimodal inputs.
  • Image-text retrieval: learning to produce interleaved multimodal outputs.

For image captioning, visual embeddings are extracted using a pre-trained visual encoder. A linear mapping, WcRm×kd\mathbf{W}_c \in \mathbb{R}^{m \times kd}, is learned to map these embeddings into the input space of the LLM via a maximum-likelihood objective. mm: dimension of visual embeddings kk: number of vectors dd: hidden dimensionality

For image-text retrieval, the LLM learns a new [RET] token representing an image. Another linear mapping, WtRp×q\mathbf{W}_t \in \mathbb{R}^{p \times q}, is trained using contrastive learning to map the [RET] embeddings for a caption to be close to the visual embeddings of its paired image. The visual embeddings vϕ(yi)v_{\phi}(y_i) are mapped into the same retrieval space using the linear mapping WiRm×q\mathbf{W}_i \in \mathbb{R}^{m \times q}. pp: hidden representation of the [RET] token from the last hidden layer of the LLM

qq: retrieval dimension, where q<pq < p

The normalized cosine similarity for the image and text embeddings is computed as:

sim(x,y)=(hθ(x)TWt)(vϕ(y)TWi)Thθ(x)TWtvϕ(y)TWi)T\text{sim}(x, y) = \frac{(h_{\theta}(x)^T \mathbf{W}_t) (v_{\phi}(y)^T \mathbf{W}_i)^T}{ \lVert h_{\theta}(x)^T \mathbf{W}_t \rVert \lVert v_{\phi}(y)^T \mathbf{W}_i)^T \rVert }

Where: xx: caption yy: paired image hθ(x)h_{\theta}(x): output of the last hidden layer of the LLM (LLM) for the [RET] token vϕ(y)v_{\phi}(y): output of the visual encoder for the image Wt\mathbf{W}_t: linear mapping to map the hidden representation of [RET] from the last hidden layer of the LLM (LLM) Wi\mathbf{W}_i: linear mapping to map the visual embeddings

The InfoNCE loss is minimized for text-to-image (t2i) and image-to-text (i2t) retrieval over a batch of NN text-image pairs (xi,yi)(x_i, y_i). The loss functions are:

$\mathcal{L}_{\text{t2i} = -\frac{1}{N} \sum_{i=1}^N \left( \log \frac{\exp(\text{sim}(x_i, y_i) / \tau)}{ \sum_{j=1}^N \exp(\text{sim}(x_i, y_j) / \tau )} \right)$

$\mathcal{L}_{\text{i2t} = -\frac{1}{N} \sum_{i=1}^N \left( \log \frac{\exp(\text{sim}(y_i, x_i) / \tau)}{ \sum_{j=1}^N \exp(\text{sim}(y_i, x_j) / \tau )} \right)$

Where: τ\tau: learnable temperature parameter.

The final training loss is a weighted sum of the captioning loss Lc\mathcal{L}_{\text{c}} and the retrieval losses:

$\mathcal{L} = \lambda_c \mathcal{L}_{\text{c} + \lambda_r (\mathcal{L}_{\text{t2i} + \mathcal{L}_{\text{i2t})$

Where: λc\lambda_c: captioning loss weight λr\lambda_r: retrieval loss weight

During training, only the linear mappings (Wc\mathbf{W}_c, Wt\mathbf{W}_t, and Wi\mathbf{W}_i) and the [RET] embedding vector are updated.

The paper evaluates FROMAGe on tasks such as contextual image retrieval and visual dialogue, demonstrating strong zero-shot performance. Key findings include:

  • Autoregressive LLMs can perform text-to-image retrieval with greater sensitivity to input text compared to existing models.
  • The existing capabilities of pre-trained text-only LLMs can be leveraged for visually grounded tasks.

Experiments on the Visual Storytelling (VIST) dataset [huang2016visual] show that FROMAGe outperforms CLIP [radford2021learning] in contextual image retrieval, especially when provided with longer, temporally dependent sentences and interleaved image-and-text context. On Visual Dialog (VisDial) [das2017visual], FROMAGe achieves competitive results in zero-shot text answer selection and significantly outperforms prior work in text-to-image retrieval. Ablation studies validate the importance of freezing the LLM and using a dedicated retrieval token. The paper also presents results showing a positive correlation between model size and performance.

Youtube Logo Streamline Icon: https://streamlinehq.com