Grounding Language Models to Images for Multimodal Inputs and Outputs (2301.13823v4)

Published 31 Jan 2023 in cs.CL, cs.AI, cs.CV, and cs.LG

Abstract: We propose an efficient method to ground pretrained text-only LLMs to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method leverages the abilities of LLMs learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the LLM frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf LLM and paves the way towards an effective, general solution for leveraging pretrained LLMs in visually grounded settings.

References (60)

Citations (98)

View on Semantic Scholar

Summary

The paper introduces FROMAGe, a novel approach that grounds text-only LLMs to process interleaved image-and-text data with strong zero-shot retrieval performance.
It employs fine-tuning of input/output linear layers while keeping the LLM frozen to facilitate effective cross-modality interactions.
Experimental results on VIST and VisDial datasets demonstrate superior contextual image retrieval and competitive zero-shot dialogue performance.

The paper "Grounding LLMs to Images for Multimodal Inputs and Outputs" (2301.13823) introduces Frozen Retrieval Over Multimodal Data for Autoregressive Generation (FROMAGe), a method for grounding pre-trained text-only LLMs to the visual domain. This enables the model to process arbitrarily interleaved image-and-text data and generate text interleaved with retrieved images. The key idea is to leverage the existing capabilities of LLMs, such as in-context learning and free-form text generation, while adapting them to handle visual information.

The approach involves keeping the LLM frozen and fine-tuning input and output linear layers to facilitate cross-modality interactions. The model is trained with a multi-task objective:

Image captioning: learning to process interleaved multimodal inputs.
Image-text retrieval: learning to produce interleaved multimodal outputs.

For image captioning, visual embeddings are extracted using a pre-trained visual encoder. A linear mapping, $\mathbf{W}_c \in \mathbb{R}^{m \times kd}$ , is learned to map these embeddings into the input space of the LLM via a maximum-likelihood objective. $m$ : dimension of visual embeddings $k$ : number of vectors $d$ : hidden dimensionality

For image-text retrieval, the LLM learns a new [RET] token representing an image. Another linear mapping, $\mathbf{W}_t \in \mathbb{R}^{p \times q}$ , is trained using contrastive learning to map the [RET] embeddings for a caption to be close to the visual embeddings of its paired image. The visual embeddings $v_{\phi}(y_i)$ are mapped into the same retrieval space using the linear mapping $\mathbf{W}_i \in \mathbb{R}^{m \times q}$ . $p$ : hidden representation of the [RET] token from the last hidden layer of the LLM

$q$ : retrieval dimension, where $q < p$

The normalized cosine similarity for the image and text embeddings is computed as:

$\text{sim}(x, y) = \frac{(h_{\theta}(x)^T \mathbf{W}_t) (v_{\phi}(y)^T \mathbf{W}_i)^T}{ \lVert h_{\theta}(x)^T \mathbf{W}_t \rVert \lVert v_{\phi}(y)^T \mathbf{W}_i)^T \rVert }$

Where: $x$ : caption $y$ : paired image $h_{\theta}(x)$ : output of the last hidden layer of the LLM (LLM) for the [RET] token $v_{\phi}(y)$ : output of the visual encoder for the image $\mathbf{W}_t$ : linear mapping to map the hidden representation of [RET] from the last hidden layer of the LLM (LLM) $\mathbf{W}_i$ : linear mapping to map the visual embeddings

The InfoNCE loss is minimized for text-to-image (t2i) and image-to-text (i2t) retrieval over a batch of $N$ text-image pairs $(x_i, y_i)$ . The loss functions are:

$\mathcal{L}_{\text{t2i} = -\frac{1}{N} \sum_{i=1}^N \left( \log \frac{\exp(\text{sim}(x_i, y_i) / \tau)}{ \sum_{j=1}^N \exp(\text{sim}(x_i, y_j) / \tau )} \right)$

$\mathcal{L}_{\text{i2t} = -\frac{1}{N} \sum_{i=1}^N \left( \log \frac{\exp(\text{sim}(y_i, x_i) / \tau)}{ \sum_{j=1}^N \exp(\text{sim}(y_i, x_j) / \tau )} \right)$

Where: $\tau$ : learnable temperature parameter.

The final training loss is a weighted sum of the captioning loss $\mathcal{L}_{\text{c}}$ and the retrieval losses:

$\mathcal{L} = \lambda_c \mathcal{L}_{\text{c} + \lambda_r (\mathcal{L}_{\text{t2i} + \mathcal{L}_{\text{i2t})$

Where: $\lambda_c$ : captioning loss weight $\lambda_r$ : retrieval loss weight

During training, only the linear mappings ( $\mathbf{W}_c$ , $\mathbf{W}_t$ , and $\mathbf{W}_i$ ) and the [RET] embedding vector are updated.

The paper evaluates FROMAGe on tasks such as contextual image retrieval and visual dialogue, demonstrating strong zero-shot performance. Key findings include:

Autoregressive LLMs can perform text-to-image retrieval with greater sensitivity to input text compared to existing models.
The existing capabilities of pre-trained text-only LLMs can be leveraged for visually grounded tasks.

Experiments on the Visual Storytelling (VIST) dataset [huang2016visual] show that FROMAGe outperforms CLIP [radford2021learning] in contextual image retrieval, especially when provided with longer, temporally dependent sentences and interleaved image-and-text context. On Visual Dialog (VisDial) [das2017visual], FROMAGe achieves competitive results in zero-shot text answer selection and significantly outperforms prior work in text-to-image retrieval. Ablation studies validate the importance of freezing the LLM and using a dedicated retrieval token. The paper also presents results showing a positive correlation between model size and performance.