Linearly Mapping from Image to Text Space (2209.15162v3)

Published 30 Sep 2022 in cs.CL and cs.LG

Abstract: The extent to which text-only LLMs (LMs) learn to represent features of the non-linguistic world is an open question. Prior work has shown that pretrained LMs can be taught to caption images when a vision model's parameters are optimized to encode images in the language space. We test a stronger hypothesis: that the conceptual representations learned by frozen text-only models and vision-only models are similar enough that this can be achieved with a linear map. We show that the image representations from vision models can be transferred as continuous prompts to frozen LMs by training only a single linear projection. Using these to prompt the LM achieves competitive performance on captioning and visual question answering tasks compared to models that tune both the image encoder and text decoder (such as the MAGMA model). We compare three image encoders with increasing amounts of linguistic supervision seen during pretraining: BEIT (no linguistic information), NF-ResNET (lexical category information), and CLIP (full natural language descriptions). We find that all three encoders perform equally well at transferring visual property information to the LLM (e.g., whether an animal is large or small), but that image encoders pretrained with linguistic supervision more saliently encode category information (e.g., distinguishing hippo vs. elephant) and thus perform significantly better on benchmark language-and-vision tasks. Our results indicate that LMs encode conceptual information structurally similarly to vision-based models, even those that are solely trained on images. Code is available here: https://github.com/jmerullo/limber

Citations (96)

View on Semantic Scholar

Summary

The paper proposes using a single linear projection to align image encoder outputs with language model inputs, enabling soft prompting for image description.
It demonstrates that linguistic pretraining quality significantly affects the granularity and accuracy of visual concept transfer on captioning and VQA tasks.
The findings suggest that structural similarities between vision and language models can guide future multimodal integration with minimal tuning.

Overview

The paper presents an intriguing examination of the representational transfer between pre-trained computer vision models and LLMs (LMs). The authors propose that embedding spaces of image encoders and LMs might be aligned through a linear projection, a method they name LiMBeR (Linearly Mapping Between Representation spaces). Remarkably, their work indicates the potential of "soft prompting" LMs with image representations converted by a single linear transformation without further tuning the models. The findings could have substantial implications for the integration of visual inputs into LMs and feed into discussions about the capabilities of LMs trained devoid of explicit multimodal data.

Hypothesis and Methodology

A key hypothesis tested in this paper is that LMs can generate relevant image descriptions using a linearly projected embedding from a vision model, even when the LM has been trained only on text data. To test this hypothesis, the authors train a single linear layer as a projection to convert the output of an image encoder into inputs for a LLM. Different levels of linguistic supervision during the image encoders' pretraining were evaluated—ranging from models with full linguistic pretraining to those without any exposure to language. This methodological choice reflects a meticulous design aimed at dissecting the relationship between vision representation learning and language understanding.

Evaluating Projection Performance

The performance of this linear mapping was evaluated on captioning and visual question-answering (VQA) tasks, revealing that the mappings from image encoders to LMs were indeed capable of transferring meaningful information. They observed a progression in performance that corresponded to the extent of linguistic supervision received by the vision models during pretraining. Notably, the CLIP model, which was aligned during pretraining with natural language descriptions, outperformed others, while BEIT, which had no linguistic supervision, managed to transfer only coarse conceptual information. This outcome demonstrates that conceptual information in vision models is structurally similar to LLMs, to an extent dependent on linguistic supervision.

Analysis and Interpretation

The research goes on to scrutinize the granularity of concept transfer, investigating whether incorrect generation of lexical categories by models like BEIT was due to imprecise alignment of conceptual spaces or insufficiently fine-grained representation learning. It turns out that although BEIT is able to relay broad conceptual information, it often defaults to generic or semantically-related terms instead of specific ones. These results were bolstered by probing experiments and representation similarity analyses that further elucidated the structure and limitations of learned representations.

Implications and Future Work

In conclusion, the paper opens promising directions in understanding how visual features can be mobilized in LLMs, suggesting a hitherto unexplored similarity between visual and textual representations. This work posits that approaching representation transfer through simplistic linear projections can not only benchmark the efficacy of more complex multimodal models but also seed new exploratory avenues into how conceptual knowledge can be bridged between different data modalities. The provision of code and trained model weights stand as testament to the reproducibility of the research, encouraging further experimentation and adaptation. Future avenues of research could explore how multimodal pretraining can leverage structured representational similarities that are highlighted by LiMBeR, particularly in grounding LLMs in visual data.

PDF Markdown

Related Papers

GitHub

GitHub - jmerullo/limber: https://arxiv.org/abs/2209.15162 (48 stars)

Tweets

https://twitter.com/jowenpetty/status/1774300005966184624

https://twitter.com/dan_fried/status/1849580833880830258

YouTube

Show All Videos