Emergent Mind

Linearly Mapping from Image to Text Space

(2209.15162)
Published Sep 30, 2022 in cs.CL and cs.LG

Abstract

The extent to which text-only language models (LMs) learn to represent features of the non-linguistic world is an open question. Prior work has shown that pretrained LMs can be taught to caption images when a vision model's parameters are optimized to encode images in the language space. We test a stronger hypothesis: that the conceptual representations learned by frozen text-only models and vision-only models are similar enough that this can be achieved with a linear map. We show that the image representations from vision models can be transferred as continuous prompts to frozen LMs by training only a single linear projection. Using these to prompt the LM achieves competitive performance on captioning and visual question answering tasks compared to models that tune both the image encoder and text decoder (such as the MAGMA model). We compare three image encoders with increasing amounts of linguistic supervision seen during pretraining: BEIT (no linguistic information), NF-ResNET (lexical category information), and CLIP (full natural language descriptions). We find that all three encoders perform equally well at transferring visual property information to the language model (e.g., whether an animal is large or small), but that image encoders pretrained with linguistic supervision more saliently encode category information (e.g., distinguishing hippo vs. elephant) and thus perform significantly better on benchmark language-and-vision tasks. Our results indicate that LMs encode conceptual information structurally similarly to vision-based models, even those that are solely trained on images. Code is available here: https://github.com/jmerullo/limber

Overview

  • The paper introduces LiMBeR, a method aligning image encoder embedding spaces with language model spaces through a single linear transformation.

  • Experiments demonstrate that the pretraining of vision models with varying degrees of linguistic supervision impacts their ability to transfer conceptual information to LMs.

  • Language models can produce relevant image descriptions after converting image encoder output using linear projection, even without direct multimodal training.

  • The study explores the limitations of conceptual knowledge transfer between vision and language models and suggests future research directions.

Overview

The study presents an intriguing examination of the representational transfer between pre-trained computer vision models and language models (LMs). The authors propose that embedding spaces of image encoders and LMs might be aligned through a linear projection, a method they name LiMBeR (Linearly Mapping Between Representation spaces). Remarkably, their work indicates the potential of "soft prompting" LMs with image representations converted by a single linear transformation without further tuning the models. The findings could have substantial implications for the integration of visual inputs into LMs and feed into discussions about the capabilities of LMs trained devoid of explicit multimodal data.

Hypothesis and Methodology

A key hypothesis tested in this paper is that LMs can generate relevant image descriptions using a linearly projected embedding from a vision model, even when the LM has been trained only on text data. To test this hypothesis, the authors train a single linear layer as a projection to convert the output of an image encoder into inputs for a language model. Different levels of linguistic supervision during the image encoders' pretraining were evaluated—ranging from models with full linguistic pretraining to those without any exposure to language. This methodological choice reflects a meticulous design aimed at dissecting the relationship between vision representation learning and language understanding.

Evaluating Projection Performance

The performance of this linear mapping was evaluated on captioning and visual question-answering (VQA) tasks, revealing that the mappings from image encoders to LMs were indeed capable of transferring meaningful information. They observed a progression in performance that corresponded to the extent of linguistic supervision received by the vision models during pretraining. Notably, the CLIP model, which was aligned during pretraining with natural language descriptions, outperformed others, while BEIT, which had no linguistic supervision, managed to transfer only coarse conceptual information. This outcome demonstrates that conceptual information in vision models is structurally similar to language models, to an extent dependent on linguistic supervision.

Analysis and Interpretation

The research goes on to scrutinize the granularity of concept transfer, investigating whether incorrect generation of lexical categories by models like BEIT was due to imprecise alignment of conceptual spaces or insufficiently fine-grained representation learning. It turns out that although BEIT is able to relay broad conceptual information, it often defaults to generic or semantically-related terms instead of specific ones. These results were bolstered by probing experiments and representation similarity analyses that further elucidated the structure and limitations of learned representations.

Implications and Future Work

In conclusion, the study opens promising directions in understanding how visual features can be mobilized in language models, suggesting a hitherto unexplored similarity between visual and textual representations. This work posits that approaching representation transfer through simplistic linear projections can not only benchmark the efficacy of more complex multimodal models but also seed new exploratory avenues into how conceptual knowledge can be bridged between different data modalities. The provision of code and trained model weights stand as testament to the reproducibility of the research, encouraging further experimentation and adaptation. Future avenues of research could explore how multimodal pretraining can leverage structured representational similarities that are highlighted by LiMBeR, particularly in grounding language models in visual data.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube