Emergent Mind

Improving Visual Commonsense in Language Models via Multiple Image Generation

(2406.13621)
Published Jun 19, 2024 in cs.CL , cs.CV , and cs.LG

Abstract

Commonsense reasoning is fundamentally based on multimodal knowledge. However, existing LLMs are primarily trained using textual data only, limiting their ability to incorporate essential visual information. In contrast, Visual Language Models, which excel at visually-oriented tasks, often fail at non-visual tasks such as basic commonsense reasoning. This divergence highlights a critical challenge - the integration of robust visual understanding with foundational text-based language reasoning. To this end, we introduce a method aimed at enhancing LLMs' visual commonsense. Specifically, our method generates multiple images based on the input text prompt and integrates these into the model's decision-making process by mixing their prediction probabilities. To facilitate multimodal grounded language modeling, we employ a late-fusion layer that combines the projected visual features with the output of a pre-trained LLM conditioned on text only. This late-fusion layer enables predictions based on comprehensive image-text knowledge as well as text only when this is required. We evaluate our approach using several visual commonsense reasoning tasks together with traditional NLP tasks, including common sense reasoning and reading comprehension. Our experimental results demonstrate significant superiority over existing baselines. When applied to recent state-of-the-art LLMs (e.g., Llama3), we observe improvements not only in visual common sense but also in traditional NLP benchmarks. Code and models are available under https://github.com/guyyariv/vLMIG.

Proposed method overview: training with image-text pairs and synthetic images; fusion layer combines visual and textual tokens.

Overview

  • The paper proposes a novel methodology to integrate visual commonsense into Language Models (LMs) by leveraging multiple image generation from text prompts.

  • The vLMIG (Visual Language Models via Multiple Image Generation) approach includes a Visual Token Projector (VTP) and a Late Fusion Attention Layer (LFAL) to merge visual and textual information effectively.

  • Experiments show that vLMIG significantly outperforms existing models in tasks related to object properties, visual commonsense, and maintains or slightly improves performance in commonsense reasoning and reading comprehension tasks.

Improving Visual Commonsense in Language Models via Multiple Image Generation

"Improving Visual Commonsense in Language Models via Multiple Image Generation" by Yariv et al. addresses the challenge of integrating visual commonsense capabilities into Language Models (LMs). While LMs excel in natural language tasks, they traditionally lack the ability to incorporate essential visual information, an area where Vision Language Models (VLMs) shine. However, VLMs often falter in tasks requiring basic commonsense reasoning that is not visually oriented. This paper proposes a novel method to bridge this gap.

The authors introduce a dual-component approach designed to enhance LLMs with visual commonsense. The first component involves a training architecture that incorporates a late-fusion layer for merging visual and textual information. The second component deals with inference, leveraging multiple images generated from text prompts by a pre-trained text-to-image model. The technique integrates these images into the decision-making process, guiding the final prediction by averaging probability vectors from several image variations.

Methodology

The proposed approach, vLMIG (Visual Language Models via Multiple Image Generation), comprises four primary components: a pre-trained LLM, a pre-trained Vision Encoder, a Visual Token Projector (VTP), and a Late Fusion Attention Layer (LFAL). The Vision Encoder and LLM remain frozen during training to prevent disruption of their learned representations, with the VTP and LFAL layers being the main trainable modules.

  • Visual Token Projector (VTP): This module transforms visual features extracted by the Vision Encoder into a pseudo-text representation, aligning visual data dimensions with those of textual embeddings.
  • Late Fusion Attention Layer (LFAL): Positioned immediately before the prediction layer, this attention mechanism integrates visual and textual features, allowing the model to attend to both pseudo-text tokens from visual input and text tokens.

The inference process involves generating multiple images from textual inputs using a pre-trained text-to-image model. These images, representing different aspects or variations of the input text, are passed through the visually augmented LLM. Individual prediction probability vectors from each image are then averaged, optimizing the final decision through diverse visual inputs.

Experimental Setup and Results

The authors evaluated vLMIG on several tasks, including object commonsense, visual commonsense, commonsense reasoning, and reading comprehension. Their experiments involved models of various scales and established benchmarks:

  • Object Commonsense: They used the zero-shot benchmark, evaluating tasks related to object properties such as color, shape, and relative size. The results showed that vLMIG significantly outperforms baselines such as BERT and GPT-2 by large margins, achieving notable improvements in all object-related tasks.
  • Visual Commonsense: Evaluation on ImageNetVC's question-answer pairs demonstrated vLMIG's superior performance in visual commonsense tasks. Impressively, the method was not confined to merely answering visually intuitive questions but extended its strength to diverse domains.
  • Commonsense Reasoning and Reading Comprehension: The results in these categories showed slight improvements for vLMIG in comparison to existing models, indicating that the method maintains, and sometimes even enhances, text-based language reasoning capabilities.

Implications and Future Directions

The implications of this research are both practical and theoretical. From a practical standpoint, vLMIG's capability to enhance LLMs with visual commonsense can drive improvements in applications requiring multimodal understanding, such as interactive assistants, automated reasoning systems, and more sophisticated AI-driven visual question answering platforms. Theoretically, the framework presents a pathway for integrating modular visual and textual components in a late-fusion manner, an approach that could be adapted and expanded upon in future research.

Further developments could involve optimizing the integration process to reduce inference times—a current limitation due to the requirement of generating multiple images. Additionally, exploring more advanced text-to-image models could yield even richer visual representations, further enhancing performance across both visual and non-visual tasks.

In conclusion, this paper makes a substantive contribution to the field of AI by addressing the integration of visual commonsense in LMs. The innovative approach and significant improvements across various benchmarks emphasize its potential to advance the capabilities of LLM-based systems. As text-to-image models and multimodal integrations advance, methods like vLMIG will likely play a crucial role in the evolution of intelligent, visually-aware language models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.