Emergent Mind

Abstract

Recently, Multi-Modal(MM) LLMs(LLMs) have unlocked many complex use-cases that require MM understanding (e.g., image captioning or visual question answering) and MM generation (e.g., text-guided image generation or editing) capabilities. To further improve the output fidelity of MM-LLMs we introduce the model-agnostic UniRAG technique that adds relevant retrieved information to prompts as few-shot examples during inference. Unlike the common belief that Retrieval Augmentation (RA) mainly improves generation or understanding of uncommon entities, our evaluation results on the MSCOCO dataset with common entities show that both proprietary models like GPT4 and Gemini-Pro and smaller open-source models like Llava, LaVIT, and Emu2 significantly enhance their generation quality when their input prompts are augmented with relevant information retrieved by MM retrievers like UniIR models.

Caption examples using CLIP-SF as the retriever for $k \in \{0, 1\}$.

Overview

  • UniRAG is a technique designed to enhance Multi-Modal LLMs (MM-LLMs) by incorporating Retrieval Augmentation (RA) to improve the quality and fidelity of generated outputs.

  • The methodology involves a two-stage process: retrieval, where relevant information is gathered using models like CLIP-SF and BLIP-FF, and generation, where MM-LLMs use enriched prompts to produce outputs using zero-shot and few-shot prompting techniques.

  • Experiments using the MSCOCO dataset show that UniRAG significantly improves performance in tasks like image captioning and image generation, demonstrating the practical and theoretical benefits of this approach.

Enhancing Multi-Modal Models with UniRAG Technique

Introduction

In recent years, Multi-Modal LLMs (MM-LLMs) like GPT4, Gemini-Pro, and others have allowed for exciting advancements in tasks that bridge different data types, such as converting images to text (image captioning) and generating images from text descriptions. Despite their impressive capabilities, these models often struggle when tasked with generating accurate results for lesser-known or more recent subjects. This limitation primarily arises because MM-LLMs can only generate output based on their training data without considering external information.

To address this challenge, the paper presents UniRAG, a technique that enhances MM-LLMs by incorporating Retrieval Augmentation (RA). UniRAG retrieves relevant information (images, captions, etc.) and adds it to the models' prompts as few-shot examples. This approach aims to improve the overall quality and fidelity of the generated outputs, especially when dealing with common entities.

Methodology

UniRAG is built on a two-stage process: retrieval and generation.

1. Retrieval Stage

The retrieval stage involves getting the top relevant candidates from a multi-modal database using models like UniIR's CLIP Score Fusion (CLIP-SF) and BLIP Feature Fusion (BLIP-FF). These models are designed to handle both image and text modalities, enabling the retrieval of heterogeneous outputs. The retrieved information is then used to enrich the input prompt of the MM-LLM.

2. Generation Stage

In the generation stage, MM-LLMs are guided by the enriched prompts to produce the desired outputs. This stage leverages zero-shot and few-shot prompting techniques:

  • Zero-shot: The model generates output based solely on the input query.
  • Few-shot: The model is given additional few-shot examples retrieved in the first stage, enhancing its response generation capability.

Experimental Setup and Results

The paper evaluates the performance of UniRAG using the MSCOCO dataset, focusing on two primary tasks: image captioning (image-to-text) and image generation (text-to-image).

Image Captioning

Various MM-LLMs, including Llava, Gemini-Pro, and GPT4, were used for the image captioning task. The results indicated that adding relevant captions as few-shot examples significantly improved the models' performance. Key metrics such as SPICE showed notable improvements:

  • Llava: Adding one relevant example (k=1) improved SPICE by 11.44 points.
  • Gemini-Pro and GPT4: Continued to see improvements as more relevant examples were added (up to k=10).

Image Generation

For the image generation task, LaVIT and Emu2-Gen models were employed. The effectiveness was evaluated using metrics like FID (lower is better) and CLIP Score (higher is better):

  • LaVIT: Adding a single relevant image (k=1) improved FID from 155.75 to 119.54.
  • Emu2-Gen: Showed substantial improvements with an FID reduction from 61.18 to 26.53 when k=1 was used.

The results clearly demonstrate that MM-LLMs benefit significantly from the UniRAG approach, irrespective of the baseline capabilities.

Implications and Future Directions

The implications of this research are both practical and theoretical:

  • Practical: Incorporating retrieval-augmented generation can significantly enhance the fidelity of MM-LLM outputs across various tasks, making these models more reliable in real-world applications.
  • Theoretical: This technique bridges the gap between training data limitations and real-time understanding, paving the way for more adaptive AI systems.

Future Work:

  • Experiment with out-of-domain retrieval to test the generalization capabilities of UniRAG.
  • Conduct ablation studies on different prompt templates to explore their influence on the model’s performance.

In summary, UniRAG provides a scalable and effective solution for enhancing the capabilities of MM-LLMs by integrating external knowledge at inference time. This model-agnostic technique holds promise for future developments in the field of AI, offering a robust way to improve the accuracy and relevance of model-generated content.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube