Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 188 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 39 tok/s Pro

GPT-5 High 39 tok/s Pro

GPT-4o 78 tok/s Pro

Kimi K2 207 tok/s Pro

GPT OSS 120B 446 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models (2405.10311v3)

Published 16 May 2024 in cs.IR

Abstract: Recently, Large Vision LLMs (LVLMs) have unlocked many complex use cases that require Multi-Modal (MM) understanding (e.g., image captioning or visual question answering) and MM generation (e.g., text-guided image generation or editing) capabilities. To further improve the output fidelityof LVLMs we introduce UniRAG, a plug-and-play technique that adds relevant retrieved information to prompts as few-shot examples during inference. Unlike the common belief that Retrieval Augmentation (RA) mainly improves generation or understanding of uncommon entities, our evaluation results on the MSCOCO dataset with common entities show that both proprietary models like GPT-4o and Gemini-Pro and smaller open-source models like LLaVA, LaVIT, and Emu2 significantly enhance their generation quality when their input prompts are augmented with relevant information retrieved by Vision-Language (VL) retrievers like UniIR models. All the necessary code to reproduce our results is available at https://github.com/castorini/UniRAG

References (32)

Citations (2)

View on Semantic Scholar

Summary

The paper presents UniRAG, a retrieval augmentation architecture that enhances multi-modal LLMs by appending contextually relevant in-context examples.
It employs a two-stage process combining multi-modal retrievers (CLIP-SF and BLIP-FF) with state-of-the-art generators in both zero-shot and few-shot settings, leading to measurable improvements in SPICE and FID metrics.
The study reveals that a single retrieved example significantly boosts performance while excessive augmentation can degrade results, underscoring the importance of candidate relevance.

The paper proposes a model-agnostic retrieval augmentation architecture for multi-modal LLMs (MM-LLMs). The approach, termed UniRAG, integrates a two-stage workflow where an external multi-modal retriever first identifies contextually relevant candidates and then appends these as in-context examples to the prompt provided to the generator. This modular design can be seamlessly applied to different downstream tasks, including image captioning (image-to-text) and text-to-image generation.

The methodology is structured around two primary components:

Retriever Stage:

The system employs UniIR models configured with CLIP Score Fusion (CLIP-SF) and BLIP Feature Fusion (BLIP-FF), which are instruction-tuned on diverse multi-modal datasets. Specifically, these retrievers combine text and image representations either by a weighted sum of encoder outputs (CLIP-SF) or by utilizing cross-attention layers to fuse features (BLIP-FF). For a query in one modality, the retrievers extract the top $k$ candidates from a global multi-modal database using Faiss for efficient dot product similarity computation. In image captioning, for example, an input image retrieves candidate captions; for image generation, an input text query retrieves candidate images.

Generator Stage:
- For caption generation, models such as Llava (13B), GPT-4, and Gemini-Pro are employed.
- For text-to-image generation, LaVIT (LLaMA-7B backbone) and Emu2-Gen (LLaMA-33B backbone) are used.

The experimental results systematically compare the baseline performance ( $k=0$ ) with retrieval-augmented (few-shot) performance. For instance, adding just one retrieved caption in the prompt yields an improvement of approximately 10 SPICE units on the MSCOCO image captioning task relative to the zero-shot baseline. In the text-to-image generation task, augmentation leads to a decrease of roughly 30 FID units.

Several insights are reported regarding in-context example selection:

Sensitivity to the Number of Examples:

While a single retrieved example consistently boosts generation quality, increasing $k$ beyond one can lead to performance degradation in certain configurations, particularly for smaller models. This suggests that the relevance of retrieved candidates is crucial; introducing too many less-relevant examples can confuse the generative process.

Retriever Model Comparison:

Although the baseline performance of the CLIP-SF retriever is generally superior to BLIP-FF, the effectiveness gap narrows when applied within the retrieval augmentation framework. This observation implies that even retrievers with modest baseline performance can contribute positively when their outputs are integrated as in-context examples.

Model-Specific Findings:

The experiments further reveal that while models like Llava perform relatively well in zero-shot settings, both proprietary models such as GPT-4 and Gemini-Pro are capable of leveraging additional retrieved examples to continuously improve output fidelity. In text-to-image generation, the results also demonstrate that Emu2-Gen, with its larger LLaMA-33B backbone, outperforms the smaller LaVIT model, and both benefit from a single retrieved example, although excessive augmentation can again be detrimental.

Performance metrics across tasks include several n-gram overlap measures (BLEU-1 to BLEU-4, CIDEr, ROUGE) and semantic alignment metrics (SPICE for caption generation; FID, CLIP Score, and Inception Score for image generation). For example, quantitative evaluation on the MSCOCO dataset reveals that the retrieval-augmented prompt setup significantly improves the quality of outputs over the zero-shot baseline across these metrics. Notably, the paper emphasizes that retrieval augmentation not only refines the handling of uncommon entities but also improves the generation quality for common entities, which challenges the conventional expectation that retrieval techniques mainly aid in processing rare content.

The paper concludes with discussions on future work, including:

Extending the evaluation to out-of-domain retrieval settings to assess the robustness of the approach when the candidate pool is not from the same distribution as the target task.
Investigating alternative prompt templates and relevance-based candidate selection mechanisms to optimize the number and quality of in-context examples dynamically.

Overall, the work presents a comprehensive evaluation of combining retrieval augmentation with MM-LLMs, yielding consistent improvements in multi-modal generation tasks and providing valuable insights into the interaction between retrieval quality and generative performance.