UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models (2405.10311v2)
Abstract: Recently, Multi-Modal (MM) LLMs have unlocked many complex use-cases that require MM understanding (e.g., image captioning or visual question answering) and MM generation (e.g., text-guided image generation or editing) capabilities. To further improve the output fidelity of MM-LLMs we introduce UniRAG, a plug-and-play technique that adds relevant retrieved information to prompts as few-shot examples during inference. Unlike the common belief that Retrieval Augmentation (RA) mainly improves generation or understanding of uncommon entities, our evaluation results on the MSCOCO dataset with common entities show that both proprietary models like GPT-4o and Gemini-Pro and smaller open-source models like LLaVA, LaVIT, and Emu2 significantly enhance their generation quality when their input prompts are augmented with relevant information retrieved by MM retrievers like UniIR models.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
- Spice: Semantic propositional image caption evaluation. Preprint, arXiv:1607.08822.
- Improving language models by retrieving from trillions of tokens. Preprint, arXiv:2112.04426.
- Language models are few-shot learners. Preprint, arXiv:2005.14165.
- Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491.
- Microsoft coco captions: Data collection and evaluation server. Preprint, arXiv:1504.00325.
- Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807.
- Realm: Retrieval-augmented language model pre-training. Preprint, arXiv:2002.08909.
- Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.
- Unified language-vision pretraining with dynamic discrete visual tokenization. arXiv preprint arXiv:2309.04669.
- Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547.
- Large language models are zero-shot reasoners. Preprint, arXiv:2205.11916.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. Preprint, arXiv:2201.12086.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Improved baselines with visual instruction tuning. Preprint, arXiv:2310.03744.
- Visual instruction tuning. Preprint, arXiv:2304.08485.
- Gpt-4 technical report. Preprint, arXiv:2303.08774.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Learning transferable visual models from natural language supervision. Preprint, arXiv:2103.00020.
- Zero-shot text-to-image generation. Preprint, arXiv:2102.12092.
- Improved techniques for training gans. Advances in neural information processing systems, 29.
- Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286.
- Gemini: A family of highly capable multimodal models. Preprint, arXiv:2312.11805.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Cider: Consensus-based image description evaluation. Preprint, arXiv:1411.5726.
- Uniir: Training and benchmarking universal multimodal information retrievers. arXiv preprint arXiv:2311.17136.
- Retrieval-augmented multimodal language modeling. arXiv preprint arXiv:2211.12561.
- Scaling autoregressive models for content-rich text-to-image generation. Preprint, arXiv:2206.10789.
- Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591.
- Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601.
- Retrieving multimodal information for augmented generation: A survey. arXiv preprint arXiv:2303.10868.
- Sahel Sharifymoghaddam (6 papers)
- Shivani Upadhyay (9 papers)
- Wenhu Chen (134 papers)
- Jimmy Lin (208 papers)