BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions (2308.09936v3)
Abstract: Vision LLMs (VLMs), which extend LLMs (LLM) by incorporating visual understanding capability, have demonstrated significant advancements in addressing open-ended visual question-answering (VQA) tasks. However, these models cannot accurately interpret images infused with text, a common occurrence in real-world scenarios. Standard procedures for extracting information from images often involve learning a fixed set of query embeddings. These embeddings are designed to encapsulate image contexts and are later used as soft prompt inputs in LLMs. Yet, this process is limited to the token count, potentially curtailing the recognition of scenes with text-rich context. To improve upon them, the present study introduces BLIVA: an augmented version of InstructBLIP with Visual Assistant. BLIVA incorporates the query embeddings from InstructBLIP and also directly projects encoded patch embeddings into the LLM, a technique inspired by LLaVA. This approach assists the model to capture intricate details potentially missed during the query decoding process. Empirical evidence demonstrates that our model, BLIVA, significantly enhances performance in processing text-rich VQA benchmarks (up to 17.76% in OCR-VQA benchmark) and in undertaking general (not particularly text-rich) VQA benchmarks (up to 7.9% in Visual Spatial Reasoning benchmark), and achieved 17.72% overall improvement in a comprehensive multimodal LLM benchmark (MME), comparing to our baseline InstructBLIP. BLIVA demonstrates significant capability in decoding real-world images, irrespective of text presence. To demonstrate the broad industry applications enabled by BLIVA, we evaluate the model using a new dataset comprising YouTube thumbnails paired with question-answer sets across 11 diverse categories. Our code and models are freely accessible at https://github.com/mlpc-ucsd/BLIVA.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems (NeurIPS), 35: 23716–23736.
- Openflamingo.
- LaTr: Layout-Aware Transformer for Scene-Text VQA. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16548–16558.
- Scaling Instruction-Finetuned Language Models. arXiv:2210.11416.
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500.
- Visual Dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394.
- ImageBind: One Embedding Space To Bind Them All. In Computer Vision and Pattern Recognition Conference (CVPR).
- MultiModal-GPT: A Vision and Language Model for Dialogue with Humans. arXiv:2305.04790.
- Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR).
- Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 3608–3617.
- Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor. arXiv:2212.09689.
- LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations (ICLR).
- Icdar2019 competition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), 1516–1520. IEEE.
- GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In Computer Vision and Pattern Recognition (CVPR).
- Funsd: A dataset for form understanding in noisy scanned documents. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), volume 2, 1–6. IEEE.
- The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in neural information processing systems (NeurIPS), 33: 2611–2624.
- Visual information extraction in the wild: practical dataset and end-to-end solution. In International Conference on Document Analysis and Recognition, 36–53. Springer.
- Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425.
- LAVIS: A One-stop Library for Language-Vision Intelligence. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), 31–41. Toronto, Canada: Association for Computational Linguistics.
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In ICML.
- Microsoft COCO: Common Objects in Context. arXiv:1405.0312.
- Visual spatial reasoning. Transactions of the Association for Computational Linguistics (TACL), 11: 635–651.
- Visual Instruction Tuning.
- On the Hidden Mystery of OCR in Large Multimodal Models. arXiv:2305.07895.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning. arXiv:2110.13214.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (CVPR), 3195–3204.
- ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, 2263–2279.
- Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1697–1706.
- DocVQA: A Dataset for VQA on Document Images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2200–2209.
- OCR-VQA: Visual Question Answering by Reading Text in Images. In ICDAR.
- Large-Scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline. In Vedaldi, A.; Bischof, H.; Brox, T.; and Frahm, J.-M., eds., ECCV.
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
- Im2Text: Describing Images Using 1 Million Captioned Photographs. In Shawe-Taylor, J.; Zemel, R.; Bartlett, P.; Pereira, F.; and Weinberger, K., eds., Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc.
- Multitask Prompted Training Enables Zero-Shot Task Generalization. arXiv:2110.08207.
- LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. arXiv:2111.02114.
- A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge. arXiv.
- Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of ACL.
- TextCaps: a Dataset for Image Captioning with Reading Comprehension. arXiv:2003.12462.
- Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 8317–8326.
- PandaGPT: One Model To Instruction-Follow Them All. arXiv preprint arXiv:2305.16355.
- EVA-CLIP: Improved Training Techniques for CLIP at Scale. arXiv:2303.15389.
- Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford˙alpaca.
- LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.
- CIDEr: Consensus-based image description evaluation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4566–4575.
- OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. CoRR, abs/2202.03052.
- On the general value of evidence, and bilingual scene-text visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10126–10135.
- Self-Instruct: Aligning Language Models with Self-Generated Instructions. arXiv:2212.10560.
- Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks. arXiv:2204.07705.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Improving Cross-Task Generalization with Step-by-Step Instructions. arXiv preprint arXiv:2305.04429.
- Video Question Answering via Gradually Refined Attention over Appearance and Motion. In Proceedings of the 25th ACM International Conference on Multimedia, 1645–1653.
- MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning. arXiv:2212.10773.
- Just ask: Learning to answer questions from millions of narrated videos. In International Conference on Computer Vision (ICCV), 1686–1697.
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv:2304.14178.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Nlp.cs.illinois.edu.
- OPT: Open Pre-trained Transformer Language Models. arXiv preprint arXiv:2205.01068.
- Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv:2306.05685.
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592.