Multimodal Prompt Retrieval for Generative Visual Question Answering (2306.17675v1)
Abstract: Recent years have witnessed impressive results of pre-trained vision-LLMs on knowledge-intensive tasks such as visual question answering (VQA). Despite the recent advances in VQA, existing methods mainly adopt a discriminative formulation that predicts answers within a pre-defined label set, leading to easy overfitting on low-resource domains with limited labeled data (e.g., medicine) and poor generalization under domain shift to another dataset. To tackle this limitation, we propose a novel generative model enhanced by multimodal prompt retrieval (MPR) that integrates retrieved prompts and multimodal features to generate answers in free text. Our generative model enables rapid zero-shot dataset adaptation to unseen data distributions and open-set answer labels across datasets. Our experiments on medical VQA tasks show that MPR outperforms its non-retrieval counterpart by up to 30% accuracy points in a few-shot domain adaptation setting.
- Rethinking evaluation practices in visual question answering: A case study on out-of-distribution generalization. In Proceedings of the 2023 European Chapter of the Association for Computational Linguistics (Findings).
- Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
- Murag: Multimodal retrieval-augmented generator for open question answering over images and text. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
- Does clip benefit visual question answering in the medical domain as much as it does in the general domain? arXiv preprint arXiv:2112.13906.
- Transform-retrieve-generate: Natural language-centric outside-knowledge visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5067–5077.
- Are you talking to a machine? dataset and methods for multilingual image question. Advances in neural information processing systems, 28.
- Bilinear graph networks for visual question answering. IEEE Transactions on Neural Networks and Learning Systems.
- Retrieval augmented language model pre-training. In International Conference on Machine Learning, pages 3929–3938. PMLR.
- Overview of imageclef 2018 medical domain visual question answering task. In Conference and Labs of the Evaluation Forum (Working Notes).
- Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.
- Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations.
- Mmbert: multimodal bert pretraining for improved medical vqa. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1033–1036. IEEE.
- Unifiedqa: Crossing format boundaries with a single qa system. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (Findings).
- Bilinear attention networks. Advances in neural information processing systems, 31.
- Towards visual dialog for radiology. In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, pages 60–69, Online. Association for Computational Linguistics.
- A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10.
- The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Relation-aware graph attention network for visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10313–10322.
- Weizhe Lin and Bill Byrne. 2022. Retrieval augmented visual question answering with outside knowledge. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
- Medical visual question answering: A survey. ArXiv, abs/2111.10056.
- Slake: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650–1654. IEEE.
- Hierarchical question-image co-attention for visual question answering. Advances in neural information processing systems, 29.
- Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the IEEE international conference on computer vision, pages 1–9.
- Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14111–14121.
- Multi-modal understanding and generation for medical images and text via vision-language pre-training. IEEE Journal of Biomedical and Health Informatics.
- Medhini Narasimhan and Alexander G Schwing. 2018. Straight to the facts: Learning knowledge base retrieval for factual visual question answering. In Proceedings of the European conference on computer vision (ECCV), pages 451–468.
- Overcoming data limitation in medical visual question answering. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 522–530. Springer.
- Learning conditioned graph structures for interpretable visual question answering. Advances in neural information processing systems, 31.
- Radiology objects in context (roco): a multimodal image dataset. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis, pages 180–189. Springer.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.
- Casey Ross and Ike Swetlitz. 2018. Ibm’s watson supercomputer recommended ‘unsafe and incorrect’cancer treatments, internal documents show. Stat, 25.
- Reasoning over vision and language: Exploring the benefits of supplemental knowledge. arXiv preprint arXiv:2101.06013.
- Anshumali Shrivastava and Ping Li. 2014. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc.
- Repsnet: Combining vision with language for automated medical reports. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 714–724. Springer.
- Overcoming catastrophic forgetting during domain adaptation of neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2062–2068.
- Multi-modal answer validation for knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2712–2721.
- Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 21–29.
- Medical visual question answering via conditional reasoning. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2345–2354.
- Tua1 at imageclef 2019 vqa-med: a classification and generation model based on transfer learning. In Conference and Labs of the Evaluation Forum (Working Notes).