Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
124 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multimodal Prompt Retrieval for Generative Visual Question Answering (2306.17675v1)

Published 30 Jun 2023 in cs.CV and cs.AI

Abstract: Recent years have witnessed impressive results of pre-trained vision-LLMs on knowledge-intensive tasks such as visual question answering (VQA). Despite the recent advances in VQA, existing methods mainly adopt a discriminative formulation that predicts answers within a pre-defined label set, leading to easy overfitting on low-resource domains with limited labeled data (e.g., medicine) and poor generalization under domain shift to another dataset. To tackle this limitation, we propose a novel generative model enhanced by multimodal prompt retrieval (MPR) that integrates retrieved prompts and multimodal features to generate answers in free text. Our generative model enables rapid zero-shot dataset adaptation to unseen data distributions and open-set answer labels across datasets. Our experiments on medical VQA tasks show that MPR outperforms its non-retrieval counterpart by up to 30% accuracy points in a few-shot domain adaptation setting.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Rethinking evaluation practices in visual question answering: A case study on out-of-distribution generalization. In Proceedings of the 2023 European Chapter of the Association for Computational Linguistics (Findings).
  2. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086.
  3. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
  4. Murag: Multimodal retrieval-augmented generator for open question answering over images and text. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
  5. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
  6. Does clip benefit visual question answering in the medical domain as much as it does in the general domain? arXiv preprint arXiv:2112.13906.
  7. Transform-retrieve-generate: Natural language-centric outside-knowledge visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5067–5077.
  8. Are you talking to a machine? dataset and methods for multilingual image question. Advances in neural information processing systems, 28.
  9. Bilinear graph networks for visual question answering. IEEE Transactions on Neural Networks and Learning Systems.
  10. Retrieval augmented language model pre-training. In International Conference on Machine Learning, pages 3929–3938. PMLR.
  11. Overview of imageclef 2018 medical domain visual question answering task. In Conference and Labs of the Evaluation Forum (Working Notes).
  12. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.
  13. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations.
  14. Mmbert: multimodal bert pretraining for improved medical vqa. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1033–1036. IEEE.
  15. Unifiedqa: Crossing format boundaries with a single qa system. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (Findings).
  16. Bilinear attention networks. Advances in neural information processing systems, 31.
  17. Towards visual dialog for radiology. In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, pages 60–69, Online. Association for Computational Linguistics.
  18. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10.
  19. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  20. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  21. Relation-aware graph attention network for visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10313–10322.
  22. Weizhe Lin and Bill Byrne. 2022. Retrieval augmented visual question answering with outside knowledge. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
  23. Medical visual question answering: A survey. ArXiv, abs/2111.10056.
  24. Slake: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650–1654. IEEE.
  25. Hierarchical question-image co-attention for visual question answering. Advances in neural information processing systems, 29.
  26. Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the IEEE international conference on computer vision, pages 1–9.
  27. Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14111–14121.
  28. Multi-modal understanding and generation for medical images and text via vision-language pre-training. IEEE Journal of Biomedical and Health Informatics.
  29. Medhini Narasimhan and Alexander G Schwing. 2018. Straight to the facts: Learning knowledge base retrieval for factual visual question answering. In Proceedings of the European conference on computer vision (ECCV), pages 451–468.
  30. Overcoming data limitation in medical visual question answering. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 522–530. Springer.
  31. Learning conditioned graph structures for interpretable visual question answering. Advances in neural information processing systems, 31.
  32. Radiology objects in context (roco): a multimodal image dataset. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis, pages 180–189. Springer.
  33. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
  34. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  35. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.
  36. Casey Ross and Ike Swetlitz. 2018. Ibm’s watson supercomputer recommended ‘unsafe and incorrect’cancer treatments, internal documents show. Stat, 25.
  37. Reasoning over vision and language: Exploring the benefits of supplemental knowledge. arXiv preprint arXiv:2101.06013.
  38. Anshumali Shrivastava and Ping Li. 2014. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc.
  39. Repsnet: Combining vision with language for automated medical reports. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 714–724. Springer.
  40. Overcoming catastrophic forgetting during domain adaptation of neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2062–2068.
  41. Multi-modal answer validation for knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2712–2721.
  42. Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 21–29.
  43. Medical visual question answering via conditional reasoning. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2345–2354.
  44. Tua1 at imageclef 2019 vqa-med: a classification and generation model based on transfer learning. In Conference and Labs of the Evaluation Forum (Working Notes).

Summary

We haven't generated a summary for this paper yet.