2000 character limit reached
All You May Need for VQA are Image Captions (2205.01883v1)
Published 4 May 2022 in cs.CV and cs.CL
Abstract: Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but has not enjoyed the same level of engagement in terms of data creation. In this paper, we propose a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation. We show that the resulting data is of high-quality. VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits and achieve a level of robustness that lacks in the same model trained on human-annotated VQA data.
- Soravit Changpinyo (24 papers)
- Doron Kukliansky (3 papers)
- Idan Szpektor (47 papers)
- Xi Chen (1040 papers)
- Nan Ding (57 papers)
- Radu Soricut (54 papers)