Detection-based Intermediate Supervision for Visual Question Answering (2312.16012v1)
Abstract: Recently, neural module networks (NMNs) have yielded ongoing success in answering compositional visual questions, especially those involving multi-hop visual and logical reasoning. NMNs decompose the complex question into several sub-tasks using instance-modules from the reasoning paths of that question and then exploit intermediate supervisions to guide answer prediction, thereby improving inference interpretability. However, their performance may be hindered due to sketchy modeling of intermediate supervisions. For instance, (1) a prior assumption that each instance-module refers to only one grounded object yet overlooks other potentially associated grounded objects, impeding full cross-modal alignment learning; (2) IoU-based intermediate supervisions may introduce noise signals as the bounding box overlap issue might guide the model's focus towards irrelevant objects. To address these issues, a novel method, \textbf{\underline{D}}etection-based \textbf{\underline{I}}ntermediate \textbf{\underline{S}}upervision (DIS), is proposed, which adopts a generative detection framework to facilitate multiple grounding supervisions via sequence generation. As such, DIS offers more comprehensive and accurate intermediate supervisions, thereby boosting answer prediction performance. Furthermore, by considering intermediate results, DIS enhances the consistency in answering compositional questions and their sub-questions.Extensive experiments demonstrate the superiority of our proposed DIS, showcasing both improved accuracy and state-of-the-art reasoning consistency compared to prior approaches.
- VQA: Visual Question Answering. International Journal of Computer Vision, 123(1): 4–31.
- Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6077–6086.
- Neural Module Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 39–48.
- Soft-NMS — Improving Object Detection with One Line of Code. In IEEE/CVF International Conference on Computer Vision (ICCV), 5562–5570.
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, volume 33, 1877–1901.
- End-to-end object detection with transformers. In European conference on computer vision, 213–229. Springer.
- Pix2seq: A Language Modeling Framework for Object Detection. In International Conference on Learning Representations.
- Meta Module Network for Compositional Visual Reasoning. In IEEE Winter Conference on Applications of Computer Vision, WACV, 655–664.
- Girshick, R. B. 2015. Fast R-CNN. In IEEE International Conference on Computer Vision (ICCV), 1440–1448.
- Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. ArXiv, abs/1706.02677.
- Graph Reasoning Networks for Visual Question Answering. ArXiv, abs/1907.09815.
- Explainable neural computation via stack neural module networks. In Proceedings of the European conference on computer vision (ECCV), 53–69.
- Language-Conditioned Graph Networks for Relational Reasoning. In IEEE/CVF International Conference on Computer Vision (ICCV), 10293–10302.
- Learning by abstraction: The neural state machine. In Advances in Neural Information Processing Systems, volume 32, 5901–5914.
- Compositional Attention Networks for Machine Reasoning. In International Conference on Learning Representations.
- GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6693–6702.
- In Defense of Grid Features for Visual Question Answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10264–10273.
- Learning the dynamics of visual relational reasoning via reinforced path routing. In Proceedings of the AAAI Conference on Artificial Intelligence, 1122–1130.
- Maintaining Reasoning Consistency in Compositional Visual Question Answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5089–5098.
- Bilinear Attention Networks. In Advances in Neural Information Processing Systems, 1571–1581.
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision, 123: 32–73.
- Perceptual Visual Reasoning with Knowledge Propagation. In Proceedings of the 27th ACM International Conference on Multimedia, 530–538.
- Relation-Aware Graph Attention Network for Visual Question Answering. In IEEE/CVF International Conference on Computer Vision (ICCV), 10312–10321.
- GloVe: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1532–1543.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21: 140:1–140:67.
- You Only Look Once: Unified, Real-Time Object Detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779–788.
- Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39: 1137–1149.
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 5100–5111.
- FCOS: Fully Convolutional One-Stage Object Detection. In IEEE/CVF International Conference on Computer Vision (ICCV), 9626–9635.
- Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30, 5998–6008.
- Trrnet: Tiered relation reasoning for compositional visual question answering. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, 414–430. Springer.
- Deep Modular Co-Attention Networks for Visual Question Answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6281–6290.
- ProTo: Program-Guided Transformer for Program-Guided Tasks. In Advances in Neural Information Processing Systems, 17021–17036.
- Yuhang Liu (57 papers)
- Daowan Peng (5 papers)
- Wei Wei (425 papers)
- Yuanyuan Fu (6 papers)
- Wenfeng Xie (8 papers)
- Dangyang Chen (20 papers)