Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Semantics in Multimodal Chain of Thought via Soft Negative Sampling (2405.09848v1)

Published 16 May 2024 in cs.CL and cs.AI

Abstract: Chain of thought (CoT) has proven useful for problems requiring complex reasoning. Many of these problems are both textual and multimodal. Given the inputs in different modalities, a model generates a rationale and then uses it to answer a question. Because of the hallucination issue, the generated soft negative rationales with high textual quality but illogical semantics do not always help improve answer accuracy. This study proposes a rationale generation method using soft negative sampling (SNSE-CoT) to mitigate hallucinations in multimodal CoT. Five methods were applied to generate soft negative samples that shared highly similar text but had different semantics from the original. Bidirectional margin loss (BML) was applied to introduce them into the traditional contrastive learning framework that involves only positive and negative samples. Extensive experiments on the ScienceQA dataset demonstrated the effectiveness of the proposed method. Code and data are released at https://github.com/zgMin/SNSE-CoT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Cont: Contrastive neural text generation. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS-2022).
  2. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the 31st IEEE Conference on Computer Vision and Pattern Recognition (CVPR-2018), pages 6077–6086.
  3. Neural module networks. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR-2016), pages 39–48.
  4. Vqa: Visual question answering. In Proceedings of the 15th IEEE International Conference on Computer Vision (ICCV-2015), pages 2425–2433.
  5. End-to-end object detection with transformers. In Proceedings of the 16th European Conference on Computer Vision (ECCV-2020), pages 213–229.
  6. Emerging properties in self-supervised vision transformers. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision (ICCV-2021), pages 9650–9660.
  7. Big self-supervised models are strong semi-supervised learners. Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS-2020), 33:22243–22255.
  8. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297.
  9. Xinlei Chen and Kaiming He. 2021. Exploring simple siamese representation learning. In Proceedings of the 34th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR-2021), pages 15750–15758.
  10. Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR-2019), pages 6639–6648.
  11. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP-2021), pages 6894–6910.
  12. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR-2017), pages 6904–6913.
  13. Bootstrap your own latent-a new approach to self-supervised learning. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS-2020), volume 33, pages 21271–21284.
  14. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071.
  15. Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR-2019), pages 6700–6709.
  16. Survey of hallucination in natural language generation. ACM Computing Surveys, 55:1–38.
  17. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR-2017), pages 2901–2910.
  18. UNIFIEDQA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1896–1907.
  19. Bilinear Attention Networks. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS-2018), pages 1571–1581.
  20. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML-2021), pages 5583–5594.
  21. Large language models are zero-shot reasoners. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS-2022).
  22. On vision features in multimodal machine translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL-2022), pages 6327–6337.
  23. What does bert with vision look at? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL-2020), pages 5265–5275.
  24. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  25. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS-2022).
  26. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842.
  27. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS-2021).
  28. Teaching small language models to reason. arXiv preprint arXiv:2212.08410.
  29. Contrastive learning for many-to-many multilingual neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021), pages 244–258.
  30. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21:5485–5551.
  31. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  32. Sncse: Contrastive learning for unsupervised sentence embedding with soft negative samples. arXiv preprint arXiv:2201.05979.
  33. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS-2022), volume 35, pages 24824–24837.
  34. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR-2016), pages 4622–4630.
  35. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the 31st IEEE Conference on Computer Vision and Pattern Recognition (CVPR-2018), pages 3733–3742.
  36. Dynamic memory networks for visual and textual question answering. In Proceedings of the 33rd International Conference on Machine Learning (ICML-2016), pages 2397–2406.
  37. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR-2019), pages 6210–6219.
  38. Deep modular co-attention networks for visual question answering. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR-2019), pages 6281–6290.
  39. Joint multimodal entity-relation extraction based on edge-enhanced graph alignment network and word-pair relation tagging. In Proceedings of the 37th AAAI conference on artificial intelligence (AAAI-2023), 9, pages 11051–11059.
  40. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
  41. Neural machine translation with universal visual representation. In Proceedings of the 8th International Conference on Learning Representations (ICLR-2020).
  42. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.
  43. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets