Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 54 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 105 tok/s Pro
Kimi K2 182 tok/s Pro
GPT OSS 120B 466 tok/s Pro
Claude Sonnet 4 40 tok/s Pro
2000 character limit reached

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs (2401.02582v1)

Published 5 Jan 2024 in cs.CV

Abstract: When exploring the development of AGI, a critical task for these models involves interpreting and processing information from multiple image inputs. However, Large Multimodal Models (LMMs) encounter two issues in such scenarios: (1) a lack of fine-grained perception, and (2) a tendency to blend information across multiple images. We first extensively investigate the capability of LMMs to perceive fine-grained visual details when dealing with multiple input images. The research focuses on two aspects: first, image-to-image matching (to evaluate whether LMMs can effectively reason and pair relevant images), and second, multi-image-to-text matching (to assess whether LMMs can accurately capture and summarize detailed image information). We conduct evaluations on a range of both open-source and closed-source large models, including GPT-4V, Gemini, OpenFlamingo, and MMICL. To enhance model performance, we further develop a Contrastive Chain-of-Thought (CoCoT) prompting approach based on multi-input multimodal models. This method requires LMMs to compare the similarities and differences among multiple image inputs, and then guide the models to answer detailed questions about multi-image inputs based on the identified similarities and differences. Our experimental results showcase CoCoT's proficiency in enhancing the multi-image comprehension capabilities of large multimodal models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  2. “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744, 2022.
  3. “Dnagpt: A generalized pretrained tool for multiple dna sequence analysis tasks,” bioRxiv, pp. 2023–07, 2023.
  4. “Visual instruction tuning,” arXiv preprint arXiv:2304.08485, 2023.
  5. “Llmva-gebc: Large language model with video adapter for generic event boundary captioning,” arXiv preprint arXiv:2306.10354, 2023.
  6. “Promptcap: Prompt-guided task-aware image captioning,” arXiv preprint arXiv:2211.09699, 2022.
  7. “Fine-tuning pre-trained language models with noise stability regularization,” arXiv preprint arXiv:2206.05658, 2022.
  8. “Gpt-4v (ision) as a social media analysis engine,” arXiv preprint arXiv:2311.07547, 2023.
  9. “Unbiased multi-modality guidance for image inpainting,” in European Conference on Computer Vision. Springer, 2022, pp. 668–684.
  10. “Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models,” arXiv preprint arXiv:2310.16436, 2023.
  11. “Compositional chain-of-thought prompting for large multimodal models,” arXiv preprint arXiv:2311.17076, 2023.
  12. “Openflamingo: An open-source framework for training large autoregressive vision-language models,” arXiv preprint arXiv:2308.01390, 2023.
  13. “Mmicl: Empowering vision-language model with multi-modal in-context learning,” arXiv preprint arXiv:2309.07915, 2023.
  14. OpenAI, “GPT-4 technical report,” CoRR, vol. abs/2303.08774, 2023.
  15. “Gemini: A family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
  16. “Winoground: Probing vision and language models for visio-linguistic compositionality,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5238–5248.
  17. “Mmbench: Is your multi-modal model an all-around player?,” arXiv preprint arXiv:2307.06281, 2023.
  18. “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22199–22213, 2022.
  19. “Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15211–15222.
  20. “Raven: A dataset for relational and analogical visual reasoning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5317–5327.
  21. “Language is not all you need: Aligning perception with language models,” arXiv preprint arXiv:2302.14045, 2023.
  22. “Factify 2: A multimodal fake news and satire news dataset,” arXiv preprint arXiv:2304.03897, 2023.
Citations (23)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.