Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

All in an Aggregated Image for In-Image Learning (2402.17971v2)

Published 28 Feb 2024 in cs.CV, cs.AI, and cs.CL

Abstract: This paper introduces a new in-context learning (ICL) mechanism called In-Image Learning (I$2$L) that combines demonstration examples, visual cues, and chain-of-thought reasoning into an aggregated image to enhance the capabilities of Large Multimodal Models (e.g., GPT-4V) in multimodal reasoning tasks. Unlike previous approaches that rely on converting images to text or incorporating visual input into LLMs, I$2$L consolidates all information into an aggregated image and leverages image processing, understanding, and reasoning abilities. This has several advantages: it reduces inaccurate textual descriptions of complex images, provides flexibility in positioning demonstration examples, and avoids multiple input images and lengthy prompts. We also introduce I$2$L-Hybrid, a method that combines the strengths of I$2$L with other ICL methods. Specifically, it uses an automatic strategy to select the most suitable method (I$2$L or another certain ICL method) for a specific task instance. We conduct extensive experiments to assess the effectiveness of I$2$L and I$2$L-Hybrid on MathVista, which covers a variety of complex multimodal reasoning tasks. Additionally, we investigate the influence of image resolution, the number of demonstration examples in a single image, and the positions of these demonstrations in the aggregated image on the effectiveness of I$2$L. Our code is publicly available at https://github.com/AGI-Edgerunners/IIL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
  2. Anthropic. 2023. Claude 2.
  3. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390.
  4. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
  7. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  8. InstructBLIP: Towards general-purpose vision-language models with instruction tuning.
  9. A survey for in-context learning. arXiv preprint arXiv:2301.00234.
  10. Google. 2023. Bard.
  11. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR).
  12. From images to textual prompts: Zero-shot vqa with frozen large language models. arXiv preprint arXiv:2212.10846.
  13. Icl-d3ie: In-context learning with diverse demonstrations updating for document information extraction. arXiv preprint arXiv:2303.05063.
  14. Promptcap: Prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699.
  15. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839.
  16. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  17. OBELICS: An open web-scale filtered dataset of interleaved image-text documents.
  18. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
  19. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  20. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566.
  21. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565.
  22. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
  23. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  24. What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804.
  25. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.
  26. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255.
  27. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786.
  28. Aman Madaan and Amir Yazdanbakhsh. 2022. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686.
  29. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837.
  30. OpenAI. 2022. Introducing chatgpt. https://openai.com/blog/chatgpt.
  31. OpenAI. 2023a. GPT-4 technical report. CoRR, abs/2303.08774.
  32. OpenAI. 2023b. Gpt-4v(ision) system card.
  33. Xricl: Cross-lingual retrieval-augmented in-context learning for cross-lingual text-to-sql semantic parsing. arXiv preprint arXiv:2210.13693.
  34. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  35. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  36. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100.
  37. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091.
  38. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  39. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441.
  40. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089.
  41. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9.
  42. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
  43. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257.
  44. Ground-truth labels matter: A deeper look into input-label demonstrations. arXiv preprint arXiv:2205.12685.
  45. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
  46. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  47. Lost in translation: When gpt-4v (ision) can’t see eye to eye with text. a vision-language-consistency analysis of vllms and beyond. arXiv preprint arXiv:2310.12520.
  48. Gpt-4v (ision) as a generalist evaluator for vision-language tasks. arXiv preprint arXiv:2311.01361.
  49. LLaVAR: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107.
  50. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.
  51. Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239.
  52. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
  53. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com