Papers
Topics
Authors
Recent
2000 character limit reached

Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers (2401.01974v2)

Published 3 Jan 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with LLMs as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools. Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning. Nevertheless, in their current form, these models heavily rely on human engineering of in-context examples in the prompt, which are often dataset- and task-specific and require significant labor by highly skilled programmers. In this work, we present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples, thereby avoiding human-created in-context examples. On a number of visual reasoning tasks, we show that our framework leads to consistent gains in performance, makes LLMs as controllers setup more robust, and removes the need for human engineering of in-context examples.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (101)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Neural module networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  39–48, 2016.
  3. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  4. Systematic generalization: what is required and can it be learned? arXiv preprint arXiv:1811.12889, 2018.
  5. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  6. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Revisiting the" video" in video-language understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  2917–2927, 2022.
  9. Measuring progress in fine-grained vision-and-language understanding. arXiv preprint arXiv:2305.07558, 2023.
  10. Improving code generation by training with natural language feedback. arXiv preprint arXiv:2303.16749, 2023a.
  11. How many demonstrations do you need for in-context learning?, 2023b.
  12. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  13. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022a.
  14. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022b.
  15. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023c.
  16. Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023d.
  17. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023e.
  18. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  19. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  20. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pp.  7480–7512. PMLR, 2023.
  21. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  22. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  23. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797, 2023.
  24. Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision, 14(3–4):163–352, 2022.
  25. The capacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459, 2023.
  26. Pal: Program-aided language models. In International Conference on Machine Learning, pp.  10764–10799. PMLR, 2023.
  27. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723, 2020.
  28. Google. Google cloud vertex ai api [code-bison], available at: https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/code-generation. 2023.
  29. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14953–14962, 2023.
  30. Probing image-language transformers for verb understanding. arXiv preprint arXiv:2106.09141, 2021.
  31. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. arXiv preprint arXiv:2306.14610, 2023.
  32. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE international conference on computer vision, pp.  804–813, 2017.
  33. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023.
  34. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  6700–6709, 2019.
  35. A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models. arXiv preprint arXiv:2110.08484, 2021.
  36. Inferring and executing programs for visual reasoning. In Proceedings of the IEEE international conference on computer vision, pp.  2989–2998, 2017.
  37. Daniel Kahneman. Thinking, fast and slow. 2017.
  38. Big transfer (bit): General visual representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pp.  491–507. Springer, 2020.
  39. Internet-augmented dialogue generation. arXiv preprint arXiv:2107.07566, 2021.
  40. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  41. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10965–10975, 2022.
  42. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916, 2022.
  43. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786, 2021.
  44. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686, 2022.
  45. Language models of code are few-shot commonsense learners. arXiv preprint arXiv:2210.07128, 2022.
  46. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  47. Scaling open-vocabulary object detection, 2023.
  48. Evaluating the robustness of neural language models to input perturbations. arXiv preprint arXiv:2108.12237, 2021.
  49. Coarse-to-fine reasoning for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4558–4566, 2022.
  50. Demystifying gpt self-repair for code generation. arXiv preprint arXiv:2306.09896, 2023.
  51. OpenAI. Openai chatgpt api [gpt-3.5-turbo], available at: https://platform.openai.com/docs/model-index-for-researchers. 2023.
  52. Teaching clip to count to ten. arXiv preprint arXiv:2302.12066, 2023.
  53. Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255, 2022.
  54. Perception test: A diagnostic benchmark for multimodal video models. arXiv preprint arXiv:2305.13786, 2023.
  55. Grips: Gradient-free, edit-based instruction search for prompting large language models. arXiv preprint arXiv:2203.07281, 2022.
  56. Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495, 2023.
  57. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
  58. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
  59. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp.  1–7, 2021.
  60. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  61. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
  62. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  63. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  64. Gpt-4 doesn’t know it’s wrong: An analysis of iterative prompting for reasoning problems, 2023.
  65. Obtaining faithful interpretations from compositional neural networks. arXiv preprint arXiv:2005.00724, 2020.
  66. Modular visual question answering via code generation. arXiv preprint arXiv:2306.05392, 2023.
  67. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
  68. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  69. Image captioners are scalable vision learners too. arXiv preprint arXiv:2306.07915, 2023.
  70. Shimon Ullman. Visual routines. In Readings in computer vision, pp.  298–328. Elsevier, 1987.
  71. Can large language models really improve by self-critiquing their own plans?, 2023.
  72. Will we run out of data? an analysis of the limits of scaling datasets in machine learning. arXiv preprint arXiv:2211.04325, 2022.
  73. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022a.
  74. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091, 2023a.
  75. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pp.  23318–23340. PMLR, 2022b.
  76. Code4struct: Code generation for few-shot event structure prediction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3640–3663, 2023b.
  77. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022c.
  78. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  79. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023.
  80. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668, 2023.
  81. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
  82. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  9777–9786, 2021.
  83. Gps: Genetic prompt search for efficient few-shot learning. arXiv preprint arXiv:2210.17041, 2022.
  84. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023a.
  85. Fine-grained visual prompting. arXiv preprint arXiv:2306.04356, 2023b.
  86. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  3081–3089, 2022.
  87. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  88. Cpt: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797, 2021.
  89. Hitea: Hierarchical temporal-aware video-language pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  15405–15416, 2023.
  90. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  91. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp.  69–85. Springer, 2016.
  92. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations, 2022.
  93. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
  94. Multi-grained vision language pre-training: Aligning texts with visual concepts. arXiv preprint arXiv:2111.08276, 2021.
  95. The visual task adaptation benchmark. 2019.
  96. Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023.
  97. Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations. arXiv preprint arXiv:2207.00221, 2022.
  98. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pp.  12697–12706. PMLR, 2021.
  99. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022a.
  100. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022b.
  101. Mindstorms in natural language-based societies of mind. arXiv preprint arXiv:2305.17066, 2023.
Citations (8)

Summary

  • The paper introduces a novel framework enabling zero-shot compositional visual reasoning with LLMs as programmers through abstract APIs.
  • It leverages automatically generated in-context examples and spatial-temporal routines to boost performance across multiple visual reasoning datasets.
  • A self-correction mechanism dynamically refines model outputs, enhancing robustness and reducing the need for manual engineering.

"Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers"

Introduction

The paper "Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers" (2401.01974) introduces a novel framework to enhance visual reasoning tasks through LLMs acting as controllers. This framework seeks to address the persistent challenges faced by conventional end-to-end neural networks, particularly their limitations in compositional reasoning, generalization, and spatial-temporal tasks. Unlike traditional models that have extensive reliance on dataset-specific, human-engineered in-context examples, this approach leverages automatically generated examples and introduces spatially and temporally abstract routines to facilitate reasoning processes, thereby making LLMs more robust controllers without significant manual engineering.

Framework Components

Abstract API Development

The core innovation of the paper is the development of an "Abstract API" composed of spatial and temporal routines that help circumvent the limitations of current LLMs in spatial and temporal reasoning. By encapsulating complex operations within higher-order abstractions, the API reduces cognitive overhead and allows LLMs to perform reasoning tasks more efficiently. Figure 1

Figure 1

Figure 1

Figure 1

Figure 1: Using our Abstract API improves performance over the ViperGPT API across all datasets. Similarly, ACEs consistently improve performance, and these gains compound with the gains from the Abstract API. Uncertainty bars represent standard deviations computed over three random seeds.

Automatic Generation of In-context Examples (ACEs)

One significant advancement is the automatic generation of in-context examples (ACEs). Typically, such examples are hand-crafted, requiring substantial expertise and labor. The framework generates ACEs using a small set of labeled examples, improving performance significantly and ensuring broader applicability across datasets without manual tuning. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Increasing the number of ACEs in the prompt improves performance. Notably, using the ViperGPT API results in only three correct ACEs in certain scenarios so performance plateaus after a few examples.

Self-Correction Mechanism

The framework includes a self-correction mechanism which allows the LLMs to perform self-debugging and self-tuning operations. This enables the models to adjust dynamically without external feedback, enhancing resilience and reliability in executing complex visual reasoning tasks. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: Increasing the number of "self-tuning" steps leads to improved performance. Our Abstract API (Abs. API) consistently outperforms the ViperGPT API (Vip. API). The best performance is achieved when using dynamic object detector thresholds in combination with ACE.

Experimental Validation

The framework was evaluated using multiple datasets, including RefCOCO, RefCOCO+, GQA, and NExT-QA, exhibiting enhanced performance over previous state-of-the-art methods. Each component of the framework was tested individually and in combination, demonstrating substantial improvements in IoU and accuracy metrics across diverse tasks. Figure 4

Figure 4

Figure 4: Error diagrams for the ViperGPT API and our Abstract API. Visualizing percentages of samples with IoU in defined ranges, highlighting execution failures due to specific errors.

Implications and Future Directions

The implications of this research are profound, offering a pathway to truly zero-shot learning in visual reasoning tasks, thus reducing the need for extensive dataset-specific engineering. Future research could explore optimizing the set of abstract routines and further automating prompt generation to enhance task comprehension without manual intervention. Additionally, revisiting benchmarks and datasets to assess the generalization capabilities of LLMs is essential for sustained advancement in this field.

Conclusion

The paper presents a sophisticated framework that redefines the use of LLMs in compositional visual reasoning tasks, overcoming historical limitations through innovative API design and automation of in-context example generation. These enhancements—along with self-correction capabilities—adopt a more dynamic and adaptable approach, providing promising directions for future exploration in artificial intelligence. This work moves towards truly zero-shot compositional reasoning, with LLMs as versatile programmers, paving the way for more generalized, robust AI systems.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.