Emergent Mind

Abstract

Recent research on LLMs has led to remarkable advancements in general NLP AI assistants. Some studies have further explored the use of LLMs for planning and invoking models or APIs to address more general multi-modal user queries. Despite this progress, complex visual-based tasks still remain challenging due to the diverse nature of visual tasks. This diversity is reflected in two aspects: 1) Reasoning paths. For many real-life applications, it is hard to accurately decompose a query simply by examining the query itself. Planning based on the specific visual content and the results of each step is usually required. 2) Flexible inputs and intermediate results. Input forms could be flexible for in-the-wild cases, and involves not only a single image or video but a mixture of videos and images, e.g., a user-view image with some reference videos. Besides, a complex reasoning process will also generate diverse multimodal intermediate results, e.g., video narrations, segmented video clips, etc. To address such general cases, we propose a multi-modal AI assistant, AssistGPT, with an interleaved code and language reasoning approach called Plan, Execute, Inspect, and Learn (PEIL) to integrate LLMs with various tools. Specifically, the Planner is capable of using natural language to plan which tool in Executor should do next based on the current reasoning progress. Inspector is an efficient memory manager to assist the Planner to feed proper visual information into a specific tool. Finally, since the entire reasoning process is complex and flexible, a Learner is designed to enable the model to autonomously explore and discover the optimal solution. We conducted experiments on A-OKVQA and NExT-QA benchmarks, achieving state-of-the-art results. Moreover, showcases demonstrate the ability of our system to handle questions far more complex than those found in the benchmarks.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  2. Language models are unsupervised multitask learners. OpenAI blog, page 9
  3. Language models are few-shot learners. NeurIPS, pages 1877–1901
  4. LLaMA: Open and Efficient Foundation Language Models
  5. OpenAI. Introducing chatgpt. OpenAI Blog, 09 2021.
  6. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
  7. Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
  8. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
  9. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pages 12888–12900
  10. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  11. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
  12. End-to-end object detection with transformers. In ECCV, pages 213–229
  13. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10965–10975, June 2022.
  14. GRiT: A Generative Region-to-text Transformer for Object Understanding
  15. GIT: A Generative Image-to-text Transformer for Vision and Language
  16. Pali: A jointly-scaled multilingual language-image model
  17. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
  18. OpenAGI: When LLM Meets Domain Experts
  19. Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
  20. VideoChat: Chat-Centric Video Understanding
  21. ViperGPT: Visual Inference via Python Execution for Reasoning
  22. Visual Programming: Compositional visual reasoning without training
  23. TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs
  24. ReAct: Synergizing Reasoning and Acting in Language Models
  25. Toolformer: Language Models Can Teach Themselves to Use Tools
  26. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433
  27. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086
  28. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32
  29. LXMERT: Learning Cross-Modality Encoder Representations from Transformers
  30. Learning to recognize visual concepts for visual question answering with structural label space. IEEE Journal of Selected Topics in Signal Processing, 14(3):494–505
  31. Visual textbook network: Watch carefully before answering visual questions. In BMVC
  32. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204
  33. A-okvqa: A benchmark for visual question answering using world knowledge. arXiv
  34. Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence, 40(10):2413–2427
  35. KAT: A Knowledge Augmented Transformer for Vision-and-Language
  36. Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14111–14121
  37. Cric: A vqa dataset for compositional reasoning on vision and commonsense. IEEE Transactions on Pattern Analysis and Machine Intelligence
  38. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738
  39. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7331–7341, June 2021.
  40. Geb+: A benchmark for generic event boundary captioning, grounding and retrieval. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 709–725. Springer
  41. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012
  42. Egocentric video-language pretraining. Advances in Neural Information Processing Systems, 35:7575–7586
  43. Env-qa: A video question answering benchmark for comprehensive understanding of dynamic environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1675–1685
  44. CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding
  45. Assistq: Affordance-centric question-driven task completion for egocentric assistant. In European Conference on Computer Vision, pages 485–501. Springer
  46. Assistsr: Task-oriented video segment retrieval for personal ai assistant. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 319–338
  47. Affordance grounding from demonstration video to target image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6799–6808
  48. Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9992–10002
  49. Tap: Text-aware pre-training for text-vqa and text-caption. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8751–8761
  50. Multi-modal graph neural network for joint reasoning on vision and scene text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  51. Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task
  52. OpenAI. Gpt-4 technical report
  53. PaLM-E: An Embodied Multimodal Language Model
  54. Minigpt-4: Enhancing vision-language understanding with advanced large language models
  55. Evaluating Large Language Models Trained on Code
  56. Neural module networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 39–48
  57. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 804–813
  58. Inferring and executing programs for visual reasoning. In Proceedings of the IEEE international conference on computer vision, pages 2989–2998
  59. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
  60. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
  61. Segment Anything
  62. Semantic segment anything. https://github.com/fudan-zvg/Semantic-Segment-Anything

  63. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
  64. Robust Speech Recognition via Large-Scale Weak Supervision
  65. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
  66. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR
  67. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9777–9786
  68. Webly supervised concept expansion for general purpose vision models. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 662–681. Springer
  69. PromptCap: Prompt-Guided Task-Aware Image Captioning
  70. Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1686–1697
  71. Revisiting the" video" in video-language understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2917–2927
  72. Video graph transformer for video question answering. In European Conference on Computer Vision, pages 39–58. Springer
  73. MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
  74. WebGPT: Browser-assisted question-answering with human feedback

Show All 74