VCoder: Versatile Vision Encoders for Multimodal Large Language Models (2312.14233v1)
Abstract: Humans possess the remarkable skill of Visual Perception, the ability to see and understand the seen, helping them make sense of the visual world and, in turn, reason. Multimodal LLMs (MLLM) have recently achieved impressive performance on vision-language tasks ranging from visual question-answering and image captioning to visual reasoning and image generation. However, when prompted to identify or count (perceive) the entities in a given image, existing MLLM systems fail. Working towards developing an accurate MLLM system for perception and reasoning, we propose using Versatile vision enCoders (VCoder) as perception eyes for Multimodal LLMs. We feed the VCoder with perception modalities such as segmentation or depth maps, improving the MLLM's perception abilities. Secondly, we leverage the images from COCO and outputs from off-the-shelf vision perception models to create our COCO Segmentation Text (COST) dataset for training and evaluating MLLMs on the object perception task. Thirdly, we introduce metrics to assess the object perception abilities in MLLMs on our COST dataset. Lastly, we provide extensive experimental evidence proving the VCoder's improved object-level perception skills over existing Multimodal LLMs, including GPT-4V. We open-source our dataset, code, and models to promote research. We open-source our code at https://github.com/SHI-Labs/VCoder
- Vqa: Visual question answering, 2015.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- Falcon-40B: an open large language model with state-of-the-art performance. arXiv, 2023.
- Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv, 2023.
- Qwen technical report. arXiv, 2023.
- End-to-end object detection with transformers. In ECCV, 2020.
- Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
- Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
- Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, 2020.
- Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
- OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
- Active appearance models. IEEE Transactions on pattern analysis and machine intelligence, 23(6):681–685, 2001.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pages 2366–2374, 2014.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv, 2023.
- Omnivore: A Single Model for Many Visual Modalities. In CVPR, 2022.
- Dilated neighborhood attention transformer. arXiv:2209.15001, 2022.
- Neighborhood attention transformer. In CVPR, 2023.
- Deep residual learning for image recognition. In CVPR, 2016.
- Mask r-cnn. In ICCV, 2017.
- Language is not all you need: Aligning perception with language models. ArXiv, abs/2302.14045, 2023.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. CVPR, 2019.
- Semask: Semantically masking transformer backbones for effective semantic segmentation. arXiv, 2021.
- OneFormer: One Transformer to Rule Universal Image Segmentation. In CVPR, 2023.
- Unified language-vision pretraining in llm with dynamic discrete visual tokenization. arXiv, 2023.
- Panoptic feature pyramid networks. In CVPR, 2019.
- Generating images with multimodal language models. NeurIPS, 2023.
- Deanna Kuhn. The Skills of Argument. Cambridge University Press, 1991.
- Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth International Conference on 3D Vision (3DV), pages 239–248. IEEE, 2016.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE.
- Mimic-it: Multi-modal in-context instruction tuning. 2023a.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023b.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023c.
- Evaluating object hallucination in large vision-language models. In EMNLP, 2023d.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566, 2023a.
- Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023b.
- Improved baselines with visual instruction tuning, 2023c.
- Visual instruction tuning. In NeurIPS, 2023d.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- Fully convolutional networks for semantic segmentation. In CVPR, 2015.
- Negative object presence evaluation (nope) to measure object hallucination in vision-language models. arXiv, 2023.
- Neural baby talk. In CVPR, 2018.
- H. Moravec. Mind children: The future of robot and human intelligence. Harvard University Press, 1988.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv, 2023.
- Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
- OpenAI. Chatgpt. https://chat.openai.com/, 2022.
- OpenAI. Gpt-4 technical report, 2023.
- Dinov2: Learning robust visual features without supervision, 2023.
- Im2text: Describing images using 1 million captioned photographs. In NeurIPS, 2011.
- Kosmos-2: Grounding multimodal large language models to the world. ArXiv, abs/2306, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021a.
- Learning transferable visual models from natural language supervision. arXiv, 2021b.
- Vision transformers for dense prediction. In ICCV, 2021.
- Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv, 2015.
- Object hallucination in image captioning. In EMNLP, 2018.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. In NeurIPS Workshops 2021, 2021.
- Generative pretraining in multimodality. arXiv, 2023.
- Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1653–1660, 2014.
- Llama: Open and efficient foundation language models. arXiv, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv, 2023b.
- Multimodal few-shot learning with frozen language models. In NeurIPS, 2021.
- Image parsing: Unifying segmentation, detection, and recognition. In IJCV, 2005.
- Rapid object detection using a boosted cascade of simple features. In CVPR, 2001.
- Cogvlm: Visual expert for pretrained language models. arXiv, 2023.
- Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
- Prompt-free diffusion: Taking” text” out of text-to-image diffusion models. arXiv preprint arXiv:2305.16223, 2023a.
- Versatile diffusion: Text, images and variations all in one diffusion model. In ICCV, 2023b.
- mplug-owl: Modularization empowers large language models with multimodality, 2023.
- What matters in training a gpt4-style language model with multimodal inputs? arXiv preprint arXiv:2307.02469, 2023.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023a.
- Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv, 2023b.
- Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv, 2023a.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023b.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv, 2023.
- Deformable detr: Deformable transformers for end-to-end object detection. arXiv, 2020.
- Jitesh Jain (11 papers)
- Jianwei Yang (93 papers)
- Humphrey Shi (97 papers)