Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic (2306.15195v2)
Abstract: In human conversations, individuals can indicate relevant regions within a scene while addressing others. In turn, the other person can then respond by referring to specific regions if necessary. This natural referential ability in dialogue remains absent in current Multimodal LLMs (MLLMs). To fill this gap, this paper proposes an MLLM called Shikra, which can handle spatial coordinate inputs and outputs in natural language. Its architecture consists of a vision encoder, an alignment layer, and a LLM. It is designed to be straightforward and simple, without the need for extra vocabularies, position encoder, pre-/post-detection modules, or external plug-in models. All inputs and outputs are in natural language form. Referential dialogue is a superset of various vision-language (VL) tasks. Shikra can naturally handle location-related tasks like REC and PointQA, as well as conventional VL tasks such as Image Captioning and VQA. Experimental results showcase Shikra's promising performance. Furthermore, it enables numerous exciting applications, like providing mentioned objects' coordinates in chains of thoughts and comparing user-pointed regions similarities. Our code, model and dataset are accessed at https://github.com/shikras/shikra.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
- Openflamingo.
- Disclip: Open-vocabulary referring expression generation. arXiv preprint arXiv:2305.19108.
- End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer.
- X-LLM: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160.
- Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500.
- PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010.
- Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448.
- Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790.
- Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364.
- Modeling relationships in referential expressions with compositional modular networks. In CVPR, pages 1115–1124.
- Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045.
- CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, pages 2901–2910.
- Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, pages 787–798.
- Segment anything. arXiv preprint arXiv:2304.02643.
- Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823.
- Visual Genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123:32–73.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355.
- Interactive image segmentation with first click attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13339–13348.
- Multi-mode interactive image segmentation. In Proceedings of the 30th ACM International Conference on Multimedia, pages 905–914.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- Referring expression generation and comprehension via attributes. In Proceedings of the IEEE International Conference on Computer Vision, pages 4856–4864.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499.
- Ilya Loshchilov and Frank Hutter. 2017. SGDR: stochastic gradient descent with warm restarts. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916.
- 12-in-1: Multi-task vision and language representation learning. In CVPR, pages 10437–10446.
- Point and ask: Incorporating pointing into visual question answering. arXiv preprint arXiv:2011.13681.
- Generation and comprehension of unambiguous object descriptions. In CVPR, pages 11–20.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204.
- Embodiedgpt: Vision-language pre-training via embodied chain of thought. arXiv preprint arXiv:2305.15021.
- OpenAI. 2023. Gpt-4 technical report.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649.
- Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355.
- Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems, 33:7537–7547.
- Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9627–9636.
- End-to-end object detection with fully convolutional network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15849–15858.
- Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100.
- ONE-PEACE: Exploring one general representation model toward unlimited modalities. arXiv preprint arXiv:2305.11172.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175.
- Exposing the troublemakers in described object detection. arXiv preprint.
- Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265.
- Universal instance perception as object discovery and retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15325–15336.
- Unitab: Unifying text and box outputs for grounded vision-language modeling. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 521–539. Springer.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
- Transfer visual prompt generator across llms. arXiv preprint arXiv:2305.01278.
- Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923.
- More grounded image captioning by distilling image-text matching model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4777–4786.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
- Visual7w: Grounded question answering in images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4995–5004.