Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic (2306.15195v2)

Published 27 Jun 2023 in cs.CV

Abstract: In human conversations, individuals can indicate relevant regions within a scene while addressing others. In turn, the other person can then respond by referring to specific regions if necessary. This natural referential ability in dialogue remains absent in current Multimodal LLMs (MLLMs). To fill this gap, this paper proposes an MLLM called Shikra, which can handle spatial coordinate inputs and outputs in natural language. Its architecture consists of a vision encoder, an alignment layer, and a LLM. It is designed to be straightforward and simple, without the need for extra vocabularies, position encoder, pre-/post-detection modules, or external plug-in models. All inputs and outputs are in natural language form. Referential dialogue is a superset of various vision-language (VL) tasks. Shikra can naturally handle location-related tasks like REC and PointQA, as well as conventional VL tasks such as Image Captioning and VQA. Experimental results showcase Shikra's promising performance. Furthermore, it enables numerous exciting applications, like providing mentioned objects' coordinates in chains of thoughts and comparing user-pointed regions similarities. Our code, model and dataset are accessed at https://github.com/shikras/shikra.

References (57)

Citations (494)

View on Semantic Scholar

Summary

The paper introduces Shikra, a streamlined multimodal LLM that incorporates referential dialogue using natural language spatial representations.
It leverages a simple architecture without extra vocabularies or plug-ins, achieving high accuracy in tasks such as REC, VQA, and image captioning.
The approach’s implications extend to XR and visual robotics, offering a novel direction to reduce visual hallucinations in AI systems.

Overview of "Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic"

This paper presents Shikra, a Multimodal LLM (MLLM) designed to address a significant gap in current models' abilities to engage in referential dialogue. This capability is a natural aspect of human communication, allowing seamless discussions that reference specific spatial regions in a scene. Shikra is engineered to handle spatial coordinates in a natural language format, enhancing its applicability in various vision-language (VL) tasks beyond existing MLLMs.

Architecture and Design

Shikra's architecture is streamlined, consisting of a vision encoder, an alignment layer, and a LLM. Notably, it does not require additional vocabularies, position encoders, or external plug-in models, thereby maintaining simplicity and uniformity. The model processes all input and output in natural language, utilizing numerical representations for spatial data like coordinates.

Referential Dialogue and VL Tasks

The paper introduces Referential Dialogue (RD) as a critical component that expands traditional VL tasks. RD encompasses tasks such as Referring Expression Comprehension (REC), PointQA, image captioning, and Visual Question Answering (VQA). Shikra exemplifies promising performance across these domains without needing fine-tuning. Notably, it also supports applications in Mixed Reality (XR) environments and enhances communication for visual robotics.

Quantitative Results and Position Representation

Shikra's performance across conventional VL tasks is quantitatively robust, achieving high accuracy in REC tasks, even without task-specific fine-tuning. Its approach to position representation—using numerical forms directly within natural language—proves more effective than using specialized token vocabularies. This choice enhances flexibility and computational efficiency for tasks requiring precise spatial awareness.

Implications and Future Directions

The implications of Shikra's abilities are substantial, suggesting a new direction for MLLM research. The integration of referential dialogue can facilitate more interactive and intuitive AI systems, particularly in areas like XR and human-robot interaction. Additionally, the concept of Grounding Chain of Thoughts (GCoT), which incorporates spatial annotations, offers a pathway to reduce the visual hallucinations that commonly afflict current models.

Challenges and Limitations

The paper acknowledges some limitations, such as the model's current restriction to English and its inapplicability to dense object detection tasks. Future iterations may explore multilingual support and investigate optimized coordinate representations to handle more complex visual scenes.

In conclusion, Shikra represents an advancement in MLLM capabilities, addressing a previously unmet need for natural referential dialogue. Its performance demonstrates the potential benefits of simple yet powerful architectural choices and opens avenues for future research and development in AI interactive systems.

PDF Markdown

GitHub

GitHub - shikras/shikra (751 stars)