Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference (2403.14520v3)

Published 21 Mar 2024 in cs.CV

Abstract: In recent years, the application of multimodal LLMs (MLLM) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, current MLLMs are composed of the well-known Transformer network, which has a less efficient quadratic computation complexity. To improve the efficiency of such basic models, we propose Cobra, a linear computational complexity MLLM. Specifically, Cobra integrates the efficient Mamba LLM into the visual modality. Moreover, we explore and study various modal fusion schemes to create an effective multi-modal Mamba. Extensive experiments demonstrate that (1) Cobra achieves extremely competitive performance with current computationally efficient state-of-the-art methods, e.g., LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and has faster speed due to Cobra's linear sequential modeling. (2) Interestingly, the results of closed-set challenging prediction benchmarks show that Cobra performs well in overcoming visual illusions and spatial relationship judgments. (3) Notably, Cobra even achieves comparable performance to LLaVA with about 43% of the number of parameters. We will make all codes of Cobra open-source and hope that the proposed method can facilitate future research on complexity problems in MLLM. Our project page is available at: https://sites.google.com/view/cobravlm.

References (2)

Citations (46)

View on Semantic Scholar

Summary

The paper introduces Cobra as a multi-modal LLM that extends Mamba by integrating visual information with a linear complexity design, reducing computational overhead.
It employs innovative modal fusion strategies using selective SSM and input-dependent coefficients to balance visual and linguistic data processing.
Experimental validation demonstrates Cobra’s robust performance on benchmarks, with 3-4x faster inference and approximately 43% fewer parameters than comparable models.

Extending Mamba with Cobra: A Leap Towards Efficient Multi-Modal LLMing

Introduction to Cobra

The paper introduces Cobra, a multi-modal LLM (MLLM) that innovatively extends the Mamba LLM to incorporate visual information, paving the way for efficient inference processes. Cobra distinguishes itself by transitioning from the conventional quadratic complexity characteristics of Transformer networks to a more computationally sustainable linear complexity approach. Its architecture seamlessly integrates visual modalities with the Mamba model, employing several modal fusion strategies to enhance multimodal representation. The empirical validation of Cobra manifests through extensive experiments, showcasing its competitive edge in both performance and speed compared to existing state-of-the-art methods like LLaVA-Phi, TinyLLaVA, and MobileVLM v2.

Efficiency and Performance

One of Cobra's significant achievements is its ability to demonstrate exemplary performance with vastly improved computational efficiency. The model leverages the linear complexity of the Mamba model, substantially reducing the computational overhead without compromising the quality of multimodal integration. The experimental outcomes are indicative of Cobra's superiority, where it not only aligns closely with the performance metrics of significantly larger models like LLaVA but also demonstrates a 3 to 4 times faster inference speed compared to MobileVLM v2 3B and TinyLLaVA 3B. Remarkably, Cobra achieves these feats with approximately 43% fewer parameters compared to some of its counterparts.

Modality Fusion Strategies

The paper explores the exploration of various modal fusion strategies to find an optimal balance between efficiently processing visual and linguistic information. These strategies underscore the importance of selecting and integrating visual encoders and projectors that align with the inherently efficient nature of the Mamba model. By adapting input-dependent coefficients and a selective SSM framework, Cobra enables dynamic adjustment of its processing pathway, responding aptly to the complexities involved in multimodal data handling.

Experimental Validation

Cobra's validation on predominant benchmarks in the multimodal domain is meticulous. It involves competing in both open-ended visual question answering tasks and closed-set prediction tasks, highlighting its aptitude in understanding visual illusions and spatial relationships. The model's robustness is further evidenced by its commendable performance across various benchmarks, including GQA, VQA v2, and more specialized tasks that assess visual hallucination avoidance and spatial reasoning. The comparative analysis with state-of-the-art models delineates Cobra's proficiency in delivering highly efficient, yet accurate, multimodal LLMing.

Limitations and Future Prospects

Despite its breakthroughs, Cobra encounters challenges in text recognition within images, signaling a direction for future enhancements. Additionally, the model's inference capabilities, although superior, are bounded by the precision requirements of its underlying architecture. This limitation introduces a discussion point on model optimization and reduction techniques that could potentially facilitate Cobra's deployment on resource-constrained platforms without degrading performance.

Conclusion

Cobra sets a new precedent in multimodal LLMing, advancing the frontier with its efficient inference capabilities and commendable performance. By addressing the computational complexity limitations inherent in traditional Transformer-based models, it offers a viable pathway for the development of more sustainable, yet powerful, MLLMs. The implications of Cobra extend beyond academic interests, hinting at potential applications in real-world scenarios that demand high-frequency processing of visual information with linguistic context. As the community looks towards open-sourcing Cobra, it invites further research and development efforts to explore complexity problems in MLLM, promising enriching contributions to the field of artificial intelligence.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1771033002748837953

https://twitter.com/gm8xx8/status/1771017388839858639

https://twitter.com/m_j_rossman/status/1771619122347946236

https://twitter.com/knishimae0531/status/1772218647357894788

https://twitter.com/rungalileo/status/1777003402180075718

https://twitter.com/knishimae0531/status/1771136586169905645

YouTube

Show All Videos