Emergent Mind

Abstract

In recent years, the application of multimodal LLMs (MLLM) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, current MLLMs are composed of the well-known Transformer network, which has a less efficient quadratic computation complexity. To improve the efficiency of such basic models, we propose Cobra, a linear computational complexity MLLM. Specifically, Cobra integrates the efficient Mamba language model into the visual modality. Moreover, we explore and study various modal fusion schemes to create an effective multi-modal Mamba. Extensive experiments demonstrate that (1) Cobra achieves extremely competitive performance with current computationally efficient state-of-the-art methods, e.g., LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and has faster speed due to Cobra's linear sequential modeling. (2) Interestingly, the results of closed-set challenging prediction benchmarks show that Cobra performs well in overcoming visual illusions and spatial relationship judgments. (3) Notably, Cobra even achieves comparable performance to LLaVA with about 43% of the number of parameters. We will make all codes of Cobra open-source and hope that the proposed method can facilitate future research on complexity problems in MLLM. Our project page is available at: https://sites.google.com/view/cobravlm.

Cobra's architecture with Mamba backbone; identical Mamba blocks detailed, vision encoder parameters frozen during training.

Overview

  • Cobra is a multi-modal large language model (MLLM) that extends the Mamba language model to include visual information, aiming for efficient inference while maintaining computational sustainability.

  • The model showcases a significant performance with improved computational efficiency, outperforming other models like LLaVA-Phi, TinyLLaVA, and MobileVLM v2 in speed and efficiency due to its linear complexity approach.

  • Various modality fusion strategies are exploited to efficiently process visual and linguistic information, achieving balance and robust multimodal representation through dynamic adjustment techniques.

  • Despite its strengths, Cobra faces challenges in text recognition within images and is limited by the precision of its architecture, marking areas for future improvements.

Extending Mamba with Cobra: A Leap Towards Efficient Multi-Modal Large Language Modeling

Introduction to Cobra

The paper introduces Cobra, a multi-modal large language model (MLLM) that innovatively extends the Mamba language model to incorporate visual information, paving the way for efficient inference processes. Cobra distinguishes itself by transitioning from the conventional quadratic complexity characteristics of Transformer networks to a more computationally sustainable linear complexity approach. Its architecture seamlessly integrates visual modalities with the Mamba model, employing several modal fusion strategies to enhance multimodal representation. The empirical validation of Cobra manifests through extensive experiments, showcasing its competitive edge in both performance and speed compared to existing state-of-the-art methods like LLaVA-Phi, TinyLLaVA, and MobileVLM v2.

Efficiency and Performance

One of Cobra's significant achievements is its ability to demonstrate exemplary performance with vastly improved computational efficiency. The model leverages the linear complexity of the Mamba model, substantially reducing the computational overhead without compromising the quality of multimodal integration. The experimental outcomes are indicative of Cobra's superiority, where it not only aligns closely with the performance metrics of significantly larger models like LLaVA but also demonstrates a 3 to 4 times faster inference speed compared to MobileVLM v2 3B and TinyLLaVA 3B. Remarkably, Cobra achieves these feats with approximately 43% fewer parameters compared to some of its counterparts.

Modality Fusion Strategies

The paper explore the exploration of various modal fusion strategies to find an optimal balance between efficiently processing visual and linguistic information. These strategies underscore the importance of selecting and integrating visual encoders and projectors that align with the inherently efficient nature of the Mamba model. By adapting input-dependent coefficients and a selective SSM framework, Cobra enables dynamic adjustment of its processing pathway, responding aptly to the complexities involved in multimodal data handling.

Experimental Validation

Cobra's validation on predominant benchmarks in the multimodal domain is meticulous. It involves competing in both open-ended visual question answering tasks and closed-set prediction tasks, highlighting its aptitude in understanding visual illusions and spatial relationships. The model's robustness is further evidenced by its commendable performance across various benchmarks, including GQA, VQA v2, and more specialized tasks that assess visual hallucination avoidance and spatial reasoning. The comparative analysis with state-of-the-art models delineates Cobra's proficiency in delivering highly efficient, yet accurate, multimodal language modeling.

Limitations and Future Prospects

Despite its breakthroughs, Cobra encounters challenges in text recognition within images, signaling a direction for future enhancements. Additionally, the model's inference capabilities, although superior, are bounded by the precision requirements of its underlying architecture. This limitation introduces a discussion point on model optimization and reduction techniques that could potentially facilitate Cobra's deployment on resource-constrained platforms without degrading performance.

Conclusion

Cobra sets a new precedent in multimodal language modeling, advancing the frontier with its efficient inference capabilities and commendable performance. By addressing the computational complexity limitations inherent in traditional Transformer-based models, it offers a viable pathway for the development of more sustainable, yet powerful, MLLMs. The implications of Cobra extend beyond academic interests, hinting at potential applications in real-world scenarios that demand high-frequency processing of visual information with linguistic context. As the community looks towards open-sourcing Cobra, it invites further research and development efforts to explore complexity problems in MLLM, promising enriching contributions to the field of artificial intelligence.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube