Emergent Mind

Abstract

We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations, offering new insights into different models and architectures -- self-supervised, strongly supervised, or combinations thereof -- based on experiments with over 20 vision encoders. We critically examine existing MLLM benchmarks, addressing the difficulties involved in consolidating and interpreting results from various tasks, and introduce a new vision-centric benchmark, CV-Bench. To further improve visual grounding, we propose the Spatial Vision Aggregator (SVA), a dynamic and spatially-aware connector that integrates high-resolution vision features with LLMs while reducing the number of tokens. Additionally, we discuss the curation of high-quality visual instruction-tuning data from publicly available sources, emphasizing the importance of data source balancing and distribution ratio. Collectively, Cambrian-1 not only achieves state-of-the-art performance but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes. We hope our release will inspire and accelerate advancements in multimodal systems and visual representation learning.

Cambrian-1 outperforms other open-source models, especially in OCR & Chart and Vision-Centric benchmarks.

Overview

  • The paper introduces Cambrian-1, a set of multimodal LLMs that focus on enhancing visual representation to improve real-world task performance.

  • Key innovations include the exploration and evaluation of various vision encoder designs, the introduction of the Spatial Vision Aggregator (SVA), and the creation of the comprehensive CV-Bench for evaluating visual grounding capabilities.

  • The study emphasizes the importance of well-curated data for instruction tuning and addresses the 'answer machine' phenomenon, ensuring better conversational and reasoning capabilities in MLLMs.

Cambrian-1: A Vision-Centric Approach to Multimodal LLMs

The paper introduces Cambrian-1, a sophisticated family of multimodal LLMs (MLLMs) with a vision-centric orientation. This research emphasizes the substantial yet often neglected importance of visual representation in enhancing the performance and grounding of MLLMs in real-world tasks. The paper addresses the disconnection between current language-centric approaches and the need for robust visual representation learning, proposing a comprehensive framework that explores and evaluates various vision encoder designs, connector architectures, and instruction tuning data curation strategies.

Key Contributions and Findings

Vision Encoder Exploration

The research assesses multiple vision encoder designs, ranging from language-supervised models like CLIP and SigLIP to self-supervised models such as DINOv2. The evaluation reveals that while language-supervised models generally outperform others in most benchmark categories, self-supervised models like DINOv2 demonstrate competitive performance in vision-centric tasks. This suggests that improvements in self-supervised models and training them on larger datasets could bridge the performance gap with language-supervised models.

Spatial Vision Aggregator (SVA)

A novel connector, the Spatial Vision Aggregator (SVA), is introduced to enhance the integration of high-resolution visual features into the LLMs. SVA incorporates spatial inductive biases and allows for multiple cross-attentions between the LLM and vision features across different layers, significantly improving performance, especially in OCR and vision-centric tasks. This method condenses the visual tokens, reducing computational overhead while maintaining high performance.

Comprehensive Benchmarking: CV-Bench

The paper proposes CV-Bench, a new vision-centric benchmark designed to assess the capabilities of MLLMs in fundamental 2D and 3D visual understanding tasks. CV-Bench repurposes standard vision benchmarks into VQA format, ensuring a thorough evaluation of the model's visual grounding capabilities. The benchmark includes tasks like spatial relationships, object counting, depth order, and relative distance.

Instruction Tuning Data

Data curation proves critical in training MLLMs. The study constructs Cambrian-10M, a large and diverse dataset amalgamating various VQA, OCR, and interaction datasets. By applying systematic data filtering and balancing techniques, the researchers create Cambrian-7M, a more efficient and performance-optimized dataset mix. The importance of well-balanced and targeted data selection is reaffirmed, aligning with optimal performance across general, knowledge-based, and vision-centric benchmarks.

Alleviating the "Answer Machine" Phenomenon

To enhance the conversational and reasoning capabilities of MLLMs, the researchers incorporate system prompts in the instruction-tuning data, addressing the tendency of models to provide overly concise responses. This methodology ensures that the models maintain high benchmark performance while improving their ability to generate more comprehensive and engaged interactions.

Practical and Theoretical Implications

Cambrian-1 represents a significant advancement in the construction and evaluation of MLLMs with a vision-centric approach. The findings underscore the necessity of integrating robust visual representations to achieve substantial improvements in multimodal understanding and performance in real-world applications.

Practically, the provision of open-source code, datasets, and detailed recipes promises to expedite future research and development in this domain, fostering an open research community. The paper's insights into data curation, vision encoder designs, and connector architectures provide a solid foundation for developing more sophisticated and capable multimodal systems.

Theoretically, the introduction of CV-Bench and the emphasis on vision-centric benchmarks highlight the importance of rethinking multimodal evaluation protocols to reflect the diverse and complex challenges of real-world perception tasks. This shift will guide future research towards more holistic and integrated model designs, ultimately contributing to the progress of visual representation learning and multimodal AI.

Future Directions

The research opens multiple avenues for future exploration. Enhancing the scalability and efficiency of models akin to the SVA module could further condense high-resolution visual information into manageable tokens without sacrificing performance. Additionally, the exploration of reinforcement learning for post-training alignment poses an exciting prospect for refining model capabilities beyond the limits of supervised fine-tuning.

Further improving and expanding the data curation pipeline, alongside developing even more comprehensive and diversified benchmarks, will be essential to push the boundaries of MLLMs. As the complexities of real-world applications continue to evolve, the insights and methodologies presented in Cambrian-1 will play a pivotal role in shaping the future landscape of multimodal AI systems.

In conclusion, Cambrian-1 marks a substantial progression in the field of multimodal AI, presenting a vision-centric framework that addresses the gaps in current MLLM research. Through rigorous evaluation, innovative design, and thoughtful data curation, this work sets a new standard for the development and assessment of multimodal systems.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube