Emergent Mind

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

(2406.09406)
Published Jun 13, 2024 in cs.CV , cs.AI , and cs.LG

Abstract

Current multimodal and multitask foundation models like 4M or UnifiedIO show promising results, but in practice their out-of-the-box abilities to accept diverse inputs and perform diverse tasks are limited by the (usually rather small) number of modalities and tasks they are trained on. In this paper, we expand upon the capabilities of them by training a single model on tens of highly diverse modalities and by performing co-training on large-scale multimodal datasets and text corpora. This includes training on several semantic and geometric modalities, feature maps from recent state of the art models like DINOv2 and ImageBind, pseudo labels of specialist models like SAM and 4DHumans, and a range of new modalities that allow for novel ways to interact with the model and steer the generation, for example image metadata or color palettes. A crucial step in this process is performing discrete tokenization on various modalities, whether they are image-like, neural network feature maps, vectors, structured data like instance segmentation or human poses, or data that can be represented as text. Through this, we expand on the out-of-the-box capabilities of multimodal models and specifically show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance. This enables more fine-grained and controllable multimodal generation capabilities and allows us to study the distillation of models trained on diverse data and objectives into a unified model. We successfully scale the training to a three billion parameter model using tens of modalities and different datasets. The resulting models and training code are open sourced at 4m.epfl.ch.

Fine-grained multimodal generation with human poses, polygon edits, and improved text understanding using diverse modalities.

Overview

  • The paper '4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities' presents the 4M-21 model, a unified approach to handling a wide array of vision tasks and modalities without performance degradation typically observed in specialized models.

  • The authors employ a versatile pre-training scheme that converts diverse data types into sequences of discrete tokens using modality-specific tokenizers, processed through a shared Transformer architecture to maintain uniform representation and support tasks such as image segmentation, depth estimation, and more.

  • Evaluations demonstrate the model's robustness across various benchmarks, significantly outperforming standard baselines in tasks like surface normal estimation, depth estimation, and semantic segmentation, suggesting practical applications in areas like autonomous systems and multi-modal data retrieval.

4M-21: An Advanced Model for Vision Tasks and Modalities

The research paper "4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities" by Roman Bachmann et al. introduces a state-of-the-art unified model, 4M-21, which is capable of addressing a significantly broader set of tasks and modalities compared to existing models. This study expands on previous work by training a model on a diverse and extensive range of modalities, achieving this without degradation in performance typically associated with single-task or few-task specialized models.

Methodology

The authors adopt a comprehensive pre-training scheme where diverse modalities are first converted into sequences of discrete tokens using modality-specific tokenizers. This tokenization approach supports image-like data, feature maps, textual data, and structured data such as human poses and segmentation instances. Tokens are then processed using a shared Transformer architecture, enhancing the model's flexibility through uniform representation space.

The various modalities include, but are not limited to:

  • Standard image modalities: RGB images, surface normals, depth maps.
  • High-level features: Pseudo labels from state-of-the-art models like DINOv2 and ImageBind.
  • Specialist data: 3D human poses, SAM instances, edges, color palettes.
  • Textual and metadata: Captions, T5-XXL embeddings, and a variety of image-derived metadata.

Importantly, the 4M-21 model also integrates modalities such as metadata, enhancing its ability to generate and retrieve information grounded in nuanced semantic and contextual details.

Performance and Capabilities

The authors provide thorough evaluations demonstrating that 4M-21 excels in tasks traditionally challenging for multitask models, without the common pitfalls of negative transfer and performance degradation. The performance metrics on datasets such as DIODE for surface normal and depth estimation, COCO for semantic and instance segmentation, ImageNet-1K for kNN retrieval, and 3DPW for 3D human keypoint estimation affirm the model's robust generalization capabilities.

Key findings include:

  • Surface normals estimation: The 4M-21 XL model outperformed baselines with a mean angular error of 20.8.
  • Depth estimation: Mean L1 depth error was reduced to 0.68.
  • Semantic segmentation: Achieved up to 48.1 mIoU on COCO.
  • Instance segmentation and retrieval: Demonstrated strong instance retrieval capabilities extending beyond DINOv2 benchmarks.

Implications and Future Directions

The implications of 4M-21 extend both practically and theoretically. Practically, the model supports a multitude of applications from high-fidelity image generation conditioned on varying inputs to nuanced data retrieval across multiple modalities. This makes it highly relevant for applications in areas such as computer vision-driven content generation, autonomous systems requiring complex multi-modal understanding, and holistic scene comprehension.

Theoretically, the success of 4M-21 in handling a broad array of tasks without performance regression pushes the boundary on the scalability and versatility of foundation models. It challenges the traditional limitations of multitask learning and sets a precedent for future investigations into even more expansive and coordinated model architectures.

Further research could look into exploring more efficient tokenization techniques, improving co-training strategies, and advancing cross-modal interaction capabilities. The integration of even more diverse datasets can potentially enhance the model's ability to uncover and leverage emergent capabilities, similar to breakthroughs witnessed in LLMs.

In conclusion, 4M-21 represents a significant step forward in vision research, offering a unified model that remarkably balances versatility and performance, laying a robust foundation for future advancements in multi-modal and multi-task learning paradigms.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube