Emergent Mind

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

(2312.17172)
Published Dec 28, 2023 in cs.CV , cs.AI , and cs.CL

Abstract

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs -- images, text, audio, action, bounding boxes, etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Since training with such diverse modalities is challenging, we propose various architectural improvements to stabilize model training. We train our model from scratch on a large multimodal pre-training corpus from diverse sources with a multimodal mixture of denoisers objective. To learn an expansive set of skills, such as following multimodal instructions, we construct and finetune on an ensemble of 120 datasets with prompts and augmentations. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and strong results in more than 35 benchmarks, including image generation and understanding, natural language understanding, video and audio understanding, and robotic manipulation. We release all our models to the research community.

Illustration of pre-training data distribution by sampling rates, data types, and specific datasets.

Overview

  • The model is the first autoregressive multimodal AI to understand and output in image, text, audio, and action, using a shared semantic space.

  • It leverages an encoder-decoder transformer architecture with improvements such as 2D rotary embeddings and scaled cosine attention for efficient training.

  • A multimodal mixture of denoisers pre-training strategy allows the model to incorporate multiple modalities and balance their representation.

  • After instruction tuning with over 120 datasets, the model excels in instruction-following and achieves state-of-the-art results on various benchmarks.

  • The model is released to the research community, recognizing potential for enhancement in niche tasks through augmentation and finetuning.

Pre-training Achievements

The presented model has achieved a noteworthy milestone in the field of AI by becoming the first autoregressive multimodal model capable of comprehending and generating outputs across multiple modalities, including image, text, audio, and action. The model processes inputs from diverse modalities and translates them into a shared semantic space, thereby enabling the generation of free-form multimodal responses. The training involved a vast and varied dataset, including image-text pairs, text tokens, video clips, and others, to impart a broad set of skills upon the model.

Model Architecture and Training

The model employs an encoder-decoder transformer architecture, tailored with novel architectural improvements to address the challenges of multimodal training instability. Techniques like 2D rotary embeddings, normalization enhancements, and scaled cosine attention mechanisms have been essential to stabilize and optimize the training process. Moreover, dynamic packing techniques have been used to efficiently handle variable sequence lengths, significantly increasing training throughput.

Pre-training Strategy

A multimodal mixture of denoisers objective was crafted to utilize the diverse self-supervised learning signals, which facilitated the effective incorporation of multiple modalities into the model. This pre-training strategy, coupled with a unique methodology for selecting and processing input data, enabled the balanced representation of output modalities and modal interactions.

Instruction Tuning and Benchmark Performance

The model underwent extensive instruction tuning, using a mix of over 120 datasets encompassing more than 220 tasks. This enabled the model to handle tasks unseen during training, showcasing instruction-following versatility. The unified model demonstrated state-of-the-art performance on the GRIT benchmark and showed strong results across over 35 modal benchmarks, confirming its capability across image generation, understanding, as well as language, video, and audio comprehension.

Release to Research Community

By releasing the models to the research community, there is an opportunity for further exploration and development. Future work may include scaling to a decoder-only model and improving data quality. Despite exceeding many multitasking capabilities and learning a wide array of tasks, the model might exhibit limitations in certain niche capacities like depth estimation and 3D object detection. However, purposeful augmentation and finetuning could enhance performance in these areas.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.