Emergent Mind

Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities

(2405.18669)
Published May 29, 2024 in cs.LG , cs.AI , cs.CL , and eess.AS

Abstract

Integrating multiple generative foundation models, especially those trained on different modalities, into something greater than the sum of its parts poses significant challenges. Two key hurdles are the availability of aligned data (concepts that contain similar meaning but is expressed differently in different modalities), and effectively leveraging unimodal representations in cross-domain generative tasks, without compromising their original unimodal capabilities. We propose Zipper, a multi-tower decoder architecture that addresses these concerns by using cross-attention to flexibly compose multimodal generative models from independently pre-trained unimodal decoders. In our experiments fusing speech and text modalities, we show the proposed architecture performs very competitively in scenarios with limited aligned text-speech data. We also showcase the flexibility of our model to selectively maintain unimodal (e.g., text-to-text generation) generation performance by freezing the corresponding modal tower (e.g. text). In cross-modal tasks such as automatic speech recognition (ASR) where the output modality is text, we show that freezing the text backbone results in negligible performance degradation. In cross-modal tasks such as text-to-speech generation (TTS) where the output modality is speech, we show that using a pre-trained speech backbone results in superior performance to the baseline.

Zipper model with gated cross-attention and projection layers, using speech tokens.

Overview

  • The Zipper model introduces a multi-tower decoder architecture that fuses independently pre-trained unimodal decoders using gated cross-attention layers, focusing on speech and text modalities.

  • Zipper demonstrates competitive performance on Automatic Speech Recognition (ASR) and significant improvements in Text-to-Speech (TTS) tasks, particularly in scenarios with limited aligned data.

  • The model maintains unimodal generation capabilities while adding cross-modal functionalities, showing potential for future extensions to additional modalities and larger datasets.

Integrating Independently Pre-trained Unimodal Decoders in Multimodal Generation Tasks: The Zipper Model

Introduction

The paper presents a novel approach, Zipper, designed to address challenges associated with integrating multiple generative foundation models trained on different modalities. The main difficulties in achieving efficient multimodal integration involve aligning data between modalities and leveraging unimodal representations in cross-domain tasks without losing their original capabilities.

Methodology

Zipper introduces a multi-tower decoder architecture that utilizes cross-attention to fuse independently pre-trained unimodal decoders. Primarily focused on speech and text modalities, this architecture can compose multimodal generative models more efficiently than traditional approaches. The architecture consists of two autoregressive decoder towers—a text and a speech backbone—combined using gated cross-attention layers. Each backbone is trained independently using next-token prediction.

Auto-regressive masking is adapted for multi-modal sequences, allowing the model to generate outputs in a specified sequence of modalities during inference. This flexible design ensures the retention of unimodal generation performance by freezing the corresponding modality tower when necessary. For example, retaining the text-to-text generation capability during cross-modal alignment tasks like Automatic Speech Recognition (ASR).

Experiments and Results

Automatic Speech Recognition (ASR)

The Zipper architecture exhibits competitive performance on ASR tasks when compared to the conventional Single Decoder model, which expands the vocabulary to include speech tokens. The paper reports that for the frozen modality backbone, Zipper's performance is comparable with small differences in WER, particularly on the noisier test-other subset.

Text-to-Speech Generation (TTS)

In the realm of TTS, the Zipper model demonstrates a significant reduction in WER compared to the Single Decoder model. It notably improves performance by reducing WER by 12 absolute points (40% relative error reduction) for models with an unfrozen speech backbone. The advantages stem from using pre-trained unimodal backbones for better alignment, especially important as the context length in speech generation grows.

Limited Aligned Data Scenarios

Empirical results highlight Zipper's capability to learn efficiently from a minimal amount of aligned data. With as little as 1% of the original data, Zipper achieves a mid-twenty WER on ASR tasks, significantly outperforming the Single Decoder model under identical conditions. This capability underscores Zipper's advantage in data-constrained scenarios, leveraging strong unimodal pre-training.

Implications and Future Work

The Zipper model's capability to retain unimodal generative performance while adding cross-modal functionalities addresses several challenges in multimodal model integration. Given its flexibility and reduced dependency on large amounts of aligned data, it stands to significantly impact tasks requiring limited cross-modal data.

Going forward, several expansions are envisaged. The model can be extended to integrate more than two modalities, incorporating text, speech, images, video, and other niche modalities like protein sequences. Future work will also explore scaling Zipper to larger model sizes and more diverse datasets, providing a comprehensive solution for multimodal generation tasks.

Conclusion

The Zipper architecture demonstrates a robust method for integrating unimodal generative models, preserving their core capabilities while adding cross-modal functionalities. The empirical results in ASR and TTS tasks affirm its competitive edge against traditional approaches, especially in data-constrained scenarios. This work lays a foundation for more flexible and scalable multimodal generative models, potentially transforming various applications in AI where multimodal data integration is paramount.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.