AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation (2406.07686v1)

Published 11 Jun 2024 in cs.CV

Abstract: Recent Diffusion Transformers (DiTs) have shown impressive capabilities in generating high-quality single-modality content, including images, videos, and audio. However, it is still under-explored whether the transformer-based diffuser can efficiently denoise the Gaussian noises towards superb multimodal content creation. To bridge this gap, we introduce AV-DiT, a novel and efficient audio-visual diffusion transformer designed to generate high-quality, realistic videos with both visual and audio tracks. To minimize model complexity and computational costs, AV-DiT utilizes a shared DiT backbone pre-trained on image-only data, with only lightweight, newly inserted adapters being trainable. This shared backbone facilitates both audio and video generation. Specifically, the video branch incorporates a trainable temporal attention layer into a frozen pre-trained DiT block for temporal consistency. Additionally, a small number of trainable parameters adapt the image-based DiT block for audio generation. An extra shared DiT block, equipped with lightweight parameters, facilitates feature interaction between audio and visual modalities, ensuring alignment. Extensive experiments on the AIST++ and Landscape datasets demonstrate that AV-DiT achieves state-of-the-art performance in joint audio-visual generation with significantly fewer tunable parameters. Furthermore, our results highlight that a single shared image generative backbone with modality-specific adaptations is sufficient for constructing a joint audio-video generator. Our source code and pre-trained models will be released.

Citations (2)

View on Semantic Scholar

Summary

Overview of AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

This paper discusses AV-DiT, a novel audio-visual diffusion transformer architecture designed for simultaneous audio and video generation. The research addresses the challenge of generating multimodal content due to the dominance of single-modality models in existing literature. AV-DiT employs a shared diffusion transformer backbone pre-trained on image-only data, modified with lightweight, trainable adapters to facilitate joint audio and video generation, significantly reducing computational complexity and the number of tunable parameters.

Methodology

Diffusion Process and Transformer Architecture

AV-DiT leverages diffusion models, characterized by a forward process that adds Gaussian noise to data and a reverse process using a denoising network to generate cleaner samples. The unique aspect of AV-DiT is its application of Visual Diffusion Transformer (DiT), traditionally used in generating high-quality images, to multimodal, audio-visual generation. To achieve this, AV-DiT utilizes a pre-trained DiT backbone, inserting modality-specific adapters for both video and audio, ensuring efficient feature alignment and temporal consistency.

Audio-Visual Integration

The paper proposes an approach where joint generation is facilitated by a multimodal denoising network $\theta_{av}$ , trained to fit the reverse process of denoising both video and audio modalities simultaneously. The method includes key architectural components like trainable temporal attention layers and modality-specific LoRA adaptations