SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models (2405.00878v1)
Abstract: We are witnessing a revolution in conditional image synthesis with the recent success of large scale text-to-image generation methods. This success also opens up new opportunities in controlling the generation and editing process using multi-modal input. While spatial control using cues such as depth, sketch, and other images has attracted a lot of research, we argue that another equally effective modality is audio since sound and sight are two main components of human perception. Hence, we propose a method to enable audio-conditioning in large scale image diffusion models. Our method first maps features obtained from audio clips to tokens that can be injected into the diffusion model in a fashion similar to text tokens. We introduce additional audio-image cross attention layers which we finetune while freezing the weights of the original layers of the diffusion model. In addition to audio conditioned image generation, our method can also be utilized in conjuction with diffusion based editing methods to enable audio conditioned image editing. We demonstrate our method on a wide range of audio and image datasets. We perform extensive comparisons with recent methods and show favorable performance.
- Self-supervised multimodal versatile networks. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 33. 25–37.
- Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations (ICLR).
- Instructpix2pix: Learning to follow image editing instructions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18392–18402.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 33. 1877–1901.
- Unsupervised Cross-lingual Representation Learning at Scale. In The 58th Annual Meeting of the Association for Computational Linguistics (ACL). 8440–8451.
- CLAP: Learning audio concepts from natural language supervision. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
- ImageBind: One embedding space to bind them all. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 15180–15190.
- Generative Adversarial Nets. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 27.
- AudioCLIP: Extending clip to image, text and audio. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 976–980.
- Prompt-to-Prompt Image Editing with Cross-Attention Control. In International Conference on Learning Representations (ICLR).
- GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems (NeurIPS).
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 33. 6840–6851.
- Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning (ICML). 2790–2799.
- The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion. In IEEE/CVF International Conference on Computer Vision (ICCV). 7822–7832.
- A style-based generator architecture for generative adversarial networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4401–4410.
- Imagic: Text-based real image editing with diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6007–6017.
- Sound-guided semantic video generation. In European Conference on Computer Vision (ECCV). Springer, 34–50.
- Robust Sound-Guided Image Manipulation. arXiv preprint arXiv:2208.14114 (2023).
- Sound-Guided Semantic Image Manipulation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3377–3386.
- Generating Realistic Images from In-the-wild Sounds. In IEEE/CVF International Conference on Computer Vision (ICCV). 7160–7170.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv preprint arXiv:2301.12597 (2023).
- Learning visual styles from audio-visual associations. In European Conference on Computer Vision (ECCV). Springer, 235–252.
- Steven R. Livingstone and Frank A. Russo. 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [Data set]. PLoS ONE 13, 5 (2018), e0196391. https://doi.org/10.5281/zenodo.1188976
- SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In International Conference on Learning Representations (ICLR).
- Null-text inversion for editing real images using guided diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6038–6047.
- T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models. arXiv preprint arXiv:2302.08453 (2023).
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In International Conference on Machine Learning (ICML). PMLR, 16784–16804.
- Visually indicated sounds. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2405–2413.
- Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings. 1–11.
- GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation. In IEEE/CVF International Conference on Computer Vision (ICCV). 23085–23096.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML). 8748–8763.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67.
- Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.
- High-Resolution Image Synthesis With Latent Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI). 234–241.
- Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 35. 36479–36494.
- Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6430–6440.
- Any-to-Any Generation via Composable Diffusion. arXiv preprint arXiv:2305.11846 (2023).
- Plug-and-play diffusion features for text-driven image-to-image translation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1921–1930.
- Representation Learning with Contrastive Predictive Coding. arXiv preprint arXiv:1807.03748 (2019).
- Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation. arXiv:2309.16429 [cs.LG]
- Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation. In Annual Conference of the International Speech Communication Association (INTERSPEECH). 5446–5450. https://doi.org/10.21437/Interspeech.2023-852
- IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. arXiv preprint arxiv:2308.06721 (2023).
- LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention. arXiv preprint arXiv:2303.16199 (2023).