SonicVisionLM: Playing Sound with Vision Language Models (2401.04394v3)
Abstract: There has been a growing interest in the task of generating sound for silent videos, primarily because of its practicality in streamlining video post-production. However, existing methods for video-sound generation attempt to directly create sound from visual representations, which can be challenging due to the difficulty of aligning visual representations with audio representations. In this paper, we present SonicVisionLM, a novel framework aimed at generating a wide range of sound effects by leveraging vision-LLMs(VLMs). Instead of generating audio directly from video, we use the capabilities of powerful VLMs. When provided with a silent video, our approach first identifies events within the video using a VLM to suggest possible sounds that match the video content. This shift in approach transforms the challenging task of aligning image and audio into more well-studied sub-problems of aligning image-to-text and text-to-audio through the popular diffusion models. To improve the quality of audio recommendations with LLMs, we have collected an extensive dataset that maps text descriptions to specific sound effects and developed a time-controlled audio adapter. Our approach surpasses current state-of-the-art methods for converting video to audio, enhancing synchronization with the visuals, and improving alignment between audio and video components. Project page: https://yusiissy.github.io/SonicVisionLM.github.io/
- Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
- Infergrad: Improving diffusion models for vocoder by considering inference in training. In ICASSP, pages 8432–8436. IEEE, 2022a.
- Resgrad: Residual denoising diffusion probabilistic models for text to speech. arXiv preprint arXiv:2212.14518, 2022b.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- Simple and controllable music generation. arXiv preprint arXiv:2306.05284, 2023.
- Varietysound: Timbre-controllable video to sound generation via unsupervised information disentanglement. In ICASSP, pages 1–5. IEEE, 2023.
- Conditional generation of audio from video via foley analogies. In CVPR, pages 2426–2436, 2023.
- Text-to-audio generation using instruction-tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. NIPS, 30, 2017.
- Noise2music: Text-conditioned music generation with diffusion models. arXiv preprint arXiv:2302.03917, 2023a.
- Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661, 2023b.
- Taming visually guided sound generation. arXiv, 2021.
- Sparse in space and time: Audio-visual synchronisation with trainable selectors. In The 33st British Machine Vision Virtual Conference. BMVC, 2022.
- Fr\\\backslash\’echet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Audiogen: Textually guided audio generation. arXiv preprint arXiv:2209.15352, 2022.
- Bilateral denoising diffusion models. arXiv preprint arXiv:2108.11514, 2021.
- Efficient neural music generation. arXiv preprint arXiv:2305.15719, 2023.
- Priorgrad: Improving conditional denoising diffusion models with data-dependent adaptive prior. arXiv preprint arXiv:2106.06406, 2021.
- Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.
- Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023.
- Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. arXiv, 2023.
- Visually indicated sounds. In CVPR, pages 2405–2413, 2016.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- Grad-tts: A diffusion probabilistic model for text-to-speech. In ICML, pages 8599–8608. PMLR, 2021.
- Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558, 2020.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
- Improved techniques for training gans. NIPS, 29, 2016.
- I hear your true colors: Image guided audio generation. In ICASSP, pages 1–5. IEEE, 2023.
- Naturalspeech: End-to-end text to speech synthesis with human-level quality. arXiv preprint arXiv:2205.04421, 2022.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- A closer look at spatiotemporal convolutions for action recognition. In CVPR, pages 6450–6459, 2018.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023a.
- Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023b.
- Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
- Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023b.
- Repetitive activity counting by sight and sound. In CVPR, pages 14070–14079, 2021.
- Zhifeng Xie (11 papers)
- Shengye Yu (1 paper)
- Mengtian Li (31 papers)
- Qile He (2 papers)