Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 165 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 81 tok/s Pro
Kimi K2 189 tok/s Pro
GPT OSS 120B 445 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

C3LLM: Conditional Multimodal Content Generation Using Large Language Models (2405.16136v1)

Published 25 May 2024 in cs.AI, cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: We introduce C3LLM (Conditioned-on-Three-Modalities LLMs), a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together. C3LLM adapts the LLM structure as a bridge for aligning different modalities, synthesizing the given conditional information, and making multimodal generation in a discrete manner. Our contributions are as follows. First, we adapt a hierarchical structure for audio generation tasks with pre-trained audio codebooks. Specifically, we train the LLM to generate audio semantic tokens from the given conditions, and further use a non-autoregressive transformer to generate different levels of acoustic tokens in layers to better enhance the fidelity of the generated audio. Second, based on the intuition that LLMs were originally designed for discrete tasks with the next-word prediction method, we use the discrete representation for audio generation and compress their semantic meanings into acoustic tokens, similar to adding "acoustic vocabulary" to LLM. Third, our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model, providing more versatility in an end-to-end fashion. Our C3LLM achieves improved results through various automated evaluation metrics, providing better semantic alignment compared to previous methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. AI@Meta. Llama 3 model card, 2024.
  2. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, 2016.
  3. Vggsound: A large-scale audio-visual dataset. In ICASSP, 2020.
  4. Generating visually aligned sound from videos. IEEE Transactions on Image Processing, 2020.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  6. Simple and controllable music generation. arXiv:2306.05284, 2024.
  7. Clipsonic: Text-to-audio synthesis with unlabeled videos and pretrained languagevision models. WASPAA, 2023.
  8. Audiocaps: Generating captions for audios in the wild. In NAACL-HLT, 2019.
  9. Clotho: an audio captioning datase. In Proceedings of the ICASSP, pages 736––740, 2020.
  10. High fidelity neural audio compression. In CVPR, 2023.
  11. High fidelity neural audio compression. arXiv:2210.13438, 2022.
  12. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  13. Andrea Agostinelli et. al. Musiclm: Generating music from text. arXiv:2301.11325, 2023.
  14. Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
  15. Peihao Chen et al. Generating visually aligned sound from videos. TIP, 2020.
  16. Thomas Mesnard et al. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. arXiv:2403.08295, 2024.
  17. Yusong Wu et al. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. arXiv:2211.06687, 2022.
  18. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, 2017.
  19. Imagebind: One embedding space to bind them all. In CVPR, 2023.
  20. Listen, think, and understand. arXiv:2305.10790, 2023b., 2024.
  21. Imagen video: High definition video generation with diffusion models. arXiv:2210.02303, 2022.
  22. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv:2205.15868, 2022.
  23. Lora: Low-rank adaptation of large language models. arXiv:2106.09685, 2021.
  24. Visual instruction tuning. arXiv:2304.08485, 2024.
  25. Audiogpt: Understanding and generating speech, music, sound, and talking head. ArXiv, abs/2304.12995, 2023.
  26. Taming visually guided sound generation. arXiv:2110.08791, 2021.
  27. Enclap: Combining neural audio codec and audio-text joint embedding for automated audio captioning. arXiv preprint arXiv:2401.17690, 2024.
  28. Etienne Labbé. aac-metrics: Metrics for evaluating automated audio captioning systems for pytorch. https://github.com/Labbeti/aac-metrics/, 2013.
  29. Mind the gap: Understanding the modality gap in multimodal contrastive representation learning. arXiv:2203.02053, 2022.
  30. World model on million-length video and language with ringattention. arXiv preprint, 2024.
  31. Mustango: Toward controllable text-to-music generation. arXiv:2311.08355, 2023.
  32. Clipcap: Clip prefix for image captioning. arXiv:2111.09734, 2021.
  33. OpenAI. Introducing chatgpt, 2022.
  34. OpenAI. Gpt-4 technical report, 2023.
  35. OpenAI. Llama: Open and efficient foundation language models, 2023.
  36. Learning transferable visual models from natural language supervision. In ICML, 2021.
  37. Language models are unsupervised multitask learners, 2019.
  38. I hear your true colors: Image guided audio generation. In ICASSP, 2023.
  39. Audio-visual llm for video understanding. arXiv:2312.06720, 2024.
  40. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
  41. Emu: Generative pretraining in multimodality. arXiv:2307.05222, 2024.
  42. Codi-2: In-context, interleaved, and interactive any-to-any generation. arXiv:2311.18775, 2023.
  43. Any-to-any generation via composable diffusion. arXiv:2305.11846, 2023.
  44. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  45. Attention is all you need. In NeurIPS, 2017.
  46. Neural codec language models are zero-shot text to speech synthesizers. arXiv:2301.02111, 2023.
  47. Customvideo: Customizing text-to-video generation with multiple subjects. arXiv:2401.09962, 2024.
  48. Next-gpt: Any-to-any multimodal llm. arXiv:2309.05519, 2023.
  49. Sonicvisionlm: Playing sound with vision language models. arXiv:2401.04394, 2024.
  50. Diffsound: Discrete diffusion model for text-to-sound generation. arXiv:2207.09983, 2022.
  51. Video-llama: An instruction-tuned audio-visual language model for video understanding. CoRR, abs/2306.02858, 2023.
  52. C3net: Compound conditioned controlnet for multimodal content generation. arXiv:2311.17951, 2023.
  53. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  54. Visual to sound: Generating natural sound for videos in the wild. In CVPR, 2018.
  55. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 8 likes.

Upgrade to Pro to view all of the tweets about this paper:

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube