Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 34 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 80 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 461 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

C3LLM: Conditional Multimodal Content Generation Using Large Language Models (2405.16136v1)

Published 25 May 2024 in cs.AI, cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: We introduce C3LLM (Conditioned-on-Three-Modalities LLMs), a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together. C3LLM adapts the LLM structure as a bridge for aligning different modalities, synthesizing the given conditional information, and making multimodal generation in a discrete manner. Our contributions are as follows. First, we adapt a hierarchical structure for audio generation tasks with pre-trained audio codebooks. Specifically, we train the LLM to generate audio semantic tokens from the given conditions, and further use a non-autoregressive transformer to generate different levels of acoustic tokens in layers to better enhance the fidelity of the generated audio. Second, based on the intuition that LLMs were originally designed for discrete tasks with the next-word prediction method, we use the discrete representation for audio generation and compress their semantic meanings into acoustic tokens, similar to adding "acoustic vocabulary" to LLM. Third, our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model, providing more versatility in an end-to-end fashion. Our C3LLM achieves improved results through various automated evaluation metrics, providing better semantic alignment compared to previous methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. AI@Meta. Llama 3 model card, 2024.
  2. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, 2016.
  3. Vggsound: A large-scale audio-visual dataset. In ICASSP, 2020.
  4. Generating visually aligned sound from videos. IEEE Transactions on Image Processing, 2020.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  6. Simple and controllable music generation. arXiv:2306.05284, 2024.
  7. Clipsonic: Text-to-audio synthesis with unlabeled videos and pretrained languagevision models. WASPAA, 2023.
  8. Audiocaps: Generating captions for audios in the wild. In NAACL-HLT, 2019.
  9. Clotho: an audio captioning datase. In Proceedings of the ICASSP, pages 736––740, 2020.
  10. High fidelity neural audio compression. In CVPR, 2023.
  11. High fidelity neural audio compression. arXiv:2210.13438, 2022.
  12. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  13. Andrea Agostinelli et. al. Musiclm: Generating music from text. arXiv:2301.11325, 2023.
  14. Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
  15. Peihao Chen et al. Generating visually aligned sound from videos. TIP, 2020.
  16. Thomas Mesnard et al. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. arXiv:2403.08295, 2024.
  17. Yusong Wu et al. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. arXiv:2211.06687, 2022.
  18. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, 2017.
  19. Imagebind: One embedding space to bind them all. In CVPR, 2023.
  20. Listen, think, and understand. arXiv:2305.10790, 2023b., 2024.
  21. Imagen video: High definition video generation with diffusion models. arXiv:2210.02303, 2022.
  22. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv:2205.15868, 2022.
  23. Lora: Low-rank adaptation of large language models. arXiv:2106.09685, 2021.
  24. Visual instruction tuning. arXiv:2304.08485, 2024.
  25. Audiogpt: Understanding and generating speech, music, sound, and talking head. ArXiv, abs/2304.12995, 2023.
  26. Taming visually guided sound generation. arXiv:2110.08791, 2021.
  27. Enclap: Combining neural audio codec and audio-text joint embedding for automated audio captioning. arXiv preprint arXiv:2401.17690, 2024.
  28. Etienne Labbé. aac-metrics: Metrics for evaluating automated audio captioning systems for pytorch. https://github.com/Labbeti/aac-metrics/, 2013.
  29. Mind the gap: Understanding the modality gap in multimodal contrastive representation learning. arXiv:2203.02053, 2022.
  30. World model on million-length video and language with ringattention. arXiv preprint, 2024.
  31. Mustango: Toward controllable text-to-music generation. arXiv:2311.08355, 2023.
  32. Clipcap: Clip prefix for image captioning. arXiv:2111.09734, 2021.
  33. OpenAI. Introducing chatgpt, 2022.
  34. OpenAI. Gpt-4 technical report, 2023.
  35. OpenAI. Llama: Open and efficient foundation language models, 2023.
  36. Learning transferable visual models from natural language supervision. In ICML, 2021.
  37. Language models are unsupervised multitask learners, 2019.
  38. I hear your true colors: Image guided audio generation. In ICASSP, 2023.
  39. Audio-visual llm for video understanding. arXiv:2312.06720, 2024.
  40. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
  41. Emu: Generative pretraining in multimodality. arXiv:2307.05222, 2024.
  42. Codi-2: In-context, interleaved, and interactive any-to-any generation. arXiv:2311.18775, 2023.
  43. Any-to-any generation via composable diffusion. arXiv:2305.11846, 2023.
  44. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  45. Attention is all you need. In NeurIPS, 2017.
  46. Neural codec language models are zero-shot text to speech synthesizers. arXiv:2301.02111, 2023.
  47. Customvideo: Customizing text-to-video generation with multiple subjects. arXiv:2401.09962, 2024.
  48. Next-gpt: Any-to-any multimodal llm. arXiv:2309.05519, 2023.
  49. Sonicvisionlm: Playing sound with vision language models. arXiv:2401.04394, 2024.
  50. Diffsound: Discrete diffusion model for text-to-sound generation. arXiv:2207.09983, 2022.
  51. Video-llama: An instruction-tuned audio-visual language model for video understanding. CoRR, abs/2306.02858, 2023.
  52. C3net: Compound conditioned controlnet for multimodal content generation. arXiv:2311.17951, 2023.
  53. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  54. Visual to sound: Generating natural sound for videos in the wild. In CVPR, 2018.
  55. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Citations (2)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.