Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning (2405.18386v2)
Abstract: Recent advances in text-to-music editing, which employ text queries to modify music (e.g.\ by changing its style or adjusting instrumental components), present unique challenges and opportunities for AI-assisted music creation. Previous approaches in this domain have been constrained by the necessity to train specific editing models from scratch, which is both resource-intensive and inefficient; other research uses LLMs to predict edited music, resulting in imprecise audio reconstruction. To Combine the strengths and address these limitations, we introduce Instruct-MusicGen, a novel approach that finetunes a pretrained MusicGen model to efficiently follow editing instructions such as adding, removing, or separating stems. Our approach involves a modification of the original MusicGen architecture by incorporating a text fusion module and an audio fusion module, which allow the model to process instruction texts and audio inputs concurrently and yield the desired edited music. Remarkably, Instruct-MusicGen only introduces 8% new parameters to the original MusicGen model and only trains for 5K steps, yet it achieves superior performance across all tasks compared to existing baselines, and demonstrates performance comparable to the models trained for specific tasks. This advancement not only enhances the efficiency of text-to-music editing but also broadens the applicability of music LLMs in dynamic music production environments.
- MusicLM: Generating music from text. CoRR, abs/2301.11325, 2023. doi: 10.48550/arxiv.2301.11325. URL https://doi.org/10.48550/arxiv.2301.11325.
- InstructPix2Pix: Learning to follow image editing instructions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 18392–18402. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01764. URL https://doi.org/10.1109/CVPR52729.2023.01764.
- Pix2Video: Video editing using image diffusion. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 23149–23160. IEEE, 2023. doi: 10.1109/ICCV51070.2023.02121. URL https://doi.org/10.1109/ICCV51070.2023.02121.
- StableVideo: Text-driven consistency-aware diffusion video editing. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 22983–22993. IEEE, 2023. doi: 10.1109/ICCV51070.2023.02106. URL https://doi.org/10.1109/ICCV51070.2023.02106.
- MusicLDM: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. CoRR, abs/2308.01546, 2023. doi: 10.48550/arxiv.2308.01546. URL https://doi.org/10.48550/arxiv.2308.01546.
- Simple and controllable music generation. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/94b472a1842cd7c56dcb125fb2765fbd-Abstract-Conference.html.
- High fidelity neural audio compression. CoRR, abs/2210.13438, 2022. doi: 10.48550/arxiv.2210.13438. URL https://doi.org/10.48550/arxiv.2210.13438.
- ComposerX: Multi-agent symbolic music composition with LLMs. arxiv preprint arxiv:2404.18081, 2024.
- Fast timing-conditioned latent audio diffusion. CoRR, abs/2402.04825, 2024a. doi: 10.48550/arxiv.2402.04825. URL https://doi.org/10.48550/arxiv.2402.04825.
- Long-form music generation with latent diffusion. arxiv preprint arxiv:2404.10301, 2024b.
- InstructME: An instruction guided music edit and remix framework with latent diffusion models. CoRR, abs/2308.14360, 2023. doi: 10.48550/arxiv.2308.14360. URL https://doi.org/10.48550/arxiv.2308.14360.
- LoRA: Low-rank adaptation of large language models. CoRR, abs/2106.09685, 2021. URL https://arxiv.org/abs/2106.09685.
- M22{}^{\mbox{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTUGen: Multi-modal music understanding and generation with the power of large language models. CoRR, abs/2311.11255, 2023. doi: 10.48550/arxiv.2311.11255. URL https://doi.org/10.48550/arxiv.2311.11255.
- Single-channel multi-speaker separation using deep clustering. In Nelson Morgan, editor, Interspeech 2016, 17th Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, September 8-12, 2016, pages 545–549. ISCA, 2016. doi: 10.21437/INTERSPEECH.2016-1176. URL https://doi.org/10.21437/Interspeech.2016-1176.
- Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms. In Gernot Kubin and Zdravko Kacic, editors, Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pages 2350–2354. ISCA, 2019. doi: 10.21437/INTERSPEECH.2019-2219. URL https://doi.org/10.21437/Interspeech.2019-2219.
- JEN-1: Text-guided universal music generation with omnidirectional diffusion models. CoRR, abs/2308.04729, 2023. doi: 10.48550/arxiv.2308.04729. URL https://doi.org/10.48550/arxiv.2308.04729.
- Music style transfer with time-varying inversion of diffusion models. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors, Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada, pages 547–555. AAAI Press, 2024. doi: 10.1609/AAAI.V38I1.27810. URL https://doi.org/10.1609/aaai.v38i1.27810.
- WavCraft: Audio editing and generation with large language models. In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024.
- Content-based controls for music large language modeling. CoRR, abs/2310.17162, 2023. doi: 10.48550/arxiv.2310.17162. URL https://doi.org/10.48550/arxiv.2310.17162.
- Arrange, inpaint, and refine: Steerable long-term music audio generation and editing via content-based controls. CoRR, abs/2402.09508, 2024. doi: 10.48550/arxiv.2402.09508. URL https://doi.org/10.48550/arxiv.2402.09508.
- AudioLDM 2: Learning holistic audio generation with self-supervised pretraining. CoRR, abs/2308.05734, 2023a. doi: 10.48550/arxiv.2308.05734. URL https://doi.org/10.48550/arxiv.2308.05734.
- Visual instruction tuning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023b. URL http://papers.nips.cc/paper_files/paper/2023/hash/6dcf277ea32ce3288914faf369fe6de0-Abstract-Conference.html.
- Separate anything you describe. CoRR, abs/2308.05037, 2023c. doi: 10.48550/arxiv.2308.05037. URL https://doi.org/10.48550/arxiv.2308.05037.
- Cutting music source separation some Slakh: A dataset to study the impact of training data quality and quantity. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2019, New Paltz, NY, USA, October 20-23, 2019, pages 45–49. IEEE, 2019. doi: 10.1109/WASPAA.2019.8937170. URL https://doi.org/10.1109/WASPAA.2019.8937170.
- Zero-shot unsupervised and text-based audio editing using DDPM inversion. CoRR, abs/2402.10009, 2024. doi: 10.48550/arxiv.2402.10009. URL https://doi.org/10.48550/arxiv.2402.10009.
- Multi-source diffusion models for simultaneous music generation and separation. CoRR, abs/2302.02257, 2023. doi: 10.48550/arxiv.2302.02257. URL https://doi.org/10.48550/arxiv.2302.02257.
- Mustango: Toward controllable text-to-music generation. arxiv preprint arxiv:2311.08355, 2023.
- StemGen: A music generation model that listens. CoRR, abs/2312.08723, 2023. doi: 10.48550/arxiv.2312.08723. URL https://doi.org/10.48550/arxiv.2312.08723.
- MoisesDB: A dataset for source separation beyond 4-stems. In Augusto Sarti, Fabio Antonacci, Mark Sandler, Paolo Bestagini, Simon Dixon, Beici Liang, Gaël Richard, and Johan Pauwels, editors, Proceedings of the 24th International Society for Music Information Retrieval Conference, ISMIR 2023, Milan, Italy, November 5-9, 2023, pages 619–626, 2023. doi: 10.5281/ZENODO.10265363. URL https://doi.org/10.5281/zenodo.10265363.
- Generalized multi-source inference for text conditioned music diffusion models. CoRR, abs/2403.11706, 2024. doi: 10.48550/arxiv.2403.11706. URL https://doi.org/10.48550/arxiv.2403.11706.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
- SDR - half-baked or well done? In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, United Kingdom, May 12-17, 2019, pages 626–630. IEEE, 2019. doi: 10.1109/ICASSP.2019.8683855. URL https://doi.org/10.1109/ICASSP.2019.8683855.
- AUDIT: Audio editing by following instructions with latent diffusion models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/e1b619a9e241606a23eb21767f16cf81-Abstract-Conference.html.
- Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004. doi: 10.1109/TIP.2003.819861. URL https://doi.org/10.1109/TIP.2003.819861.
- Music ControlNet: Multiple time-varying controls for music generation. CoRR, abs/2311.07069, 2023a. doi: 10.48550/arxiv.2311.07069. URL https://doi.org/10.48550/arxiv.2311.07069.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pages 1–5. IEEE, 2023b. doi: 10.1109/ICASSP49357.2023.10095969. URL https://doi.org/10.1109/ICASSP49357.2023.10095969.
- UniAudio: An audio foundation model toward universal audio generation. CoRR, abs/2310.00704, 2023. doi: 10.48550/arxiv.2310.00704. URL https://doi.org/10.48550/arxiv.2310.00704.
- JEN-1 Composer: A unified framework for high-fidelity multi-track music generation. CoRR, abs/2310.19180, 2023. doi: 10.48550/arxiv.2310.19180. URL https://doi.org/10.48550/arxiv.2310.19180.
- MusicAgent: An AI agent for music understanding and generation with large language models. In Yansong Feng and Els Lefever, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - System Demonstrations, Singapore, December 6-10, 2023, pages 246–255. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.EMNLP-DEMO.21. URL https://doi.org/10.18653/v1/2023.emnlp-demo.21.
- SoundStream: An end-to-end neural audio codec. IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022. doi: 10.1109/TASLP.2021.3129994. URL https://doi.org/10.1109/TASLP.2021.3129994.
- LLaMA-Adapter: Efficient fine-tuning of language models with zero-init attention. CoRR, abs/2303.16199, 2023a. doi: 10.48550/arxiv.2303.16199. URL https://doi.org/10.48550/arxiv.2303.16199.
- Loop Copilot: Conducting AI ensembles for music generation and iterative editing. CoRR, abs/2310.12404, 2023b. doi: 10.48550/arxiv.2310.12404. URL https://doi.org/10.48550/arxiv.2310.12404.
- MusicMagus: Zero-shot text-to-music editing via diffusion models. CoRR, abs/2402.06178, 2024. doi: 10.48550/arxiv.2402.06178. URL https://doi.org/10.48550/arxiv.2402.06178.