Generating Sample-Based Musical Instruments Using Neural Audio Codec Language Models (2407.15641v1)
Abstract: In this paper, we propose and investigate the use of neural audio codec LLMs for the automatic generation of sample-based musical instruments based on text or reference audio prompts. Our approach extends a generative audio framework to condition on pitch across an 88-key spectrum, velocity, and a combined text/audio embedding. We identify maintaining timbral consistency within the generated instruments as a major challenge. To tackle this issue, we introduce three distinct conditioning schemes. We analyze our methods through objective metrics and human listening tests, demonstrating that our approach can produce compelling musical instruments. Specifically, we introduce a new objective metric to evaluate the timbral consistency of the generated instruments and adapt the average Contrastive Language-Audio Pretraining (CLAP) score for the text-to-instrument case, noting that its naive application is unsuitable for assessing this task. Our findings reveal a complex interplay between timbral consistency, the quality of generated samples, and their correspondence to the input prompt.
- G. Narita, J. Shimizu, and T. Akama, “GANStrument: Adversarial Instrument Sound Synthesis with Pitch-Invariant Instance Conditioning,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Jun. 2023.
- H. Chang, H. Zhang, J. Barber, A. Maschinot, J. Lezama, L. Jiang, M.-H. Yang, K. P. Murphy, W. T. Freeman, M. Rubinstein, Y. Li, and D. Krishnan, “Muse: Text-To-Image Generation via Masked Generative Transformers,” in Proceedings of the International Conference on Machine Learning, Jul. 2023.
- J. Engel, C. Resnick, A. Roberts, S. Dieleman, D. Eck, K. Simonyan, and M. Norouzi, “Neural audio synthesis of musical notes with WaveNet autoencoders,” in Proceedings of the International Conference on Machine Learning, Aug. 2017.
- A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw audio,” arXiv:1609.03499, 2016.
- J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts, “GANSynth: Adversarial Neural Audio Synthesis,” in Proceedings of the International Conference on Learning Representations, May 2019.
- J. Engel, L. Hantrakul, C. Gu, and A. Roberts, “DDSP: Differentiable Digital Signal Processing,” in Proceedings of the International Conference on Learning Representations, April 2020.
- D. Y. Wu, W. Y. Hsiao, F. R. Yang, O. Friedman, W. Jackson, S. Bruzenak, Y. W. Liu, and Y. H. Yang, “DDSP-Based Singing Vocoders: A New Subtractive Based Synthesizer and A Comprehensive Evaluation,” in Proceedings of the International Society for Music Information Retrieval Conference, Dec. 2022.
- A. Caillon and P. Esling, “RAVE: A Variational Autoencoder for Fast and High-Quality Neural Audio Synthesis,” arXiv:2111.05011, Nov. 2021.
- Z. Evans, C. Carr, J. Taylor, S. H. Hawley, and J. Pons, “Fast timing-conditioned latent audio diffusion,” arXiv:2402.04825, Feb. 2024.
- N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An End-to-End Neural Audio Codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, Nov. 2021.
- R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved RVQGAN,” Conference on Neural Information Processing Systems, Dec. 2023.
- Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi, and N. Zeghidour, “AudioLM: a Language Modeling Approach to Audio Generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, Jun. 2023.
- C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers,” arXiv:2301.02111, Jan. 2023.
- F. Kreuk, G. Synnaeve, A. Polyak, U. Singer, A. Défossez, J. Copet, D. Parikh, Y. Taigman, and Y. Adi, “AudioGen: Textually Guided Audio Generation,” in Proceedings of the International Conference on Learning Representations, 2023.
- A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi, M. Sharifi, N. Zeghidour, and C. Frank, “MusicLM: Generating Music From Text,” arXiv:2301.11325, Jan. 2023.
- J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. Défossez, “Simple and Controllable Music Generation,” in Proceedings of the Conference on Neural Information Processing Systems, Dec. 2023.
- J. D. Parker, J. Spijkervet, K. Kosta, F. Yesiler, B. Kuznetsov, J. C. Wang, M. Avent, J. Chen, and D. Le, “StemGen: A music generation model that listens,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Apr. 2024.
- H. F. Garcia, P. Seetharaman, R. Kumar, and B. Pardo, “VampNet: Music generation via masked acoustic token modeling,” in Proceedings of the International Society for Music Information Retrieval Conference, Nov. 2023.
- S. Nercessian and J. Imort, “InstrumentGen: Generating sample-based musical instruments from text,” in Neural Information Processing Systems Workshop on Machine Learning for Audio, Dec. 2023.
- B. Hayes, J. Shier, G. Fazerkas, A. McPherson, and C. Saitis, “A Review of Differentiable Digital Signal Processing for Music and Speech Synthesis,” Frontiers in Signal Processing, Jan. 2024.
- Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Jun. 2023.
- A. Ziv, I. Gat, G. L. Lan, T. Remez, F. Kreuk, J. Copet, A. Défossez, G. Synnaeve, and Y. Adi, “Masked audio generative modeling,” in Proceedings of the International Conference on Learning Representations, May 2024.
- A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High Fidelity Neural Audio Compression,” Transactions on Machine Learning Research, Sep. 2023.
- C. Hawthorne, E. Elsen, J. Song, A. Roberts, I. Simon, C. Raffel, J. Engel, S. Oore, and D. Eck, “Onsets and frames: Dual-objective piano transcription,” in Proceedings of the International Society for Music Information Retrieval Conference, Sep. 2018.
- V. Vapnik and R. Izmailov, “Learning using privileged information: Similarity control and knowledge transfer,” Journal of Machine Learning Research, Nov. 2015.
- K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, May 2022.
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv:1907.11692, Jul. 2019.
- K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Frechet audio distance: A metric for evaluating music enhancement algorithms,” arXiv:1812.08466, Dec. 2018.
- A. Gui, H. Gamper, S. Braun, and D. Emmanouilidou, “Adapting Frechet audio distance for generative music evaluation,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Apr. 2024.
- J. Camp, T. Kenter, L. Finkelstein, and R. Clark, “MOS vs. AB: Evaluating text-to-speech systems reliably using clustered standard errors,” in Proceedings of Interspeech, Aug. 2023.
- K. Sohn, N. Ruiz, K. Lee, D. C. Chin, I. Blok, H. Chang, J. Barber, L. Jiang, G. Entis, Y. Li, Y. Hao, I. Essa, M. Rubinstein, and D. Krishnan, “StyleDrop: Text-to-Image Generation in Any Style,” in Proceedings of the Conference on Neural Information Processing Systems, Dec. 2023.
- J. Barnet, “The ethical implications of generative audio models: A systematic literature review,” in Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Aug. 2023.
- A. Lacoste, A. Luccioni, V. Schmidt, and T. Dandres, “Quantifying the carbon emissions of machine learning,” arXiv:1910.09700, 2019.
- Shahan Nercessian (7 papers)
- Johannes Imort (4 papers)
- Ninon Devis (5 papers)
- Frederik Blang (2 papers)