Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 98 tok/s Pro
Kimi K2 195 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Joint Music and Language Attention Models for Zero-shot Music Tagging (2310.10159v1)

Published 16 Oct 2023 in cs.SD, cs.CL, and eess.AS

Abstract: Music tagging is a task to predict the tags of music recordings. However, previous music tagging research primarily focuses on close-set music tagging tasks which can not be generalized to new tags. In this work, we propose a zero-shot music tagging system modeled by a joint music and language attention (JMLA) model to address the open-set music tagging problem. The JMLA model consists of an audio encoder modeled by a pretrained masked autoencoder and a decoder modeled by a Falcon7B. We introduce preceiver resampler to convert arbitrary length audio into fixed length embeddings. We introduce dense attention connections between encoder and decoder layers to improve the information flow between the encoder and decoder layers. We collect a large-scale music and description dataset from the internet. We propose to use ChatGPT to convert the raw descriptions into formalized and diverse descriptions to train the JMLA models. Our proposed JMLA system achieves a zero-shot audio tagging accuracy of $ 64.82\% $ on the GTZAN dataset, outperforming previous zero-shot systems and achieves comparable results to previous systems on the FMA and the MagnaTagATune datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. “A survey of audio-based music classification and annotation,” IEEE Transactions on Multimedia, vol. 13, no. 2, pp. 303–319, 2010.
  2. “Convolutional recurrent neural networks for music classification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2392–2396.
  3. “Automatic tagging using deep convolutional neural networks,” in International Society of Music Information Retrieval (ISMIR), 2016.
  4. “Evaluation of CNN-based automatic music tagging models,” in Sound and Music Computing Conference (SMC), 2020.
  5. “Semi-supervised music tagging transformer,” in International Society for Music Information Retrieval (ISMIR), 2021.
  6. “MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training,” arXiv preprint arXiv:2306.00107, 2023.
  7. “Open set recognition for music genre classification,” arXiv preprint arXiv:2209.07548, 2022.
  8. “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
  9. “Instructblip: Towards general-purpose vision-language models with instruction tuning,” 2023.
  10. “Flamingo: a visual language model for few-shot learning,” in Advances in Neural Information Processing Systems, 2022.
  11. “Pengi: An audio language model for audio tasks,” arXiv preprint arXiv:2305.11834, 2023.
  12. “MuLan: A joint embedding of music audio and natural language,” in International Society for Music Information Retrieval Conference (ISMIR), 2022.
  13. OpenAI, “GPT-4 Technical Report,” arXiv preprint arXiv:2303.08774, 2023.
  14. “Masked spectrogram modeling using masked autoencoders for learning general-purpose audio representation,” in Proceedings of Machine Learning Research, 2022, vol. 166.
  15. “Perceiver: General perception with iterative attention,” in International Conference on Machine Learning (ICML), 2021, pp. 4651–4664.
  16. “Attention is all you need,” Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017.
  17. “SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing,” in Association for Computational Linguistics, 2022.
  18. “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190, 2021.
  19. “Falcon-40B: an open large language model with state-of-the-art performance,” 2023.
  20. “Musical genre classification of audio signals,” IEEE Transactions on speech and audio processing, vol. 10, no. 5, pp. 293–302, 2002.
  21. “FMA: A dataset for music analysis,” arXiv preprint arXiv:1612.01840, 2016.
  22. “Evaluation of algorithms using games: The case of music tagging.,” in International Society for Music Information Retrieval Conference (ISMIR), 2009, pp. 387–392.
  23. “HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 646–650.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.