Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 164 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 40 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 102 tok/s Pro
Kimi K2 216 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

An Experimental Comparison Of Multi-view Self-supervised Methods For Music Tagging (2404.09177v1)

Published 14 Apr 2024 in cs.SD, cs.LG, and eess.AS

Abstract: Self-supervised learning has emerged as a powerful way to pre-train generalizable machine learning models on large amounts of unlabeled data. It is particularly compelling in the music domain, where obtaining labeled data is time-consuming, error-prone, and ambiguous. During the self-supervised process, models are trained on pretext tasks, with the primary objective of acquiring robust and informative features that can later be fine-tuned for specific downstream tasks. The choice of the pretext task is critical as it guides the model to shape the feature space with meaningful constraints for information encoding. In the context of music, most works have relied on contrastive learning or masking techniques. In this study, we expand the scope of pretext tasks applied to music by investigating and comparing the performance of new self-supervised methods for music tagging. We open-source a simple ResNet model trained on a diverse catalog of millions of tracks. Our results demonstrate that, although most of these pre-training methods result in similar downstream results, contrastive learning consistently results in better downstream performance compared to other self-supervised pre-training methods. This holds true in a limited-data downstream context.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
  2. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” NeurIPS, 2020.
  3. “Vision models are more robust and fair when pretrained on uncurated images without supervision,” arXiv:2202.08360, 2022.
  4. “Using self-supervised learning can improve model robustness and uncertainty,” NeurIPS, 2019.
  5. “Towards learning universal audio representations,” in IEEE ICASSP, 2022.
  6. “How well do self-supervised models transfer?,” in IEEE/CVF CVPR, 2021.
  7. “Learning transferable visual models from natural language supervision,” in ICML, 2021.
  8. “Look, listen and learn,” in IEEE ICCV, 2017.
  9. “A cookbook of self-supervised learning,” arXiv:2304.12210, 2023.
  10. “Language models are few-shot learners,” NeurIPS, 2020.
  11. “Pre-training audio representations with self-supervision,” IEEE Signal Processing Letters, 2020.
  12. “Understanding dimensional collapse in contrastive self-supervised learning,” in ICLR, 2022.
  13. “Supervised and unsupervised learning of audio representations for music understanding,” in ISMIR, 2022.
  14. “Contrastive learning of musical representations,” ISMIR, 2021.
  15. “Contrastive learning of general-purpose audio representations,” in IEEE ICASSP, 2021.
  16. “Zero-note samba: Self-supervised beat tracking,” IEEE Trans. on Audio, Speech, and Language Processing, 2023.
  17. “Self-supervised contrastive learning for singing voices,” IEEE Trans. on Audio, Speech, and Language Processing, 2022.
  18. “Deep residual learning for image recognition,” in IEEE CVPR, 2016.
  19. “A simple framework for contrastive learning of visual representations,” in ICML, 2020.
  20. “Bootstrap your own latent-a new approach to self-supervised learning,” NeurIPS, 2020.
  21. “Unsupervised learning of visual features by contrasting cluster assignments,” NeurIPS, 2020.
  22. “Emerging properties in self-supervised vision transformers,” CoRR, 2021.
  23. “Barlow twins: Self-supervised learning via redundancy reduction,” in ICML, 2021.
  24. “Vicreg: Variance-invariance-covariance regularization for self-supervised learning,” in ICLR, 2022.
  25. “Vicregl: Self-supervised learning of local visual features,” arXiv:2210.01571, 2022.
  26. “The million song dataset,” 2011.
  27. “Evaluation of algorithms using games: The case of music tagging.,” in ISMIR, 2009.
  28. “The mtg-jamendo dataset for automatic music tagging,” in ICML, 2019.
  29. “Music representation learning based on editorial metadata from discogs,” in ISMIR, 2022.
  30. “Contrastive audio-language learning for music,” in ISMIR, 2022.
  31. Kihyuk Sohn, “Improved deep metric learning with multi-class n-pair loss objective,” NeurIPS, 2016.
  32. “Byol for audio: Self-supervised learning for general-purpose audio representation,” in IEEE IJCNN, 2021.
  33. “Mert: Acoustic music understanding model with large-scale self-supervised training,” arXiv:2306.00107, 2023.
  34. “vocadito: A dataset of solo vocals with f⁢_⁢0𝑓_0f\_0italic_f _ 0, note, and lyric annotations,” arXiv:2110.05580, 2021.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: