An Experimental Comparison Of Multi-view Self-supervised Methods For Music Tagging (2404.09177v1)
Abstract: Self-supervised learning has emerged as a powerful way to pre-train generalizable machine learning models on large amounts of unlabeled data. It is particularly compelling in the music domain, where obtaining labeled data is time-consuming, error-prone, and ambiguous. During the self-supervised process, models are trained on pretext tasks, with the primary objective of acquiring robust and informative features that can later be fine-tuned for specific downstream tasks. The choice of the pretext task is critical as it guides the model to shape the feature space with meaningful constraints for information encoding. In the context of music, most works have relied on contrastive learning or masking techniques. In this study, we expand the scope of pretext tasks applied to music by investigating and comparing the performance of new self-supervised methods for music tagging. We open-source a simple ResNet model trained on a diverse catalog of millions of tracks. Our results demonstrate that, although most of these pre-training methods result in similar downstream results, contrastive learning consistently results in better downstream performance compared to other self-supervised pre-training methods. This holds true in a limited-data downstream context.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” NeurIPS, 2020.
- “Vision models are more robust and fair when pretrained on uncurated images without supervision,” arXiv:2202.08360, 2022.
- “Using self-supervised learning can improve model robustness and uncertainty,” NeurIPS, 2019.
- “Towards learning universal audio representations,” in IEEE ICASSP, 2022.
- “How well do self-supervised models transfer?,” in IEEE/CVF CVPR, 2021.
- “Learning transferable visual models from natural language supervision,” in ICML, 2021.
- “Look, listen and learn,” in IEEE ICCV, 2017.
- “A cookbook of self-supervised learning,” arXiv:2304.12210, 2023.
- “Language models are few-shot learners,” NeurIPS, 2020.
- “Pre-training audio representations with self-supervision,” IEEE Signal Processing Letters, 2020.
- “Understanding dimensional collapse in contrastive self-supervised learning,” in ICLR, 2022.
- “Supervised and unsupervised learning of audio representations for music understanding,” in ISMIR, 2022.
- “Contrastive learning of musical representations,” ISMIR, 2021.
- “Contrastive learning of general-purpose audio representations,” in IEEE ICASSP, 2021.
- “Zero-note samba: Self-supervised beat tracking,” IEEE Trans. on Audio, Speech, and Language Processing, 2023.
- “Self-supervised contrastive learning for singing voices,” IEEE Trans. on Audio, Speech, and Language Processing, 2022.
- “Deep residual learning for image recognition,” in IEEE CVPR, 2016.
- “A simple framework for contrastive learning of visual representations,” in ICML, 2020.
- “Bootstrap your own latent-a new approach to self-supervised learning,” NeurIPS, 2020.
- “Unsupervised learning of visual features by contrasting cluster assignments,” NeurIPS, 2020.
- “Emerging properties in self-supervised vision transformers,” CoRR, 2021.
- “Barlow twins: Self-supervised learning via redundancy reduction,” in ICML, 2021.
- “Vicreg: Variance-invariance-covariance regularization for self-supervised learning,” in ICLR, 2022.
- “Vicregl: Self-supervised learning of local visual features,” arXiv:2210.01571, 2022.
- “The million song dataset,” 2011.
- “Evaluation of algorithms using games: The case of music tagging.,” in ISMIR, 2009.
- “The mtg-jamendo dataset for automatic music tagging,” in ICML, 2019.
- “Music representation learning based on editorial metadata from discogs,” in ISMIR, 2022.
- “Contrastive audio-language learning for music,” in ISMIR, 2022.
- Kihyuk Sohn, “Improved deep metric learning with multi-class n-pair loss objective,” NeurIPS, 2016.
- “Byol for audio: Self-supervised learning for general-purpose audio representation,” in IEEE IJCNN, 2021.
- “Mert: Acoustic music understanding model with large-scale self-supervised training,” arXiv:2306.00107, 2023.
- “vocadito: A dataset of solo vocals with f_0𝑓_0f\_0italic_f _ 0, note, and lyric annotations,” arXiv:2110.05580, 2021.