Semi-Supervised Contrastive Learning of Musical Representations (2407.13840v1)

Published 18 Jul 2024 in eess.AS

Abstract: Despite the success of contrastive learning in Music Information Retrieval, the inherent ambiguity of contrastive self-supervision presents a challenge. Relying solely on augmentation chains and self-supervised positive sampling strategies can lead to a pretraining objective that does not capture key musical information for downstream tasks. We introduce semi-supervised contrastive learning (SemiSupCon), a simple method for leveraging musically informed labeled data (supervision signals) in the contrastive learning of musical representations. Our approach introduces musically relevant supervision signals into self-supervised contrastive learning by combining supervised and self-supervised contrastive objectives in a simpler framework than previous approaches. This framework improves downstream performance and robustness to audio corruptions on a range of downstream MIR tasks with moderate amounts of labeled data. Our approach enables shaping the learned similarity metric through the choice of labeled data that (1) infuses the representations with musical domain knowledge and (2) improves out-of-domain performance with minimal general downstream performance loss. We show strong transfer learning performance on musically related yet not trivially similar tasks - such as pitch and key estimation. Additionally, our approach shows performance improvement on automatic tagging over self-supervised approaches with only 5\% of available labels included in pretraining.

Summary

The paper introduces SemiSupCon, a hybrid framework that integrates supervised signals with self-supervised contrastive learning to improve musical representations in MIR tasks.
It demonstrates that using as little as 5% labeled data boosts AUROC in automatic tagging while achieving strong cross-task generalization.
The method further enhances robustness against audio corruptions and refines positive mining strategies, making musical representation learning more effective.

Semi-supervised Contrastive Learning of Musical Representations

In the paper titled "Semi-supervised Contrastive Learning of Musical Representations," authors Julien Guinot, Elio Quinton, and György Fazekas propose a novel method for enhancing the performance of contrastive learning in Music Information Retrieval (MIR) by incorporating labeled data into the self-supervised learning (SSL) framework, termed SemiSupCon. This approach aims to address the limitations of traditional SSL approaches, where a lack of labeled data can result in pretraining objectives that do not optimally capture key musical information essential for downstream tasks.

Introduction

The authors introduce SemiSupCon as a solution to improve upon current self-supervised learning methods by integrating supervised signals in a contrastive learning framework. This method aims to leverage both labeled and unlabeled data to infuse the learned representations with musical domain knowledge, thereby enhancing the model's performance across a range of MIR tasks. The key contributions of this work are threefold:

The introduction of a simple framework that integrates supervised and self-supervised contrastive learning.
Demonstrating the ability of SemiSupCon to shape representations according to the supervision signal with minimal performance loss on unrelated tasks.
Proposing a representation learning framework that exhibits higher robustness to audio corruptions.

Methodology

The methodology involves three primary components: self-supervised contrastive learning, supervised contrastive learning, and the hybrid semi-supervised contrastive learning approach. Each component uses an embedding encoder and a projection head to map audio data into a latent space designed for contrastive learning.

Self-Supervised Contrastive Learning: The conventional SSL is based on maximizing agreement between augmented views of the same audio sample while minimizing it for different samples using a contrastive loss.
Supervised Contrastive Learning: This extends the SSL approach by leveraging labeled data to create a supervised contrastive loss that uses class labels to define positive pairs.
Semi-supervised Contrastive Learning (SemiSupCon): This method combines both approaches. By using a hybrid loss function that encompasses both labeled and unlabeled data, the authors aim to leverage the strengths of both learning paradigms.

Experiments and Results

Automatic Tagging

The authors evaluate the performance of SemiSupCon on multiple MIR tasks, with particular focus on automatic tagging using the MagnaTagATune (MTAT) dataset. They find that incorporating only 5% of labeled data improves the area under the receiver operating characteristic curve (AUROC) from 88.8 to 89.4. The semi-supervised approach shows incremental performance gains as the proportion of labeled data increases, surpassing both self-supervised and fully supervised baselines under similar computing conditions.

Cross-Task Generalization

In a comprehensive set of experiments, the paper evaluates SemiSupCon's performance across various downstream tasks using multiple datasets. These include genre classification, instrument identification, and pitch classification among others. SemiSupCon demonstrates robust performance improvements, exhibiting effective transfer learning capabilities across tasks that are musically related but not trivially similar. For instance, pretraining with a pitch classification dataset significantly improved performance on instrument classification tasks, demonstrating the utility of musically informed supervision signals.

Robustness to Data Corruption

SemiSupCon models also exhibited greater robustness to input data corruptions compared to traditional end-to-end supervised models. This robustness stems from the contrastive learning framework's inherent design, which trains models to be invariant to augmentations applied during training.

Positive Mining Strategies

The paper explores various strategies for mining positive samples from multi-label datasets. It is shown that a criterion based on shared class labels or a continuous 'semantic weighing' strategy leads to more nuanced and effective supervision signals, further enhancing the performance of SemiSupCon.

Implications and Future Work

The implications of this research are substantial for the field of MIR. The ability to leverage modest amounts of labeled data within a predominantly self-supervised framework could significantly reduce the labeling burden, which is often a bottleneck in MIR tasks. The demonstrated robustness and transfer learning capacity suggest that SemiSupCon could be adapted to a wide range of audio analysis tasks beyond MIR.

Future research directions highlighted in the paper include the exploration of additional supervision signals, such as perceptual metrics and chord estimation, and extending the framework to multimodal learning scenarios. Moreover, a deeper understanding of the interaction between the proportion of labeled data and the quality of learned representations could provide further insights into optimizing the SemiSupCon framework.

Conclusion

The introduction of SemiSupCon establishes a promising direction for representation learning in MIR. By effectively combining labeled and unlabeled data, this method addresses the limitations of traditional SSL approaches, offering enhanced performance, robustness, and transferability across a variety of musical tasks. The results and insights presented in this paper provide a solid foundation for further advancements in semi-supervised learning methodologies within the broader context of audio analysis and beyond.

Overall, this paper contributes a meaningful advancement in leveraging semi-supervised learning for musical representation, significantly enriching the toolbox available to researchers and practitioners in the field of MIR.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Juj_Guinot/status/1816079245405245924