End-to-end music source separation: is it possible in the waveform domain? (1810.12187v2)

Published 29 Oct 2018 in cs.SD, cs.LG, and eess.AS

Abstract: Most of the currently successful source separation techniques use the magnitude spectrogram as input, and are therefore by default omitting part of the signal: the phase. To avoid omitting potentially useful information, we study the viability of using end-to-end models for music source separation --- which take into account all the information available in the raw audio signal, including the phase. Although during the last decades end-to-end music source separation has been considered almost unattainable, our results confirm that waveform-based models can perform similarly (if not better) than a spectrogram-based deep learning model. Namely: a Wavenet-based model we propose and Wave-U-Net can outperform DeepConvSep, a recent spectrogram-based deep learning model.

Authors (3)

Francesc Lluís (7 papers)
Jordi Pons (36 papers)
Xavier Serra (82 papers)

Citations (71)

View on Semantic Scholar

Summary

The paper introduces an end-to-end model that processes raw audio to preserve phase information and challenge traditional spectrogram approaches.
It employs a Wavenet-inspired deep learning architecture to directly regress on waveforms while minimizing mean absolute error.
Performance tests using SDR, SIR, and SAR metrics demonstrate competitive results, particularly in separating drums and bass.

End-to-end Music Source Separation in the Waveform Domain

The field of music source separation is pivotal in signal processing, dealing with the extraction of individual audio sources from a mixture. Traditionally, spectrogram-based methods, which often disregard the phase, have dominated this domain. However, recent advances suggest the potential viability of end-to-end models that directly process the raw waveform, thereby preserving all available information, including the phase. This paper explores such a possibility, evaluating whether end-to-end models operating directly in the waveform domain can achieve comparable, or even superior, performance to their spectrogram-based counterparts.

Motivation and Background

Current successful methodologies heavily rely on the magnitude spectrogram, which, while effective, inherently dismisses phase information. This exclusion can lead to suboptimal outcomes, as the phase contains valuable data indicative of sound source characteristics and interactions. The work challenges the status quo by positing that waveform-based models, integrated with deep learning's robust acoustic modeling capabilities, can surpass spectrogram-focused systems. Historically deemed impractical owing to the waveform's complexity and high dimensionality, this research revisits the feasibility of waveform-based music source separation, adding a crucial perspective to this underexplored area.

Methodological Framework

The authors propose a Wavenet-inspired model that operates predominantly in a non-causal mode, adapting techniques from recent advancements in audio processing. By regressing directly on the audio waveforms, the model minimizes mean absolute error (MAE), ensuring an end-to-end learning process devoid of reliance on traditional time-frequency transformations. The work also introduces baseline comparisons with established models like DeepConvSep, a spectrogram-based model, and Wave-U-Net, a waveform-based alternative that employs the popular U-Net architecture within the waveform domain.

Evaluation and Results

Detailed analyses of various model architectures were conducted, focusing on the number of layers (stacks) and their respective parameterization (e.g., CNN filter numbers). Evaluating through both perceptual tests and standard audio quality metrics such as Signal-to-Distortion Ratio (SDR), Signal-to-Interference Ratio (SIR), and Signal-to-Artifacts Ratio (SAR), the paper demonstrates that waveforms can yield satisfactory results. The strength of waveform-based models is especially evident in their ability to match or exceed the performance of spectrogram-based models in specific scenarios, particularly when separating drums and bass. Nonetheless, challenges remain, as spectrogram-based models still outperform in median SDR by approximately 1.5dB.

Implications and Future Perspectives

While the results underscore the transformative potential of end-to-end models for source separation, significant challenges must be addressed before they achieve parity with spectrogram-based methods under all settings. This research paves the way for further examination of waveform representation's role in audio processing, particularly regarding the bygone phase problem and the mask-based filtering paradigm. Additionally, the authors suggest continuing exploration into training strategies and loss functions that could enhance waveform-based models' efficiency and performance.

Conclusion

The journey elucidated in this paper indicates that end-to-end music source separation is not only possible but promising. As the paper suggests, leveraging deep learning in processing raw audio may circumvent issues inherent to spectrogram methodologies, potentially initiating a paradigm shift in audio processing. By advocating for a comprehensive approach that includes waveform-domain solutions, this research invites further inquiry and application within the fields of music and speech source separation. This work contributes a crucial step toward realizing more holistic audio separation techniques, encompassing all facets of the signal without compromise.

PDF Markdown

Related Papers

GitHub

GitHub - francesclluis/source-separation-wavenet: A neural network for end-to-end music source separation (228 stars)