Automatic multitrack mixing with a differentiable mixing console of neural audio effects

Published 20 Oct 2020 in eess.AS and cs.SD | (2010.10291v1)

Abstract: Applications of deep learning to automatic multitrack mixing are largely unexplored. This is partly due to the limited available data, coupled with the fact that such data is relatively unstructured and variable. To address these challenges, we propose a domain-inspired model with a strong inductive bias for the mixing task. We achieve this with the application of pre-trained sub-networks and weight sharing, as well as with a sum/difference stereo loss function. The proposed model can be trained with a limited number of examples, is permutation invariant with respect to the input ordering, and places no limit on the number of input sources. Furthermore, it produces human-readable mixing parameters, allowing users to manually adjust or refine the generated mix. Results from a perceptual evaluation involving audio engineers indicate that our approach generates mixes that outperform baseline approaches. To the best of our knowledge, this work demonstrates the first approach in learning multitrack mixing conventions from real-world data at the waveform level, without knowledge of the underlying mixing parameters.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (41)

View on Semantic Scholar

Summary

The paper presents a differentiable mixing console (DMC) that leverages a stereo loss function and pre-trained sub-networks to automate multitrack mixing.
It employs a temporal convolutional network to simulate equalization, compression, and reverberation, closely mimicking traditional audio mixing practices.
Experimental evaluations on datasets like ENST-Drums and MedleyDB show that DMC produces mixes of competitive quality compared to human-engineered outputs.

Automatic Multitrack Mixing with a Differentiable Mixing Console of Neural Audio Effects

Christian J. Steinmetz et al. present an innovative approach to the domain of intelligent music production through their work on automatic multitrack mixing using neural networks. The paper explores the relatively understudied application of deep learning in multitrack audio mixing, aiming to model signal processing techniques employed in traditional audio mixing consoles.

One of the central themes of the paper is the proposal of a differentiable mixing console (DMC). This model stands out due to its incorporation of a strong inductive bias influenced by domain knowledge. It accomplishes this by employing pre-trained sub-networks and sharing weights alongside a unique stereo loss function tailored to the automatic mixing task. The stereo loss function, notably, introduces invariance in left-right orientation by leveraging sum and difference signals, which is critical for training models on stereo mixes.

The differentiable mixing console is built upon a temporal convolutional network (TCN) model that simulates the operations of equalization, compression, and reverberation effects, processes inherent in standard mixing tasks. The authors underline that this model's design benefits from the transformability of the digital signal processing domain, allowing for an extensive generation of examples required to train transformation networks effectively. Unlike methods limited by parametric data scarcity, the authors leverage a Python package named pymixconsole to create training scenarios that replicate real-world processing chains with the model.

The study's rigorous experimental setup evaluated the DMC on complex, realistic datasets such as ENST-Drums and MedleyDB, providing perceptual evaluations from experienced audio engineers. The perceptual evaluation indicated that the DMC produces mixes of competitive quality alongside baseline and human-engineered mixes. Despite the challenges posed by the subjective nature of audio aesthetics, the DMC model demonstrated promising results, especially for tasks involving consistent sources and mixing techniques.

The paper also contrasts its DMC approach against classical time-domain deep learning models, highlighting the limitations of the latter not just due to lack of inductive bias, but also because of difficulties in handling variations in input. This underlines the importance of designing models with architectural intuitions from traditional mixing practices, enabling neural networks to become powerful tools even with the challenges of limited data availability and varied source input.

The implications of Steinmetz et al.'s work are multifaceted. Practically, tools like the differentiable mixing console can streamline workflows for audio engineers, lower entry barriers for novice artists, and provide new analytical insights into contemporary multitrack mixing practices. Theoretically, this work opens avenues in AI research focused on audio and music production, corroborating the need for inductive biases that mirror domain-specific processes. Future developments may expand on the breadth of signal processing tasks the DMC can accurately emulate and further etiolate the bridges between AI methodologies and traditional audio engineering nuances.

Markdown Report Issue