High-resolution Piano Transcription with Pedals by Regressing Onset and Offset Times (2010.01815v3)

Published 5 Oct 2020 in cs.SD and eess.AS

Abstract: Automatic music transcription (AMT) is the task of transcribing audio recordings into symbolic representations. Recently, neural network-based methods have been applied to AMT, and have achieved state-of-the-art results. However, many previous systems only detect the onset and offset of notes frame-wise, so the transcription resolution is limited to the frame hop size. There is a lack of research on using different strategies to encode onset and offset targets for training. In addition, previous AMT systems are sensitive to the misaligned onset and offset labels of audio recordings. Furthermore, there are limited researches on sustain pedal transcription on large-scale datasets. In this article, we propose a high-resolution AMT system trained by regressing precise onset and offset times of piano notes. At inference, we propose an algorithm to analytically calculate the precise onset and offset times of piano notes and pedal events. We show that our AMT system is robust to the misaligned onset and offset labels compared to previous systems. Our proposed system achieves an onset F1 of 96.72% on the MAESTRO dataset, outperforming previous onsets and frames system of 94.80%. Our system achieves a pedal onset F1 score of 91.86\%, which is the first benchmark result on the MAESTRO dataset. We have released the source code and checkpoints of our work at https://github.com/bytedance/piano_transcription.

Citations (102)

View on Semantic Scholar

Summary

The paper introduces a regression-based method to precisely model piano note onsets, offsets, and pedal events.
It leverages convolutional layers and bidirectional GRUs on log mel spectrograms to outperform traditional frame-based techniques.
The system achieves a 96.72% onset F1 score and 91.86% pedal F1 score on the MAESTRO dataset, enabling robust high-resolution transcription.

High-Resolution Piano Transcription with Pedals by Regressing Onset and Offset Times

This paper addresses the task of Automatic Music Transcription (AMT) for piano recordings, focusing on improving transcription resolution and robustness. AMT involves converting audio recordings into symbolic representations like MIDI files. The authors introduce a novel method that enhances the resolution of piano transcription by regressing precise onset and offset times for notes and pedal events, diverging from traditional frame-wise methods constrained by frame hop sizes.

Methodology

The proposed system leverages neural networks to perform regression-based modeling of onsets and offsets, offering enhancements over binary classification models. Traditional methods often suffer from limited temporal resolution and are sensitive to misaligned labels. By regressing the time differences between the center of a frame and the nearest onset or offset, this approach maintains precise onset and offset information.

The architecture comprises convolutional layers followed by bidirectional GRUs to handle log mel spectrogram inputs. This allows for capturing both temporal and frequency-domain information effectively. The system employs loss functions based on binary cross-entropy over regression targets rather than classifying presence/absence in each frame.

Numerical Results

The system exhibits strong performance on the MAESTRO dataset, achieving an onset F1 score of 96.72%, surpassing previous onsets and frames methods. The paper highlights robustness against label misalignments, suggesting that the regression-based approach can handle variations in label alignment more effectively than frame-based methods.

Additionally, the inclusion of a novel sustain pedal detection mechanism within the framework is noteworthy. The pedal transcription demonstrates an F1 score of 91.86%, providing a benchmark for future evaluations.

Implications and Future Work

This regression-based approach opens new avenues for high-resolution music transcription, presenting implications for music information retrieval, performance analysis, and intelligent music editing. The architecture is adaptable for future developments in multi-instrument transcription and can potentially improve real-time transcription systems.

However, limitations such as dependency on audio quality and the need for system modifications for real-time applications are acknowledged. Future work may focus on expanding this framework to accommodate diverse musical contexts and instruments.

The release of the system’s source code encourages further validation and adaptation by the research community, fostering advancements in digital music processing.