Emergent Mind

Using Cell Phone Pictures of Sheet Music To Retrieve MIDI Passages

(2004.11724)
Published Apr 22, 2020 in cs.MM , cs.SD , eess.AS , and eess.IV

Abstract

This article investigates a cross-modal retrieval problem in which a user would like to retrieve a passage of music from a MIDI file by taking a cell phone picture of several lines of sheet music. This problem is challenging for two reasons: it has a significant runtime constraint since it is a user-facing application, and there is very little relevant training data containing cell phone images of sheet music. To solve this problem, we introduce a novel feature representation called a bootleg score which encodes the position of noteheads relative to staff lines in sheet music. The MIDI representation can be converted into a bootleg score using deterministic rules of Western musical notation, and the sheet music image can be converted into a bootleg score using classical computer vision techniques for detecting simple geometrical shapes. Once the MIDI and cell phone image have been converted into bootleg scores, we can estimate the alignment using dynamic programming. The most notable characteristic of our system is that it has no trainable weights at all -- only a set of about 40 hyperparameters. With a training set of just 400 images, we show that our system generalizes well to a much larger set of 1600 test images from 160 unseen musical scores. Our system achieves a test F measure score of 0.89, has an average runtime of 0.90 seconds, and outperforms baseline systems based on music object detection and sheet-audio alignment. We provide extensive experimental validation and analysis of our system.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a summary of this paper on our Pro plan:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.