Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions (2002.00212v3)

Published 1 Feb 2020 in cs.SD, cs.AI, eess.AS, and stat.ML

Abstract: A great number of deep learning based models have been recently proposed for automatic music composition. Among these models, the Transformer stands out as a prominent approach for generating expressive classical piano performance with a coherent structure of up to one minute. The model is powerful in that it learns abstractions of data on its own, without much human-imposed domain knowledge or constraints. In contrast with this general approach, this paper shows that Transformers can do even better for music modeling, when we improve the way a musical score is converted into the data fed to a Transformer model. In particular, we seek to impose a metrical structure in the input data, so that Transformers can be more easily aware of the beat-bar-phrase hierarchical structure in music. The new data representation maintains the flexibility of local tempo changes, and provides hurdles to control the rhythmic and harmonic structure of music. With this approach, we build a Pop Music Transformer that composes Pop piano music with better rhythmic structure than existing Transformer models.

Citations (39)

View on Semantic Scholar

Summary

The paper introduces REMI, a beat-based MIDI event representation that improves rhythmic and harmonic modeling in Transformer-based music generation.
It employs a structured grid with Bar, Position, Tempo, and Chord events to capture metrical and harmonic nuances.
Experimental evaluations show superior rhythmic consistency and listener preference over baseline Music Transformer models.

Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions

The paper presents an innovative approach to automatic music composition using a Transformer architecture tailored for generating expressive Pop piano music. The primary contribution is the introduction of a revamped MIDI-derived event representation named REMI, enhancing the rhythmic and harmonic structures fed into the Transformer model.

Key Contributions

The integration of a beat-based data structure is a central theme of the paper. The authors propose that traditional MIDI-like representations, while effective for certain tasks, lack the explicit representation of musical metrical structures. REMI addresses this by introducing new events—such as Bar and Position—that align with musical bars and beats, enabling the Transformer to better capture rhythmic regularity. Additional events, such as Tempo and Chord, are included to further enhance rhythmic and harmonic generation.

Technical Details

The REMI framework introduces significant modifications to the way music is encoded for LLMing tasks within a Transformer-XL architecture. Key distinctions include:

Note-On and Note Duration: The REMI representation uses Note Duration instead of Note-Off, explicitly defining note lengths and improving rhythmic representation.
Position and Bar Events: These are included to provide a structured grid that encapsulates the hierarchical beat-bar structure, aiding the model in capturing rhythmic regularity.
Tempo and Chord Events: Tempo changes and chord progressions are explicitly represented, allowing for expressive variability and harmonic control in generated compositions.

The implementation employs an integration of various music information retrieval (MIR) techniques, such as automatic transcription and beat tracking, to preprocess the audio data into structured sequences suitable for the model.

Experimental Evaluation

The evaluation framework included both objective and subjective assessments, with notable findings:

Objective Evaluation: The proposed model excelled in metrics related to rhythmic consistency, such as standard deviation of beat and downbeat occurrences, indicating improved stability and salience over baseline models.
Subjective Evaluation: Listening tests demonstrated a preference for the compositions generated by the REMI-based model over variations of the Music Transformer, suggesting enhanced perceptual coherence and pleasantness.

Implications and Future Directions

Practically, this research could influence the development of more sophisticated AI-based music composition tools. Theoretically, it suggests that the incorporation of domain-specific structures in data representations can significantly improve the performance of general-purpose models like Transformers.

Looking forward, this methodology could be extended to multi-instrumental compositions by embedding broader musical knowledge such as groove and emotion. Furthermore, exploring architectures that can handle longer sequences might capture even more complex musical structures, thereby expanding the boundaries of machine-generated music.

Conclusion

The paper effectively demonstrates that embedding prior human knowledge of musical structures through sophisticated event representations can substantially enhance the capabilities of Transformer models for music composition. This approach invites further exploration into domain-specific modifications of neural architectures to better align computational models with task characteristics and domain requirements.

PDF Markdown

Related Papers

GitHub

GitHub - YatingMusic/remi: "Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions", ACM Multimedia 2020 (577 stars)

YouTube

Show All Videos