ChatMusician: Understanding and Generating Music Intrinsically with LLM (2402.16153v1)
Abstract: While LLMs demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language. ChatMusician can understand and generate music with a pure text tokenizer without any external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score. Our model is capable of composing well-structured, full-length music, conditioned on texts, chords, melodies, motifs, musical forms, etc, surpassing GPT-4 baseline. On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 on zero-shot setting by a noticeable margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. We release our 4B token music-language corpora MusicPile, the collected MusicTheoryBench, code, model and demo in GitHub.
- Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- CCARH at Stanford University. 2023. A library of virtual musical scores in the humdrum **kern data format.
- A systematic review of artificial intelligence-based music generation: Scope, applications, and future trends. Expert Systems with Applications, page 118190.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm.
- Simple and controllable music generation. arXiv preprint arXiv:2306.05284.
- What is missing in deep music generation? a study of repetition and structure in popular music. arXiv preprint arXiv:2209.00182.
- Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341.
- Singsong: Generating musical accompaniments from singing. arXiv preprint arXiv:2301.12662.
- High fidelity neural audio compression.
- Chessgpt: Bridging policy learning and language modeling. arXiv preprint arXiv:2306.09200.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Music transformer. arXiv preprint arXiv:1809.04281.
- Music transformer: Generating music with long-term structure. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- Yu-Siang Huang and Yi-Hsuan Yang. 2020a. Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, pages 1180–1188. ACM.
- Yu-Siang Huang and Yi-Hsuan Yang. 2020b. Pop music transformer: Generating music with rhythm and harmony. CoRR, abs/2002.00212.
- Harsh Jhamtani and Taylor Berg-Kirkpatrick. 2019. Modeling self-repetition in music generation using generative adversarial networks. In Machine Learning for Music Discovery Workshop, ICML.
- Matthew Kenney. 2023. arxiv-math-instruct-50.
- Camel: Communicative agents for "mind" exploration of large scale language model society.
- Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382.
- LinkSoul-AI. 2023. LinkSoul/instruction_merge_set. https://huggingface.co/datasets/LinkSoul/instruction_merge_set.
- Musecoco: Generating symbolic music from text. arXiv preprint arXiv:2306.00110.
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583.
- Elizabeth Hellmuth Margulis and Rhimmon Simchy-Gross. 2016. Repetition enhances the musicality of randomly generated tone sequences. Music Perception: An Interdisciplinary Journal, 33(4):509–514.
- Nobuo Masataka. 2007. Music, evolution and language. Developmental science, 10(1):35–39.
- Nobuo Masataka. 2009. The origins of language and the evolution of music: A comparative perspective. Physics of Life Reviews, 6(1):11–22.
- This time with feeling: Learning expressive musical performance. CoRR, abs/1808.03715.
- Christine Payne. 2019. Musenet. OpenAI Blog.
- Christine Payne. 2022. Musenet. https://openai.com/research/musenet.
- The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
- The association between music and language in children: A state-of-the-art review. Children, 10(5):801.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
- Folk music style modelling by recurrent neural networks with long short term memory units. In 16th international society for music information retrieval conference.
- Anticipatory music transformer. arXiv preprint arXiv:2306.08620.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Attention is all you need. Advances in neural information processing systems, 30.
- Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235.
- How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Wikipedia contributors. 2023. Wikipedia database.
- Chord-conditioned melody harmonization with controllable harmonicity. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
- Shangda Wu and Maosong Sun. 2022. Exploring the efficacy of pre-trained checkpoints in text-to-music generation task. arXiv preprint arXiv:2211.11216.
- Shangda Wu and Maosong Sun. 2023. Tunesformer: Forming tunes with control codes. arXiv preprint arXiv:2301.02884.
- Nature language reasoning, a survey. arXiv preprint arXiv:2303.14725.
- Marble: Music audio representation benchmark for universal evaluation. arXiv preprint arXiv:2306.10548.
- Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Video background music generation: Dataset, method and evaluation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15637–15647.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.