2000 character limit reached
Towards audio language modeling -- an overview (2402.13236v1)
Published 20 Feb 2024 in eess.AS and cs.SD
Abstract: Neural audio codecs are initially introduced to compress audio data into compact codes to reduce transmission latency. Researchers recently discovered the potential of codecs as suitable tokenizers for converting continuous audio into discrete codes, which can be employed to develop audio LLMs (LMs). Numerous high-performance neural audio codecs and codec-based LMs have been developed. The paper aims to provide a thorough and systematic overview of the neural audio codec models and codec-based LMs.
- Alexandre Défossez et al., “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438, 2022.
- Neil Zeghidour et al., “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
- Zalán Borsos et al., “Soundstorm: Efficient parallel audio generation,” arXiv preprint arXiv:2305.09636, 2023.
- Yi-Chiao Wu et al., “Audiodec: An open-source streaming high-fidelity neural audio codec,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- Dongchao Yang et al., “Hifi-codec: Group-residual vector quantization for high fidelity audio codec,” arXiv preprint arXiv:2305.02765, 2023.
- “Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec,” arXiv preprint arXiv:2309.07405, 2023.
- “Speechtokenizer: Unified speech tokenizer for speech large language models,” arXiv preprint arXiv:2308.16692, 2023.
- “High-fidelity audio compression with improved rvqgan,” arXiv preprint arXiv:2306.06546, 2023.
- Zalán Borsos et al., “Audiolm: a language modeling approach to audio generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- Paul K Rubenstein et al., “Audiopalm: A large language model that can speak and listen,” arXiv preprint arXiv:2306.12925, 2023.
- Andrea Agostinelli et al., “Musiclm: Generating music from text,” arXiv preprint arXiv:2301.11325, 2023.
- Chengyi Wang et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
- Ziqiang Zhang et al., “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,” arXiv preprint arXiv:2303.03926, 2023.
- Tianrui Wang et al., “Viola: Unified codec language models for speech recognition, synthesis, and translation,” arXiv preprint arXiv:2305.16107, 2023.
- Dongchao Yang et al., “Uniaudio: An audio foundation model toward universal audio generation,” arXiv preprint arXiv:2310.00704, 2023.
- Qian Chen et al., “Lauragpt: Listen, attend, understand, and regenerate audio with gpt,” arXiv preprint arXiv:2310.04673, 2023.
- Xiaofei Wang et al., “Speechx: Neural codec language model as a versatile speech transformer,” arXiv preprint arXiv:2308.06873, 2023.
- Jade Copet et al., “Simple and controllable music generation,” arXiv preprint arXiv:2306.05284, 2023.
- Gael Le Lan et al., “Stack-and-delay: a new codebook pattern for music generation,” arXiv preprint arXiv:2309.08804, 2023.
- Felix Kreuk et al., “Audiogen: Textually guided audio generation,” arXiv preprint arXiv:2209.15352, 2022.
- Jean-Marc Valin et al., “Rfc 6716: Definition of the opus audio codec,” 2012.
- Martin Dietz et al., “Overview of the evs codec architecture,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5698–5702.
- Marco Tagliasacchi et al., “Seanet: A multi-modal speech enhancement network,” arXiv preprint arXiv:2009.02095, 2020.
- Kundan Kumar et al., “Melgan: Generative adversarial networks for conditional waveform synthesis,” Advances in neural information processing systems, vol. 32, 2019.
- “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- Ashish Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033, 2020.
- Wei-Ning Hsu et al., “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- “Neural networks fail to learn periodic functions and how to fix it,” Advances in Neural Information Processing Systems, vol. 33, pp. 1583–1594, 2020.
- “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” 2020.
- “Bigvgan: A universal neural vocoder with large-scale training,” arXiv preprint arXiv:2206.04658, 2022.
- Yu-An Chung et al., “W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 244–250.
- Rohan Anil et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.
- “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023.
- “Megabyte: Predicting million-byte sequences with multiscale transformers,” arXiv preprint arXiv:2305.07185, 2023.
- “Mulan: A joint embedding of music audio and natural language,” arXiv preprint arXiv:2208.12415, 2022.
- “Phonetic Analysis of Self-supervised Representations of English Speech,” in Proc. Interspeech 2022, 2022, pp. 3583–3587.
- Adam Polyak et al., “Speech resynthesis from discrete disentangled self-supervised representations,” in Interspeech, 2021, pp. 3615–3619.
- Kushal Lakhotia et al., “On generative spoken language modeling from raw audio,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1336–1354, 2021.
- Eugene Kharitonov et al., “Text-free prosody-aware generative spoken language modeling,” arXiv preprint arXiv:2109.03264, 2021.
- Tu Anh Nguyen et al., “Generative spoken dialogue language modeling,” Transactions of the Association for Computational Linguistics, vol. 11, pp. 250–266, 2023.
- Michael Hassid et al., “Textually pretrained speech language models,” arXiv preprint arXiv:2305.13009, 2023.
- Sravya Popuri et al., “Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation,” in Proc. Interspeech 2022, 2022, pp. 5195–5199.
- Alexei Baevski et al., “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
- Hirofumi Inaguma et al., “Unity: Two-pass direct speech-to-speech translation with discrete units,” arXiv preprint arXiv:2212.08055, 2022.
- Loïc Barrault et al., “Seamlessm4t-massively multilingual & multimodal machine translation,” arXiv preprint arXiv:2308.11596, 2023.
- Loïc Barrault et al., “Seamless: Multilingual expressive and streaming speech translation,” arXiv preprint arXiv:2312.05187, 2023.
- Kai-Wei Chang et al., “An Exploration of Prompt Tuning on Generative Spoken Language Model for Speech Processing Tasks,” in Proc. Interspeech 2022, 2022, pp. 5005–5009.
- Kai-Wei Chang et al., “Speechprompt v2: Prompt tuning for speech classification tasks,” arXiv preprint arXiv:2303.00733, 2023.
- “Speechgen: Unlocking the generative power of speech language models with prompts,” arXiv preprint arXiv:2306.02207, 2023.
- Ming-Hao Hsu et al., “An exploration of in-context learning for speech language model,” arXiv preprint arXiv:2310.12477, 2023.
- “Towards general-purpose text-instruction-guided voice conversion,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8.
- “Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech,” arXiv preprint arXiv:2309.09510, 2023.