Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PSCodec: A Series of High-Fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders (2404.02702v3)

Published 3 Apr 2024 in cs.SD and cs.AI

Abstract: Neural speech codecs have recently emerged as a focal point in the fields of speech compression and generation. Despite this progress, achieving high-quality speech reconstruction under low-bitrate scenarios remains a significant challenge. In this paper, we propose PSCodec, a series of neural speech codecs based on prompt encoders, comprising PSCodec-Base, PSCodec-DRL-ICT, and PSCodec-CasAN, which are capable of delivering high-performance speech reconstruction with low bandwidths. Specifically, we first introduce PSCodec-Base, which leverages a pretrained speaker verification model-based prompt encoder (VPP-Enc) and a learnable Mel-spectrogram-based prompt encoder (MelP-Enc) to effectively disentangle and integrate voiceprint and Mel-related features in utterances. To further enhance feature utilization efficiency, we propose PSCodec-DRL-ICT, incorporating a structural similarity (SSIM) based disentangled representation loss (DRL) and an incremental continuous training (ICT) strategy. While PSCodec-DRL-ICT demonstrates impressive performance, its reliance on extensive hyperparameter tuning and multi-stage training makes it somewhat labor-intensive. To circumvent these limitations, we propose PSCodec-CasAN, utilizing an advanced cascaded attention network (CasAN) to enhance representational capacity of the entire system. Extensive experiments show that our proposed PSCodec-Base, PSCodec-DRL-ICT, and PSCodec-CasAN all significantly outperform several state-of-the-art neural codecs, exhibiting substantial improvements in both speech reconstruction quality and speaker similarity under low-bitrate conditions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. C. Wang, S. Chen, Y. Wu, et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
  2. Z. Borsos, M. Sharifi, D. Vincent, et al., “Soundstorm: Efficient parallel audio generation,” 2023.
  3. J. Betker, “Better speech synthesis through scaling,” arXiv preprint arXiv:2305.07243, 2023.
  4. E. Kharitonov, D. Vincent, Z. Borsos, et al., “Speak, read and prompt: High-fidelity text-to-speech with minimal supervision,” 2023.
  5. D. Yang, S. Liu, R. Huang, et al., “Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt,” 2023.
  6. Z. Borsos, R. Marinier, D. Vincent, et al., “Audiolm: a language modeling approach to audio generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  7. C. Gârbacea, A. van den Oord, Y. Li, et al., “Low bit-rate speech coding with vq-vae and a wavenet decoder,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 735–739.
  8. N. Zeghidour, A. Luebs, A. Omran, et al., “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
  9. A. Défossez, J. Copet, G. Synnaeve, et al., “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438, 2022.
  10. R. Kumar, P. Seetharaman, A. Luebs, et al., “High-fidelity audio compression with improved rvqgan,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  11. D. Yang, S. Liu, R. Huang, et al., “Hifi-codec: Group-residual vector quantization for high fidelity audio codec,” arXiv preprint arXiv:2305.02765, 2023.
  12. M. Dietz, M. Multrus, V. Eksler, et al., “Overview of the evs codec architecture,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2015, pp. 5698–5702.
  13. J. Klejsa, P. Hedelin, C. Zhou, et al., “High-quality speech coding with sample rnn,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 7155–7159.
  14. W. B. Kleijn, F. S. Lim, A. Luebs, et al., “Wavenet based low rate speech coding,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2018, pp. 676–680.
  15. J.-M. Valin, G. Maxwell, T. B. Terriberry, et al., “High-quality, low-delay music coding in the opus codec,” arXiv preprint arXiv:1602.04845, 2016.
  16. A. Van Den Oord, O. Vinyals, et al., “Neural discrete representation learning,” Advances in neural information processing systems, vol. 30, 2017.
  17. A. Van Den Oord, S. Dieleman, H. Zen, et al., “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, vol. 12, 2016.
  18. J. Pons, S. Pascual, G. Cengarle, et al., “Upsampling artifacts in neural audio synthesis,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 3005–3009.
  19. M. Morrison, R. Kumar, K. Kumar, et al., “Chunked autoregressive gan for conditional waveform synthesis,” arXiv preprint arXiv:2110.10139, 2021.
  20. Y. Ren, T. Wang, J. Yi, et al., “Fewer-token neural speech codec with time-invariant codes,” arXiv preprint arXiv:2310.00014, 2023.
  21. H. Wang, S. Zheng, Y. Chen, et al., “Cam++: A fast and efficient network for speaker verification using context-aware masking,” arXiv preprint arXiv:2303.00332, 2023.
  22. Y. Ai, X.-H. Jiang, Y.-X. Lu, et al., “Apcodec: A neural audio codec with parallel amplitude and phase spectrum encoding and decoding,” arXiv preprint arXiv:2402.10533, 2024.
  23. L. Matthey, I. Higgins, D. Hassabis, et al., “dsprites: Disentanglement testing sprites dataset,” 2017.
  24. L. Tran, X. Yin, and X. Liu, “Disentangled representation learning gan for pose-invariant face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1415–1424.
  25. X. Sun, J. Wu, X. Zhang, et al., “Pix3d: Dataset and methods for single-image 3d shape modeling,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2974–2983.
  26. M. Yang, F. Liu, Z. Chen, et al., “Causalvae: Disentangled representation learning via neural structural causal models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 9593–9602.
  27. V. John, L. Mou, H. Bahuleyan, et al., “Disentangled representation learning for non-parallel text style transfer,” arXiv preprint arXiv:1808.04339, 2018.
  28. P. Cheng, M. R. Min, D. Shen, et al., “Improving disentangled text representation learning with information-theoretic guidance,” arXiv preprint arXiv:2006.00693, 2020.
  29. P. Colombo, C. Clavel, and P. Piantanida, “A novel estimator of mutual information for learning to disentangle textual representations,” arXiv preprint arXiv:2105.02685, 2021.
  30. P. Colombo, G. Staerman, N. Noiry, et al., “Learning disentangled textual representations via statistical measures of similarity,” arXiv preprint arXiv:2205.03589, 2022.
  31. Y. Pan, Y. Hu, Y. Yang, et al., “Gemo-clap: Gender-attribute-enhanced contrastive language-audio pretraining for speech emotion recognition,” arXiv preprint arXiv:2306.07848, 2023.
  32. Y. Pan, “Msac: Multiple speech attribute control method for speech emotion recognition,” arXiv preprint arXiv:2308.04025, 2023.
  33. J. Yao, Y. Yang, Y. Lei, et al., “Promptvc: Flexible stylistic voice conversion in latent space driven by natural language prompts,” arXiv preprint arXiv:2309.09262, 2023.
  34. Y. Brima, U. Krumnack, S. Pika, et al., “Learning disentangled audio representations through controlled synthesis,” arXiv preprint arXiv:2402.10547, 2024.
  35. J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in neural information processing systems, vol. 33, pp. 17 022–17 033, 2020.
  36. K. Kumar, R. Kumar, T. De Boissiere, et al., “Melgan: Generative adversarial networks for conditional waveform synthesis,” Advances in neural information processing systems, vol. 32, 2019.
  37. H. Zen, V. Dang, R. Clark, et al., “Libritts: A corpus derived from librispeech for text-to-speech,” arXiv preprint arXiv:1904.02882, 2019.
  38. V. Panayotov, G. Chen, D. Povey, et al., “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2015, pp. 5206–5210.
  39. C. H. Taal, R. C. Hendriks, R. Heusdens, et al., “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in 2010 IEEE international conference on acoustics, speech and signal processing.   IEEE, 2010, pp. 4214–4217.
  40. A. W. Rix, J. G. Beerends, M. P. Hollier, et al., “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), vol. 2.   IEEE, 2001, pp. 749–752.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yu Pan (155 papers)
  2. Lei Ma (197 papers)
  3. Jianjun Zhao (63 papers)
  4. Xiang Zhang (395 papers)
  5. Yuguang Yang (37 papers)
  6. Jixun Yao (36 papers)
  7. Yanni Hu (8 papers)
  8. Jianhao Ye (9 papers)
  9. Hongbin Zhou (28 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com