BrainTalker: Low-Resource Brain-to-Speech Synthesis with Transfer Learning using Wav2Vec 2.0 (2312.13600v1)
Abstract: Decoding spoken speech from neural activity in the brain is a fast-emerging research topic, as it could enable communication for people who have difficulties with producing audible speech. For this task, electrocorticography (ECoG) is a common method for recording brain activity with high temporal resolution and high spatial precision. However, due to the risky surgical procedure required for obtaining ECoG recordings, relatively little of this data has been collected, and the amount is insufficient to train a neural network-based Brain-to-Speech (BTS) system. To address this problem, we propose BrainTalker-a novel BTS framework that generates intelligible spoken speech from ECoG signals under extremely low-resource scenarios. We apply a transfer learning approach utilizing a pre-trained self supervised model, Wav2Vec 2.0. Specifically, we train an encoder module to map ECoG signals to latent embeddings that match Wav2Vec 2.0 representations of the corresponding spoken speech. These embeddings are then transformed into mel-spectrograms using stacked convolutional and transformer-based layers, which are fed into a neural vocoder to synthesize speech waveform. Experimental results demonstrate our proposed framework achieves outstanding performance in terms of subjective and objective metrics, including a Pearson correlation coefficient of 0.9 between generated and ground truth mel spectrograms. We share publicly available Demos and Code.
- C. Herff, D. Heger, A. De Pesters, D. Telaar, P. Brunner, G. Schalk, and T. Schultz, “Brain-to-text: decoding spoken phrases from phone representations in the brain,” Frontiers in neuroscience, vol. 9, p. 217, 2015.
- G. H. Wilson, S. D. Stavisky, F. R. Willett, D. T. Avansino, J. N. Kelemen, L. R. Hochberg, J. M. Henderson, S. Druckmann, and K. V. Shenoy, “Decoding spoken english from intracortical electrode arrays in dorsal precentral gyrus,” Journal of neural engineering, vol. 17, no. 6, p. 066007, 2020.
- Y.-E. Lee, S.-H. Lee, S.-H. Kim, and S.-W. Lee, “Towards voice reconstruction from eeg during imagined speech,” arXiv preprint arXiv:2301.07173, 2023.
- Y. A. Furman, V. Sevast’yanov, and K. Ivanov, “Modern problems of brain-signal analysis and approaches to their solution,” Pattern Recognition and Image Analysis, vol. 29, pp. 99–119, 2019.
- Z. Gao, W. Dang, X. Wang, X. Hong, L. Hou, K. Ma, and M. Perc, “Complex networks and deep learning for eeg signal analysis,” Cognitive Neurodynamics, vol. 15, pp. 369–388, 2021.
- J. S. Kumar and P. Bhuvaneswari, “Analysis of electroencephalography (eeg) signals and its categorization–a study,” Procedia engineering, vol. 38, pp. 2525–2536, 2012.
- A. Dubey and S. Ray, “Cortical electrocorticogram (ecog) is a local signal,” Journal of Neuroscience, vol. 39, no. 22, pp. 4299–4311, 2019.
- G. Buzsáki, C. A. Anastassiou, and C. Koch, “The origin of extracellular fields and currents—eeg, ecog, lfp and spikes,” Nature reviews neuroscience, vol. 13, no. 6, pp. 407–420, 2012.
- G. K. Anumanchipalli, J. Chartier, and E. F. Chang, “Speech synthesis from neural decoding of spoken sentences,” Nature, vol. 568, no. 7753, pp. 493–498, 2019.
- A. Graves and A. Graves, “Long short-term memory,” Supervised sequence labelling with recurrent neural networks, pp. 37–45, 2012.
- M. Angrick, C. Herff, E. Mugler, M. C. Tate, M. W. Slutzky, D. J. Krusienski, and T. Schultz, “Speech synthesis from ecog using densely connected 3d convolutional neural networks,” Journal of neural engineering, vol. 16, no. 3, p. 036019, 2019.
- G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
- A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” in The 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13-15 September 2016. ISCA, 2016, p. 125.
- K. Shigemi, S. Komeiji, T. Mitsuhashi, Y. Iimura, H. Suzuki, H. Sugano, K. Shinoda, K. Yatabe, and T. Tanaka, “Synthesizing speech from ecog with a combination of transformer-based encoder and neural vocoder,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6199–6203.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.
- J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 022–17 033, 2020.
- L. Ou, X. Gu, and Y. Wang, “Towards transfer learning of wav2vec 2.0 for automatic lyric transcription,” in International Society for Music Information Retrieval Conference, 2022.
- D. Kostas, S. Aroca-Ouellette, and F. Rudzicz, “Bendr: using transformers and a contrastive self-supervised learning task to learn from massive amounts of eeg data,” Frontiers in Human Neuroscience, vol. 15, p. 653659, 2021.
- J. Xu, Z. Li, B. Du, M. Zhang, and J. Liu, “Reluplex made more practical: Leaky relu,” in 2020 IEEE Symposium on Computers and communications (ISCC). IEEE, 2020, pp. 1–7.
- R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu, “On layer normalization in the transformer architecture,” in International Conference on Machine Learning. PMLR, 2020, pp. 10 524–10 533.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- Z. Piao, M. Kim, H. Yoon, and H.-G. Kang, “Happyquokka system for icassp 2023 auditory eeg challenge,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–2.
- J. P. Rauschecker and S. K. Scott, “Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing,” Nature neuroscience, vol. 12, no. 6, pp. 718–724, 2009.
- M. Shum, D. M. Shiller, S. R. Baum, and V. L. Gracco, “Sensorimotor integration for speech motor learning involves the inferior parietal cortex,” European Journal of Neuroscience, vol. 34, no. 11, pp. 1817–1822, 2011.
- F. Geranmayeh, S. L. Brownsett, R. Leech, C. F. Beckmann, Z. Woodhead, and R. J. Wise, “The contribution of the inferior parietal cortex to spoken language production,” Brain and language, vol. 121, no. 1, pp. 47–57, 2012.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2015.
- C. Veaux, J. Yamagishi, K. MacDonald et al., “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2017.
- R. C. Streijl, S. Winkler, and D. S. Hands, “Mean opinion score (mos) revisited: methods and applications, limitations and alternatives,” Multimedia Systems, vol. 22, no. 2, pp. 213–227, 2016.
- R. Kubichek, “Mel-cepstral distance measure for objective speech quality assessment,” in Proceedings of IEEE pacific rim conference on communications computers and signal processing, vol. 1. IEEE, 1993, pp. 125–128.