Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Audio-textual Diffusion Model For Converting Speech Signals Into Ultrasound Tongue Imaging Data (2403.05820v2)

Published 9 Mar 2024 in cs.SD, cs.CL, and eess.AS

Abstract: Acoustic-to-articulatory inversion (AAI) is to convert audio into articulator movements, such as ultrasound tongue imaging (UTI) data. An issue of existing AAI methods is only using the personalized acoustic information to derive the general patterns of tongue motions, and thus the quality of generated UTI data is limited. To address this issue, this paper proposes an audio-textual diffusion model for the UTI data generation task. In this model, the inherent acoustic characteristics of individuals related to the tongue motion details are encoded by using wav2vec 2.0, while the ASR transcriptions related to the universality of tongue motions are encoded by using BERT. UTI data are then generated by using a diffusion module. Experimental results showed that the proposed diffusion model could generate high-quality UTI data with clear tongue contour that is crucial for the linguistic analysis and clinical assessment. The project can be found on the website\footnote{https://yangyudong2020.github.io/wav2uti/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Tanja Schultz and Michael Wand et al, “Biosignal-based spoken communication: A survey,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 12, pp. 2257–2271, 2017.
  2. Abdolreza Sabzi Shahrebabaki and Giampiero Salvi et al, “Acoustic-to-articulatory mapping with joint optimization of deep speech enhancement and articulatory inversion models,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 135–147, 2022.
  3. Philipp Arens and Thomas Fischer et al, “Ultrasound shear wave elastography of the tongue during selective hypoglossal nerve stimulation in patients with obstructive sleep apnea syndrome,” Ultrasound in Medicine and Biology, vol. 47, no. 10, pp. 2869–2879, 2021.
  4. Manuel Sam Ribeiro and Joanne Cleland et al, “Exploiting ultrasound tongue imaging for the automatic detection of speech articulation errors,” Speech Communication, vol. 128, pp. 24–34, 2021.
  5. Manuel Sam Ribeiro and Jennifer Sanger et al, “Tal: A synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos,” in SLT 2021, 2021, pp. 1109–1116.
  6. Rui-Chen Zheng and Yang Ai et al, “Speech reconstruction from silent tongue and lip articulation by pseudo target generation and domain adversarial training,” in ICASSP 2023, 2023, pp. 1–5.
  7. “Real-time mri video synthesis from time aligned phonemes with sequence-to-sequence networks,” in ICASSP 2023, 2023, pp. 1–5.
  8. Dagoberto Porras and Alexander Sepúlveda-Sepúlveda et al, “Dnn-based acoustic-to-articulatory inversion using ultrasound tongue imaging,” in IJCNN 2019, 2019, pp. 1–8.
  9. Jianrong Wang and Yalong Yang et al, “Continuous ultrasound based tongue movement video synthesis from speech,” in ICASSP 2016, 2016, pp. 1716–1720.
  10. Jonathan L Preston and Tara McAllister Byun et al, “Ultrasound images of the tongue: A tutorial for assessment and remediation of speech sound errors,” Journal of Visualized Experiments, , no. 119, pp. e55123, 2017.
  11. Sathvik Udupa and Siddarth C et al, “Improved acoustic-to-articulatory inversion using representations from pretrained self-supervised learning models,” in ICASSP 2023, 2023, pp. 1–5.
  12. Soowon Kim and Young-Eun Lee et al, “Diff-E: Diffusion-based Learning for Decoding Imagined Speech EEG,” in Proc. INTERSPEECH 2023, 2023, pp. 1159–1163.
  13. “STE-GAN: Speech-to-Electromyography Signal Conversion using Generative Adversarial Networks,” in Proc. INTERSPEECH 2023, 2023, pp. 1174–1178.
  14. Hadrien Reynaud and Mengyun Qiao et al, “Feature-conditioned cascaded video diffusion models for precise echocardiogram synthesis,” arXiv preprint arXiv:2303.12644, 2023.
  15. Ruihan Yang and Prakhar Srivastava et al, “Diffusion probabilistic modeling for video generation,” arXiv preprint arXiv:2203.09481, 2022.
  16. Chitwan Saharia and William Chan et al, “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding,” NeurIPS 2022, vol. 35, pp. 36479–36494, Dec. 2022.
  17. Alexei Baevski and Yuhao Zhou et al, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” NeurIPS 2020, vol. 33, pp. 12449–12460, 2020.
  18. Jacob Devlin and Ming-Wei Chang et al, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  19. Jonathan Ho and Tim . Salimans et al, “Video diffusion models,” arXiv preprint arXiv:2204.03458, 2022.
  20. Richard Zhang and Phillip Isola et al, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR 2018, 2018, pp. 586–595.
  21. Martin Heusel and Hubert Ramsauer et al, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” NeurIPS 2017, vol. 30, 2017.
  22. Tero Karras and Miika Aittala et al, “Elucidating the design space of diffusion-based generative models,” NeurIPS 2022, vol. 35, pp. 26565–26577, 2022.
  23. Yang Song and Jascha Sohl-Dickstein et al, “Score-based generative modeling through stochastic differential equations,” in ICLR 2021, 2021.
  24. “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10684–10695.
  25. Forrest N. Iandola and Song Hanet al, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and ¡0.5mb model size,” arXiv preprint arXiv:1602.07360, 2016.
  26. Ivan Skorokhodov and Sergey Tulyakov et al, “Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2,” in CVPR 2022, 2022, pp. 3626–3636.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com