Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AV-RIR: Audio-Visual Room Impulse Response Estimation (2312.00834v2)

Published 30 Nov 2023 in cs.SD and cs.CV

Abstract: Accurate estimation of Room Impulse Response (RIR), which captures an environment's acoustic properties, is important for speech processing and AR/VR applications. We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and the visual cues of its corresponding environment. AV-RIR builds on a novel neural codec-based architecture that effectively captures environment geometry and materials properties and solves speech dereverberation as an auxiliary task by using multi-task learning. We also propose Geo-Mat features that augment material information into visual cues and CRIP that improves late reverberation components in the estimated RIR via image-to-RIR retrieval by 86%. Empirical results show that AV-RIR quantitatively outperforms previous audio-only and visual-only approaches by achieving 36% - 63% improvement across various acoustic metrics in RIR estimation. Additionally, it also achieves higher preference scores in human evaluation. As an auxiliary benefit, dereverbed speech from AV-RIR shows competitive performance with the state-of-the-art in various spoken language processing tasks and outperforms reverberation time error score in the real-world AVSpeech dataset. Qualitative examples of both synthesized reverberant speech and enhanced speech can be found at https://www.youtube.com/watch?v=tTsKhviukAE.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (96)
  1. About this reverberation business. 1977.
  2. Steam audio, 2018.
  3. Oculus spatializer, 2019.
  4. Microsoft project acoustics, 2019.
  5. Audiocite.net: Livres audio gratuits mp3, 2023.
  6. Nobuharu Aoshima. Computer‐generated pulse signal applied for sound measurement. Journal of the Acoustical Society of America, 69:1484–1488, 1981.
  7. Auditorium Acoustics and Architectural Design. The Journal of the Acoustical Society of America, 96(1):612–612, 1994.
  8. Boundary element methods in acoustics. 1991.
  9. Interactive sound propagation with bidirectional path tracing. ACM Trans. Graph., 35(6):180:1–180:11, 2016.
  10. The ami meeting corpus: A pre-announcement. In Proceedings of the Second International Conference on Machine Learning for Multimodal Interaction, page 28–39, Berlin, Heidelberg, 2005. Springer-Verlag.
  11. Matterport3d: Learning from rgb-d data in indoor environments. International Conference on 3D Vision (3DV), 2017.
  12. Soundspaces: Audio-visual navigaton in 3d environments. In ECCV, 2020.
  13. Visual acoustic matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18858–18868, 2022a.
  14. Soundspaces 2.0: A simulation platform for visual-acoustic learning. In NeurIPS 2022 Datasets and Benchmarks Track, 2022b.
  15. Novel-view acoustic synthesis. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6409–6419, 2023a.
  16. Learning audio-visual dereverberation. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023b.
  17. Be everywhere - hear everything (bee): Audio scene reconstruction by sparse audio-visual samples. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7853–7862, 2023c.
  18. Adverb: Visually guided audio dereverberation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7884–7896, 2023.
  19. Room impulse response estimation using sparse online prediction and absolute loss. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, pages III–III, 2006.
  20. Simultaneous echo cancellation and car noise suppression employing a microphone array. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 239–242 vol.1, 1997.
  21. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
  22. Real Time Speech Enhancement in the Waveform Domain. In Proc. Interspeech 2020, pages 3291–3295, 2020.
  23. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. ACM Trans. Graph., 37(4), 2018.
  24. Speech dereverberation using fully convolutional networks. In 2018 26th European Signal Processing Conference (EUSIPCO), pages 390–394, 2018.
  25. Angelo Farina. Advancements in impulse response measurements by sine sweeps. Journal of The Audio Engineering Society, 2007.
  26. Sparse modeling of the early part of noisy room impulse responses with sparse bayesian learning. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 586–590, 2022.
  27. Metricgan+: An improved version of metricgan for speech enhancement. arXiv preprint arXiv:2104.03538, 2021.
  28. Visual Speech Enhancement. In Proc. Interspeech 2018, pages 1170–1174, 2018.
  29. Digging into self-supervised monocular depth prediction. 2019.
  30. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  31. Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Transactions on Emerging Topics in Computational Intelligence, 2(2):117–128, 2018.
  32. Tag2text: Guiding vision-language model via image tagging. arXiv preprint arXiv:2303.05657, 2023.
  33. Cortical adaptation to sound reverberation. eLife, 11:e75090, 2022.
  34. A multi-microphone signal subspace approach for speech enhancement. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), pages 205–208 vol.1, 2001.
  35. Generative Adversarial Network-Based Postfilter for STFT Spectrograms. In Proc. Interspeech 2017, pages 3389–3393, 2017.
  36. Estimation of modal decay parameters from noisy response measurements. Journal of the Audio Engineering Society, 50(11):867–878, 2002.
  37. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  38. A summary of the reverb challenge: State-of-the-art and remaining challenges in reverberant speech processing research. Journal on Advances in Signal Processing, 2016, 2016.
  39. Segment anything. arXiv:2304.02643, 2023.
  40. Christoph Kling. Absorption coefficient database, 2018.
  41. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Advances in Neural Information Processing Systems, pages 17022–17033. Curran Associates, Inc., 2020.
  42. Complex-Valued Time-Frequency Self-Attention for Speech Dereverberation. In Proc. Interspeech 2022, pages 2543–2547, 2022a.
  43. Vinay Kothapally and John H. L. Hansen. Skipconvgan: Monaural speech dereverberation using generative adversarial networks via complex time-frequency masking. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:1600–1613, 2022b.
  44. Melgan: Generative adversarial networks for conditional waveform synthesis. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019.
  45. Yet another generative model for room impulse response estimation. In 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 1–5, 2023.
  46. Virtual reality system with integrated sound field simulation and reproduction. EURASIP J. Adv. Signal Process, 2007(1):187, 2007.
  47. Real-time speech frequency bandwidth extension. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 691–695, 2021.
  48. Av-nerf: Learning neural fields for real-world audio-visual scene synthesis. arXiv preprint arXiv:2302.02088, 2023a.
  49. Neural acoustic context field: Rendering realistic room impulse response with neural fields. In NeurIPS, 2023b.
  50. Bayesian regularization and nonnegative deconvolution for room impulse response estimation. IEEE Transactions on Signal Processing, 54(3):839–847, 2006.
  51. VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration. In Proc. Interspeech 2022, pages 4232–4236, 2022.
  52. Sound Synthesis, Propagation, and Rendering. Morgan & Claypool Publishers, 2022.
  53. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  54. Learning neural acoustic fields. In Advances in Neural Information Processing Systems, pages 3165–3177. Curran Associates, Inc., 2022.
  55. Few-shot audio-visual learning of environment acoustics. In Advances in Neural Information Processing Systems, 2022.
  56. Analysis of noise reduction and dereverberation techniques based on microphone arrays with postfiltering. IEEE Transactions on Speech and Audio Processing, 6(3):240–259, 1998.
  57. Wave-based sound propagation in large open scenes using an equivalent source formulation. ACM Trans. Graph., 32(2), 2013.
  58. Inverse filtering of room acoustics. IEEE Transactions on Acoustics, Speech, and Signal Processing, 36(2):145–152, 1988.
  59. Speech dereverberation based on variance-normalized delayed linear prediction. IEEE Transactions on Audio, Speech, and Language Processing, 18(7):1717–1731, 2010.
  60. Speech Dereverberation. Springer Publishing Company, Incorporated, 1st edition, 2010.
  61. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015.
  62. Tcnn: Temporal convolutional neural network for real-time speech enhancement in the time domain. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6875–6879, 2019.
  63. Machine-learning-based estimation of reverberation time using room geometry for room effect rendering. In Proceedings of the 23rd International Congress on Acoustics: integrating 4th EAA Euroregio, page 13, 2019.
  64. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  65. Interactive and Immersive Auralization. Springer, 2022.
  66. Precomputed wave simulation for real-time sound propagation of dynamic sources in complex scenes. ACM Trans. Graph., 29(4), 2010.
  67. Listen2Scene: Interactive material-aware binaural sound propagation for reconstructed 3D scenes. arXiv e-prints, art. arXiv:2302.02809, 2023.
  68. IR-GAN: room impulse response generator for far-field speech recognition. In Interspeech, pages 286–290. ISCA, 2021.
  69. Mesh2ir: Neural acoustic impulse response generator for complex 3d scenes. In Proceedings of the 30th ACM International Conference on Multimedia, page 924–933, New York, NY, USA, 2022a. Association for Computing Machinery.
  70. Fast-rir: Fast neural diffuse room impulse response generator. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 571–575, 2022b.
  71. Towards improved room impulse response estimation for speech recognition. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023.
  72. SpeechBrain: A general-purpose speech toolkit, 2021. arXiv:2106.04624.
  73. Generating diverse high-fidelity images with vq-vae-2. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019.
  74. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China, 2019. Association for Computational Linguistics.
  75. Small-rooms dedicated to music: From room response analysis to acoustic design. Journal of The Audio Engineering Society, 2016.
  76. Gsound: Interactive sound propagation for games. In Audio Engineering Society Conference: 41st International Conference: Audio for Games. Audio Engineering Society, 2011.
  77. Interactive sound propagation and rendering for large multi-source scenes. ACM Trans. Graph., 36(4), 2016.
  78. Acoustic classification and optimization for multi-modal rendering of real-world scenes. IEEE Transactions on Visualization and Computer Graphics, 24(3):1246–1259, 2018.
  79. M. R. Schroeder. Integrated‐impulse method measuring sound decay without using impulses. The Journal of the Acoustical Society of America, 66(2):497–500, 1979.
  80. M. R. Schroeder. New Method of Measuring Reverberation Time. The Journal of the Acoustical Society of America, 37(3):409–412, 2005.
  81. Image2reverb: Cross-modal reverb impulse response synthesis. In ICCV, pages 286–295. IEEE, 2021.
  82. Self-supervised visual acoustic matching. In NeurIPS, 2023.
  83. Filtered noise shaping for time domain room impulse response estimation from reverberant speech. In 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 221–225, 2021.
  84. HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks. In Proc. Interspeech 2020, pages 4506–4510, 2020.
  85. Horizonnet: Learning room layout with 1d representation and pano stretch data augmentation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1047–1056, 2019.
  86. Scene-aware audio rendering via deep acoustic analysis. IEEE Transactions on Visualization and Computer Graphics, 26(5):1991–2001, 2020.
  87. Gwa: A large high-quality acoustic dataset for audio processing. In ACM SIGGRAPH 2022 Conference Proceedings, New York, NY, USA, 2022. Association for Computing Machinery.
  88. Guided multiview ray tracing for fast auralization. IEEE Transactions on Visualization and Computer Graphics, 18(11):1797–1810, 2012.
  89. Neural discrete representation learning. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
  90. A review of vector quantization techniques. IEEE Potentials, 25(4):39–47, 2006.
  91. Cross-domain diffusion based speech enhancement for very noisy speech. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023.
  92. Audiodec: An open-source streaming high-fidelity neural audio codec. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023.
  93. Hifi-codec: Group-residual vector quantization for high fidelity audio codec. arXiv preprint arXiv:2305.02765, 2023.
  94. Audio-visual speech codecs: Rethinking audio-visual speech enhancement by re-synthesis. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8217–8227, 2022.
  95. Wave-ray coupling for interactive sound propagation in large complex scenes. ACM Trans. Graph., 32(6), 2013.
  96. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2022.
Citations (8)

Summary

We haven't generated a summary for this paper yet.