sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection with Spiking Neural Networks (2403.05772v1)
Abstract: Speech applications are expected to be low-power and robust under noisy conditions. An effective Voice Activity Detection (VAD) front-end lowers the computational need. Spiking Neural Networks (SNNs) are known to be biologically plausible and power-efficient. However, SNN-based VADs have yet to achieve noise robustness and often require large models for high performance. This paper introduces a novel SNN-based VAD model, referred to as sVAD, which features an auditory encoder with an SNN-based attention mechanism. Particularly, it provides effective auditory feature representation through SincNet and 1D convolution, and improves noise robustness with attention mechanisms. The classifier utilizes Spiking Recurrent Neural Networks (sRNN) to exploit temporal speech information. Experimental results demonstrate that our sVAD achieves remarkable noise robustness and meanwhile maintains low power consumption and a small footprint, making it a promising solution for real-world VAD applications.
- S. Oh, M. Cho, Z. Shi, J. Lim, Y. Kim, S. Jeong, Y. Chen, R. Rothe, D. Blaauw, H.-S. Kim et al., “An acoustic signal processing chip with 142-nw voice activity detection using mixer-based sequential frequency scanning and neural network classification,” IEEE JSSC, vol. 54, no. 11, pp. 3005–3016, 2019.
- M. Price, J. Glass, and A. P. Chandrakasan, “A low-power speech recognizer and voice activity detector using deep neural networks,” IEEE JSSC, vol. 53, no. 1, pp. 66–75, 2017.
- J. Wu, Q. Liu, M. Zhang, Z. Pan, H. Li, and K. C. Tan, “Hurai: A brain-inspired computational model for human-robot auditory interface,” Neurocomputing, vol. 465, pp. 103–113, 2021.
- S. Yadav, P. A. D. Legaspi, M. S. O. Alink, A. B. Kokkeler, and B. Nauta, “Hardware implementations for voice activity detection: trends, challenges and outlook,” IEEE TCAS-I, vol. 70, no. 3, pp. 1083–1096, 2022.
- Q. Liu, H. Ruan, D. Xing, H. Tang, and G. Pan, “Effective aer object classification using segmented probability-maximization learning in spiking neural networks,” in AAAI, vol. 34, no. 02, 2020, pp. 1308–1315.
- Q. Liu, D. Xing, L. Feng, H. Tang, and G. Pan, “Event-based multimodal spiking neural network with attention mechanism,” in ICASSP. IEEE, 2022, pp. 8922–8926.
- Q. Yang, Q. Liu, and H. Li, “Deep residual spiking neural network for keyword spotting in low-resource settings,” Interspeech 2022, pp. 3023–3027, 2022.
- Q. Yang, J. Wu, M. Zhang, Y. Chua, X. Wang, and H. Li, “Training spiking neural networks with local tandem learning,” NeurIPS, vol. 35, pp. 12 662–12 676, 2022.
- C. Farabet, R. Paz, J. Pérez-Carrasco, C. Zamarreño-Ramos, A. Linares-Barranco, Y. LeCun, E. Culurciello, T. Serrano-Gotarredona, and B. Linares-Barranco, “Comparison between frame-constrained fix-pixel-value and frame-free spiking-dynamic-pixel convnets for visual processing,” Front. Neurosci., vol. 6, p. 32, 2012.
- X. Ma, G. Fang, and X. Wang, “LLM-Pruner: On the Structural Pruning of Large Language Models,” in NeurIPS, 2023.
- G. Fang, X. Ma, M. Song, M. B. Mi, and X. Wang, “DepGraph: Towards Any Structural Pruning,” in CVPR, 2023, pp. 16 091–16 101.
- X. Ma, G. Fang, and X. Wang, “Deepcache: Accelerating diffusion models for free,” arXiv preprint arXiv:2312.00858, 2023.
- X. Yang, D. Zhou, S. Liu, J. Ye, and X. Wang, “Deep Model Reassembly,” in NeurIPS, vol. 35, 2022, pp. 25 739–25 753.
- G. Dellaferrera, F. Martinelli, and M. Cernak, “A bin encoding training of a spiking neural network based voice activity detection,” in ICASSP. IEEE, 2020, pp. 3207–3211.
- M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveform with sincnet,” in SLT workshop. IEEE, 2018, pp. 1021–1028.
- Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM TASLP, vol. 27, no. 8, pp. 1256–1266, 2019.
- P. J. Werbos, “Backpropagation through time: what it does and how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
- E. O. Neftci, H. Mostafa, and F. Zenke, “Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks,” IEEE Signal Processing Mag., vol. 36, no. 6, pp. 51–63, 2019.
- D. Dean, S. Sridharan, R. Vogt, and M. Mason, “The qut-noise-timit corpus for evaluation of voice activity detection algorithms,” in Interspeech, 2010, pp. 3110–3113.
- E. S. Doc, “Speech processing, transmission and quality aspects (stq); distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms,” ETSI ES, vol. 202, no. 050, p. v1, 2002.
- A. Benyassine, E. Shlomot, H. Su, D. Massaloux, C. Lamblin, and J. Petit, “A silence compression scheme for use with g. 729 optimized for v. 70 digital simultaneous voice and data applications (recommendation g. 729 annex b),” IEEE Commun. Mag, vol. 35, no. 9, pp. 64–73, 1997.
- J. Ramırez, J. C. Segura, C. Benıtez, A. De La Torre, and A. Rubio, “Efficient voice activity detection algorithms using long-term speech information,” Speech communication, vol. 42, no. 3-4, pp. 271–287, 2004.
- J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based voice activity detection,” IEEE Sig. Proc. Letters, vol. 6, no. 1, pp. 1–3, 1999.
- H. Ghaemmaghami, D. Dean, S. Kalantari, S. Sridharan, and C. Fookes, “Complete-linkage clustering for voice activity detection in audio and visual speech,” in Interspeech, 2015, pp. 2292–2296.
- D. A. Silva, J. A. Stuchi, R. P. V. Violato, and L. G. D. Cuozzo, “Exploring convolutional neural networks for voice activity detection,” Cognitive technologies, pp. 37–47, 2017.
- F. Martinelli, G. Dellaferrera, P. Mainar, and M. Cernak, “Spiking neural networks trained with backpropagation for low power neuromorphic implementation of voice activity detection,” in ICASSP. IEEE, 2020, pp. 8544–8548.
- M. Yang, C.-H. Yeh, Y. Zhou, J. P. Cerqueira, A. A. Lazar, and M. Seok, “A 1μ𝜇\muitalic_μw voice activity detector using analog feature extraction and digital deep neural network,” in ISSCC. IEEE, 2018, pp. 346–348.
- G. Meoni, L. Pilato, and L. Fanucci, “A low power voice activity detector for portable applications,” in 14th conference on PRIME. IEEE, 2018, pp. 41–44.
- M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain et al., “Loihi: A neuromorphic manycore processor with on-chip learning,” IEEE Micro, vol. 38, no. 1, pp. 82–99, 2018.