Detection of Deepfake Environmental Audio (2403.17529v2)
Abstract: With the ever-rising quality of deep generative models, it is increasingly important to be able to discern whether the audio data at hand have been recorded or synthesized. Although the detection of fake speech signals has been studied extensively, this is not the case for the detection of fake environmental audio. We propose a simple and efficient pipeline for detecting fake environmental sounds based on the CLAP audio embedding. We evaluate this detector using audio data from the 2023 DCASE challenge task on Foley sound synthesis. Our experiments show that fake sounds generated by 44 state-of-the-art synthesizers can be detected on average with 98% accuracy. We show that using an audio embedding learned on environmental audio is beneficial over a standard VGGish one as it provides a 10% increase in detection performance. Informal listening to Incorrect Negative examples demonstrates audible features of fake sounds missed by the detector such as distortion and implausible background noise.
- H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” in ODYSSEY 2022, The Speaker Language Recognition Workshop, June 28th-July 1st, 2022, Beijing, China, ISCA, Ed., Beijing, 2022.
- T. Zhang, “Deepfake generation and detection, a survey,” Multimedia Tools and Applications, vol. 81, no. 5, pp. 6259–6276, 2022.
- Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li, “Spoofing and countermeasures for speaker verification,” Speech Communication, vol. 66, no. C, pp. 130–153, 2015.
- S. Das, S. Seferbekov, A. Datta, M. S. Islam, and M. R. Amin, “Towards solving the deepfake problem: An analysis on improving deepfake detection using dynamic face augmentation,” in 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). IEEE, 2021, pp. 3769–3778.
- Y. Zhao, J. Yi, J. Tao, C. Wang, C. Zhang, T. Wang, and Y. Dong, “Emofake: An initial dataset for emotion fake audio detection,” 11 2022.
- J. Yi, C. Wang, J. Tao, Z. Tian, C. Fan, H. Ma, and R. Fu, “Scenefake: An initial dataset and benchmarks for scene fake audio detection,” CoRR, vol. abs/2211.06073, 2022.
- J. Yi, Y. Bai, J. Tao, H. Ma, Z. Tian, C. Wang, T. Wang, and R. Fu, “Half-Truth: A Partially Fake Audio Detection Dataset,” in Proc. Interspeech 2021, 2021, pp. 1654–1658.
- B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- K. Choi, J. Im, L. Heller, B. McFee, K. Imoto, Y. Okamoto, M. Lagrange, and S. Takamichi, “Foley sound synthesis at the dcase 2023 challenge,” 04 2023.
- A. K. Singh and P. Singh, “Detection of ai-synthesized speech using cepstral & bispectral statistics,” in 2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR). IEEE, 2021, pp. 412–417.
- C. Borrelli, P. Bestagini, F. Antonacci, A. Sarti, and S. Tubaro, “Synthetic speech detection through short-term and long-term prediction traces,” EURASIP Journal on Information Security, vol. 2021, no. 1, pp. 1–14, 2021.
- I. Altalahin, S. AlZu’bi, A. Alqudah, and A. Mughaid, “Unmasking the truth: A deep learning approach to detecting deepfake audio through mfcc features,” in 2023 International Conference on Information Technology (ICIT). IEEE, 2023, pp. 511–518.
- A. Qais, A. Rastogi, A. Saxena, A. Rana, and D. Sinha, “Deepfake audio detection with neural networks using audio features,” in 2022 International Conference on Intelligent Controller and Computing for Smart Power (ICICCSP). IEEE, 2022, pp. 1–6.
- D. M. Ballesteros, Y. Rodriguez-Ortega, D. Renza, and G. Arce, “Deep4snet: deep learning for fake speech classification,” Expert Systems with Applications, vol. 184, p. 115465, 2021.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in 3rd International Conference on Learning Representations (ICLR 2015). Computational and Biological Learning Society, pp. 1–14.
- B. Elizalde, S. Deshmukh, M. A. Ismail, and H. Wang, “CLAP learning audio concepts from natural language supervision,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 1–5.
- Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020.
- Y. Yuan, H. Liu, X. Liu, X. Kang, M. D. Plumbley, and W. Wang, “Latent diffusion model based foley sound generation system for DCASE challenge 2023 task 7,” CoRR, vol. abs/2305.15905, 2023.
- H. V. Pham, S. Qian, J. Wang, T. Lutellier, J. Rosenthal, L. Tan, Y. Yu, and N. Nagappan, “Problems and opportunities in training deep learning software systems: An analysis of variance,” 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 771–783, 2020.