PITCH: AI-assisted Tagging of Deepfake Audio Calls using Challenge-Response (2402.18085v4)
Abstract: The rise of AI voice-cloning technology, particularly audio Real-time Deepfakes (RTDFs), has intensified social engineering attacks by enabling real-time voice impersonation that bypasses conventional enroLLMent-based authentication. This technology represents an existential threat to phone-based authentication systems, while total identity fraud losses reached $43 billion. Unlike traditional robocalls, these personalized AI-generated voice attacks target high-value accounts and circumvent existing defensive measures, creating an urgent cybersecurity challenge. To address this, we propose PITCH, a robust challenge-response method to detect and tag interactive deepfake audio calls. We developed a comprehensive taxonomy of audio challenges based on the human auditory system, linguistics, and environmental factors, yielding 20 prospective challenges. Testing against leading voice-cloning systems using a novel dataset (18,600 original and 1.6 million deepfake samples from 100 users), PITCH's challenges enhanced machine detection capabilities to 88.7% AUROC score, enabling us to identify 10 highly-effective challenges. For human evaluation, we filtered a challenging, balanced subset on which human evaluators independently achieved 72.6% accuracy, while machines scored 87.7%. Recognizing that call environments require human control, we developed a novel human-AI collaborative system that tags suspicious calls as "Deepfake-likely." Contrary to prior findings, we discovered that integrating human intuition with machine precision offers complementary advantages, giving users maximum control while boosting detection accuracy to 84.5%. This significant improvement situates PITCH's potential as an AI-assisted pre-screener for verifying calls, offering an adaptable approach to combat real-time voice-cloning attacks while maintaining human decision authority.
- Adversarial Perturbations of Deep Neural Networks, pages 311–342. 2017.
- Caller id spoofing: How to spot and avoid spoofed calls. Norton Blog, 2023. [Accessed: 23-Nov-2023].
- How does biometrics voice recognition work? KYCAML Guide Blog, Jan 2023. Accessed: 2024-02-15.
- Voice deepfakes are coming for your bank balance. The New York Times, Aug 2023. Accessed: 2024-02-15.
- Will generative ai kill kyc authentication? CSO Online, Oct 2023. Accessed: 2024-02-15.
- https://www.cnn.com/2024/02/04/asia/deepfake-cfo-scam-hong-kong-intl-hnk/index.html, Feb 2024. Accessed: 2024-02-15.
- Fake joe biden robocall tells new hampshire democrats not to vote tuesday. https://www.nbcnews.com/politics/2024-election/fake-joe-biden-robocall-tells-new-hampshire-democrats-not-vote-tuesday-rcna134984, Feb 2024. Accessed: 2024-02-15.
- APNews. Can New York’s mayor speak Mandarin? No, but with AI he’s making robocalls in different languages. https://apnews.com/article/nyc-mayor-ai-robocalls-foreign-languages-30517885466994e5f1f54745c08691e0. [Accessed: 23-Nov-2023].
- Voice conversion with just nearest neighbors. In Interspeech, 2023.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. CoRR, abs/2006.11477, 2020.
- Whisperx: Time-accurate speech transcription of long-form audio. INTERSPEECH 2023, 2023.
- James Betker. Better speech synthesis through scaling. arXiv preprint arXiv:2305.07243, 2023.
- Voice biometrics: Deep learning-based voiceprint authentication system. In 2017 12th System of Systems Engineering Conference (SoSE), pages 1–6. IEEE, 2017.
- Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- Domain adaptation for speaker recognition in singing and spoken voice. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7192–7196, 2022.
- CNN. ’Mom, these bad men have me’: She believes scammers cloned her daughter’s voice in a fake kidnapping. Online, 2023. [Accessed: 23-Nov-2023].
- Restricted black-box adversarial attack against deepfake face swapping. IEEE Transactions on Information Forensics and Security, 2023.
- Xu Tan Rongjie Huang Songxiang Liu Xuankai Chang Jiatong Shi Sheng Zhao Jiang Bian Xixin Wu Zhou Zhao Helen Meng Dongchao Yang, Jinchuan Tian. Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704, 2023.
- Blind and human: Exploring more usable audio {{\{{CAPTCHA}}\}} designs. In Sixteenth Symposium on Usable Privacy and Security (SOUPS 2020), pages 111–125, 2020.
- Federal Communications Commission. Fcc makes ai-generated voices in robocalls illegal. https://www.fcc.gov/document/fcc-makes-ai-generated-voices-robocalls-illegal, Feb 2024. Accessed: 2024-02-15.
- Internet Crime Complaint Center (IC3). Malicious Actors Almost Certainly Will Leverage Synthetic Content for Cyber and Foreign Influence Operations. Online, 2021. [Accessed: 23-Nov-2023].
- Wall Street Journal. Fraudsters used ai to mimic ceo’s voice in unusual cybercrime case. https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-11567157402, 2019. [Accessed: 23-Nov-2023].
- Vulnerability of automatic identity recognition to audio-visual deepfakes. 2023.
- Audiogen: Textually guided audio generation. In The Eleventh International Conference on Learning Representations, 2023.
- Freevc: Towards high-quality text-free one-shot voice conversion. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models. In Advances in Neural Information Processing Systems, 2023.
- StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding Voice Conversion. In Proc. Interspeech 2021, pages 1349–1353, 2021.
- Any-to-many voice conversion with location-relative sequence-to-sequence modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:1717–1728, 2021.
- McAfee. Beware the Artificial Impostor. https://www.mcafee.com/content/dam/consumer/en-us/resources/cybersecurity/artificial-intelligence/rp-beware-the-artificial-impostor-report.pdf. [Accessed: 23-Nov-2023].
- Hearing lips and seeing voices. Nature, 264(5588):746–748, 1976.
- Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets. In Interspeech 2021. ISCA, 2021.
- Gotcha: A challenge-response system for real-time deepfake detection. arXiv preprint arXiv:2210.06186, 2022.
- From wer and ril to mer and wil: improved evaluation measures for connected speech recognition. 10 2004.
- Human perception of audio deepfakes. In Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia, pages 85–91, 2022.
- Asvspoof 2019: spoofing countermeasures for the detection of synthesized, converted and replayed speech. IEEE Transactions on Biometrics, Behavior, and Identity Science, 3(2):252–265, 2021.
- NPR. That panicky call from a relative? it could be a thief using a voice clone, ftc warns. https://www.gpb.org/news/2023/03/22/panicky-call-relative-it-could-be-thief-using-voice-clone-ftc-warns, 2023. [Accessed: 23-Nov-2023].
- SpeechBrain: A general-purpose speech toolkit, 2021. arXiv:2106.04624.
- Bias and statistical significance in evaluating speech synthesis with mean opinion scores. In Interspeech, pages 3976–3980, 2017.
- suno-ai. Bark: Text-Prompted Generative Audio Model. https://github.com/suno-ai/bark, 2023. Accessed: 2024-02-15.
- End-to-end anti-spoofing with rawnet2. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6369–6373. IEEE, 2021.
- Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
- "hello, it’s me": Deep learning-based speech synthesis attacks in the real world. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, CCS ’21, page 235–251, New York, NY, USA, 2021. Association for Computing Machinery.
- AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios. In Proc. Interspeech 2022, pages 2568–2572, 2022.
- Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92). University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019.
- Streamvc: Real-time low-latency voice conversion. 2024.
- Deepfake captcha: A method for preventing fake calls. In Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security, ASIA CCS ’23, page 608–622, New York, NY, USA, 2023. Association for Computing Machinery.
- Audio deepfake detection: A survey. arXiv preprint arXiv:2308.14970, 2023.
- A phoneme localization based liveness detection for text-independent speaker verification. IEEE Transactions on Mobile Computing, pages 1–14, 2022.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.