Proposal-based Few-shot Sound Event Detection for Speech and Environmental Sounds with Perceivers (2107.13616v2)

Published 28 Jul 2021 in eess.AS, cs.NE, and cs.SD

Abstract: Many applications involve detecting and localizing specific sound events within long, untrimmed documents, including keyword spotting, medical observation, and bioacoustic monitoring for conservation. Deep learning techniques often set the state-of-the-art for these tasks. However, for some types of events, there is insufficient labeled data to train such models. In this paper, we propose a region proposal-based approach to few-shot sound event detection utilizing the Perceiver architecture. Motivated by a lack of suitable benchmark datasets, we generate two new few-shot sound event localization datasets: "Vox-CASE," using clips of celebrity speech as the sound event, and "ESC-CASE," using environmental sound events. Our highest performing proposed few-shot approaches achieve 0.483 and 0.418 F1-score, respectively, with 5-shot 5-way tasks on these two datasets. These represent relative improvements of 72.5% and 11.2% over strong proposal-free few-shot sound event detection baselines.

References (49)

Citations (10)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Proposal-based Few-shot Sound Event Detection for Speech and Environmental Sounds with Perceivers (2107.13616v2)

Summary

Related Papers