Emergent Mind

Abstract

In the acoustic scene classification (ASC) task, an acoustic scene consists of diverse sounds and is inferred by identifying combinations of distinct attributes among them. This study aims to extract and cluster these attributes effectively using an improved multiple-instance learning (MIL) framework for ASC. MIL, known as a weakly supervised learning method, is a strategy for extracting an instance from a bundle of frames composing an input audio clip and inferring a scene corresponding to the input data using these unlabeled instances. However, many studies pointed out an underestimation problem of MIL. In this study, we develop a MIL framework more suitable for ASC systems by defining instance-level labels and loss to extract and cluster instances effectively. Furthermore, we design a fully separated convolutional module, which is a lightweight neural network comprising pointwise, frequency-sided depthwise, and temporal-sided depthwise convolutional filters. As a result, compared to vanilla MIL, the confidence and proportion of positive instances increase significantly, overcoming the underestimation problem and improving the classification accuracy up to 11%. The proposed system achieved a performance of 81.1% and 72.3% on the TAU urban acoustic scenes 2019 and 2020 mobile datasets with 139 K parameters, respectively. Especially, it achieves the highest performance among the systems having under the 1 M parameters on the TAU urban acoustic scenes 2019 dataset.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a summary of this paper on our Pro plan:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.