Audio Event Detection using Weakly Labeled Data (1605.02401v3)

Published 9 May 2016 in cs.SD, cs.AI, and cs.MM

Abstract: Acoustic event detection is essential for content analysis and description of multimedia recordings. The majority of current literature on the topic learns the detectors through fully-supervised techniques employing strongly labeled data. However, the labels available for majority of multimedia data are generally weak and do not provide sufficient detail for such methods to be employed. In this paper we propose a framework for learning acoustic event detectors using only weakly labeled data. We first show that audio event detection using weak labels can be formulated as an Multiple Instance Learning problem. We then suggest two frameworks for solving multiple-instance learning, one based on support vector machines, and the other on neural networks. The proposed methods can help in removing the time consuming and expensive process of manually annotating data to facilitate fully supervised learning. Moreover, it can not only detect events in a recording but can also provide temporal locations of events in the recording. This helps in obtaining a complete description of the recording and is notable since temporal information was never known in the first place in weakly labeled data.

Authors (2)

Anurag Kumar (118 papers)
Bhiksha Raj (180 papers)

Citations (169)

View on Semantic Scholar

Summary

Audio Event Detection using Weakly Labeled Data

This paper presents a method for learning acoustic event detectors utilizing weakly labeled data, bypassing the need for extensive manual annotation of multimedia recordings. Typically, audio event detection frameworks rely on strongly labeled datasets that include precise temporal markers identifying when events occur in audio sequences. However, this approach can be resource-intensive and impractical given the vast quantity of unannotated multimedia data available today. Kumar and Raj propose a solution by framing audio event detection within the context of Multiple Instance Learning (MIL), a paradigm that facilitates classification using labels available at the level of entire bags of instances, rather than individually labeled instances.

Key Contributions

The authors introduce an MIL framework wherein audio recordings are segmented into instances, each representing potential occurrences of acoustic events. This perspective interprets weakly labeled recordings as collections of instances, allowing for the application of MIL methods to train event detectors. Two methodologies are introduced for this process:

mi-SVM (Multiple-instance learning with Support Vector Machines): This approach adapts support vector machines, emphasizing the separation of positive and negative bags based on instance-level margins.
BP-MIL (Backpropagation for Multiple Instance Learning): Utilizes neural networks with a specialized divergence measurement that considers the maximum output of instances within a bag to infer bag-level labels.

Experimental Evaluation

Experiments were conducted on a subset of the TRECVID-MED 2011 dataset, annotated for specific acoustic events such as clapping or engine noise. Recordings were divided into four parts, three for training and one for testing, with MFCC-based features serving as input for the detection models. Two types of Gaussian mixture model (GMM)-based features were explored:

$\vec{F}$ features that represent a soft-count histogram of MFCC vectors across GMM components.
$\vec{M}$ features derived from MAP adaptations of GMMs specific to each segment, offering detailed modes of MFCC distributions.

Across several configurations, the paper demonstrates that both mi-SVM and BP-MIL can achieve competitive performance compared to fully supervised methods, with AUC values ranging from approximately 0.6 to 0.8 for different events across varying setups. This indicates that MIL frameworks are viable for detecting and temporally localizing audio events from weakly labeled data.

Implications and Future Work

The implications of this paper are multifaceted. Practically, it suggests a scalable approach for harnessing large volumes of unannotated multimedia content, leveraging weak labels from metadata to train effective acoustic event detectors. Theoretically, it establishes a bridge between MIL paradigms and audio content analysis, encouraging further exploration of GMM-based features and alternative MIL models to enhance detection capabilities. The paper underscores the potential of MIL frameworks in offering temporal localization of events, making it possible to extract detailed descriptions of recordings where explicit temporal data was absent initially.

Future developments could involve refinement of feature representations, expansion of the event vocabulary, and exploration of alternative classifiers and learning models to improve detection precision and scalability. Moreover, integrating this approach into higher-level multimedia event detection concepts would allow for more comprehensive content analysis systems, facilitating advanced retrieval and indexing of multimedia data. This positions MIL within a broader context of automated content-based multimedia analysis and retrieval, indicating its promise for ongoing and future research endeavors in artificial intelligence.

PDF Markdown