Background Suppression Network for Weakly-supervised Temporal Action Localization (1911.09963v1)

Published 22 Nov 2019 in cs.CV

Abstract: Weakly-supervised temporal action localization is a very challenging problem because frame-wise labels are not given in the training stage while the only hint is video-level labels: whether each video contains action frames of interest. Previous methods aggregate frame-level class scores to produce video-level prediction and learn from video-level action labels. This formulation does not fully model the problem in that background frames are forced to be misclassified as action classes to predict video-level labels accurately. In this paper, we design Background Suppression Network (BaS-Net) which introduces an auxiliary class for background and has a two-branch weight-sharing architecture with an asymmetrical training strategy. This enables BaS-Net to suppress activations from background frames to improve localization performance. Extensive experiments demonstrate the effectiveness of BaS-Net and its superiority over the state-of-the-art methods on the most popular benchmarks - THUMOS'14 and ActivityNet. Our code and the trained model are available at https://github.com/Pilhyeon/BaSNet-pytorch.

Citations (202)

View on Semantic Scholar

Summary

The paper introduces BaS-Net, a dual-branch architecture with an auxiliary background class to improve weakly-supervised temporal action localization.
It employs an asymmetrical training strategy that refines frame classification by actively suppressing background noise through a dedicated filtering module.
Empirical results on benchmarks like THUMOS'14 and ActivityNet show significant improvements in mean Average Precision compared to previous methods.

Background Suppression Network for Weakly-supervised Temporal Action Localization: An Expert Overview

The paper "Background Suppression Network for Weakly-supervised Temporal Action Localization" presents a novel approach to enhance the weakly-supervised temporal action localization (WTAL) problem. This problem is significant due to the absence of frame-level labels during training, relying instead on labels at the video level. The paper identifies limitations in previous methods where background frames are not treated separately, thus degrading localization accuracy. The proposed solution is the Background Suppression Network (BaS-Net), which incorporates a distinct class for background frames and employs an innovative dual-branch architecture for improved action localization.

Overview of the BaS-Net

The BaS-Net introduces a two-branch structure with weight-sharing capabilities between them. The key innovation lies in the addition of an explicit background class, which previous techniques often ignored. The Base branch of BaS-Net aggregates segment-level scores into video-level predictions, while the Suppression branch actively suppresses the contributions from background segments, employing a filtering module designed for this purpose. This module is tasked with attenuating input features from background frames, thereby focusing the network’s attention on action frames and minimizing false positives from background noise.

Methodological Innovations

The methodological advancements of BaS-Net are rooted in:

Auxiliary Background Class: The inclusion of an auxiliary background class addresses the classification of non-action frames. However, introducing this class alone does not enhance performance; it risks misclassifying all frames as background due to the lack of direct negative samples for training.
Two-branch Architecture: Featuring the Base and Suppression branches, this architecture ensures the network can simultaneously optimize both action class identification and background suppression. Shared weights enforce a balance between recognizing action frames and minimizing background noise. The Suppression branch is particularly tasked with leveraging contrasting objectives to refine frame classification.
Asymmetrical Training Strategy: This strategy uses diverging training objectives for each branch, emphasizing background frame suppression in the Suppression branch. This dual-objective training is critical for improving the precision of localization outcomes.

Empirical Evidence and Implications

Significant empirical validation is provided, with BaS-Net surpassing existing state-of-the-art methods on popular benchmarks like THUMOS'14 and ActivityNet. The results highlight the efficacy of BaS-Net in mitigating background interference, evident in the performance metrics such as mean Average Precision (mAP). Notably, the integration of the background class, coupled with the two-branch framework, leads to improved detection of action instances, even without frame-specific annotations.

The paper demonstrates that incorporating background modeling within weakly-supervised contexts is not only feasible but beneficial. The approach has theoretical implications for how action frames are represented and learned, offering a pathway to bridging the gap between weakly-supervised and fully-supervised localization methods.

Future Directions

The implications of this research extend toward enhancing WTAL frameworks, fostering future developments where background and action contexts are more robustly defined. This method could serve as a precursor to more sophisticated models capable of real-time action detection in dynamic environments, potentially leading to advancements in fields such as video surveillance, human-computer interaction, and autonomous vehicle navigation.

In closing, the paper successfully introduces a method that surpasses previous frameworks in WTAL by employing a clever architectural innovation that treats background noise with the significance it necessitates. The findings set a precedent for future work in weak supervision methodologies, reinforcing the merit of thorough background modeling in action localization tasks.

PDF Markdown

Related Papers

GitHub

GitHub - Pilhyeon/BaSNet-pytorch: Official Pytorch Implementation of 'Background Suppression Network for Weakly-supervised Temporal Action Localization' (AAAI-20 Spotlight) (172 stars)