STM: SpatioTemporal and Motion Encoding for Action Recognition

Published 7 Aug 2019 in cs.CV | (1908.02486v2)

Abstract: Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (367)

View on Semantic Scholar

Summary

The paper introduces an STM network that replaces standard CNN residual blocks with specialized spatiotemporal and motion encoding modules to improve efficiency.
The proposed Channel-wise SpatioTemporal Module and Channel-wise Motion Module enable effective feature extraction without relying on computationally expensive 3D convolutions or optical flow.
STM achieves state-of-the-art performance on temporal video datasets while significantly reducing computational cost, making it ideal for real-time applications.

Overview of STM: SpatioTemporal and Motion Encoding for Action Recognition

This paper presents a novel approach for video action recognition, addressing the key challenge of efficiently encoding spatiotemporal and motion features in a unified 2D convolutional neural network (CNN) framework. Traditional methods have often relied on computationally expensive 3D CNNs or dual-stream architectures combining RGB and optical flow streams to achieve high performance. The proposed SpatioTemporal and Motion (STM) network introduces an STM block composed of the Channel-wise SpatioTemporal Module (CSTM) and the Channel-wise Motion Module (CMM) to effectively and efficiently harness these features without the overhead of precomputing optical flow or using hefty 3D convolutions.

Key Contributions

Unified 2D CNN with STM Blocks: The authors replace standard residual blocks in ResNet with STM blocks, integrating spatiotemporal and motion encoding without substantially increasing computational demand. This integration allows the network to capture essential action dynamics more efficiently compared to maintaining separate processing paths for spatial and temporal information.
Novel CSTM and CMM Components:
- CSTM: This module uses channel-wise 1D convolution to learn temporal combinations independently for each channel. This leverages different temporal relationships for each channel while keeping additional computational cost low.
- CMM: Learned motion patterns are generated by comparing consecutive frames, thus encoding motion without having to compute traditional optical flow. This lightweight alternative captures distinct edges much like optical flow but incorporates them into the network learning in a more resource-efficient manner.
Performance: Experimentation on both temporal-focused datasets (e.g., Something-Something v1 and v2, Jester) and scene-related datasets (e.g., Kinetics-400, UCF-101, HMDB-51) showcase that STM achieves state-of-the-art performance on temporal datasets and competitive results on scene-focused datasets. Specifically, on the Something-Something v1 dataset, STM yielded a top-1 accuracy improvement of 29.5% over the TSN baseline, with comparable evaluations on other benchmarks.
Efficiency: The STM network achieves these results without requiring the computational intensiveness of 3D CNNs or the optical flow computations typical of two-stream methods. This is pivotal for reducing the cost and enhancing deployment feasibility in various practical applications.

Implications and Future Developments

The successful integration of spatiotemporal and motion features within a unified 2D CNN framework as demonstrated by the STM network has significant implications for designing efficient video recognition systems. This method can be particularly transformative for real-time applications and devices with limited processing capabilities such as edge devices and mobile platforms.

From a theoretical perspective, this approach uncovers potential avenues for investigating further optimizations in feature encoding strategies that bypass traditional pre-processing heavy methodologies. Future work might explore extending these principles to longer video sequences or integrating additional modalities such as audio for even richer context interpretation.

Moreover, the STM work signifies a move towards more parsimonious model designs that could inspire evolutions in other areas of computer vision and machine learning where model efficiency is paramount.

STM represents a meaningful advancement in action recognition methodology by striking a sound balance between model complexity and empirical performance, paving the way for more resource-aware designs in video understanding frameworks.

Markdown Report Issue