Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition (1704.01194v1)

Published 4 Apr 2017 in cs.CV

Abstract: In this paper we address the problem of human action recognition from video sequences. Inspired by the exemplary results obtained via automatic feature learning and deep learning approaches in computer vision, we focus our attention towards learning salient spatial features via a convolutional neural network (CNN) and then map their temporal relationship with the aid of Long-Short-Term-Memory (LSTM) networks. Our contribution in this paper is a deep fusion framework that more effectively exploits spatial features from CNNs with temporal features from LSTM models. We also extensively evaluate their strengths and weaknesses. We find that by combining both the sets of features, the fully connected features effectively act as an attention mechanism to direct the LSTM to interesting parts of the convolutional feature sequence. The significance of our fusion method is its simplicity and effectiveness compared to other state-of-the-art methods. The evaluation results demonstrate that this hierarchical multi stream fusion method has higher performance compared to single stream mapping methods allowing it to achieve high accuracy outperforming current state-of-the-art methods in three widely used databases: UCF11, UCFSports, jHMDB.

Citations (164)

View on Semantic Scholar

Summary

The paper introduces a two-stream LSTM framework that effectively fuses spatial features from CNNs and temporal features from LSTMs to improve human action recognition accuracy.
The proposed fusion models, particularly fu-2, achieve state-of-the-art results on benchmark datasets, including 94.6% accuracy on UCF11, 99.1% on UCF Sports, and 69.0% on jHMDB.
This deep fusion approach provides a robust method for recognizing complex actions in challenging video environments, applicable to surveillance, sports analysis, and human-machine interaction.

Analysis of the "Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition" Paper

The paper "Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition" presents a novel approach that addresses the issue of human action recognition in video sequences. This challenge is relevant across multiple domains including surveillance, sports analysis, and human-machine interactions. The authors' primary contribution is the development of a deep fusion framework that enhances the combination of spatial and temporal information derived from Convolutional Neural Networks (CNNs) and Long-Short-Term-Memory (LSTM) networks respectively.

Key Contributions

The authors introduce a mechanism by which convolutional layer outputs (last convolution layer and first fully connected layer) are fed into an LSTM network to model temporal relationships and improve recognition accuracy. The crux of the research lies in the two-stream LSTM approach that effectively merges spatial and temporal feature sets, facilitating more precise mappings of human actions. This methodology allows for the automatic feature learning characteristic of CNNs to be leveraged alongside the sequential modeling strengths of LSTMs to classify actions more accurately.

Four distinct models are proposed:

conv-L: Utilizes convolutional layer outputs.
fc-L: Leverages fully connected layer outputs.
fu-1 and fu-2: These fusion models incorporate both convolutional and fully connected features with fu-2 enabling joint backpropagation for further performance enhancement.

Evaluation and Results

The paper evaluates these models against established benchmarks in the UCF11, UCF Sports, and jHMDB datasets. Notably, the fusion models, particularly fu-2, outperform state-of-the-art methods with accuracies of 94.6%, 99.1%, and 69.0% respectively across these datasets. The results underscore the effectiveness and efficiency of the proposed fusion framework, which can achieve competitive accuracy in action recognition with fewer training parameters than previous methods.

A notable advantage of the proposed approach is the joint learning mechanism facilitated by the fusion models. This allows the fc-L stream to guide the LSTM's attention through the convolutional feature sequences, leveraging spatial features to enhance the understanding of temporal correlations.

Implications

From a practical perspective, the framework's ability to accurately recognize complex human actions in challenging video datasets opens avenues for enhanced applications in real-world scenarios characterized by cluttered backgrounds and variable lighting conditions. Theoretically, this work propels the understanding of how hierarchical structures can leverage deep learning to improve classification tasks, which could be extended to other domains beyond action recognition.

Future Directions

While the proposed two-stream LSTM architecture achieves high accuracy, future research could explore the integration of additional modalities or hybrid architectures to further capture contextually relevant features. Emerging AI frameworks could also benefit from this research by adopting similar fusion strategies to address analogous challenges in sequence modeling across different disciplines. The exploration of lightweight models that offer comparable performance with reduced computational complexity remains a compelling avenue for subsequent studies.

In summary, the robust modeling offered by the two-stream LSTM framework signifies a notable advancement in human action recognition, showcasing the potential of combining spatial-depth and temporal-sequence information through deep learning architectures.