Action Recognition using Visual Attention

Published 12 Nov 2015 in cs.LG and cs.CV | (1511.04119v3)

Abstract: We propose a soft attention based model for the task of action recognition in videos. We use multi-layered Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units which are deep both spatially and temporally. Our model learns to focus selectively on parts of the video frames and classifies videos after taking a few glimpses. The model essentially learns which parts in the frames are relevant for the task at hand and attaches higher importance to them. We evaluate the model on UCF-11 (YouTube Action), HMDB-51 and Hollywood2 datasets and analyze how the model focuses its attention depending on the scene and the action being performed.

Abstract PDF Upgrade to Chat

Citations (657)

View on Semantic Scholar

Summary

The paper demonstrates that integrating a soft attention mechanism with LSTM networks significantly improves action recognition accuracy.
The paper employs multi-layered RNNs with GoogLeNet features to selectively focus on salient spatial and temporal regions in videos.
The paper validates its approach with notable gains, achieving 84.96% accuracy on UCF-11 and competitive results on HMDB-51.

Action Recognition using Visual Attention: An Expert Overview

The paper "Action Recognition using Visual Attention" by Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov presents a novel approach to action recognition in videos leveraging a soft attention mechanism. The model utilizes multi-layered Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units, increasing their depth both spatially and temporally. This architecture allows the model to selectively focus on important parts of video frames, enhancing its ability to classify actions efficiently after minimal glimpses.

Methodology

Central to the paper is the application of the soft attention mechanism, contrasting with the hard attention models which are inherently stochastic and require computationally expensive sampling techniques. The soft attention strategy allows for deterministic outputs trained via backpropagation, establishing a differentiable mapping from attention weights to RNN inputs. The model predicts action classes using convolutional features from the GoogLeNet architecture, dynamically identifying relevant regions of interest through a location softmax.

The authors assess their model on well-known datasets: UCF-11, HMDB-51, and Hollywood2, providing a diverse range of human activities captured in real-world video scenarios. Through this approach, the model aims to mimic human visual cognition whereby attention dynamically shifts to pertinent elements across frames.

Quantitative Analysis

The proposed model demonstrates measurable improvements over baseline approaches such as softmax regression and traditional pooled LSTMs. On the UCF-11 dataset, for instance, the attention model achieves an accuracy of 84.96%, a noticeable improvement over the baselines. Similarly, on the HMDB-51 dataset, it registers an accuracy of 41.31%, outperforming other models relying on RGB video input.

Comparative Evaluation

When juxtaposed against state-of-the-art models, particularly those utilizing only RGB data, the proposed soft attention model stands competitive. It provides a balance of performance and interpretability, distinguishing it from methods incorporating optical flow or additional data modalities.

Qualitative Results

The emphasis on a visual attention mechanism brings forth an interpretability advantage, allowing insights into the model's focus areas during classification. Several examples illustrate the model accurately discerning critical elements such as sports equipment or human motion features corresponding to specific actions, thereby facilitating correct classification.

Implications and Future Directions

The exploration of attention mechanisms in video action recognition opens avenues for enhanced interpretability and efficiency in temporal modeling tasks. The success of this model could spur further research into optimizing attention-based frameworks, potentially integrating hybrid attention strategies combining soft and hard mechanisms. Future work could also address scaling the model to larger datasets or augmenting attention models with multi-resolution features to capture diverse video contexts more holistically.

In summary, the paper presents a method that not only improves action classification accuracy but also enhances understanding of the underlying model decisions. Its contributions lie in advancing the integration of attention mechanisms in temporal sequence analysis, setting the stage for future innovations in video understanding tasks.