Online Tracking by Learning Discriminative Saliency Map with Convolutional Neural Network

Published 24 Feb 2015 in cs.CV | (1502.06796v1)

Abstract: We propose an online visual tracking algorithm by learning discriminative saliency map using Convolutional Neural Network (CNN). Given a CNN pre-trained on a large-scale image repository in offline, our algorithm takes outputs from hidden layers of the network as feature descriptors since they show excellent representation performance in various general visual recognition problems. The features are used to learn discriminative target appearance models using an online Support Vector Machine (SVM). In addition, we construct target-specific saliency map by backpropagating CNN features with guidance of the SVM, and obtain the final tracking result in each frame based on the appearance model generatively constructed with the saliency map. Since the saliency map visualizes spatial configuration of target effectively, it improves target localization accuracy and enable us to achieve pixel-level target segmentation. We verify the effectiveness of our tracking algorithm through extensive experiment on a challenging benchmark, where our method illustrates outstanding performance compared to the state-of-the-art tracking algorithms.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (817)

View on Semantic Scholar

Summary

The paper introduces a novel online tracking method that combines CNN-extracted features with an online SVM to generate target-specific saliency maps.
It demonstrates superior performance by accurately localizing targets and handling challenges such as occlusion, illumination changes, and fast motion.
The approach achieves precise pixel-level segmentation, offering promising applications in surveillance, autonomous navigation, and advanced visual analysis.

Online Tracking by Learning Discriminative Saliency Map with Convolutional Neural Network

The paper proposes an innovative approach to online visual tracking through the use of convolutional neural networks (CNNs) and discriminative saliency maps. The primary contribution of the paper is a tracking algorithm that leverages a pre-trained CNN to extract robust features, which are then used to learn target-specific information online. This approach aims to improve both the accuracy and adaptability of visual tracking systems.

Methodology and Contributions

The proposed algorithm can be summarized as follows:

Feature Extraction:
- The algorithm utilizes a CNN pre-trained on a large-scale image repository for feature extraction. The first fully-connected layer of the CNN provides feature vectors for samples extracted from the input frame.
Online SVM for Discriminative Learning:
- An online Support Vector Machine (SVM) is employed to discriminate between target and background using the CNN features. The SVM model is updated incrementally as new samples are processed.
Target-Specific Saliency Map:
- The SVM and CNN are combined to generate a target-specific saliency map. By back-propagating the features that are deemed relevant by the SVM, the algorithm highlights regions in the input image that are most discriminative for the target.
- The saliency maps for positive samples are aggregated to produce a comprehensive target-specific saliency map for each frame.
Generative Appearance Model:
- A generative model of the target appearance is constructed using the target-specific saliency maps from previous frames.
- This model helps in achieving spatial configuration and accurate localization of the target. Tracking is performed via sequential Bayesian filtering, using the saliency map as the observation.

Results

The algorithm's effectiveness was validated on a challenging benchmark dataset, demonstrating superior performance compared to state-of-the-art methods. Specifically:

The proposed method achieved higher success rates and precision in object tracking tasks involving various challenges such as occlusion, illumination variation, and fast motion.
The algorithm provided significant improvements in accurately locating and segmenting the target at the pixel level.

Implications and Future Research

The practical implications of this research are multifaceted:

Improved Target Localization:
- By utilizing a target-specific saliency map, the algorithm effectively captures spatial configurations, thereby enhancing target localization accuracy.
Robustness to Challenges:
- The algorithm's ability to handle common visual tracking challenges (e.g., occlusion, deformation) highlights its robustness and potential applicability in real-world scenarios.
Pixel-Level Segmentation:
- The method's capability to achieve pixel-level segmentation opens new opportunities for applications requiring precise tracking, such as video surveillance and autonomous navigation.

Future research may explore several avenues:

Adaptation to Different Contexts:
- Investigating the adaptation of this algorithm to different environmental contexts and its performance across varied datasets would be valuable.
Hybrid Models:
- Further integration of generative and discriminative models could be explored to balance the trade-offs between robustness and flexibility.
Efficiency Improvements:
- Optimizing computational efficiency, particularly for real-time applications, would be crucial for practical deployment.

Conclusion

This paper introduces a sophisticated method that advances the state of online visual tracking by combining the discriminative power of an online SVM with the robust feature representation of a pre-trained CNN. The unique approach of generating target-specific saliency maps and constructing generative appearance models demonstrates both theoretical elegance and practical efficacy. The promising results on benchmark datasets affirm the potential of this method to significantly enhance the performance and reliability of visual tracking systems.

Markdown Report Issue