SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking (1911.07241v2)

Published 17 Nov 2019 in cs.CV

Abstract: By decomposing the visual tracking task into two subproblems as classification for pixel category and regression for object bounding box at this pixel, we propose a novel fully convolutional Siamese network to solve visual tracking end-to-end in a per-pixel manner. The proposed framework SiamCAR consists of two simple subnetworks: one Siamese subnetwork for feature extraction and one classification-regression subnetwork for bounding box prediction. Our framework takes ResNet-50 as backbone. Different from state-of-the-art trackers like Siamese-RPN, SiamRPN++ and SPM, which are based on region proposal, the proposed framework is both proposal and anchor free. Consequently, we are able to avoid the tricky hyper-parameter tuning of anchors and reduce human intervention. The proposed framework is simple, neat and effective. Extensive experiments and comparisons with state-of-the-art trackers are conducted on many challenging benchmarks like GOT-10K, LaSOT, UAV123 and OTB-50. Without bells and whistles, our SiamCAR achieves the leading performance with a considerable real-time speed.

Citations (548)

View on Semantic Scholar

Summary

The paper introduces SiamCAR, a novel approach that decomposes visual tracking into per-pixel classification and regression tasks.
It utilizes a Siamese network with a ResNet-50 backbone and depth-wise cross-correlation to enhance feature extraction and robustness.
Experiments show SiamCAR’s superior performance, with a 5.2% AO gain on GOT-10K over state-of-the-art trackers, highlighting its efficiency.

An Expert Analysis of SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking

The paper "SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking" introduces a novel approach to object tracking using convolutional neural networks. The authors propose SiamCAR, a fully convolutional Siamese network that decomposes the tracking task into classification and regression subproblems. This method diverges from conventional region proposal-based trackers, such as SiamRPN and SiamRPN++, by being anchor and proposal-free, which simplifies the model architecture and reduces the need for hyperparameter tuning.

Framework and Methodology

The architecture of SiamCAR consists of two central components: a Siamese network for feature extraction and a classification-regression subnetwork for bounding box prediction. The use of ResNet-50 as a backbone enhances the feature representation capabilities. The depth-wise cross-correlation layer generates a multi-channel response map, which improves the extraction of semantic similarities crucial for accurate tracking.

Differentiating itself from anchor-based approaches, SiamCAR performs end-to-end visual tracking by handling classification (to predict pixel categories) and regression (to determine pixel-specific bounding boxes). This per-pixel prediction approach avoids the complex parameter optimization associated with anchors, demonstrating a streamlined architecture that remains effective across various benchmarks.

Experimental Results

Extensive evaluations were conducted on prominent datasets, such as GOT-10K, LaSOT, UAV123, and OTB-50. The SiamCAR achieved leading performance in accuracy and computational efficiency. For instance, on the GOT-10K dataset, it considerably outperformed existing state-of-the-art trackers including SiamRPN++ and SPM, with an average overlap (AO) improvement of 5.2%. Similarly, the results on LaSOT and OTB-50 highlighted its robustness in handling diverse tracking challenges like occlusion, scale variation, and background clutter.

Implications and Future Directions

Practically, SiamCAR's efficiency and simplicity present a significant advancement in real-time object tracking systems, pertinent for applications in surveillance and autonomous vehicles. The anchor-free design reduces both computational cost and the complexity of model training.

Theoretically, this work contributes to the understanding of how per-pixel classification and regression can simplify and improve tracking algorithms. It provides insights that may influence the design of future tracking models, particularly in exploiting fully convolutional architectures without relying on pre-defined anchors.

Looking forward, SiamCAR's simplicity might open avenues for further customizations and enhancements, such as integrating more sophisticated data enhancements or utilizing adaptive learning strategies. Its flexible architecture can serve as a foundational framework for future research and development in the field of real-time and robust visual tracking systems.

PDF Markdown