Learning Policies for Adaptive Tracking with Deep Feature Cascades (1708.02973v2)

Published 9 Aug 2017 in cs.CV

Abstract: Visual object tracking is a fundamental and time-critical vision task. Recent years have seen many shallow tracking methods based on real-time pixel-based correlation filters, as well as deep methods that have top performance but need a high-end GPU. In this paper, we learn to improve the speed of deep trackers without losing accuracy. Our fundamental insight is to take an adaptive approach, where easy frames are processed with cheap features (such as pixel values), while challenging frames are processed with invariant but expensive deep features. We formulate the adaptive tracking problem as a decision-making process, and learn an agent to decide whether to locate objects with high confidence on an early layer, or continue processing subsequent layers of a network. This significantly reduces the feed-forward cost for easy frames with distinct or slow-moving objects. We train the agent offline in a reinforcement learning fashion, and further demonstrate that learning all deep layers (so as to provide good features for adaptive tracking) can lead to near real-time average tracking speed of 23 fps on a single CPU while achieving state-of-the-art performance. Perhaps most tellingly, our approach provides a 100X speedup for almost 50% of the time, indicating the power of an adaptive approach.

Citations (221)

View on Semantic Scholar

Summary

The paper proposes an adaptive tracking method, EAST, using a reinforcement learning agent to dynamically decide how many deep feature layers are needed per frame, reducing computational cost.
Evaluations on OTB and VOT benchmarks show EAST achieves superior accuracy and significant speed improvements compared to state-of-the-art trackers, enabling nearly real-time performance.
This learned adaptive policy allows for deploying deep learning-based visual tracking on resource-constrained devices without sacrificing accuracy.

Overview of "Learning Policies for Adaptive Tracking with Deep Feature Cascades"

The paper, "Learning Policies for Adaptive Tracking with Deep Feature Cascades," addresses the computational challenges involved in single object visual tracking. This task is vital for applications in video surveillance and autonomous driving, where both accuracy and speed are crucial. The authors focus on integrating the strengths of deep learning techniques with efficient feature processing to strike a balance between the two.

Key observations underpinning the research include the disparity in computational cost across different frames in a video. Specifically, processing can be less demanding for frames where objects have simple motion or are visually prominent against the background, whereas more complex features are required for ambiguous or rapidly changing scenes. Therefore, the paper proposes an adaptive tracking mechanism that dynamically scales its complexity relative to the frames' requirements.

Methodology

The research builds upon existing Deep Convolutional Neural Networks (CNNs), using them as a feature cascade. The paper introduces an agent that is trained through reinforcement learning to decide whether an early feature layer is sufficient for object localization or if further processing is needed. By framing it as a decision-making problem, the agent learns policies that minimize the number of layers processed for simple frames while maintaining accuracy for complex frames.

The method leverages a fully-convolutional Siamese network, which efficiently extracts and evaluates similarities between frames using convolutional layers for robust feature representation. Adaptive processing is achieved by the agent learning to perform actions through Deep Q-learning, formulating decisions through rewards based on how accurately it localizes objects using Intersection-over-Union (IoU) as a metric.

Experimental Results and Implications

The proposed model, dubbed the EArly-Stopping Tracker (EAST), was evaluated on multiple video tracking benchmarks such as OTB-50, OTB-100, VOT-14, and VOT-15. It consistently demonstrated superior performance in accuracy while achieving significant computational efficiency. In the OTB and VOT challenges, EAST outperformed several state-of-the-art trackers, including correlation-filter-based and deep-learning methods, particularly highlighting its strength in nearly real-time tracking even on CPU endpoints.

The results reveal that EAST can successfully decide, and halt processing at earlier convolutional layers in most frames, achieving substantial speed improvements without accuracy loss. For practical applications, this represents a crucial step toward deploying real-time tracking in resource-constrained devices, enabling broader AI implementation within mobile and embedded systems.

Conclusion and Speculation on Future Directions

Setting a precedent for integrating adaptive strategies in complex neural architectures, this work suggests exciting future avenues in AI research. It opens the path for further exploration in adaptive processing, not just limited to visual tracking but extending across time-critical tasks where deep learning's computational burden could be made contextually responsive.

Further developments could focus on enhancing the robustness of such adaptive mechanisms, possibly exploring unsupervised policy learning or studying the transferability of these strategies across different neural architectures and tasks. Additionally, more work could be done toward applying Fourier transform techniques for furthering speed enhancements within the deep convolutional processing phase. Overall, this approach of learning efficient operation policies from data instead of relying on static heuristics proposes an impactful shift in how future AI systems can be both powerful and responsive across diverse application areas.

PDF Markdown