DEFT: Detection Embeddings for Tracking (2102.02267v2)

Published 3 Feb 2021 in cs.CV

Abstract: Most modern multiple object tracking (MOT) systems follow the tracking-by-detection paradigm, consisting of a detector followed by a method for associating detections into tracks. There is a long history in tracking of combining motion and appearance features to provide robustness to occlusions and other challenges, but typically this comes with the trade-off of a more complex and slower implementation. Recent successes on popular 2D tracking benchmarks indicate that top-scores can be achieved using a state-of-the-art detector and relatively simple associations relying on single-frame spatial offsets -- notably outperforming contemporary methods that leverage learned appearance features to help re-identify lost tracks. In this paper, we propose an efficient joint detection and tracking model named DEFT, or "Detection Embeddings for Tracking." Our approach relies on an appearance-based object matching network jointly-learned with an underlying object detection network. An LSTM is also added to capture motion constraints. DEFT has comparable accuracy and speed to the top methods on 2D online tracking leaderboards while having significant advantages in robustness when applied to more challenging tracking data. DEFT raises the bar on the nuScenes monocular 3D tracking challenge, more than doubling the performance of the previous top method. Code is publicly available.

Citations (85)

View on Semantic Scholar

Summary

The paper presents a novel unified framework that jointly learns detection and tracking via shared feature embeddings.
It employs multiscale feature embeddings and an LSTM-based motion module to improve tracking robustness under occlusion and large displacements.
Empirical evaluations on benchmarks like nuScenes demonstrate DEFT's superior performance compared to methods such as CenterTrack.

Overview of "DEFT: Detection Embeddings for Tracking"

The paper "DEFT: Detection Embeddings for Tracking" presents a novel approach to multi-object tracking (MOT) that improves both the accuracy and efficiency of existing methodologies. At the core of this work is the DEFT model, which stands for Detection Embeddings for Tracking. DEFT is distinct in its ability to jointly learn detection and tracking tasks within a unified framework, thereby addressing the persistent challenges of occlusion and large inter-frame displacements which commonly degrade the performance of MOT systems.

The traditional tracking-by-detection paradigm typically involves two stages: object detection and the subsequent association of detected objects across frames. While this has been the prevailing approach, the method often suffers due to its reliance on the robustness of its separate stages, which can result in inefficient and suboptimal tracking performance. DEFT seeks to mitigate these limitations by integrating appearance-based object matching into the detection network itself. This integration is facilitated by a matching network that is co-trained with the detection backbone to leverage shared features, reinforcing the system's ability to consistently track objects through diverse and challenging conditions.

Technical Contributions

The pivotal contribution of DEFT lies in its leveraging of detection embeddings for association, which allows the tracker to maintain object identity through features drawn directly from the detection network. This paper highlights a few innovative components that fuel DEFT's efficacy:

Joint Training of Detection and Tracking: By training both tasks simultaneously, DEFT promotes a synergistic relationship between detection and tracking modules. This joint approach ensures that feature representations are optimized for both object localization and re-identification, yielding greater tracking fidelity.
Multiscale Feature Embeddings: DEFT employs feature embeddings extracted from multiple scales within the detection network. This multiscale approach enhances the robustness of appearance-based tracking, mitigating the effects of scale variation in tracked objects.
LSTM-Based Motion Forecasting Module: To further bolster tracking reliability, DEFT introduces an LSTM module that forecasts future object positions, providing temporal coherence that aids in differentiating objects with similar appearances. This is particularly crucial in occlusion scenarios where visual cues alone might be insufficient.

Empirical Evaluations

DEFT's performance is rigorously evaluated on several benchmarks, including MOT16/MOT17, KITTI, and the challenging nuScenes datasets. The empirical results demonstrate that DEFT achieves competitive performance on 2D tracking datasets (MOT, KITTI) while significantly outperforming alternatives in the more complex nuScenes visual tracking benchmark.

nuScenes Benchmark: DEFT particularly excels in this domain, achieving significant improvements in AMOTA (Average Multi-Object Tracking Accuracy) over the prior state-of-the-art, indicating exceptionally robust handling of large displacements and occlusions. This illustrates DEFT's suitability for real-world applications such as autonomous driving where tracking complexities are more pronounced.
Comparison with CenterTrack: The paper contrasts DEFT with CenterTrack, noting that while both methods use similar detection backbones, DEFT's incorporation of memory for appearance embeddings and LSTM-based motion constraints lead to superior tracking under demanding conditions. The enhanced capability to recover from longer occlusions and handle high-speed object movements is testament to its augmented design.

Implications and Future Directions

DEFT's methodological improvements underscore the potential for integrated detection and tracking systems within computer vision applications. The advancements presented hold particular promise for autonomous driving technologies and surveillance systems where reliable multi-object tracking is critical.

Future work may extend DEFT's capabilities to a broader range of sensor modalities, such as LiDAR and radar, offering enhanced versatility across diverse environmental conditions. Furthermore, exploring algorithmic optimizations that reduce computational overhead without compromising tracking efficacy would be of significant interest, especially for deployment in real-time applications with hardware constraints.

In conclusion, DEFT represents a step forward in MOT research, presenting a viable path towards more integrated, robust, and efficient tracking solutions. By demonstrating the tangible benefits of joint task optimization and advanced embedding techniques, this work provides a foundational model that future research may build upon to tackle the ever-evolving challenges within the field of multi-object tracking.

PDF Markdown

Related Papers

GitHub

GitHub - MedChaabane/DEFT: Joint detection and tracking model named DEFT, or ``Detection Embeddings for Tracking." Our approach relies on an appearance-based object matching network jointly-learned with an underlying object detection network. An LSTM is also added to capture motion constraints. (275 stars)