Deep Affinity Network for Multiple Object Tracking (1810.11780v2)

Published 28 Oct 2018 in cs.CV

Abstract: Multiple Object Tracking (MOT) plays an important role in solving many fundamental problems in video analysis in computer vision. Most MOT methods employ two steps: Object Detection and Data Association. The first step detects objects of interest in every frame of a video, and the second establishes correspondence between the detected objects in different frames to obtain their tracks. Object detection has made tremendous progress in the last few years due to deep learning. However, data association for tracking still relies on hand crafted constraints such as appearance, motion, spatial proximity, grouping etc. to compute affinities between the objects in different frames. In this paper, we harness the power of deep learning for data association in tracking by jointly modelling object appearances and their affinities between different frames in an end-to-end fashion. The proposed Deep Affinity Network (DAN) learns compact; yet comprehensive features of pre-detected objects at several levels of abstraction, and performs exhaustive pairing permutations of those features in any two frames to infer object affinities. DAN also accounts for multiple objects appearing and disappearing between video frames. We exploit the resulting efficient affinity computations to associate objects in the current frame deep into the previous frames for reliable on-line tracking. Our technique is evaluated on popular multiple object tracking challenges MOT15, MOT17 and UA-DETRAC. Comprehensive benchmarking under twelve evaluation metrics demonstrates that our approach is among the best performing techniques on the leader board for these challenges. The open source implementation of our work is available at https://github.com/shijieS/SST.git.

Citations (316)

View on Semantic Scholar

Summary

The paper presents an end-to-end deep affinity model that jointly learns object features and frame associations, reducing reliance on hand-crafted constraints.
The method achieves competitive tracking accuracy and processing speed (6.3 fps) across benchmark datasets such as MOT15, MOT17, and UA-DETRAC.
The approach streamlines the data association process in multiple object tracking, offering enhanced robustness to occlusions and scene changes.

Deep Affinity Network for Multiple Object Tracking

The paper introduces a novel approach for Multiple Object Tracking (MOT), focusing on improving the data association stage through a Deep Affinity Network (DAN). The traditional MOT pipeline relies on object detection followed by data association, with the latter often relying on hand-crafted constraints. This research leverages deep learning to model object appearances and their frame-to-frame affinities in an end-to-end manner.

DAN is designed to learn compact yet comprehensive features of objects detected in video frames at various levels of abstraction. It performs exhaustive permutations of these features across frames to infer affinities, accounting for objects entering or leaving the scene. The efficient affinity computations facilitate robust online tracking by allowing deep frame association utilizing previous frame data.

The proposed method was evaluated on multiple challenging datasets such as MOT15, MOT17, and UA-DETRAC, demonstrating its competitiveness among existing techniques based on twelve evaluation metrics. The results show that DAN achieves a high level of accuracy, underscoring its potential as a reliable tool for online tracking in various scenarios.

Key Contributions and Findings

End-to-End Affinity Learning: The paper presents a unique approach to model both appearance and affinity simultaneously within a single deep network framework. By extending this joint learning to non-consecutive frames during training, DAN becomes robust to occlusions, enabling deep trajectory tracking.
Quantitative Performance: DAN achieves 6.3 fps on standard datasets and performs competitively against state-of-the-art online and offline trackers. It particularly excels in MOTA and other key metrics.
Extensive Benchmarking: The approach has been rigorously benchmarked on publicly available tracking challenges, confirming its reliability and effectiveness across diverse video scenarios.

Practical and Theoretical Implications

Practically, DAN's integration into the MOT pipeline simplifies the data association process, removing the need for manual feature engineering by utilizing a deep learning-based affinity model. This not only improves tracking accuracy but also enhances adaptability to complex scenes involving changes in lighting, occlusions, and dense crowds.

Theoretically, this approach pushes the boundaries on how deep neural networks can be trained to learn problem-specific features, carving a path toward networks that can simultaneously model multiple aspects of a problem domain. The contribution of a singular, cohesive architecture for both feature extraction and affinity estimation may inspire future research in fusing similar tasks within tracking and related domains.

Future Directions

The deployment of DAN opens several avenues for further exploration:

Scalability and Efficiency: Future work could focus on enhancing the speed and scalability of the model to handle larger datasets and real-time applications more effectively.
Robustness to Challenge Scenarios: Improving the model's robustness against extreme occlusions and similar appearance challenges can further push its applicability in more dynamic environments.
Integration with Other Sensors: Expanding the modal input to incorporate additional sensory data such as depth or thermal imaging could provide richer context and performance improvements, particularly in challenging lighting conditions or occluded scenes.

In summary, this paper provides a substantial contribution to the MOT field, demonstrating the efficacy of deep learning for comprehensive affinity modeling and providing a strong foundation for future enhancements in tracking applications.

PDF Markdown

Related Papers

GitHub

GitHub - shijieS/SST: Single Shot Tracker (397 stars)