- The paper introduces an unsupervised Re-ID learning module that forms identity associations via intra- and inter-frame similarities without requiring labeled data.
 
        - It presents an occlusion estimation module that predicts overlapping regions to recover occluded objects, improving detection in crowded scenes.
 
        - Experiments on MOTChallenge datasets show significant improvements in tracking metrics, such as higher MOTA and IDF1 scores, validating the approach.
 
    
   
 
      Introduction
The paper "Online Multi-Object Tracking with Unsupervised Re-Identification Learning and Occlusion Estimation" (2201.01297) presents two novel modules designed to enhance online multi-object tracking (MOT) systems. Addressing the inherent challenges associated with occlusions and re-identification (Re-ID), the paper introduces an unsupervised Re-ID learning module alongside an occlusion estimation module. These additions aim to reduce the dependency on annotated identity information and improve tracking performance by identifying occluded objects.
Unsupervised Re-Identification Learning Module
The unsupervised Re-ID learning module leverages the similarity in appearance between objects in adjacent video frames to build associations without requiring labeled identity information. This approach follows two key supervision signals:
- Strong Supervision: Objects within the same frame should not be associated with each other.
 
- Weak Supervision: Objects in adjacent frames are likely to share the same identity based on appearance.
 
The module uses a similarity matrix, S, measuring cosine similarity between object features. A dynamic placeholder is introduced to the assignment matrix, M′, handling cases where objects appear or disappear between frames. The learning is guided by intra-frame losses, inter-frame margin losses, and cycle consistency constraint losses to optimize the association matrix (Figure 1).
Figure 1: The proposed un-supervised Re-ID learning method. Demonstrates the identities' assignment between adjacent frames without using explicit identity information.
Occlusion Estimation Module
Occlusions pose significant challenges in MOT, often leading to missed detections. The occlusion estimation module predicts the locations of possible occlusions using a key-point estimation approach, enabling the refinding of lost objects in subsequent frames (Figure 2). The module generates an occlusion heatmap by estimating the center of overlap between bounding boxes, which informs the tracking algorithm to recover occluded objects using predicted motion and occlusion centers.
Figure 2: Typical occlusion cases. The translucent blue areas signify where occlusions occur, highlighted by red occlusion centers.
Implementation in Existing MOT Systems
These modules integrate seamlessly into existing tracking systems like FairMOT and CenterTrack. For FairMOT, the Re-ID learning mechanism is replaced by the unsupervised approach, and the occlusion estimation module is added alongside the detection head. The unsupervised Re-ID improves scalability and reduces reliance on labeled data, while the occlusion estimation enhances the system's ability to handle densely packed scenes by proactively identifying occluded objects (Figure 3).
Figure 3: Application of the unsupervised Re-ID module and occlusion module to FairMOT.
Experimentation and Results
Extensive experimentation on the MOTChallenge datasets—including MOT16, MOT17, and MOT20—demonstrates the significant improvement in tracking metrics such as MOTA, IDF1, and IDS when integrating these modules. The results highlight lower false negatives and higher tracking accuracy by successfully refinding heavily occluded objects (Figure 4).
Figure 4: Cases where lost objects are re-identified by the occlusion estimation module.
Conclusion
The paper contributes to enhancing MOT by introducing two key modules that address occlusions and Re-ID without explicit labeling requirements. The methodology presents a scalable approach, applicable to real-world tracking scenarios and adaptable to large-scale video data. Future developments could focus on further optimizing these modules to handle increasingly complex scenes.
This work also underscores the potential of unsupervised learning in tracking systems, suggesting a shift away from conventional supervised techniques reliant on extensive annotated datasets, thereby broadening applicability across varied domains.