MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking

Published 28 Jul 2023 in cs.CV | (2307.15700v3)

Abstract: As a video task, Multiple Object Tracking (MOT) is expected to capture temporal information of targets effectively. Unfortunately, most existing methods only explicitly exploit the object features between adjacent frames, while lacking the capacity to model long-term temporal information. In this paper, we propose MeMOTR, a long-term memory-augmented Transformer for multi-object tracking. Our method is able to make the same object's track embedding more stable and distinguishable by leveraging long-term memory injection with a customized memory-attention layer. This significantly improves the target association ability of our model. Experimental results on DanceTrack show that MeMOTR impressively surpasses the state-of-the-art method by 7.9% and 13.0% on HOTA and AssA metrics, respectively. Furthermore, our model also outperforms other Transformer-based methods on association performance on MOT17 and generalizes well on BDD100K. Code is available at https://github.com/MCG-NJU/MeMOTR.

Abstract PDF HTML Upgrade to Chat

Authors (2)

References (49)

Citations (36)

View on Semantic Scholar

Summary

The paper introduces MeMOTR, a memory-augmented Transformer that significantly enhances long-term association in multi-object tracking.
It employs a customized memory-attention layer and adaptive aggregation strategy to maintain stable track embeddings over extended sequences.
Experimental results on datasets like DanceTrack, MOT17, and BDD100K show significant improvements in HOTA and AssA metrics compared to state-of-the-art methods.

Analyzing MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking

The research paper introduces MeMOTR, a novel approach to multi-object tracking (MOT) leveraging a long-term memory-augmented Transformer. Traditional methods in MOT primarily focus on associating object features between adjacent frames, often neglecting long-term temporal information that can enhance tracking accuracy and stability. The proposed MeMOTR architecture integrates long-term memory with Transformer mechanisms to improve the robustness of object association over time.

Methodology and Architectural Insights

MeMOTR advances the MOT task by addressing the limitations of short-term object feature association and improving the target's track embedding through long-term memory integration. The core contributions of the paper include:

Long-Term Memory Incorporation: The method maintains a long-term memory for each tracked object using an exponential recursion update algorithm. This approach allows for a more stable track embedding by injecting this memory into the model, ultimately enhancing the model's ability to distinguish and associate tracked objects over extended sequences.
Customized Memory-Attention Layer: A memory-attention layer is employed to generate a distinguishable representation of objects. By interacting with long-term memory, this layer reduces abrupt changes in track embeddings between frames, which is crucial for maintaining consistent object tracking, especially in complex scenes with many similar objects.
Adaptive Aggregation: The model utilizes an adaptive aggregation strategy, fusing object features from adjacent frames to enhance tracking robustness. This strategy serves to alleviate issues such as occlusion and blur by dynamically adjusting the influence of the current and previous frame's outputs.
Improved Detection-Tracking Alignment: Addressing the potential semantic gap between detection and tracking queries, the study introduces an additional layer within the Transformer architecture specifically tailored for initial object detection. This layer aids in better aligning detection with the existing tracked targets, facilitating more accurate tracking.

Experimental Evaluation and Results

The experimental results substantiate the effectiveness of the MeMOTR model. On the DanceTrack dataset, recognized for its association challenges, MeMOTR outperformed state-of-the-art strategies by achieving notable improvements in Higher Order Metric for Evaluating Multi-Object Tracking (HOTA) and Association Accuracy (AssA). The model also demonstrated superior association performance on MOT17 and generalized well on BDD100K, further confirming the efficacy of the proposed long-term memory mechanism.

Significantly, MeMOTR achieved a 7.9% and 13.0% improvement on HOTA and AssA metrics, respectively, over previous leading methods on the DanceTrack dataset. These numerical results reflect the impact of integrating a memory-augmented Transformer in enhancing association reliability in MOT, particularly in complex scenarios such as tracking group dancers or sports players.

Implications and Future Directions

The theoretical and practical implications of this research are substantial. For theory, this work extends the capabilities of Transformers in temporal modeling within the vision domain, promising a new avenue for applying long-term memory to extract more informative features for sequence-based tasks. Practically, MeMOTR could be pivotal in applications requiring precise and consistent tracking over time, such as autonomous driving and intelligent surveillance systems.

Future developments in this area might explore optimizing the long-term memory update strategy for different datasets, examining alternative memory structures, or integrating additional cues (e.g., motion estimation models) to further refine tracking accuracy. Moreover, adapting the MeMOTR framework to work seamlessly with other object detection paradigms or backbone architectures could further extend its applicability and effectiveness across varying MOT scenarios.

Overall, the introduction of MeMOTR represents a significant stride towards leveraging long-term temporal dynamics in multi-object tracking, providing a strong foundation for future research and applications in this domain.

Markdown Report Issue