ACTrack: Adding Spatio-Temporal Condition for Visual Object Tracking (2403.07914v1)
Abstract: Efficiently modeling spatio-temporal relations of objects is a key challenge in visual object tracking (VOT). Existing methods track by appearance-based similarity or long-term relation modeling, resulting in rich temporal contexts between consecutive frames being easily overlooked. Moreover, training trackers from scratch or fine-tuning large pre-trained models needs more time and memory consumption. In this paper, we present ACTrack, a new tracking framework with additive spatio-temporal conditions. It preserves the quality and capabilities of the pre-trained Transformer backbone by freezing its parameters, and makes a trainable lightweight additive net to model spatio-temporal relations in tracking. We design an additive siamese convolutional network to ensure the integrity of spatial features and perform temporal sequence modeling to simplify the tracking pipeline. Experimental results on several benchmarks prove that ACTrack could balance training efficiency and tracking performance.
- Connecting language and vision for natural language-based vehicle retrieval. In CVPR, pages 4034–4043, 2021.
- Fully-convolutional siamese networks for object tracking. In ECCVW, pages 850–865, 2016.
- Learning discriminative model prediction for tracking. In CVPR, pages 6182–6191, 2019.
- Backbone is all your need: A simplified architecture for visual object tracking. In ECCV, pages 375–392, 2022.
- Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852, 2021a.
- Transformer tracking. In CVPR, pages 8126–8135, 2021b.
- Seqtrack: Sequence to sequence learning for visual object tracking. In CVPR, pages 14572–14581, 2023.
- Siamese box adaptive network for visual tracking. In CVPR, pages 6668–6677, 2020.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
- Mixformer: End-to-end tracking with iterative mixed attention. In CVPR, pages 13608–13618, 2022.
- Lasot: A high-quality benchmark for large-scale single object tracking. In CVPR, pages 5374–5383, 2019.
- Stmtrack: Template-free visual tracking with space-time memory networks. In CVPR, pages 13774–13783, 2021.
- Aiatrack: Attention in attention for transformer visual tracking. In ECCV, pages 146–164, 2022.
- Siamcar: Siamese fully convolutional classification and regression for visual tracking. In CVPR, pages 6269–6277, 2020.
- Remind your neural network to prevent catastrophic forgetting. In ECCV, pages 466–483, 2020.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Multi-object tracking by self-supervised learning appearance model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3162–3168, 2023.
- Got-10k: A large high-diversity benchmark for generic object tracking in the wild. TPAMI, 43(5):1562–1577, 2019.
- The eighth visual object tracking vot2020 challenge results. In ECCV 2020 Workshops, Part V 16, pages 547–601, 2020.
- The first visual object tracking segmentation vots2023 challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1796–1818, 2023.
- High performance visual tracking with siamese region proposal network. In CVPR, pages 8971–8980, 2018.
- Siamrpn++: Evolution of siamese visual tracking with very deep networks. In CVPR, pages 4282–4291, 2019a.
- Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019b.
- Packnet: Adding multiple tasks to a single network by iterative pruning. In CVPR, pages 7765–7773, 2018.
- Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In ECCV, pages 67–82, 2018.
- Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In ECCV, pages 300–317, 2018.
- Incremental learning through deep adaptation. PAMI, 42(3):651–663, 2018.
- Overcoming catastrophic forgetting with hard attention to the task. In ICML, pages 4548–4557. PMLR, 2018.
- Lst: Ladder side-tuning for parameter and memory efficient transfer learning. NIPS, 35:12991–13005, 2022.
- Sequence to sequence learning with neural networks. NIPS, 27, 2014.
- Attention is all you need. NIPS, 30, 2017.
- Transformer meets tracker: Exploiting temporal context for robust visual tracking. In CVPR, pages 1571–1580, 2021.
- Autoregressive visual tracking. In CVPR, pages 9697–9706, 2023.
- Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In AAAI, pages 12549–12556, 2020.
- Learning spatio-temporal transformer for visual tracking. In ICCV, pages 10448–10457, 2021.
- Joint feature learning and relation modeling for tracking: A one-stream framework. In ECCV, pages 341–357, 2022.
- Side-tuning: a baseline for network adaptation via additive side networks. In ECCV, pages 698–714, 2020a.
- Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023.
- Ocean: Object-aware anchor-free tracking. In ECCV, pages 771–787, 2020b.
- Distractor-aware siamese networks for visual object tracking. In ECCV, pages 101–117, 2018.
- Yushan Han (8 papers)
- Kaer Huang (11 papers)