Segment Anything Meets Point Tracking (2307.01197v2)
Abstract: The Segment Anything Model (SAM) has established itself as a powerful zero-shot image segmentation model, enabled by efficient point-centric annotation and prompt-based models. While click and brush interactions are both well explored in interactive image segmentation, the existing methods on videos focus on mask annotation and propagation. This paper presents SAM-PT, a novel method for point-centric interactive video segmentation, empowered by SAM and long-term point tracking. SAM-PT leverages robust and sparse point selection and propagation techniques for mask generation. Compared to traditional object-centric mask propagation strategies, we uniquely use point propagation to exploit local structure information agnostic to object semantics. We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark. Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a point-based segmentation tracker yields better zero-shot performance and efficient interactions. We release our code that integrates different point trackers and video segmentation benchmarks at https://github.com/SysCV/sam-pt.
- Speeded-up robust features (surf). Computer vision and image understanding, 110(3):346–359, 2008.
- The 2018 davis challenge on video object segmentation. In arXiv:1803.00557, 2018.
- The 2019 davis challenge on vos: Unsupervised multi-object segmentation. In arXiv:1905.00737, 2019.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Mask2former for video instance segmentation. arXiv preprint arXiv: 2112.10764, 2021a.
- Pointly-supervised instance segmentation. In CVPR, pages 2617–2626, 2022.
- Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In ECCV, 2022.
- Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In CVPR, 2021b.
- Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In CVPR, 2021c.
- Tracking anything with decoupled video segmentation. In ICCV, 2023a.
- Segment and track anything. arXiv preprint arXiv:2305.06558, 2023b.
- Superpoint: Self-supervised interest point detection and description. In CVPRW, 2018.
- Mose: A new dataset for video object segmentation in complex scenes. arXiv preprint arXiv: 2302.01872, 2023.
- Tap-vid: A benchmark for tracking any point in a video. In NeurIPS, 2022.
- Tapir: Tracking any point with per-frame initialization and temporal refinement. ICCV, 2023.
- Particle video revisited: Tracking through occlusions using point trajectories. In ECCV, 2022.
- Interactive video object segmentation using global and local transfer modules. In ECCV, 2020.
- Space-time correspondence as a contrastive random walk. In NeurIPS, 2020.
- Cotracker: It is better to track together. arXiv preprint arXiv:2307.07635, 2023.
- Segment anything in high quality. In NeurIPS, 2023.
- Segment anything. In ICCV, pages 4015–4026, 2023.
- Learning to detect unseen object classes by between-class attribute transfer. In CVPR, pages 951–958. IEEE, 2009.
- Recurrent dynamic embedding for video object segmentation. In CVPR, 2022.
- Query-memory re-aggregation for weakly-supervised video object segmentation. In AAAI, 2021.
- David G Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60:91–110, 2004.
- An iterative image registration technique with an application to stereo vision. In IJCAI, 1981.
- Memory aggregation networks for efficient interactive video object segmentation. In CVPR, 2020.
- Fast user-guided video object segmentation by interaction-and-propagation networks. In CVPR, 2019.
- A simple and fast algorithm for k-medoids clustering. Expert Systems with Applications, 36(2, Part 2):3336–3341, 2009.
- The 2017 davis challenge on video object segmentation. arXiv:1704.00675, 2017.
- Learning transferable visual models from natural language supervision. 2021.
- Particle video: Long-range motion estimation using point trajectories. IJCV, 80:72–91, 2008.
- Superglue: Learning feature matching with graph neural networks. In CVPR, 2020.
- Jianbo Shi and Tomasi. Good features to track. In CVPR, 1994.
- Raft: Recurrent all-pairs field transforms for optical flow. In ECCV, 2020.
- Detection and tracking of point. IJCV, 9:137–154, 1991.
- Interactive video cutout. In ToG, 2005.
- Fast online object tracking and segmentation: A unifying approach. In CVPR, 2019.
- Tracking everything everywhere all at once. In ICCV, 2023a.
- Unidentified video objects: A benchmark for dense, open-world segmentation. In ICCV, 2021.
- Images speak in images: A generalist painter for in-context visual learning. In CVPR, 2023b.
- Seggpt: Segmenting everything in context. In ICCV, 2023c.
- Seqformer: a frustratingly simple model for video instance segmentation. In ECCV, 2022.
- Youtube-vos: A large-scale video object segmentation benchmark, 2018.
- Self-supervised video object segmentation by motion grouping. In ICCV, 2021.
- Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968, 2023.
- Decoupling features in hierarchical propagation for video object segmentation. In NeurIPS, 2022a.
- Decoupling features in hierarchical propagation for video object segmentation. In NeurIPS, 2022b.
- Lift: Learned invariant feature transform. ECCV, 2016.
- Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Robust online video instance segmentation with track queries. arXiv preprint arXiv: 2211.09108, 2022.
- Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023.
- Pointodyssey: A large-scale synthetic dataset for long-term point tracking. In ICCV, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.