Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation (2404.03645v1)
Abstract: Referring video segmentation relies on natural language expressions to identify and segment objects, often emphasizing motion clues. Previous works treat a sentence as a whole and directly perform identification at the video-level, mixing up static image-level cues with temporal motion cues. However, image-level features cannot well comprehend motion cues in sentences, and static cues are not crucial for temporal perception. In fact, static cues can sometimes interfere with temporal perception by overshadowing motion cues. In this work, we propose to decouple video-level referring expression understanding into static and motion perception, with a specific emphasis on enhancing temporal comprehension. Firstly, we introduce an expression-decoupling module to make static cues and motion cues perform their distinct role, alleviating the issue of sentence embeddings overlooking motion cues. Secondly, we propose a hierarchical motion perception module to capture temporal information effectively across varying timescales. Furthermore, we employ contrastive learning to distinguish the motions of visually similar objects. These contributions yield state-of-the-art performance across five datasets, including a remarkable $\textbf{9.2%}$ $\mathcal{J&F}$ improvement on the challenging $\textbf{MeViS}$ dataset. Code is available at https://github.com/heshuting555/DsHmp.
- A closer look at referring expressions for video object segmentation. Multimedia Tools and Applications, 2022.
- End-to-end referring video object segmentation with multimodal transformers. In CVPR, 2022.
- A simple framework for contrastive learning of visual representations. In ICML, 2020.
- Multi-attention network for compressed video referring object segmentation. In ACM MM, 2022.
- Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
- PhraseClick: toward achieving flexible interactive segmentation by phrase and click. In ECCV, 2020.
- Vision-language transformer and query generation for referring segmentation. In ICCV, 2021a.
- MeViS: A large-scale benchmark for video segmentation with motion expressions. In ICCV, 2023a.
- MOSE: A new dataset for video object segmentation in complex scenes. In ICCV, 2023b.
- VLT: Vision-language transformer and query generation for referring segmentation. IEEE TPAMI, 2023c.
- Progressive multimodal interaction network for referring video object segmentation. The 3rd Large-scale Video Object Segmentation Challenge, 2021b.
- Language-bridged spatial-temporal interaction for referring video object segmentation. In CVPR, 2022.
- Encoder fusion network with co-attention embedding for referring image segmentation. In CVPR, 2021.
- Actor and action video segmentation from a sentence. In CVPR, 2018.
- Html: Hybrid temporal-scale multimodal learning framework for referring video object segmentation. In ICCV, 2023.
- Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
- Vita: Video instance segmentation via object token association. In NeurIPS, 2022.
- Segmentation from natural language expressions. In ECCV, 2016.
- Beyond one-to-one: Rethinking the referring image segmentation. In ICCV, 2023.
- Minvis: A minimal video instance segmentation framework without video-based training. In NeurIPS, 2022.
- Towards understanding action recognition. In ICCV, 2013.
- Hard negative mixing for contrastive learning. NeurIPS, 2020.
- Video object segmentation with language referring expressions. In ACCV, 2018.
- Restr: Convolution-free referring image segmentation using transformers. In CVPR, 2022.
- Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 1955.
- You only infer once: Cross-modal meta-transfer for referring video object segmentation. In AAAI, 2022.
- Referring image segmentation via recurrent refinement networks. In CVPR, 2018.
- Transformer-based visual segmentation: A survey. arXiv preprint arXiv:2304.09854, 2023a.
- Robust referring video object segmentation with cyclic structural consensus. In ICCV, 2023b.
- Clawcranenet: Leveraging object-level relation for text-based video segmentation. arXiv preprint arXiv:2103.10702, 2021.
- Recurrent multimodal interaction for referring image segmentation. In ICCV, 2017.
- Instance-specific feature propagation for referring segmentation. IEEE TMM, 2022a.
- GRES: Generalized referring expression segmentation. In CVPR, 2023a.
- Multi-modal mutual attention and iterative interaction for referring image segmentation. IEEE TIP, 32, 2023b.
- Polyformer: Referring image segmentation as sequential polygon generation. In CVPR, 2023c.
- Cross-modal progressive comprehension for referring segmentation. IEEE TPAMI, 2022b.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Video swin transformer. In CVPR, 2022c.
- Decoupled weight decay regularization. In ICLR, 2019.
- Soc: Semantic-assisted object cluster for referring video object segmentation. In NeurIPS, 2023.
- Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
- Dynamic multimodal instance segmentation guided by natural language queries. In ECCV, 2018.
- Spectrum-guided multi-granularity referring video object segmentation. In ICCV, 2023.
- The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- Learning to segment every referring object point by point. In CVPR, 2023.
- Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language, pages 70–80, 2015.
- Urvos: Unified referring video object segmentation network with a large-scale benchmark. In ECCV, 2020.
- Contrastive grouping with transformer for referring image segmentation. In CVPR, 2023a.
- Temporal collection and distribution for referring video object segmentation. In ICCV, 2023b.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.
- Attention is all you need. In NeurIPS, 2017.
- Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In ICCV, 2019.
- Cris: Clip-driven referring image segmentation. In CVPR, 2022.
- Multi-level representation learning with semantic alignment for referring video object segmentation. In CVPR, 2022a.
- Onlinerefer: A simple online baseline for referring video object segmentation. In ICCV, 2023a.
- Language as queries for referring video object segmentation. In CVPR, 2022b.
- Towards open vocabulary learning: A survey. IEEE TPAMI, 2024.
- Advancing referring expression segmentation beyond single image. In CVPR, 2023b.
- Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.
- Lavt: Language-aware vision transformer for referring image segmentation. In CVPR, 2022.
- Cross-modal self-attention network for referring image segmentation. In CVPR, 2019.
- Modeling context in referring expressions. In ECCV, 2016.
- Mattnet: Modular attention network for referring expression comprehension. In CVPR, 2018.
- Zero-shot referring image segmentation with global-local context features. In CVPR, 2023.
- Detection and tracking meet drones challenge. IEEE TPAMI, 2022.
- Generalized decoding for pixel, image, and language. In CVPR, 2023.
- Shuting He (23 papers)
- Henghui Ding (87 papers)