UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces (2312.15715v1)

Published 25 Dec 2023 in cs.CV

Abstract: The reference-based object segmentation tasks, namely referring image segmentation (RIS), few-shot image segmentation (FSS), referring video object segmentation (RVOS), and video object segmentation (VOS), aim to segment a specific object by utilizing either language or annotated masks as references. Despite significant progress in each respective field, current methods are task-specifically designed and developed in different directions, which hinders the activation of multi-task capabilities for these tasks. In this work, we end the current fragmented situation and propose UniRef++ to unify the four reference-based object segmentation tasks with a single architecture. At the heart of our approach is the proposed UniFusion module which performs multiway-fusion for handling different tasks with respect to their specified references. And a unified Transformer architecture is then adopted for achieving instance-level segmentation. With the unified designs, UniRef++ can be jointly trained on a broad range of benchmarks and can flexibly complete multiple tasks at run-time by specifying the corresponding references. We evaluate our unified models on various benchmarks. Extensive experimental results indicate that our proposed UniRef++ achieves state-of-the-art performance on RIS and RVOS, and performs competitively on FSS and VOS with a parameter-shared network. Moreover, we showcase that the proposed UniFusion module could be easily incorporated into the current advanced foundation model SAM and obtain satisfactory results with parameter-efficient finetuning. Codes and models are available at \url{https://github.com/FoundationVision/UniRef}.

References (133)

Citations (13)

View on Semantic Scholar

Summary

The paper introduces a unified architecture that performs four distinct object segmentation tasks using a single model.
It employs the innovative UniFusion module to fuse diverse reference cues, such as language and mask annotations, into visual features.
The approach achieves state-of-the-art performance on referring image and video segmentation benchmarks while facilitating integration with models like SAM.

UniRef++: A Unified Architecture for Referring Object Segmentation Across Images and Videos

Object segmentation is a crucial task in computer vision, enabling computers to delineate the shapes of specific objects within images or videos. Traditionally, researchers have approached the task of object segmentation by devising specialized, task-specific models, leading to fragmented development and a multiplicity of models. However, a recent work titled "UniRef++" proposes an intelligent solution: a single, unified architecture capable of mastering four different object segmentation tasks.

Unified Approach to Segmentation

At the heart of UniRef++ is the innovative UniFusion module, an engine designed to handle multiway-fusion for various segmentation tasks by fusing different types of reference information—such as language descriptions and mask annotations—into visual features. This mechanism enables a Transformer-based architecture to process these references and carry out instance-level segmentation.

One particularly exciting aspect of UniRef++ is its versatility. By specifying the appropriate reference, the same model can perform different tasks in real-time, effectively serving as a Swiss Army knife for object segmentation.

Performance Benchmarking

UniRef++'s performance doesn't just excel in theory. When tested against various benchmarks, this architecture demonstrates state-of-the-art performance in referring image segmentation (RIS) and referring video object segmentation (RVOS). Furthermore, it remains competitive in few-shot image segmentation (FSS) and video object segmentation (VOS) tasks.

Incorporating into Existing Models

Another striking feature of the UniRef++ module is its adaptability. It can be seamlessly incorporated into existing object segmentation foundation models like "SAM". Such integration allows for efficient fine-tuning and adaptation of these pre-existing models to UniRef++'s more generalized approach.

In Conclusion

UniRef++ signifies a substantial leap toward unified model architectures in the field of vision-based AI. By successfully collapsing multiple tasks into a single model, it not only fosters synergies across tasks but also substantially reduces computational costs and model complexity. This single architecture's adaptability and performance across various object segmentation tasks suggest a promising move towards more holistic, less fragmented future in AI vision capabilities.

PDF Markdown

GitHub

GitHub - FoundationVision/UniRef: [ICCV2023] Segment Every Reference Object in Spatial and Temporal Spaces (239 stars)

Tweets

https://twitter.com/22146921/status/1740125257716171073

https://twitter.com/WilliamLamkin/status/1744917321037103464