OMG-Seg: Is One Model Good Enough For All Segmentation? (2401.10229v2)

Published 18 Jan 2024 in cs.CV

Abstract: In this work, we address various segmentation tasks, each traditionally tackled by distinct or partially unified models. We propose OMG-Seg, One Model that is Good enough to efficiently and effectively handle all the segmentation tasks, including image semantic, instance, and panoptic segmentation, as well as their video counterparts, open vocabulary settings, prompt-driven, interactive segmentation like SAM, and video object segmentation. To our knowledge, this is the first model to handle all these tasks in one model and achieve satisfactory performance. We show that OMG-Seg, a transformer-based encoder-decoder architecture with task-specific queries and outputs, can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead across various tasks and datasets. We rigorously evaluate the inter-task influences and correlations during co-training. Code and models are available at https://github.com/lxtGH/OMG-Seg.

References (103)

Citations (34)

View on Semantic Scholar

Summary

The paper introduces OMG-Seg, a unified model that efficiently handles diverse segmentation tasks using a shared encoder-decoder architecture.
It utilizes a transformer-based design with task-specific queries to streamline training and inference across semantic, instance, and panoptic segmentation.
The model demonstrates competitive accuracy across over ten segmentation tasks while reducing computational footprint and parameter complexity.

Introduction to OMG-Seg

The landscape of visual segmentation in computer vision is complex, with a variety of tasks that have traditionally required distinct models or architectures to solve. The newly introduced OMG-Seg, short for "One Model that is Good enough," aims to revolutionize this area by presenting a unified approach. Unlike previous models that focused on particular segmentation tasks, OMG-Seg is versatile enough to efficiently address multiple segmentation challenges with a single encoder-decoder architecture. Specifically, this includes image and video semantic, instance, and panoptic segmentation, as well as interactive and open vocabulary segmentations.

Achievements and Evaluation

OMG-Seg represents the culmination of significant research efforts, utilizing a transformer-based architecture that integrates task-specific queries and outputs. With its innovative approach, the model demonstrates that it can reduce the footprint of traditional models in terms of both computation and the number of parameters required. Its performance has been rigorously tested across more than ten distinct segmentation tasks and multiple datasets, showing that OMG-Seg can uphold a satisfactory level of accuracy while embracing a broad scope of applications.

Technology Behind OMG-Seg

The core of OMG-Seg's innovation lies in its shared encoder-decoder structure coupled with a unified representation for different segmentation outputs. Queries within the model can represent masks, unique IDs, or visual prompts, enabling the shared decoder to process a diverse array of queries. This approach allows for significant parameter sharing across tasks and simplifies training and inference procedures. By co-training on combined datasets, OMG-Seg exhibits its capability for multiple segmentation tasks, ranging from individual frames to entire video sequences.

Unified Approach Over Specialized Methods

Comparisons with other methods shed light on the competitiveness and potential of OMG-Seg. While specialized models may have the upper hand in certain segmentation tasks, none can match the universality of OMG-Seg's framework. A comparative paper on various models demonstrates that OMG-Seg holds its ground against task-specific architectures. Its ability to operate in a wide range of scenarios, including complex video segmentation and interactive settings, underscores the model's flexibility.

In essence, OMG-Seg is not just another segmentation tool; it is a step toward a universal model that aims to be the Swiss Army knife of visual segmentation. By successfully training one model for numerous segmentation tasks, this approach paves the path for more efficient and simplified processes in image and video analysis. As visual segmentation continues to serve crucial roles in technological advancements like autonomous driving and augmented reality, the impact of a model like OMG-Seg could be far-reaching.