MobileSAMv2: Faster Segment Anything to Everything (2312.09579v1)

Published 15 Dec 2023 in cs.CV and cs.AI

Abstract: Segment anything model (SAM) addresses two practical yet challenging segmentation tasks: \textbf{segment anything (SegAny)}, which utilizes a certain point to predict the mask for a single object of interest, and \textbf{segment everything (SegEvery)}, which predicts the masks for all objects on the image. What makes SegAny slow for SAM is its heavyweight image encoder, which has been addressed by MobileSAM via decoupled knowledge distillation. The efficiency bottleneck of SegEvery with SAM, however, lies in its mask decoder because it needs to first generate numerous masks with redundant grid-search prompts and then perform filtering to obtain the final valid masks. We propose to improve its efficiency by directly generating the final masks with only valid prompts, which can be obtained through object discovery. Our proposed approach not only helps reduce the total time on the mask decoder by at least 16 times but also achieves superior performance. Specifically, our approach yields an average performance boost of 3.6\% (42.5\% \textit{v.s.} 38.9\%) for zero-shot object proposal on the LVIS dataset with the mask AR@$K$ metric. Qualitative results show that our approach generates fine-grained masks while avoiding over-segmenting things. This project targeting faster SegEvery than the original SAM is termed MobileSAMv2 to differentiate from MobileSAM which targets faster SegAny. Moreover, we demonstrate that our new prompt sampling is also compatible with the distilled image encoders in MobileSAM, contributing to a unified framework for efficient SegAny and SegEvery. The code is available at the same link as MobileSAM Project \href{https://github.com/ChaoningZhang/MobileSAM}{\textcolor{red}{https://github.com/ChaoningZhang/MobileSAM}}. \end{abstract}

Citations (15)

View on Semantic Scholar

Summary

The paper introduces an object-aware prompt sampling method that replaces grid-search, cutting mask decoding time 16X.
It achieves a 3.6% boost in mask AR@$K$ on the LVIS dataset by leveraging YOLOv8-inspired detection techniques.
The approach integrates with MobileSAM’s distilled encoders to deliver efficient segmentation for both SegAny and SegEvery tasks.

Overview of MobileSAMv2: Efficient Segmentation

The paper "MobileSAMv2: Faster Segment Anything to Everything" addresses the computational inefficiencies present in the Segment Anything Model (SAM) regarding segmentation tasks. It specifically focuses on two tasks: segment anything (SegAny) and segment everything (SegEvery). MobileSAMv2 aims to enhance the efficiency of SegEvery by refining the mask generation process, reducing reliance on dense grid-search prompt sampling, and promoting more intelligent object-aware prompt sampling.

Key Contributions

The research identifies that the efficiency bottleneck of SegEvery in SAM arises from the mask decoder, which requires exhaustive and redundant prompt sampling and subsequent filtering to identify valid masks.

Object-Aware Prompt Sampling:
- The paper introduces a novel method to sample prompts by leveraging modern object detection algorithms, notably YOLOv8, to replace the conventional grid-search sampling with object-aware prompts. This approach efficiently reduces the number of needed prompts and achieves comparable or superior segmentation performance.
Enhanced Performance Metrics:
- The approach led to at least a 16-fold reduction in time spent on the mask decoder while yielding a performance boost of 3.6% using the mask AR@ $K$ metric on the LVIS dataset. Notably, this methodology improves on avoiding over-segmentation, presenting fine-grained mask generation.
Compatibility with MobileSAM:
- MobileSAMv2 is designed to integrate seamlessly with the distilled image encoders from the original MobileSAM, forming a cohesive framework that delivers efficient solutions for task SegAny and SegEvery alike.

Methodological Insights

The authors focus on transforming the primordial inefficiencies of the default grid-search via object-aware prompt sampling, which effectively harmonizes with the SAM architecture’s prompt-guided mask decoding system. The adoption of bounding boxes as prompts circumvents the inherent ambiguities that simpler point prompts introduce, eliminating the dependency on multi-mask outputs and cumbersome post-filtering.

Strong Numerical Results

Experiments demonstrate that MobileSAMv2 using approximately 320 box prompts achieves performance parity with the higher density configurations of grid-search strategies within SAM (e.g., 4096-point grid-search), indicating a marked improvement in efficiency without compromising the segmentation quality.

Implications and Future Directions

MobileSAMv2's implementation transcends the domain of computational vision, offering practical enhancements for broader AI applications where segmentation time and accuracy are critical. This work implicitly encourages further exploration into more sophisticated object discovery algorithms and more advanced distilled encoders to refine prompt sampling and accelerate processing times.

Future investigations may illuminate the potential of deploying MobileSAMv2 in real-time and resource-constrained environments, evaluating its adaptability and performance across diverse datasets. The necessity for optimized and scalable AI solutions continues to prevail, and innovations such as this propel the discipline towards more efficient and versatile applications.

In summary, MobileSAMv2 is positioned as a significant enhancement in segmentation methodology, offering researchers and practitioners a pragmatic approach to achieving higher efficiency and performance in class-agnostic segmentation tasks while embracing the constraints of real-world computational limitations.

PDF Markdown

Related Papers

GitHub

GitHub - ChaoningZhang/MobileSAM: This is the official code for MobileSAM project that makes SAM lightweight for mobile applications and beyond! (4,399 stars)