Emergent Mind

RepViT-SAM: Towards Real-Time Segmenting Anything

(2312.05760)
Published Dec 10, 2023 in cs.CV

Abstract

Segment Anything Model (SAM) has shown impressive zero-shot transfer performance for various computer vision tasks recently. However, its heavy computation costs remain daunting for practical applications. MobileSAM proposes to replace the heavyweight image encoder in SAM with TinyViT by employing distillation, which results in a significant reduction in computational requirements. However, its deployment on resource-constrained mobile devices still encounters challenges due to the substantial memory and computational overhead caused by self-attention mechanisms. Recently, RepViT achieves the state-of-the-art performance and latency trade-off on mobile devices by incorporating efficient architectural designs of ViTs into CNNs. Here, to achieve real-time segmenting anything on mobile devices, following MobileSAM, we replace the heavyweight image encoder in SAM with RepViT model, ending up with the RepViT-SAM model. Extensive experiments show that RepViT-SAM can enjoy significantly better zero-shot transfer capability than MobileSAM, along with nearly $10\times$ faster inference speed. The code and models are available at \url{https://github.com/THU-MIG/RepViT}.

Comparative visualization of zero-shot edge detection methods on BSDS500 by SAM, MobileSAM, and RepViT-SAM.

Overview

  • RepViT-SAM optimizes Segment Anything Model (SAM) for mobile devices with real-time performance.

  • It replaces the heavy image encoder with an efficient RepViT model integrating knowledge from CNNs and ViT.

  • The model achieved a 10× faster inference speed on a MacBook compared to MobileSAM without sacrificing zero-shot transfer.

  • Extensive testing demonstrated RepViT-SAM's superior zero-shot transfer ability and remarkably faster inference speeds.

  • It maintains comparable performance with heavyweight SAM models and equips developers with a resource-efficient segmentation tool.

Introduction

In the field of computer vision, Segment Anything Model (SAM) has recently been recognized for its exceptional ability to adapt to various tasks without additional training. Despite this flexibility, SAM's computational intensity has hindered its deployment on mobile devices which heavily limits its practicality. MobileSAM addressed some of these issues by employing a lightweight image encoder and distillation techniques, but it still stumbled when it came to speed and memory requirements, especially on mobile platforms.

Methodology

The newly proposed RepViT-SAM model aims to refine SAM's architecture for real-time performance on mobile devices. It replaces the heavy image encoder from the original SAM with a RepViT model, an architecture that incorporates efficient designs from Convolutional Neural Networks (CNNs) within the Vision Transformer (ViT) framework. Leveraging efficient components like early convolutions, structural reparameterized depthwise convolutions, and squeeze-and-excitation layers, RepViT-SAM aims to deliver high-quality segmentation at a significantly reduced computational cost. In tests, it demonstrated an impressive 10× faster inference speed on a MacBook compared to MobileSAM, without compromising on zero-shot transfer capabilities.

Experimental Results

As part of the assessment, RepViT-SAM underwent various tests to compare its performance with other models in the domain. The comprehensive experiments conducted included zero-shot edge detection, instance segmentation, segmentation in the wild, video object segmentations, and other real-world applications. RepViT-SAM was found to possess a superior zero-shot transfer ability and significantly faster inference speeds when compared to its MobileSAM counterpart. Moreover, it managed to produce comparable performance with the heavyweight ViT-based SAM models, showcasing a promising balance between efficiency and effectiveness.

Conclusion

RepViT-SAM has established itself as a formidable model for efficient image and video segmentation tasks, especially suited for applications on mobile and other resource-constrained devices. Its distillation strategy and architectural solutions pave the way for future innovation in the field of lightweight, real-time computer vision systems. With the release of its code and models to the public domain, developers and researchers now have access to a robust framework for further exploration and development of segmenting solutions that can operate in real-time mobile environments.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.