Emergent Mind

SAM 2: Segment Anything in Images and Videos

(2408.00714)
Published Aug 1, 2024 in cs.CV , cs.AI , and cs.LG

Abstract

We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using 3x fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6x faster than the Segment Anything Model (SAM). We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks. We are releasing a version of our model, the dataset and an interactive demo.

SAM 2 interactively segments video frames with prompts using a streaming memory for previous inputs.

Overview

  • The Segment Anything Model 2 (SAM 2) aims to facilitate promptable visual segmentation for both images and videos, leveraging interactive prompts and a transformer architecture with a memory mechanism for real-time video processing.

  • The authors introduce the Segment Anything Video (SA-V) dataset, the largest video segmentation dataset to date, with 35.5 million masks across 50.9 thousand videos, created through model-assisted annotation and human validation.

  • SAM 2 achieves significant improvements in segmentation accuracy and speed, surpassing previous models and setting new benchmarks in video object segmentation (VOS), with applications across various domains like augmented reality (AR) and autonomous vehicles.

SAM 2: Segment Anything in Images and Videos

The paper "SAM 2: Segment Anything in Images and Videos" presents the Segment Anything Model 2 (SAM 2), which aims to address the task of promptable visual segmentation across both images and videos. This research builds upon the foundations established by the original Segment Anything Model (SAM), and introduces several innovations that adapt the model's abilities to handle the temporal complexities inherent in video data.

Overview

SAM 2 is designed to unify image and video segmentation tasks, leveraging an interactive approach where prompts (clicks, boxes, or masks) are used to segment objects within frames, thus enabling precise video tracking. At its core, SAM 2 employs a transformer architecture equipped with a memory mechanism to handle real-time video processing. This enables the model to maintain context across video frames and improve segmentation accuracy through iterative user inputs.

Dataset and Data Engine

A significant contribution of this work is the creation of the largest video segmentation dataset to date, termed the Segment Anything Video (SA-V) dataset. This extensive dataset was constructed using a meticulously designed data engine, which combines model-assisted annotation with human validation to ensure high-quality annotations. The SA-V dataset comprises 35.5 million masks across 50.9 thousand videos, providing a diverse and challenging groundwork for training and evaluating video segmentation models.

Model Architecture

SAM 2 extends the SAM framework by incorporating a streaming memory mechanism, which entails:

  • Image Encoder: A pre-trained hierarchical transformer model (Hiera) processes video frames individually.
  • Memory Attention: SAM 2 utilizes memory attention to condition current frame embeddings on historical frames, allowing context retention across the video stream.
  • Prompt Encoder and Mask Decoder: Enhanced to handle various types of prompts and predict multiple masks per frame when ambiguity in object segmentation arises.
  • Memory Encoder and Bank: Stores and retrieves frame embeddings and segmentation masks, enabling the model to maintain and update object contexts.

Key Results and Performance

SAM 2 demonstrates substantial improvements over previous models, with notable highlights:

  • In video segmentation tasks, SAM 2 achieves better accuracy using only 3 interactions, compared to previous methods.
  • It outperforms the original SAM by being 6x faster in image segmentation tasks while improving segmentation accuracy.
  • The model achieves state-of-the-art performance on several benchmarks for video object segmentation (VOS), surpassing existing methods in metrics such as J, G, and mIoU.

The release includes not only the SAM 2 model and the SA-V dataset but also an interactive demo, fostering broader engagement within the research community.

Implications and Future Directions

The practical implications of SAM 2 are profound, spanning multiple domains such as augmented reality (AR), virtual reality (VR), autonomous vehicles, and video editing, where accurate and real-time object tracking is critical. Theoretically, SAM 2 pushes the boundaries of what is achievable with transformer models in video understanding tasks, particularly through the innovative integration of memory mechanisms.

Future AI developments may build upon this work by exploring more sophisticated memory architectures, enhancing the temporal coherence in segmentation, and reducing the reliance on prompt frequency. Moreover, applying SAM 2 in diverse, real-world scenarios will provide deeper insights into its robustness and the potential need for additional model tuning or dataset expansion.

In conclusion, SAM 2 represents a significant advancement in promptable segmentation models, particularly within the context of video data, and sets a new standard in both the breadth and precision of segmenting objects in multimedia content.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube