SAM 2: Segment Anything in Images and Videos (2408.00714v2)

Published 1 Aug 2024 in cs.CV, cs.AI, and cs.LG

Abstract: We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using 3x fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6x faster than the Segment Anything Model (SAM). We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks. We are releasing our main model, dataset, as well as code for model training and our demo.

Citations (226)

View on Semantic Scholar

Summary

The paper introduces SAM 2, a unified model that efficiently segments objects in images and videos using a transformer architecture with a memory mechanism.
It leverages the extensive SA-V dataset, featuring 35.5 million masks across 50.9K videos, to ensure high-quality annotations and robust performance.
The model outperforms previous methods, achieving 6x faster image segmentation and superior video segmentation metrics with minimal user interactions.

SAM 2: Segment Anything in Images and Videos

The paper "SAM 2: Segment Anything in Images and Videos" presents the Segment Anything Model 2 (SAM 2), which aims to address the task of promptable visual segmentation across both images and videos. This research builds upon the foundations established by the original Segment Anything Model (SAM), and introduces several innovations that adapt the model's abilities to handle the temporal complexities inherent in video data.

Overview

SAM 2 is designed to unify image and video segmentation tasks, leveraging an interactive approach where prompts (clicks, boxes, or masks) are used to segment objects within frames, thus enabling precise video tracking. At its core, SAM 2 employs a transformer architecture equipped with a memory mechanism to handle real-time video processing. This enables the model to maintain context across video frames and improve segmentation accuracy through iterative user inputs.

Dataset and Data Engine

A significant contribution of this work is the creation of the largest video segmentation dataset to date, termed the Segment Anything Video (SA-V) dataset. This extensive dataset was constructed using a meticulously designed data engine, which combines model-assisted annotation with human validation to ensure high-quality annotations. The SA-V dataset comprises 35.5 million masks across 50.9 thousand videos, providing a diverse and challenging groundwork for training and evaluating video segmentation models.

Model Architecture

SAM 2 extends the SAM framework by incorporating a streaming memory mechanism, which entails:

Image Encoder: A pre-trained hierarchical transformer model (Hiera) processes video frames individually.
Memory Attention: SAM 2 utilizes memory attention to condition current frame embeddings on historical frames, allowing context retention across the video stream.
Prompt Encoder and Mask Decoder: Enhanced to handle various types of prompts and predict multiple masks per frame when ambiguity in object segmentation arises.
Memory Encoder and Bank: Stores and retrieves frame embeddings and segmentation masks, enabling the model to maintain and update object contexts.

Key Results and Performance

SAM 2 demonstrates substantial improvements over previous models, with notable highlights:

In video segmentation tasks, SAM 2 achieves better accuracy using only 3 interactions, compared to previous methods.
It outperforms the original SAM by being 6x faster in image segmentation tasks while improving segmentation accuracy.
The model achieves state-of-the-art performance on several benchmarks for video object segmentation (VOS), surpassing existing methods in metrics such as J, G, and mIoU.

The release includes not only the SAM 2 model and the SA-V dataset but also an interactive demo, fostering broader engagement within the research community.

Implications and Future Directions

The practical implications of SAM 2 are profound, spanning multiple domains such as augmented reality (AR), virtual reality (VR), autonomous vehicles, and video editing, where accurate and real-time object tracking is critical. Theoretically, SAM 2 pushes the boundaries of what is achievable with transformer models in video understanding tasks, particularly through the innovative integration of memory mechanisms.

Future AI developments may build upon this work by exploring more sophisticated memory architectures, enhancing the temporal coherence in segmentation, and reducing the reliance on prompt frequency. Moreover, applying SAM 2 in diverse, real-world scenarios will provide deeper insights into its robustness and the potential need for additional model tuning or dataset expansion.

In conclusion, SAM 2 represents a significant advancement in promptable segmentation models, particularly within the context of video data, and sets a new standard in both the breadth and precision of segmenting objects in multimedia content.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Eric_Betzig/status/1822127107075723378

https://twitter.com/varunnarsana/status/1927806381630931357

https://twitter.com/gm8xx8/status/1819187882642305034

https://twitter.com/javaeeeee1/status/1819414992841474299

https://twitter.com/susumuota/status/1819887106858057783

YouTube

Show All Videos