FusionSeg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos

Published 19 Jan 2017 in cs.CV | (1701.05384v2)

Abstract: We propose an end-to-end learning framework for segmenting generic objects in videos. Our method learns to combine appearance and motion information to produce pixel level segmentation masks for all prominent objects in videos. We formulate this task as a structured prediction problem and design a two-stream fully convolutional neural network which fuses together motion and appearance in a unified framework. Since large-scale video datasets with pixel level segmentations are problematic, we show how to bootstrap weakly annotated videos together with existing image recognition datasets for training. Through experiments on three challenging video segmentation benchmarks, our method substantially improves the state-of-the-art for segmenting generic (unseen) objects. Code and pre-trained models are available on the project website.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (371)

View on Semantic Scholar

Summary

The paper introduces a two-stream network that combines RGB and optical flow data for fully automatic segmentation of video objects.
It leverages bootstrapped training with pseudo-labels from existing datasets to overcome limited pixel-level supervision.
Evaluations on DAVIS, YouTube-Objects, and Segtrack-v2 show that its fusion strategy significantly outperforms state-of-the-art methods.

Overview of FusionSeg: An End-to-End Framework for Video Object Segmentation

The paper "FusionSeg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos" presents a deep learning-based approach to tackle the challenge of segmenting generic objects in video frames. This task involves producing pixel-wise masks for objects, regardless of their category, within the video context. The proposed approach combines both appearance and motion information using a two-stream fully convolutional neural network. This combination is achieved by integrating optical flow and RGB data to leverage their respective strengths for improved segmentation outcomes.

Methodology

The authors introduce a novel framework that consists of the following key components:

Two-Stream Network Architecture: The network is divided into two parallel streams: one handling appearance (RGB images) and the other handling motion (optical flow). The streams are subsequently fused to achieve robust segmentation results. The architecture is based on modifying the ResNet-101 model to enhance feature resolution and include multi-resolution layers to accommodate varying object sizes.
Bootstrapping Training Data: Training deep networks requires substantial labeled data, which is scarce for video object segmentation at the pixel level. The authors address this by generating pseudo-labels by leveraging existing image segmentation datasets and weakly labeled videos from ImageNet-Video. An intensive filtering approach is applied to ensure high-quality training samples, which includes bounding box tests and optical flow analysis to verify spatio-temporal coherence.
Fusion Strategy: To achieve an effective combination of appearance and motion cues, the model implements a fusion mechanism after each individual stream's processing. This involves three branches: leveraging appearance data, motion data, and synergistically, the combination of both via element-wise operations. The final prediction is determined by the supremum among these branches, which allows the framework to capitalize on complementary strengths of appearance and motion data.

Experimental Evaluation

The authors evaluate FusionSeg on three notable datasets: DAVIS, YouTube-Objects, and Segtrack-v2. The results indicate that the joint fusion model outperforms both appearance-only and motion-only approaches, achieving significant improvements over the state-of-the-art in fully automatic segmentation tasks. Additionally, the model shows competitive performance against semi-supervised methods, which require human intervention for accuracy.

On DAVIS, FusionSeg achieves a Jaccard score improvement over competitive semi-supervised methods, demonstrating its efficacy in challenging scenarios with hurdles like occlusions and motion blurs.
On YouTube-Objects, the approach shows robust performance across diverse and unconstrained categories, validating its generalization capability.
The results on Segtrack-v2 reinforce the method's competence, despite the dataset's lower resolution and unique challenges, further highlighting the importance of the motion contribution.

Implications & Future Directions

The FusionSeg framework introduces several innovations that have important implications for the field of video segmentation:

End-to-End Trainability with Limited Supervision: By bootstrapping from weak labels, the authors push the boundaries of training data requirements, opening avenues for efficient training in resource-constrained settings.
Generic Object Segmentation: The decoupling of appearance and motion learning enables the model to handle a wide array of objects beyond pre-defined categories, making it adaptable to various applications, from autonomous vehicles to content creation.

Speculatively, future developments could dive into more granular challenges, such as discriminating between multiple touching objects or integrating domain knowledge for specific video contexts (e.g., medical imaging or surveillance). Additionally, further advancements could include exploiting more sophisticated motion patterns and integrating real-time capabilities, especially for deployment in dynamic environments.

In summary, FusionSeg is a significant contribution to the domain of video segmentation, leveraging deep learning to elegantly integrate appearance and motion, thereby achieving superior segmentation performance without requiring extensive manual labeling.

Markdown Report Issue