AlphaNet: An Attention Guided Deep Network for Automatic Image Matting (2003.03613v1)

Published 7 Mar 2020 in cs.CV, cs.LG, and eess.IV

Abstract: In this paper, we propose an end to end solution for image matting i.e high-precision extraction of foreground objects from natural images. Image matting and background detection can be achieved easily through chroma keying in a studio setting when the background is either pure green or blue. Nonetheless, image matting in natural scenes with complex and uneven depth backgrounds remains a tedious task that requires human intervention. To achieve complete automatic foreground extraction in natural scenes, we propose a method that assimilates semantic segmentation and deep image matting processes into a single network to generate detailed semantic mattes for image composition task. The contribution of our proposed method is two-fold, firstly it can be interpreted as a fully automated semantic image matting method and secondly as a refinement of existing semantic segmentation models. We propose a novel model architecture as a combination of segmentation and matting that unifies the function of upsampling and downsampling operators with the notion of attention. As shown in our work, attention guided downsampling and upsampling can extract high-quality boundary details, unlike other normal downsampling and upsampling techniques. For achieving the same, we utilized an attention guided encoder-decoder framework which does unsupervised learning for generating an attention map adaptively from the data to serve and direct the upsampling and downsampling operators. We also construct a fashion e-commerce focused dataset with high-quality alpha mattes to facilitate the training and evaluation for image matting.

Citations (6)

View on Semantic Scholar

Summary

The paper demonstrates that integrating segmentation and an attention module effectively automates high-quality image matting.
It employs a dual-phase network with a DeepLabV3+ based trimap generator and a MobileNetV2 encoder-decoder, proving robust against complex transparency.
Experimental results indicate competitive performance over state-of-the-art methods using metrics like MSE, SAD, Conn, and Grad on an e-commerce dataset.

AlphaNet: An Attention Guided Deep Network for Automatic Image Matting

Introduction

The paper, "AlphaNet: An Attention Guided Deep Network for Automatic Image Matting," introduces AlphaNet, a deep network model that aims to automate the process of image matting without requiring human intervention. Image matting refers to the extraction of high-quality foreground objects from images, a crucial task in various applications like mixed reality and e-commerce. The proposed method integrates semantic segmentation with image matting into a unified network, facilitating the extraction of detailed semantic mattes, which are crucial for tasks like virtual try-ons in e-commerce platforms.

Model Architecture

AlphaNet consists of a segmentation-trimap prediction network and a matting network. The segmentation network is based on the DeepLabV3+ architecture, which has been extended with an erosion-dilation (ED) layer to generate coarse trimaps from binary masks, determining known and unknown pixel regions. The matting phase is an encoder-decoder structure employing a MobileNetV2 backbone and includes an attention module that guides the upsampling and downsampling operations.

Attention Module

The attention module employs feature maps from the encoder to generate attention maps that influence the spatial operations on the input data. This is crucial for capturing fine boundary details and ensuring high-quality matting outputs. The attention maps are produced by a fully convolutional neural network, applying several normalization layers to ensure robustness during spatial transformations. This module enables the network to handle complex transparency scenarios effectively.

Experimental Setup and Results

A new dataset focused on e-commerce was created for training and evaluating AlphaNet, consisting of images emphasizing human portraits and fashion accessories. The network was trained to predict alpha mattes, with evaluation metrics including Mean Squared Error (MSE), Sum of Absolute Differences (SAD), and perceptually motivated errors like Connectivity (Conn) and Gradient (Grad).

The results reveal that AlphaNet achieves competitive performance compared to state-of-the-art methods like Deep Image Matting (DIM) [Xu et al.], KNN matting, and others. When tested with and without human-provided trimaps, AlphaNet demonstrated its robustness and capability to automatically handle complex images with minimal degradation in matting quality.

Comparison with State-of-the-art Methods

AlphaNet's automatic capabilities were benchmarked against other interactive and automatic matting approaches. The integration of the attention module was found to significantly enhance the matting results, emphasizing the intricate structural details often lost in other methods. The paper showcases AlphaNet’s superiority in handling diverse matting challenges and highlights the advantages of the proposed homogeneous network that simplifies the training and eliminates the need for extensive manual input or additional ground truth data during inference.

Conclusion and Future Work

AlphaNet demonstrates that fully automated image matting with high fidelity is achievable by combining semantic segmentation and deep learning techniques. The deployment of an attention module proved pivotal in refining the model’s prediction capabilities, especially in scenarios involving complex boundaries and transparent regions. While the model performs exceptionally well in its targeted application context, future enhancements can include expanding the dataset to cover a broader range of transparency types and exploring additional architectural improvements to handle more diverse image content automatically.