- The paper demonstrates that integrating segmentation and an attention module effectively automates high-quality image matting.
- It employs a dual-phase network with a DeepLabV3+ based trimap generator and a MobileNetV2 encoder-decoder, proving robust against complex transparency.
- Experimental results indicate competitive performance over state-of-the-art methods using metrics like MSE, SAD, Conn, and Grad on an e-commerce dataset.
AlphaNet: An Attention Guided Deep Network for Automatic Image Matting
Introduction
The paper, "AlphaNet: An Attention Guided Deep Network for Automatic Image Matting," introduces AlphaNet, a deep network model that aims to automate the process of image matting without requiring human intervention. Image matting refers to the extraction of high-quality foreground objects from images, a crucial task in various applications like mixed reality and e-commerce. The proposed method integrates semantic segmentation with image matting into a unified network, facilitating the extraction of detailed semantic mattes, which are crucial for tasks like virtual try-ons in e-commerce platforms.
Model Architecture
AlphaNet consists of a segmentation-trimap prediction network and a matting network. The segmentation network is based on the DeepLabV3+ architecture, which has been extended with an erosion-dilation (ED) layer to generate coarse trimaps from binary masks, determining known and unknown pixel regions. The matting phase is an encoder-decoder structure employing a MobileNetV2 backbone and includes an attention module that guides the upsampling and downsampling operations.
Attention Module
The attention module employs feature maps from the encoder to generate attention maps that influence the spatial operations on the input data. This is crucial for capturing fine boundary details and ensuring high-quality matting outputs. The attention maps are produced by a fully convolutional neural network, applying several normalization layers to ensure robustness during spatial transformations. This module enables the network to handle complex transparency scenarios effectively.
Experimental Setup and Results
A new dataset focused on e-commerce was created for training and evaluating AlphaNet, consisting of images emphasizing human portraits and fashion accessories. The network was trained to predict alpha mattes, with evaluation metrics including Mean Squared Error (MSE), Sum of Absolute Differences (SAD), and perceptually motivated errors like Connectivity (Conn) and Gradient (Grad).
The results reveal that AlphaNet achieves competitive performance compared to state-of-the-art methods like Deep Image Matting (DIM) [Xu et al.], KNN matting, and others. When tested with and without human-provided trimaps, AlphaNet demonstrated its robustness and capability to automatically handle complex images with minimal degradation in matting quality.
Comparison with State-of-the-art Methods
AlphaNet's automatic capabilities were benchmarked against other interactive and automatic matting approaches. The integration of the attention module was found to significantly enhance the matting results, emphasizing the intricate structural details often lost in other methods. The paper showcases AlphaNet’s superiority in handling diverse matting challenges and highlights the advantages of the proposed homogeneous network that simplifies the training and eliminates the need for extensive manual input or additional ground truth data during inference.
Conclusion and Future Work
AlphaNet demonstrates that fully automated image matting with high fidelity is achievable by combining semantic segmentation and deep learning techniques. The deployment of an attention module proved pivotal in refining the model’s prediction capabilities, especially in scenarios involving complex boundaries and transparent regions. While the model performs exceptionally well in its targeted application context, future enhancements can include expanding the dataset to cover a broader range of transparency types and exploring additional architectural improvements to handle more diverse image content automatically.