Rethinking Atrous Convolution for Semantic Image Segmentation (1706.05587v3)

Published 17 Jun 2017 in cs.CV

Abstract: In this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter's field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects at multiple scales, we design modules which employ atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates. Furthermore, we propose to augment our previously proposed Atrous Spatial Pyramid Pooling module, which probes convolutional features at multiple scales, with image-level features encoding global context and further boost performance. We also elaborate on implementation details and share our experience on training our system. The proposed `DeepLabv3' system significantly improves over our previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark.

Citations (7,841)

View on Semantic Scholar

Summary

The paper introduces DeepLabv3, a model that leverages cascaded atrous convolutions and ASPP to capture multi-scale context for semantic segmentation.
The paper demonstrates how atrous convolution mitigates feature resolution loss from pooling and strides, enabling denser and more precise feature maps.
Experiments on PASCAL VOC 2012 validate the approach, with DeepLabv3 attaining an 85.7% performance without DenseCRF post-processing.

Rethinking Atrous Convolution for Semantic Image Segmentation

The paper "Rethinking Atrous Convolution for Semantic Image Segmentation" by Liang-Chieh Chen et al. investigates the utility of atrous convolution in semantic segmentation tasks. The authors propose DeepLabv3, an enhanced version of their previous DeepLab architectures, which employs multiple strategies to improve segmentation performance, particularly in handling objects at multiple scales.

Atrous Convolution and Semantic Segmentation

The paper revisits atrous convolution, also known as dilated convolution, highlighting its benefits for semantic image segmentation. Specifically, atrous convolution enables the extraction of dense feature maps and can effectively control the resolution at which DCNNs compute feature responses. This is particularly beneficial for addressing the reduced feature resolution caused by pooling operations and convolution strides in traditional DCNNs, which can hamper dense prediction tasks critical for semantic segmentation.

The paper identifies two primary challenges in using DCNNs for semantic segmentation: the reduced feature resolution and the presence of objects at multiple scales. To overcome these challenges, the authors leverage atrous convolution, which allows the adaptation of pretrained ImageNet networks to generate denser feature maps by removing downsampling operations and upsampling the corresponding filter kernels.

Multi-Scale Context Capture

To address the challenge of segmenting objects at multiple scales, the authors propose two main techniques:

Cascaded Atrous Convolution Modules: These modules apply atrous convolution in a cascade to capture multi-scale context. The paper experiments with replicating several copies of the last block in the ResNet architecture (e.g., ResNet block4) and arranging them in a cascade with varying atrous rates. This setup effectively increases the field-of-view of filters and allows the network to incorporate longer-range context without losing spatial resolution.
Atrous Spatial Pyramid Pooling (ASPP): ASPP enhances atrous convolution by capturing multi-scale information through parallel atrous convolutions with different rates. The authors further augment the ASPP module by incorporating image-level features through global average pooling, which encodes global context and improves segmentation performance.

Implementation and Results

The DeepLabv3 model is evaluated on the PASCAL VOC 2012 semantic segmentation benchmark. Several training and inference strategies are employed to optimize performance, including varying output_stride values, multi-scale inputs during inference, and left-right image flipping. Additionally, a simple but effective bootstrapping method is introduced to handle rare and finely annotated objects, such as bicycles, by duplicating images containing these classes in the training set.

Experiments show that the proposed methods significantly improve performance over previous DeepLab versions. For instance, DeepLabv3 achieves a performance of 85.7% on the PASCAL VOC 2012 test set without DenseCRF post-processing, highlighting its efficacy in dense prediction tasks.

Practical and Theoretical Implications

The practical implications of this research are significant for applications requiring high-accuracy semantic segmentation. By enabling more precise segmentation of objects at varying scales and improving feature resolution, the techniques proposed can enhance the performance of computer vision systems in areas such as autonomous driving, medical image analysis, and video surveillance.

Theoretically, this research advances the understanding of how atrous convolution can be extended and optimized for dense prediction tasks. The use of multi-grid methods and the integration of image-level features in ASPP offer new directions for further improving the capture of multi-scale context in convolutional networks.

Future Developments in AI

Future developments may focus on refining these techniques for even better performance and efficiency. Potential areas of improvement include optimizing computational requirements for large-scale and real-time applications. Additionally, integrating these methods with other advanced techniques, such as deformable convolutions or attention mechanisms, could further augment their capabilities.

In conclusion, the paper by Chen et al. provides a rigorous and detailed exploration of alternative architectures and methods for improving semantic segmentation using atrous convolution. The proposed DeepLabv3 model demonstrates significant advancements, proving its value both in theory and in practical applications. Future research will likely build upon these findings to push the boundaries of what can be achieved in semantic image segmentation.

PDF Markdown

Related Papers

YouTube

Show All Videos