Searching for Efficient Multi-Scale Architectures for Dense Image Prediction (1809.04184v1)

Published 11 Sep 2018 in cs.CV, cs.LG, and stat.ML

Abstract: The design of neural network architectures is an important component for achieving state-of-the-art performance with machine learning systems across a broad array of tasks. Much work has endeavored to design and build architectures automatically through clever construction of a search space paired with simple learning algorithms. Recent progress has demonstrated that such meta-learning methods may exceed scalable human-invented architectures on image classification tasks. An open question is the degree to which such methods may generalize to new domains. In this work we explore the construction of meta-learning techniques for dense image prediction focused on the tasks of scene parsing, person-part segmentation, and semantic image segmentation. Constructing viable search spaces in this domain is challenging because of the multi-scale representation of visual information and the necessity to operate on high resolution imagery. Based on a survey of techniques in dense image prediction, we construct a recursive search space and demonstrate that even with efficient random search, we can identify architectures that outperform human-invented architectures and achieve state-of-the-art performance on three dense prediction tasks including 82.7\% on Cityscapes (street scene parsing), 71.3\% on PASCAL-Person-Part (person-part segmentation), and 87.9\% on PASCAL VOC 2012 (semantic image segmentation). Additionally, the resulting architecture is more computationally efficient, requiring half the parameters and half the computational cost as previous state of the art systems.

Citations (398)

View on Semantic Scholar

Summary

The paper presents a meta-learning framework that defines a recursive search space, known as Dense Prediction Cells (DPC), along with a fast proxy task to identify efficient architectures.
It demonstrates that the discovered architectures outperform traditional human-designed models on tasks like scene parsing, person-part segmentation, and semantic image segmentation.
Experimental results reveal state-of-the-art performance with 82.7% mIOU on Cityscapes and 71.34% accuracy on PASCAL-Person-Part while using fewer parameters and FLOPS.

Efficient Multi-Scale Architectures for Dense Image Prediction

The paper "Searching for Efficient Multi-Scale Architectures for Dense Image Prediction" (1809.04184) introduces a meta-learning approach to design neural network architectures for dense image prediction tasks. The method focuses on constructing a search space of network motifs and operators suitable for multi-scale representation of visual information, and it addresses the computational challenges of operating on high-resolution imagery. Through efficient random search within this space, the paper identifies architectures that outperform human-designed counterparts on scene parsing, person-part segmentation, and semantic image segmentation tasks.

Key Methodological Components

The approach hinges on two critical components: the design of the search space and the design of a proxy task. The search space is constructed to be both expressive, encompassing a wide range of architectures, and tractable, allowing for efficient identification of high-performing models. A proxy task is designed to be rapidly computable while maintaining predictive power regarding the performance of architectures in large-scale training settings.

Architecture Search Space

The paper introduces a recursive search space for encoding multi-scale context information, termed a Dense Prediction Cell (DPC). The DPC is represented as a directed acyclic graph (DAG) consisting of $\mathcal{B}$ branches, each mapping an input tensor to an output tensor. A branch $b_i$ in a DPC is defined as a 3-tuple, $(X_i, OP_i, Y_i)$ , where $X_i \in \mathcal{X}_i$ specifies the input tensor, $OP_i \in \mathcal{OP}$ specifies the operation, and $Y_i$ denotes the output tensor. The final DPC output, $Y$ , is the concatenation of all branch outputs: $Y = concat(Y_1, Y_2, \dots, Y_{\mathcal{B}})$ . The operator space, $\mathcal{OP}$ , includes:

Convolution with a $1\times1$ kernel.
$3\times3$ atrous separable convolution with rate $r_h \times r_w$ , where $r_h$ and $r_w \in \{1, 3, 6, 9, \dots, 21\}$ .
Average spatial pyramid pooling with grid size $g_h \times g_w$ , where $g_h$ and $g_w \in \{1, 2, 4, 8\}$ .

(Figure 1)

Figure 1: Schematic diagram of architecture search for dense image prediction. Example tasks explored in this paper include scene parsing, semantic image segmentation and person-part segmentation.

Proxy Task Design

The proxy task is designed to be fast to compute and predictive of performance in a large-scale setting. The authors propose a proxy dataset using a smaller network backbone and caching the feature maps produced by the network backbone on the training set. This is equivalent to not back-propagating gradients to the network backbone in the real setting. They also employ early stopping, training each candidate architecture for only 30K iterations.

(Figure 2)

Figure 2: Measuring the fidelity of proxy tasks for a dense prediction cell (DPC) in a reduced search space. A comparison of (a) small to large network backbones, and (b) proxy versus large-scale training with MobileNet-v2 backbone.

Experimental Results

The proposed method was evaluated on three dense prediction tasks: scene parsing (Cityscapes), person part segmentation (PASCAL-Person-Part), and semantic image segmentation (PASCAL VOC 2012). The training protocol involved pre-training the network backbone on the COCO dataset, employing a polynomial learning rate, using large crop sizes, fine-tuned batch normalization parameters, and small batch training. The architecture search was conducted on Cityscapes, exploring 28K DPC architectures across 370 GPUs over one week, using random search.

Figure 3: Measuring the fidelity of the proxy tasks for a dense prediction cell (DPC) in the full search space. (a) Score distribution on the proxy task. The search algorithm is able to explore a diversity of architectures. (b) Correlation of the found top-50 architectures between the proxy dataset and large-scale training with MobileNet-v2 backbone.

Performance on Cityscapes

The best learned DPC, when trained with MobileNet-v2 and modified Xception backbones, showed improvements on the validation set. Notably, the best DPC required half the parameters and a smaller fraction of the FLOPS compared to previous state-of-the-art networks when using Xception. On the Cityscapes test set, the DPC achieved 82.7\% mIOU accuracy, surpassing human-invented architectures.

Performance on PASCAL-Person-Part

The DPC architecture achieved state-of-the-art performance of 71.34\% on the PASCAL-Person-Part dataset, a 3.74\% improvement over existing models, without requiring extra MPII training data.

(Figure 4)

Figure 4: Visualization of predictions on PASCAL-Person-Part validation set.

Performance on PASCAL VOC 2012

On the PASCAL VOC 2012 benchmark, the DPC architecture outperformed previous models by more than 1.7\%, achieving comparable results to concurrent works.

(Figure 5)

Figure 5: Visualization of predictions on PASCAL VOC 2012 validation set.

Implications and Future Directions

This work demonstrates the potential of architecture search techniques for dense image prediction tasks. The construction of a recursive search space and a fast proxy task are critical to achieving these results. The learned architecture outperforms human-designed architectures across multiple dense image prediction tasks while being more computationally efficient. The application of more sophisticated search algorithms and expanding the search space may lead to further improvements. The authors suggest that these ideas could be extended to other domains like depth prediction and object detection.

Conclusion

The paper presents a significant advancement in applying meta-learning to dense image prediction. By designing a suitable search space and proxy task, the authors have shown that architecture search can discover efficient and high-performing networks for complex vision tasks. The results highlight the potential of automated architecture design in achieving state-of-the-art performance with reduced computational costs.