- The paper presents a meta-learning framework that defines a recursive search space, known as Dense Prediction Cells (DPC), along with a fast proxy task to identify efficient architectures.
- It demonstrates that the discovered architectures outperform traditional human-designed models on tasks like scene parsing, person-part segmentation, and semantic image segmentation.
- Experimental results reveal state-of-the-art performance with 82.7% mIOU on Cityscapes and 71.34% accuracy on PASCAL-Person-Part while using fewer parameters and FLOPS.
Efficient Multi-Scale Architectures for Dense Image Prediction
The paper "Searching for Efficient Multi-Scale Architectures for Dense Image Prediction" (1809.04184) introduces a meta-learning approach to design neural network architectures for dense image prediction tasks. The method focuses on constructing a search space of network motifs and operators suitable for multi-scale representation of visual information, and it addresses the computational challenges of operating on high-resolution imagery. Through efficient random search within this space, the paper identifies architectures that outperform human-designed counterparts on scene parsing, person-part segmentation, and semantic image segmentation tasks.
Key Methodological Components
The approach hinges on two critical components: the design of the search space and the design of a proxy task. The search space is constructed to be both expressive, encompassing a wide range of architectures, and tractable, allowing for efficient identification of high-performing models. A proxy task is designed to be rapidly computable while maintaining predictive power regarding the performance of architectures in large-scale training settings.
Architecture Search Space
The paper introduces a recursive search space for encoding multi-scale context information, termed a Dense Prediction Cell (DPC). The DPC is represented as a directed acyclic graph (DAG) consisting of B branches, each mapping an input tensor to an output tensor. A branch bi​ in a DPC is defined as a 3-tuple, (Xi​,OPi​,Yi​), where Xi​∈Xi​ specifies the input tensor, OPi​∈OP specifies the operation, and Yi​ denotes the output tensor. The final DPC output, Y, is the concatenation of all branch outputs: Y=concat(Y1​,Y2​,…,YB​). The operator space, OP, includes:
- Convolution with a 1×1 kernel.
- 3×3 atrous separable convolution with rate rh​×rw​, where rh​ and rw​∈{1,3,6,9,…,21}.
- Average spatial pyramid pooling with grid size gh​×gw​, where gh​ and gw​∈{1,2,4,8}.
(Figure 1)
Figure 1: Schematic diagram of architecture search for dense image prediction. Example tasks explored in this paper include scene parsing, semantic image segmentation and person-part segmentation.
Proxy Task Design
The proxy task is designed to be fast to compute and predictive of performance in a large-scale setting. The authors propose a proxy dataset using a smaller network backbone and caching the feature maps produced by the network backbone on the training set. This is equivalent to not back-propagating gradients to the network backbone in the real setting. They also employ early stopping, training each candidate architecture for only 30K iterations.
(Figure 2)
Figure 2: Measuring the fidelity of proxy tasks for a dense prediction cell (DPC) in a reduced search space. A comparison of (a) small to large network backbones, and (b) proxy versus large-scale training with MobileNet-v2 backbone.
Experimental Results
The proposed method was evaluated on three dense prediction tasks: scene parsing (Cityscapes), person part segmentation (PASCAL-Person-Part), and semantic image segmentation (PASCAL VOC 2012). The training protocol involved pre-training the network backbone on the COCO dataset, employing a polynomial learning rate, using large crop sizes, fine-tuned batch normalization parameters, and small batch training. The architecture search was conducted on Cityscapes, exploring 28K DPC architectures across 370 GPUs over one week, using random search.
Figure 3: Measuring the fidelity of the proxy tasks for a dense prediction cell (DPC) in the full search space. (a) Score distribution on the proxy task. The search algorithm is able to explore a diversity of architectures. (b) Correlation of the found top-50 architectures between the proxy dataset and large-scale training with MobileNet-v2 backbone.
The best learned DPC, when trained with MobileNet-v2 and modified Xception backbones, showed improvements on the validation set. Notably, the best DPC required half the parameters and a smaller fraction of the FLOPS compared to previous state-of-the-art networks when using Xception. On the Cityscapes test set, the DPC achieved 82.7\% mIOU accuracy, surpassing human-invented architectures.
The DPC architecture achieved state-of-the-art performance of 71.34\% on the PASCAL-Person-Part dataset, a 3.74\% improvement over existing models, without requiring extra MPII training data.
(Figure 4)
Figure 4: Visualization of predictions on PASCAL-Person-Part validation set.
On the PASCAL VOC 2012 benchmark, the DPC architecture outperformed previous models by more than 1.7\%, achieving comparable results to concurrent works.
(Figure 5)
Figure 5: Visualization of predictions on PASCAL VOC 2012 validation set.
Implications and Future Directions
This work demonstrates the potential of architecture search techniques for dense image prediction tasks. The construction of a recursive search space and a fast proxy task are critical to achieving these results. The learned architecture outperforms human-designed architectures across multiple dense image prediction tasks while being more computationally efficient. The application of more sophisticated search algorithms and expanding the search space may lead to further improvements. The authors suggest that these ideas could be extended to other domains like depth prediction and object detection.
Conclusion
The paper presents a significant advancement in applying meta-learning to dense image prediction. By designing a suitable search space and proxy task, the authors have shown that architecture search can discover efficient and high-performing networks for complex vision tasks. The results highlight the potential of automated architecture design in achieving state-of-the-art performance with reduced computational costs.