Deep Layer Aggregation (1707.06484v3)

Published 20 Jul 2017 in cs.CV and cs.LG

Abstract: Visual recognition requires rich representations that span levels from low to high, scales from small to large, and resolutions from fine to coarse. Even with the depth of features in a convolutional network, a layer in isolation is not enough: compounding and aggregating these representations improves inference of what and where. Architectural efforts are exploring many dimensions for network backbones, designing deeper or wider architectures, but how to best aggregate layers and blocks across a network deserves further attention. Although skip connections have been incorporated to combine layers, these connections have been "shallow" themselves, and only fuse by simple, one-step operations. We augment standard architectures with deeper aggregation to better fuse information across layers. Our deep layer aggregation structures iteratively and hierarchically merge the feature hierarchy to make networks with better accuracy and fewer parameters. Experiments across architectures and tasks show that deep layer aggregation improves recognition and resolution compared to existing branching and merging schemes. The code is at https://github.com/ucbdrive/dla.

Citations (1,260)

View on Semantic Scholar

Summary

The paper introduces iterative and hierarchical deep aggregation (IDA and HDA) techniques to progressively fuse multi-scale features in CNNs.
It demonstrates that DLA networks achieve improved accuracy on ImageNet, with models like DLA-34 reaching a top-1 error rate of 25.1% using fewer parameters.
DLA models prove versatile by excelling in tasks such as fine-grained recognition, semantic segmentation, and boundary detection.

Deep Layer Aggregation

Overview

"Deep Layer Aggregation" explores an advanced architectural approach for neural networks aimed at improving visual recognition systems. The authors, Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell from UC Berkeley, propose novel structures for aggregating layers in convolutional networks to enhance feature representation and thereby improve accuracy and efficiency in various visual tasks.

Key Insights and Contributions

The essence of the paper lies in proposing two specific structures for layer aggregation:

Iterative Deep Aggregation (IDA): This structure is designed to progressively fuse features from different stages of a network, with each stage focusing on refining and aggregating features from previous ones.
Hierarchical Deep Aggregation (HDA): This structure combines features hierarchically by establishing tree-like connections across various layers, ensuring that shallower and deeper layers are merged in a manner that spans the entire feature hierarchy.

Network Architecture

The paper explores the scalability and flexibility of their proposed architectures by integrating them with existing network backbones like ResNet and ResNeXt. The deep layer aggregation (DLA) structures are designed to be compatible with different modules and stages, without requiring alteration to the building blocks of contemporary networks.

In the context of image classification, the paper tests the proposed DLA networks on the standard ImageNet dataset, demonstrating significant improvements in accuracy and parameter efficiency over standard ResNet and ResNeXt architectures. Notably, DLA structures achieve comparable or better results while often using fewer parameters.

Numerical Results

ImageNet Classification:
- DLA-34 achieves a top-1 error rate of 25.1%, outperforming the baseline ResNet-34 which has a higher error rate with more parameters.
- Similarly, DLA-X-102 exceeds the performance of ResNeXt-101 with fewer parameters, achieving a top-1 error rate very close to 21%.
Compact Models:
- DLA-X-60-C achieves a top-1 error rate of 32.0% with only 1.3M parameters and 0.59B fused multiply-adds (FMAs), surpassing compact models like SqueezeNet.

Practical Applications

The paper extends the evaluation to fine-grained visual recognition tasks, semantic segmentation, and boundary detection, verifying the general applicability of DLA.

Fine-grained Recognition: DLA models are applied to datasets such as Birds, Cars, Planes, and Food, achieving state-of-the-art performance on several benchmarks without specific fine-tuning for these tasks.
Semantic Segmentation: On the Cityscapes and CamVid datasets, DLA models outperform or match the leading methods in terms of mean intersection-over-union (mIoU) scores. For instance, DLA-169 achieves a high mIoU of 75.9% on Cityscapes.
Boundary Detection: DLA networks achieve leading results on BSDS and PASCAL Boundaries, particularly excelling in metrics like ODS and OIS.

Implications and Future Developments

The theoretical and practical implications of this research extend to both the design and application of convolutional neural networks (CNNs). The in-depth paper of layer aggregation underscores its critical role in learning and inference. The proposed architectures exhibit how strategic aggregation can lead to significant advancements in feature representation, thereby enhancing both classification and dense prediction tasks.

For future developments, the deep layer aggregation framework can be further explored and potentially integrated with emerging network architectures. Additionally, the implications of this work can inspire more research into optimizing CNN connectivity, particularly in terms of balancing computational efficiency and model accuracy.

Conclusion

"Deep Layer Aggregation" offers a robust framework for enhancing visual recognition systems by effectively merging and refining features at different stages and resolutions within neural networks. The proposed IDA and HDA structures are versatile, compatible with existing architectures, and demonstrate significant improvements across various visual tasks. This work stands as a significant contribution to the ongoing development and optimization of convolutional network architectures.