SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization (1912.05027v3)

Published 10 Dec 2019 in cs.CV, cs.LG, and eess.IV

Abstract: Convolutional neural networks typically encode an input image into a series of intermediate features with decreasing resolutions. While this structure is suited to classification tasks, it does not perform well for tasks requiring simultaneous recognition and localization (e.g., object detection). The encoder-decoder architectures are proposed to resolve this by applying a decoder network onto a backbone model designed for classification tasks. In this paper, we argue encoder-decoder architecture is ineffective in generating strong multi-scale features because of the scale-decreased backbone. We propose SpineNet, a backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by Neural Architecture Search. Using similar building blocks, SpineNet models outperform ResNet-FPN models by ~3% AP at various scales while using 10-20% fewer FLOPs. In particular, SpineNet-190 achieves 52.5% AP with a MaskR-CNN detector and achieves 52.1% AP with a RetinaNet detector on COCO for a single model without test-time augmentation, significantly outperforms prior art of detectors. SpineNet can transfer to classification tasks, achieving 5% top-1 accuracy improvement on a challenging iNaturalist fine-grained dataset. Code is at: https://github.com/tensorflow/tpu/tree/master/models/official/detection.

Authors (8)

Xianzhi Du (30 papers)
Tsung-Yi Lin (49 papers)
Pengchong Jin (6 papers)
Golnaz Ghiasi (20 papers)
Mingxing Tan (46 papers)
Yin Cui (45 papers)
Quoc V. Le (128 papers)
Xiaodan Song (13 papers)

Citations (169)

View on Semantic Scholar

Summary

The paper introduces a novel scale-permuted architecture that adaptively adjusts feature map resolutions to enhance both recognition and localization.
The study employs neural architecture search on the COCO dataset, achieving a 3% AP improvement and reducing FLOPs by 10-20% over ResNet-FPN models.
The proposed SpineNet also generalizes well to classification tasks, boosting top-1 accuracy by 5% on iNaturalist and demonstrating its versatility.

Analysis of SpineNet: Scale-Permuted Backbone for Visual Tasks

The paper "SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization" introduces a novel architecture designed to overcome the limitations of traditional CNN backbones, especially in the context of simultaneous recognition and localization tasks. The proposed approach, SpineNet, strategically embraces a scale-permuted network pattern to better integrate multi-scale features, addressing inefficiencies inherent in scale-decreased models typically used in conjunction with encoder-decoder architectures.

Key Contributions and Results

Scale-Permuted Architecture: The core contribution of this paper is the introduction of scale-permutations within intermediate feature extraction layers, diverging from the traditional top-down approach where features are progressively down-sampled. By allowing the resolution of feature maps to dynamically increase or decrease, SpineNet retains critical spatial information that is pivotal for tasks like object detection, where both recognition and accurate localization are essential.
Neural Architecture Search (NAS): The model architecture of SpineNet is optimized via NAS on the COCO dataset, ensuring that the backbone is specifically tailored for detection tasks. The search process not only determines the permutation of feature blocks but also the connections between them, effectively replacing the need for a separate decoder network.
Performance Gains: SpineNet demonstrates significant performance improvements over established models. Specifically, SpineNet improves the average precision (AP) by approximately 3% over comparable ResNet-FPN models while achieving a reduction in FLOPs by 10-20%. The SpineNet-190 variant, when evaluated on the COCO dataset, achieved a single model AP of 52.5% using Mask R-CNN, and 52.1% with RetinaNet, without necessitating test-time augmentation.
Generalization to Classification Tasks: Although learned on detection tasks, SpineNet effectively generalizes to classification problems, achieving a 5% increase in top-1 accuracy on the iNaturalist dataset. This demonstrates the versatility and capability of the SpineNet architecture to perform across various visual recognition tasks with improved efficiency and accuracy.

Implications of the Research

The significance of SpineNet lies in its innovative rethinking of feature extraction for convolutional architectures. By permitting flexible scale manipulations rather than adhering to rigid, hierarchical down-sampling, SpineNet enhances the ability to capture diversified feature maps, crucial for complex visual tasks where spatial hierarchies vary across objects of different scales.

Speculation on Future Directions

The adaptable nature of the scale-permuted backbone presents several avenues for future research. Firstly, the potential exists to apply similar architecture principles beyond object detection and classification—possibly extending to semantic segmentation or even video-related tasks that demand multi-scale temporal feature modeling. Additionally, integrating the architecture with newer structural innovations like more efficient convolutional blocks could further augment performance while minimizing computational costs.

In conclusion, SpineNet is a substantial evolution in convolutional network design, paving the way for more dynamically adaptable feature extraction layers. This research augurs a shift in how future architectures will be conceptualized, particularly in balancing computational efficiency with the demand for high-resolution spatial feature extraction. The integration of NAS into architecture development exemplifies an increasingly data-driven approach to model optimization, a trend that is likely to dominate AI research in years to come.

PDF Markdown

Related Papers

GitHub

tpu/models/official/detection at master · tensorflow/tpu · GitHub

YouTube

Show All Videos