No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects

Published 7 Aug 2022 in cs.CV and cs.LG | (2208.03641v1)

Abstract: Convolutional neural networks (CNNs) have made resounding success in many computer vision tasks such as image classification and object detection. However, their performance degrades rapidly on tougher tasks where images are of low resolution or objects are small. In this paper, we point out that this roots in a defective yet common design in existing CNN architectures, namely the use of strided convolution and/or pooling layers, which results in a loss of fine-grained information and learning of less effective feature representations. To this end, we propose a new CNN building block called SPD-Conv in place of each strided convolution layer and each pooling layer (thus eliminates them altogether). SPD-Conv is comprised of a space-to-depth (SPD) layer followed by a non-strided convolution (Conv) layer, and can be applied in most if not all CNN architectures. We explain this new design under two most representative computer vision tasks: object detection and image classification. We then create new CNN architectures by applying SPD-Conv to YOLOv5 and ResNet, and empirically show that our approach significantly outperforms state-of-the-art deep learning models, especially on tougher tasks with low-resolution images and small objects. We have open-sourced our code at https://github.com/LabSAINT/SPD-Conv.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (210)

View on Semantic Scholar

Summary

The paper introduces SPD-Conv, a building block that replaces strided convolutions and pooling to preserve fine spatial details in low-resolution images.
The paper shows that integrating SPD-Conv into YOLOv5 and ResNet yields significant gains in small-object average precision and classification accuracy.
The paper redefines CNN downscaling with a space-to-depth transformation, offering practical benefits for mobile vision and real-time analytics scenarios.

SPD-Conv: A New CNN Building Block for Low-Resolution Images and Small Objects

In the evolving landscape of computer vision, convolutional neural networks (CNNs) have substantially propelled advancements. Prominent models such as YOLO, ResNet, and their derivatives have demonstrated efficacy in tasks like object detection and image classification, predicated upon input data of adequate quality—typically well-resolved images and predominantly larger objects. However, these models historically exhibit diminishing returns when applied to scenarios involving low-resolution images and small objects, posing a limitation to their broad applicability.

The paper "No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects" by Sunkara and Luo assesses the shortcomings in existing CNN architectures, attributing performance degradation primarily to strided convolutions and pooling layers. These components, while pivotal in reducing computational complexity by downsampling feature maps, inadvertently discard fine-grained spatial information—essential in the context of low-resolution and small-object tasks. The authors introduce SPD-Conv, a novel building block designed to supplant these components, promoting retention of crucial feature information.

SPD-Conv Design and Methodology

SPD-Conv comprises two principal components: a space-to-depth (SPD) layer and a non-strided convolution layer. In the SPD step, feature maps are reorganized such that spatial dimensions are contracted into the depth dimension, effectively preserving all original information through channel enhancement. Subsequently, a standard non-strided convolution is applied, facilitating learnable transformations without sacrificing data granularity. This architecture engenders a unified approach to feature map downscaling, applicable across diverse CNN models irrespective of operational idiosyncrasies.

Empirical Evaluation

The authors critically evaluate SPD-Conv through modifications of YOLOv5 and ResNet, yielding YOLOv5-SPD and ResNet-SPD models. These adaptations are tested on datasets including COCO-2017, Tiny ImageNet, and CIFAR-10, focusing on small and low-resolution object scenarios.

In object detection benchmarks on COCO-2017, the SPD-enhanced models outperform their traditional counterparts, particularly in detecting smaller objects, yielding significant improvements in metrics such as average precision (AP) for small objects. For instance, YOLOv5-SPD variants showcase marked AP gains over the original YOLOv5 models, underscoring SPD-Conv's ability to better capture detailed features in challenging contexts.

For image classification tasks on Tiny ImageNet and CIFAR-10, the ResNet-SPD models also demonstrate superior top-1 accuracy compared to standard ResNets, reinforcing the broader applicability of SPD-Conv beyond just object detection.

Theoretical and Practical Implications

The theoretical contribution of SPD-Conv lies in reevaluating the role of dimension reduction techniques within CNNs, showing that information-efficient transformations can enhance feature representation fidelity. Practically, the research suggests the integration of SPD-Conv could markedly benefit a broad spectrum of vision tasks, especially in applications constrained by input quality. This holds substantial potential for domains such as mobile vision applications and real-time analytics where high-resolution inputs are infeasible.

Future Directions

Future research could explore optimizing SPD-Conv for varied architectures, particularly emphasizing computational efficiency to offset potential increases in depth-related computational demands. Further integration into deep learning libraries such as PyTorch and TensorFlow is anticipated, facilitating broader adoption and experimentation by the research community. Additionally, an exploration into the role of SPD-Conv within hybrid models incorporating transformer-based systems might unveil synergistic enhancements, addressing complex vision tasks with high variability in object scale and resolution.

In conclusion, the paper presents a meticulously evaluated advancement in CNN architecture, expanding the potential of computer vision models to perform effectively under conditions traditionally deemed inhibitive. Such innovation promises to democratize AI capabilities by allowing models to maintain robustness irrespective of the resolution constraints often encountered in real-world scenarios.

Markdown Report Issue