InceptionNeXt: When Inception Meets ConvNeXt

Published 29 Mar 2023 in cs.CV, cs.AI, and cs.LG | (2303.16900v3)

Abstract: Inspired by the long-range modeling ability of ViTs, large-kernel convolutions are widely studied and adopted recently to enlarge the receptive field and improve model performance, like the remarkable work ConvNeXt which employs 7x7 depthwise convolution. Although such depthwise operator only consumes a few FLOPs, it largely harms the model efficiency on powerful computing devices due to the high memory access costs. For example, ConvNeXt-T has similar FLOPs with ResNet-50 but only achieves ~60% throughputs when trained on A100 GPUs with full precision. Although reducing the kernel size of ConvNeXt can improve speed, it results in significant performance degradation, which poses a challenging problem: How to speed up large-kernel-based CNN models while preserving their performance. To tackle this issue, inspired by Inceptions, we propose to decompose large-kernel depthwise convolution into four parallel branches along channel dimension, i.e., small square kernel, two orthogonal band kernels, and an identity mapping. With this new Inception depthwise convolution, we build a series of networks, namely IncepitonNeXt, which not only enjoy high throughputs but also maintain competitive performance. For instance, InceptionNeXt-T achieves 1.6x higher training throughputs than ConvNeX-T, as well as attains 0.2% top-1 accuracy improvement on ImageNet-1K. We anticipate InceptionNeXt can serve as an economical baseline for future architecture design to reduce carbon footprint. Code is available at https://github.com/sail-sg/inceptionnext.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (82)

View on Semantic Scholar

Summary

The paper proposes a hybrid CNN architecture that decomposes large-kernel convolutions into multiple efficient branches to improve training throughput and performance.
It demonstrates a 1.6x increase in training throughput on ConvNeXt-T along with a 0.2% top-1 accuracy boost on ImageNet-1K.
The design adapts efficiently across model scales, offering a practical baseline for future CNN innovations while reducing computational costs.

InceptionNeXt: When Inception Meets ConvNeXt

This paper introduces InceptionNeXt, a new convolutional neural network (CNN) architecture designed to enhance computational efficiency while maintaining high performance, especially in large-kernel convolutions commonly adopted in modern vision tasks. The work aims to address the high memory access costs and efficiency bottlenecks associated with large-kernel depthwise convolutions by proposing an innovative hybrid approach influenced by the Inception modules.

Overview

InceptionNeXt integrates the architectural characteristics of ConvNeXt and Inception modules. The paper identifies a significant challenge with large-kernel depthwise convolutions, such as the $7 \times 7$ kernels in ConvNeXt, which, despite their low FLOPs, incur substantial efficiency costs on advanced hardware like GPUs due to increased memory access demands. The authors propose a novel decomposition of these large-kernel operations into four parallel branches:

Small Square Kernel Convolution: A $3 \times 3$ kernel is used for part of the channels, leveraging the efficiency of smaller convolutions known from both historical and recent CNN architectures.
Two Orthogonal Band Kernels: These consist of $1 \times k$ and $k \times 1$ kernels, inspired by Inception's use of factorized convolutions to extend receptive fields without full large-kernel costs.
Identity Mapping: Some channels bypass the convolutions entirely, reducing computational overhead and further accelerating processing.

This method effectively enlarges the receptive field while minimizing the associated computational costs, achieving a balance between performance and speed.

Key Results

The paper presents several salient results demonstrating InceptionNeXt's efficacy:

Training Throughput Improvement: InceptionNeXt-T achieved a 1.6x gain in training throughput versus ConvNeXt-T while showing a modest top-1 accuracy improvement of 0.2% on the ImageNet-1K benchmark.
Speed and Performance Trade-off: The architecture provides a compelling balance, matching ResNet-50 in throughput with markedly superior accuracy, showcasing the potential alignment with both speed and modern performance benchmarks.
Design Adaptability: The method scales efficiently across different model sizes (T, S, B configurations).

Implications and Future Directions

The InceptionNeXt model presents a significant advancement in the design of CNN architectures by re-evaluating how large-kernel operations are conducted. Its implications are substantial in the context of reducing computational costs and carbon footprints associated with training and deploying large-scale neural networks. The architecture is positioned as an efficient baseline for ongoing and future architectural innovations.

Looking forward, investigating how this approach could be further optimized at the hardware level represents an exciting potential avenue. Additionally, exploring the integration of these concepts into hybrid models or adapting them for tasks beyond image classification, such as semantic segmentation and dense prediction, will likely yield more insightful developments.

The study by Yu et al. thus provides a significant contribution toward more efficient machine learning architectures, appealing not only for its immediate practical application but also for its broader influence on future neural network development strategies.

Markdown Report Issue