$ShiftwiseConv:$ Small Convolutional Kernel with Large Kernel Effect (2401.12736v2)
Abstract: Large kernels make standard convolutional neural networks (CNNs) great again over transformer architectures in various vision tasks. Nonetheless, recent studies meticulously designed around increasing kernel size have shown diminishing returns or stagnation in performance. Thus, the hidden factors of large kernel convolution that affect model performance remain unexplored. In this paper, we reveal that the key hidden factors of large kernels can be summarized as two separate components: extracting features at a certain granularity and fusing features by multiple pathways. To this end, we leverage the multi-path long-distance sparse dependency relationship to enhance feature utilization via the proposed Shiftwise (SW) convolution operator with a pure CNN architecture. In a wide range of vision tasks such as classification, segmentation, and detection, SW surpasses state-of-the-art transformers and CNN architectures, including SLaK and UniRepLKNet. More importantly, our experiments demonstrate that $3 \times 3$ convolutions can replace large convolutions in existing large kernel CNNs to achieve comparable effects, which may inspire follow-up works. Code and all the models at https://github.com/lidc54/shift-wiseConv.
- Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, IEEE Computer Society (2016) 770–778
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(4) (2018) 834–848
- Unireplknet: A universal perception large-kernel convnet for audio, video, point cloud, time-series and image recognition (2023)
- Geometry-aware guided loss for deep crack recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2022) 4703–4712
- The devil is in the crack orientation: A new perspective for crack detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. (2023) 6653–6663
- A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2022) 11976–11986
- Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2022) 11963–11975
- More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity. arXiv preprint arXiv:2207.03620 (2022)
- Imagenet classification with deep convolutional neural networks. In Bartlett, P.L., Pereira, F.C.N., Burges, C.J.C., Bottou, L., Weinberger, K.Q., eds.: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States. (2012) 1106–1114
- Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, IEEE Computer Society (2015) 1–9
- Very deep convolutional networks for large-scale image recognition. In Bengio, Y., LeCun, Y., eds.: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. (2015)
- Bag of tricks for image classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, Computer Vision Foundation / IEEE (2019) 558–567
- Pyramid scene parsing network. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, IEEE Computer Society (2017) 6230–6239
- Scale-aware trident networks for object detection. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, IEEE (2019) 6053–6062
- Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, IEEE Computer Society (2015) 3431–3440
- Segnext: Rethinking convolutional attention design for semantic segmentation (2022)
- Large kernel matters - improve semantic segmentation by global convolutional network. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, IEEE Computer Society (2017) 1743–1751
- Bilinear cnns for fine-grained visual recognition (2017)
- Gated-scnn: Gated shape cnns for semantic segmentation. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, IEEE (2019) 5228–5237
- Dual attention network for scene segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, Computer Vision Foundation / IEEE (2019) 3146–3154
- Non-local neural networks. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, IEEE Computer Society (2018) 7794–7803
- Deformable convolutional networks. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, IEEE Computer Society (2017) 764–773
- Dynamic snake convolution based on topological geometric constraints for tubular structure segmentation (2023)
- Link: Linear kernel for lidar-based 3d perception (2023)
- Pointrend: Image segmentation as rendering. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE (2020) 9796–9805
- Parcnetv2: Oversized kernel with enhanced attention. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. (2023) 5752–5762
- A survey of transformers (2021)
- Visual attention network. Computational Visual Media 9(4) (2023) 733–752
- Squeeze-and-excitation networks. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, IEEE Computer Society (2018) 7132–7141
- Internimage: Exploring large-scale vision foundation models with deformable convolutions (2023)
- Dilated convolution with learnable spacings. arXiv preprint arXiv:2112.03740 (2021)
- Large separable kernel attention: Rethinking the large kernel attention design in cnn. Expert Systems with Applications 236 (2024) 121352
- Convolutional networks with oriented 1d kernels. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. (2023) 6222–6232
- Beyond self-attention: Deformable large kernel attention for medical image segmentation (2023)
- Are large kernels better teachers than transformers for convnets? arXiv preprint arXiv:2305.19412 (2023)
- Shift: A zero flop, zero parameter alternative to spatial convolutions. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, IEEE Computer Society (2018) 9127–9135
- Constructing fast network through deconstruction of convolution. In Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., eds.: Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada. (2018) 5955–5965
- All you need is a few shifts: Designing efficient convolutional neural networks for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, Computer Vision Foundation / IEEE (2019) 7241–7250
- On the integration of self-attention and convolution (2022)
- Skeleton-based action recognition with shift graph convolutional network. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE (2020) 180–189
- X-volution: On the unification of convolution and self-attention (2021)
- Akconv: Convolutional kernel with arbitrary sampled shapes and arbitrary number of parameters (2023)
- Ghostnet: More features from cheap operations. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE (2020) 1577–1586
- Cspnet: A new backbone that can enhance learning capability of cnn. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). (2020) 1571–1580
- Run, don’t walk: Chasing higher flops for faster neural networks. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). (2023) 12021–12031
- Repvgg: Making vgg-style convnets great again. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2021) 13733–13742
- Expandnets: Linear over-parameterization to train compact convolutional networks. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., eds.: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. (2020)
- Mobileone: An improved one millisecond mobile backbone (2023)
- Vanillanet: the power of minimalism in deep learning (2023)
- Lart: Five implementation strategies of the spatial-shift-operation. https://www.yuque.com/lart/ugkv9f/nnor5p 2022-05-18.
- Taichi: a language for high-performance computation on spatially sparse data structures. ACM Transactions on Graphics (TOG) 38(6) (2019) 201
- Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. (2021) 10012–10022
- Cswin transformer: A general vision transformer backbone with cross-shaped windows. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2022) 12124–12134
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.