Building Variable-sized Models via Learngene Pool (2312.05743v2)
Abstract: Recently, Stitchable Neural Networks (SN-Net) is proposed to stitch some pre-trained networks for quickly building numerous networks with different complexity and performance trade-offs. In this way, the burdens of designing or training the variable-sized networks, which can be used in application scenarios with diverse resource constraints, are alleviated. However, SN-Net still faces a few challenges. 1) Stitching from multiple independently pre-trained anchors introduces high storage resource consumption. 2) SN-Net faces challenges to build smaller models for low resource constraints. 3). SN-Net uses an unlearned initialization method for stitch layers, limiting the final performance. To overcome these challenges, motivated by the recently proposed Learngene framework, we propose a novel method called Learngene Pool. Briefly, Learngene distills the critical knowledge from a large pre-trained model into a small part (termed as learngene) and then expands this small part into a few variable-sized models. In our proposed method, we distill one pretrained large model into multiple small models whose network blocks are used as learngene instances to construct the learngene pool. Since only one large model is used, we do not need to store more large models as SN-Net and after distilling, smaller learngene instances can be created to build small models to satisfy low resource constraints. We also insert learnable transformation matrices between the instances to stitch them into variable-sized models to improve the performance of these models. Exhaustive experiments have been implemented and the results validate the effectiveness of the proposed Learngene Pool compared with SN-Net.
- Revisiting Model Stitching to Compare Neural Representations. In Neural Information Processing Systems.
- Similarity and Matching of Neural Network Representations. In Neural Information Processing Systems.
- Scaling Vision Transformers to 22 Billion Parameters. ArXiv, abs/2302.05442.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data. ArXiv, abs/2110.15094.
- Up to 100x Faster Data-free Knowledge Distillation. In AAAI Conference on Artificial Intelligence.
- Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. arXiv: Learning,arXiv: Learning.
- Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning. ArXiv, abs/2208.11580.
- SqueezeNext: Hardware-Aware Neural Network Design. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 1719–171909.
- Single Path One-Shot Neural Architecture Search with Uniform Sampling. In European Conference on Computer Vision.
- Convolutional Neural Network Compression through Generalized Kronecker Product Decomposition. In AAAI Conference on Artificial Intelligence.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- Distilling the Knowledge in a Neural Network. ArXiv, abs/1503.02531.
- MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. ArXiv, abs/1704.04861.
- LayerCAM: Exploring Hierarchical Class Activation Maps for Localization. IEEE Transactions on Image Processing, 5875–5888.
- TinyBERT: Distilling BERT for Natural Language Understanding. In Findings.
- Deep learning. nature, 521(7553): 436–444.
- Understanding Image Representations by Measuring Their Equivariance and Equivalence. International Journal of Computer Vision, 127: 456 – 476.
- MicroNet: Towards Image Recognition with Extremely Low FLOPs. ArXiv, abs/2011.12289.
- Rethinking Vision Transformers for MobileNet Size and Speed. ArXiv, abs/2212.08059.
- EfficientFormer: Vision Transformers at MobileNet Speed. ArXiv, abs/2206.01191.
- MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. ArXiv, abs/2110.02178.
- Stitchable Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16102–16112.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 211–252.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115: 211–252.
- Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. International Journal of Computer Vision, 336–359.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, 10347–10357. PMLR.
- Vanschoren, J. 2018. Meta-Learning: A Survey. arXiv: Learning,arXiv: Learning.
- Learngene: Inheriting Condensed Knowledge from the Ancestry Model to Descendant Models. ArXiv, abs/2305.02279.
- Learngene: From open-world to your learning task. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 8557–8565.
- Wightman, R. 2019. PyTorch Image Models. https://github.com/rwightman/pytorch-image-models.
- Deep Model Reassembly. ArXiv, abs/2210.17409.
- Visualizing and Understanding Convolutional Networks. In European Conference on Computer Vision.
- Are All Layers Created Equal? ArXiv, abs/1902.01996.
- Advancing Model Pruning via Bi-level Optimization. ArXiv, abs/2210.04092.
- Decoupled Knowledge Distillation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11943–11952.