Initializing Models with Larger Ones (2311.18823v1)
Abstract: Weight initialization plays an important role in neural network training. Widely used initialization methods are proposed and evaluated for networks that are trained from scratch. However, the growing number of pretrained models now offers new opportunities for tackling this classical problem of weight initialization. In this work, we introduce weight selection, a method for initializing smaller models by selecting a subset of weights from a pretrained larger model. This enables the transfer of knowledge from pretrained weights to smaller models. Our experiments demonstrate that weight selection can significantly enhance the performance of small models and reduce their training time. Notably, it can also be used together with knowledge distillation. Weight selection offers a new approach to leverage the power of pretrained models in resource-constrained settings, and we hope it can be a useful tool for training small models in the large-model era. Code is available at https://github.com/OscarXZQ/weight-selection.
- BEiT: BERT pre-training of image transformers. In ICLR, 2022.
- Knowledge distillation: A good teacher is patient and consistent. In CVPR, 2022.
- Food-101 – mining discriminative components with random forests. In ECCV, 2014.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- bert2bert: Towards reusable pretrained language models. arXiv preprint arXiv:2110.07143, 2021.
- Net2net: Accelerating learning via knowledge transfer. In ICLR, 2016.
- Describing textures in the wild. In CVPR, 2014.
- ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR, 2020.
- An analysis of single-layer networks in unsupervised feature learning. In AISTATS, 2011.
- Randaugment: Practical automated data augmentation with a reduced search space. In CVPR Workshops, 2020.
- Breaking the architecture barrier: A method for efficient knowledge transfer across networks. arXiv preprint arXiv:2212.13970, 2022.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- How to train vision transformer on small-scale datasets? In BMVC, 2022.
- Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
- Learning both weights and connections for efficient neural network. In NeurIPS, 2015.
- Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
- Deep residual learning for image recognition. In CVPR, 2016.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Introducing eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. In IGARSS, 2018.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Lora: Low-rank adaptation of large language models. In ICLR, 2022.
- Deep networks with stochastic depth. In ECCV, 2016.
- Big transfer (bit): General visual representation learning. In ECCV, 2020.
- Data-dependent initializations of convolutional neural networks. In ICLR, 2015.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009.
- Exploring strategies for training deep neural networks. Journal of machine learning research, 2009.
- Optimal brain damage. In NeurIPS, 1990.
- Pruning filters for efficient convnets. ICLR, 2017a.
- Fully convolutional instance-aware semantic segmentation. In CVPR, 2017b.
- Weight distillation: Transferring the knowledge in neural network parameters. arXiv preprint arXiv:2009.09152, 2020.
- Swin transformer: Hierarchical vision transformer using shifted windows. 2021.
- Rethinking the value of network pruning. In ICLR, 2019.
- A convnet for the 2020s. In CVPR, 2022.
- Dropout reduces underfitting. In ICML, 2023.
- Stacked convolutional auto-encoders for hierarchical feature extraction. In Artificial Neural Networks and Machine Learning–ICANN, 2011.
- All you need is a good init. In ICLR, 2016.
- Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
- Reading digits in natural images with unsupervised feature learning. In NeurIPS, 2011.
- Automated flower classification over a large number of classes. 2008.
- Cats and dogs. In CVPR, 2012.
- PyTorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In ICLR, 2014.
- Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
- Pre-trained summarization distillation. arXiv preprint arXiv:2010.13002, 2020.
- Two-stream convolutional networks for action recognition in videos. In NeurIPS, 2014.
- How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
- Rethinking the inception architecture for computer vision. In CVPR, 2016.
- Weed identification based on k-means feature learning combined with convolutional neural network. Computers and electronics in agriculture, 2017.
- Contrastive representation distillation. arXiv preprint arXiv:1910.10699, 2019.
- Mlp-mixer: An all-mlp architecture for vision. In NeurIPS, 2021.
- Training data-efficient image transformers & distillation through attention. In ICML, 2021a.
- Going deeper with image transformers. In ICCV, 2021b.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Selfie: Self-supervised pretraining for image embedding. arXiv preprint arXiv:1906.02940, 2019.
- Mimetic initialization of self-attention layers. In ICML, 2023.
- Understanding the covariance structure of convolutional filters. In ICLR, 2023.
- Attention is all you need. In NeurIPS, 2017.
- On orthogonality and learning recurrent networks with long term dependencies. In ICML, 2017.
- Ross Wightman. GitHub repository: Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
- Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
- Aggregated residual transformations for deep neural networks. In CVPR, 2017.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019.
- mixup: Beyond empirical risk minimization. In ICLR, 2018.
- Random erasing data augmentation. In AAAI, 2020.
- A comprehensive survey on transfer learning. Proceedings of the IEEE, 2020.