Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Initializing Models with Larger Ones (2311.18823v1)

Published 30 Nov 2023 in cs.LG and cs.CV

Abstract: Weight initialization plays an important role in neural network training. Widely used initialization methods are proposed and evaluated for networks that are trained from scratch. However, the growing number of pretrained models now offers new opportunities for tackling this classical problem of weight initialization. In this work, we introduce weight selection, a method for initializing smaller models by selecting a subset of weights from a pretrained larger model. This enables the transfer of knowledge from pretrained weights to smaller models. Our experiments demonstrate that weight selection can significantly enhance the performance of small models and reduce their training time. Notably, it can also be used together with knowledge distillation. Weight selection offers a new approach to leverage the power of pretrained models in resource-constrained settings, and we hope it can be a useful tool for training small models in the large-model era. Code is available at https://github.com/OscarXZQ/weight-selection.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. BEiT: BERT pre-training of image transformers. In ICLR, 2022.
  2. Knowledge distillation: A good teacher is patient and consistent. In CVPR, 2022.
  3. Food-101 – mining discriminative components with random forests. In ECCV, 2014.
  4. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  5. bert2bert: Towards reusable pretrained language models. arXiv preprint arXiv:2110.07143, 2021.
  6. Net2net: Accelerating learning via knowledge transfer. In ICLR, 2016.
  7. Describing textures in the wild. In CVPR, 2014.
  8. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR, 2020.
  9. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, 2011.
  10. Randaugment: Practical automated data augmentation with a reduced search space. In CVPR Workshops, 2020.
  11. Breaking the architecture barrier: A method for efficient knowledge transfer across networks. arXiv preprint arXiv:2212.13970, 2022.
  12. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  15. How to train vision transformer on small-scale datasets? In BMVC, 2022.
  16. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
  17. Learning both weights and connections for efficient neural network. In NeurIPS, 2015.
  18. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
  19. Deep residual learning for image recognition. In CVPR, 2016.
  20. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  21. Introducing eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. In IGARSS, 2018.
  22. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019.
  23. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  24. Lora: Low-rank adaptation of large language models. In ICLR, 2022.
  25. Deep networks with stochastic depth. In ECCV, 2016.
  26. Big transfer (bit): General visual representation learning. In ECCV, 2020.
  27. Data-dependent initializations of convolutional neural networks. In ICLR, 2015.
  28. Alex Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009.
  29. Exploring strategies for training deep neural networks. Journal of machine learning research, 2009.
  30. Optimal brain damage. In NeurIPS, 1990.
  31. Pruning filters for efficient convnets. ICLR, 2017a.
  32. Fully convolutional instance-aware semantic segmentation. In CVPR, 2017b.
  33. Weight distillation: Transferring the knowledge in neural network parameters. arXiv preprint arXiv:2009.09152, 2020.
  34. Swin transformer: Hierarchical vision transformer using shifted windows. 2021.
  35. Rethinking the value of network pruning. In ICLR, 2019.
  36. A convnet for the 2020s. In CVPR, 2022.
  37. Dropout reduces underfitting. In ICML, 2023.
  38. Stacked convolutional auto-encoders for hierarchical feature extraction. In Artificial Neural Networks and Machine Learning–ICANN, 2011.
  39. All you need is a good init. In ICLR, 2016.
  40. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
  41. Reading digits in natural images with unsupervised feature learning. In NeurIPS, 2011.
  42. Automated flower classification over a large number of classes. 2008.
  43. Cats and dogs. In CVPR, 2012.
  44. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
  45. Learning transferable visual models from natural language supervision. In ICML, 2021.
  46. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  47. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In ICLR, 2014.
  48. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
  49. Pre-trained summarization distillation. arXiv preprint arXiv:2010.13002, 2020.
  50. Two-stream convolutional networks for action recognition in videos. In NeurIPS, 2014.
  51. How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
  52. Rethinking the inception architecture for computer vision. In CVPR, 2016.
  53. Weed identification based on k-means feature learning combined with convolutional neural network. Computers and electronics in agriculture, 2017.
  54. Contrastive representation distillation. arXiv preprint arXiv:1910.10699, 2019.
  55. Mlp-mixer: An all-mlp architecture for vision. In NeurIPS, 2021.
  56. Training data-efficient image transformers & distillation through attention. In ICML, 2021a.
  57. Going deeper with image transformers. In ICCV, 2021b.
  58. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  59. Selfie: Self-supervised pretraining for image embedding. arXiv preprint arXiv:1906.02940, 2019.
  60. Mimetic initialization of self-attention layers. In ICML, 2023.
  61. Understanding the covariance structure of convolutional filters. In ICLR, 2023.
  62. Attention is all you need. In NeurIPS, 2017.
  63. On orthogonality and learning recurrent networks with long term dependencies. In ICML, 2017.
  64. Ross Wightman. GitHub repository: Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  65. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  66. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
  67. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
  68. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019.
  69. mixup: Beyond empirical risk minimization. In ICLR, 2018.
  70. Random erasing data augmentation. In AAAI, 2020.
  71. A comprehensive survey on transfer learning. Proceedings of the IEEE, 2020.
Citations (8)

Summary

  • The paper presents a novel weight selection methodology that transfers pretrained knowledge from large models to effectively initialize smaller models.
  • The paper demonstrates that integrating weight selection with knowledge distillation significantly boosts performance across diverse image classification datasets.
  • The paper shows that supervised pretraining and closely sized teacher models yield superior initialization compared to conventional methods.

Overview of "Initializing Models with Larger Ones"

The paper "Initializing Models with Larger Ones" investigates the innovative concept of utilizing pretrained large models to initialize smaller models, addressing the traditional bottleneck of weight initialization in neural network training. This work introduces a method called weight selection, which selects weights from a larger, pretrained model to initialize a smaller one within the same model family. The paper emphasizes the significance of such an approach in reducing large model training time and improving the performance of smaller models in resource-constrained environments.

Key Contributions

  1. Weight Selection Methodology: The authors propose an elegant weight selection approach, which involves layer selection, component mapping, and element selection. The method selects a subset of weights from a larger pretrained model, initializing a smaller model with these weights. This procedure effectively transfers knowledge from the larger model to the smaller counterpart.
  2. Compatibility with Knowledge Distillation: The paper explores the integration of weight selection with knowledge distillation techniques, such as logit-based and feature-based distillation. Results show that combining these techniques amplifies the performance benefits, underscoring the complementary nature of weight selection and knowledge distillation.
  3. Comprehensive Evaluation: The authors conduct extensive experiments across nine image classification datasets, spanning various scales, to validate the efficacy of the proposed method. The approach significantly improves test accuracy across all datasets, with larger gains observed in smaller datasets.
  4. Analysis of Pretraining Regimes and Teacher Size: By examining different pretrained models and varying the teacher model size, the work provides insights into the nuanced dynamics affecting the initialization quality. Supervised pretraining emerges as the most effective, while closer-sized pretrained models provide better initialization.
  5. Empirical Comparisons: The paper offers a robust comparison between weight selection and other initialization strategies, such as Xavier and Kaiming initialization, structured and unstructured pruning, and mimetic initialization. Weight selection consistently demonstrates superior performance, which is attributed to its direct utilization of pretrained weights.

Implications and Future Directions

This research presents significant implications for the efficient deployment of neural networks, particularly in settings with constrained computational resources. The proposed method offers a practical and computationally inexpensive solution to leverage the expansive knowledge encapsulated in large pretrained models, promoting broader applicability of deep learning models on devices with limited resources.

Theoretically, the paper provides a substantial contribution to the understanding of how pretrained models can be optimally utilized beyond standard transfer learning paradigms. It opens new avenues for future research in optimizing weight initialization methods tailored to various model architectures, investigating the potential of hybrid models that integrate multiple knowledge transfer techniques.

Looking forward, the methodology could inspire further exploration into modular designs and adaptive initialization strategies, potentially influencing the development of more efficient algorithms for training lightweight neural networks. The implications extend to the architecture design of models, where the focus could shift towards developing architectures that naturally facilitate weight selection across different scales.

In conclusion, "Initializing Models with Larger Ones" presents a compelling argument for rethinking weight initialization in modern neural networks, leveraging pretrained models to drive substantial improvements in model training efficiency and performance. This work is poised to inspire ongoing exploration into the scalable, efficient training of neural networks, especially in the context of increasingly sophisticated AI applications.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com