Emergent Mind

MixtureGrowth: Growing Neural Networks by Recombining Learned Parameters

(2311.04251)
Published Nov 7, 2023 in cs.LG , cs.AI , and cs.CV

Abstract

Most deep neural networks are trained under fixed network architectures and require retraining when the architecture changes. If expanding the network's size is needed, it is necessary to retrain from scratch, which is expensive. To avoid this, one can grow from a small network by adding random weights over time to gradually achieve the target network size. However, this naive approach falls short in practice as it brings too much noise to the growing process. Prior work tackled this issue by leveraging the already learned weights and training data for generating new weights through conducting a computationally expensive analysis step. In this paper, we introduce MixtureGrowth, a new approach to growing networks that circumvents the initialization overhead in prior work. Before growing, each layer in our model is generated with a linear combination of parameter templates. Newly grown layer weights are generated by using a new linear combination of existing templates for a layer. On one hand, these templates are already trained for the task, providing a strong initialization. On the other, the new coefficients provide flexibility for the added layer weights to learn something new. We show that our approach boosts top-1 accuracy over the state-of-the-art by 2-2.5% on CIFAR-100 and ImageNet datasets, while achieving comparable performance with fewer FLOPs to a larger network trained from scratch. Code is available at https://github.com/chaudatascience/mixturegrowth.

Overview

  • MixtureGrowth introduces a technique for expanding neural networks by reusing learned weights through new linear combinations, aiming to combine the efficiency of small models with the superior performance of larger ones without full retraining.

  • The method uses parameter templates and linear combinations of these templates to add new weights, preserving computational efficiency and learned representations.

  • Experimental results show up to 2.5% improvement in top-1 accuracy on the CIFAR-100 dataset and comparable performance with less computational demand on ImageNet, proving the method's efficacy.

  • Future research directions include exploring recombination strategies across various architectures and refining the template and combination processes to further improve scalability and efficiency.

MixtureGrowth: An Efficient Approach for Increasing Neural Network Size through Recombination of Learned Parameters

Introduction

In the quest to enhance the performance of deep neural networks, researchers have sought various strategies, including neural architecture search (NAS), knowledge distillation, and parameter pruning, among others. These approaches, while effective, often result in models that optimize inference performance at the cost of increased computational complexity during the training phase. An alternative strategy that has gained interest involves starting with a smaller network model and progressively growing its size. This approach benefits from the initial reduced computational requirement of smaller models and the eventual superior performance of larger networks. However, the critical challenge lies in expanding the network size without necessitating a complete retraining from scratch, which could nullify the computational savings.

MixtureGrowth Methodology

MixtureGrowth introduces a novel technique for growing neural networks by essentially leveraging already learned weights. At its core, the idea is to augment the size of a neural network by introducing new weights that are linear combinations of pre-existing parameter templates. This process not only maintains the computational efficiency by reusing learned parameters but also ensures that the expanded network inherits the learned representations. More specifically, the process involves side-stepping the intensive computation traditionally required to analyze and initialize new weights by automating the expansion using a smart blending of learned parameters.

  • Parameter Templates and Linear Combinations: Existing neural networks, when designated to grow, can significantly benefit from a mechanism that neatly integrates newly generated weights without disturbing the learned representations. MixtureGrowth achieves this by preparing a set of parameter templates, which are essentially parts of the composition of layer weights in the smaller model. New weights are then introduced as new linear combinations of these templates.
  • Growth Strategies and Implementation: A pivotal aspect of the growth process is deciding on an effective strategy to initialize these new sets of linear coefficients for the added weights. Through experimental analysis, the paper highlights orthogonal initialization as an effective method for this purpose, promoting diversity and robustness in the new weights.

Experimental Findings

MixtureGrowth demonstrates its effectiveness through substantial improvements in top-1 accuracy, upscaling models on CIFAR-100 and ImageNet datasets with reduced computational complexity. Remarkable findings include:

  • Up to 2.5% improvement in top-1 accuracy on the CIFAR-100 dataset over state-of-the-art methods under equivalent computational constraints.
  • Comparable performance with significantly fewer FLOPs needed against larger networks trained from scratch, showcasing the efficiency of the approach.

Analysis and Future Directions

Several key insights emerge from experimenting with MixtureGrowth, notably the impact of linear coefficients' initialization strategies and the exploration of optimal growth points during training. The analysis suggests that adopting orthogonal coefficients for initializing new weights after growth can lead to more significant performance gains.

One promising avenue for future research could involve investigating the recombination and growth strategies across different network architectures and task domains. Furthermore, refinements in template selection and the linear combination process could extend the methodology's applicability, potentially opening new paths toward dynamically scalable neural networks that efficiently adapt to varying computational resources and task complexities.

Conclusion

MixtureGrowth presents a compelling strategy for increasing neural network size with minimal computational overhead, leveraging the strength of parameter recombination. Its ability to significantly boost performance while maintaining or even reducing the total computational cost poses an exciting prospect for the development of more efficient and adaptable neural networks.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.