Fisher Mask Nodes for Language Model Merging (2403.09891v3)
Abstract: Fine-tuning pre-trained models provides significant advantages in downstream performance. The ubiquitous nature of pre-trained models such as BERT and its derivatives in natural language processing has also led to a proliferation of task-specific fine-tuned models. As these models typically only perform one task well, additional training or ensembling is required in multi-task scenarios. The growing field of model merging provides a solution, dealing with the challenge of combining multiple task-specific models into a single multi-task model. In this study, we introduce a novel model merging method for Transformers, combining insights from previous work in Fisher-weighted averaging and the use of Fisher information in model pruning. Utilizing the Fisher information of mask nodes within the Transformer architecture, we devise a computationally efficient weighted-averaging scheme. Our method exhibits a regular and significant performance increase across various models in the BERT family, outperforming full-scale Fisher-weighted averaging in a fraction of the computational cost, with baseline performance improvements of up to +6.5 and a speedup between 57.4x and 321.7x across models. Our results prove the potential of our method in current multi-task learning environments and suggest its scalability and adaptability to new model architectures and learning scenarios.
- Git re-basin: Merging models modulo permutation symmetries.
- Ensemble of averages: Improving model selection and boosting performance in domain generalization.
- Swad: Domain generalization by seeking flat minima.
- Fusing finetuned models for better pretraining.
- Bert: Pre-training of deep bidirectional transformers for language understanding.
- The role of permutation invariance in linear mode connectivity of neural networks.
- Efficiently identifying task groupings for multi-task learning. Advances in Neural Information Processing Systems, 34:27503–27516.
- Pre-trained models: Past, present and future. AI Open, 2:225–250.
- Measuring data leakage in machine-learning models with fisher information.
- Editing models with task arithmetic.
- Dataless knowledge fusion by merging weights of language models.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526.
- A fast post-training pruning framework for transformers. In Advances in Neural Information Processing Systems.
- On the convergence of fedavg on non-iid data.
- Convergent learning: Do different neural networks learn the same representations?
- Group fisher pruning for practical network compression. In International Conference on Machine Learning, pages 7021–7032. PMLR.
- Roberta: A robustly optimized bert pretraining approach.
- Michael Matena and Colin Raffel. 2022. Merging models with fisher-weighted averaging.
- Communication-efficient learning of deep networks from decentralized data. In International Conference on Artificial Intelligence and Statistics.
- Recent advances in natural language processing via large pre-trained language models: A survey.
- Razvan Pascanu and Yoshua Bengio. 2014. Revisiting natural gradient for deep networks.
- Mary Phuong and Marcus Hutter. 2022. Formal algorithms for transformers.
- Pre-trained models for natural language processing: A survey. Science China Technological Sciences, 63(10):1872–1897.
- Sidak Pal Singh and Martin Jaggi. 2023. Model fusion via optimal transport.
- On the variance of the fisher information for deep learning.
- Optimizing mode connectivity via neuron alignment.
- Well-read students learn better: On the importance of pre-training compact models.
- Attention is all you need. Advances in neural information processing systems, 30.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.
- Resolving interference when merging models.
- A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609.
- A comprehensive survey on pretrained foundation models: A history from bert to chatgpt.
- Thennal D K (4 papers)
- Ganesh Nathan (1 paper)
- Suchithra M S (1 paper)