Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fisher Mask Nodes for Language Model Merging (2403.09891v3)

Published 14 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Fine-tuning pre-trained models provides significant advantages in downstream performance. The ubiquitous nature of pre-trained models such as BERT and its derivatives in natural language processing has also led to a proliferation of task-specific fine-tuned models. As these models typically only perform one task well, additional training or ensembling is required in multi-task scenarios. The growing field of model merging provides a solution, dealing with the challenge of combining multiple task-specific models into a single multi-task model. In this study, we introduce a novel model merging method for Transformers, combining insights from previous work in Fisher-weighted averaging and the use of Fisher information in model pruning. Utilizing the Fisher information of mask nodes within the Transformer architecture, we devise a computationally efficient weighted-averaging scheme. Our method exhibits a regular and significant performance increase across various models in the BERT family, outperforming full-scale Fisher-weighted averaging in a fraction of the computational cost, with baseline performance improvements of up to +6.5 and a speedup between 57.4x and 321.7x across models. Our results prove the potential of our method in current multi-task learning environments and suggest its scalability and adaptability to new model architectures and learning scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Git re-basin: Merging models modulo permutation symmetries.
  2. Ensemble of averages: Improving model selection and boosting performance in domain generalization.
  3. Swad: Domain generalization by seeking flat minima.
  4. Fusing finetuned models for better pretraining.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding.
  6. The role of permutation invariance in linear mode connectivity of neural networks.
  7. Efficiently identifying task groupings for multi-task learning. Advances in Neural Information Processing Systems, 34:27503–27516.
  8. Pre-trained models: Past, present and future. AI Open, 2:225–250.
  9. Measuring data leakage in machine-learning models with fisher information.
  10. Editing models with task arithmetic.
  11. Dataless knowledge fusion by merging weights of language models.
  12. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526.
  13. A fast post-training pruning framework for transformers. In Advances in Neural Information Processing Systems.
  14. On the convergence of fedavg on non-iid data.
  15. Convergent learning: Do different neural networks learn the same representations?
  16. Group fisher pruning for practical network compression. In International Conference on Machine Learning, pages 7021–7032. PMLR.
  17. Roberta: A robustly optimized bert pretraining approach.
  18. Michael Matena and Colin Raffel. 2022. Merging models with fisher-weighted averaging.
  19. Communication-efficient learning of deep networks from decentralized data. In International Conference on Artificial Intelligence and Statistics.
  20. Recent advances in natural language processing via large pre-trained language models: A survey.
  21. Razvan Pascanu and Yoshua Bengio. 2014. Revisiting natural gradient for deep networks.
  22. Mary Phuong and Marcus Hutter. 2022. Formal algorithms for transformers.
  23. Pre-trained models for natural language processing: A survey. Science China Technological Sciences, 63(10):1872–1897.
  24. Sidak Pal Singh and Martin Jaggi. 2023. Model fusion via optimal transport.
  25. On the variance of the fisher information for deep learning.
  26. Optimizing mode connectivity via neuron alignment.
  27. Well-read students learn better: On the importance of pre-training compact models.
  28. Attention is all you need. Advances in neural information processing systems, 30.
  29. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
  30. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.
  31. Resolving interference when merging models.
  32. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609.
  33. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Thennal D K (4 papers)
  2. Ganesh Nathan (1 paper)
  3. Suchithra M S (1 paper)
Citations (4)

Summary

We haven't generated a summary for this paper yet.