MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning (2404.05621v1)
Abstract: While excellent in transfer learning, Vision-LLMs (VLMs) come with high computational costs due to their large number of parameters. To address this issue, removing parameters via model pruning is a viable solution. However, existing techniques for VLMs are task-specific, and thus require pruning the network from scratch for each new task of interest. In this work, we explore a new direction: Task-Agnostic Vision-Language Pruning (TA-VLP). Given a pretrained VLM, the goal is to find a unique pruned counterpart transferable to multiple unknown downstream tasks. In this challenging setting, the transferable representations already encoded in the pretrained model are a key aspect to preserve. Thus, we propose Multimodal Flow Pruning (MULTIFLOW), a first, gradient-free, pruning framework for TA-VLP where: (i) the importance of a parameter is expressed in terms of its magnitude and its information flow, by incorporating the saliency of the neurons it connects; and (ii) pruning is driven by the emergent (multimodal) distribution of the VLM parameters after pretraining. We benchmark eight state-of-the-art pruning algorithms in the context of TA-VLP, experimenting with two VLMs, three vision-language tasks, and three pruning ratios. Our experimental results show that MULTIFLOW outperforms recent sophisticated, combinatorial competitors in the vast majority of the cases, paving the way towards addressing TA-VLP. The code is publicly available at https://github.com/FarinaMatteo/multiflow.
- Prospect Pruning: Finding Trainable Weights at Initialization using Meta-Gradients. In International Conference on Learning Representations, 2022.
- SPICE: Semantic Propositional Image Caption Evaluation. In European Conference on Computer Vision. Springer, 2016.
- METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic evaluation measures for Machine Translation and/or Summarization, 2005.
- Fast as CHITA: Neural Network Pruning with Combinatorial Optimization. In International Conference on Machine Learning, 2023.
- Learned Thresholds Token Merging and Pruning for Vision Transformers. Transactions on Machine Learning Research, 2023.
- Pruning convolutional neural networks with self-supervision. arXiv preprint arXiv:2001.03554, 2020.
- Emerging properties in self-supervised vision transformers. In IEEE Conference on Computer Vision and Pattern Recognition, 2021.
- Chaoqi Wang and Guodong Zhang and Roger Grosse. Picking Winning Tickets Before Training by Preserving Gradient Flow. In International Conference on Learning Representations, 2020.
- Vision transformer slimming: Multi-dimension searching in continuous optimization space. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- The lottery ticket hypothesis for pre-trained BERT networks. Advances in Neural Information Processing Systems, 2020a.
- Chasing sparsity in vision transformers: An end-to-end exploration. Advances in Neural Information Processing Systems, 2021a.
- The lottery tickets hypothesis for supervised and self-supervised pre-training in computer vision models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16306–16316, 2021b.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Uniter: Universal image-text representation learning. In European Conference on Computer Vision, 2020b.
- Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, 2009.
- Learning to prune deep neural networks via layer-wise optimal brain surgeon. Advances in Neural Information Processing Systems, 2017.
- An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations, 2021.
- The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In International Conference on Learning Representations, 2019.
- Pruning Neural Networks at Initialization: Why Are We Missing the Mark? In International Conference on Learning Representations, 2021.
- SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. In International Conference on Machine Learning, 2023.
- The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
- Sparse gpu kernels for deep learning. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020.
- Playing lottery tickets with vision and language. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
- Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning. In Proceedings of the 5th Workshop on Representation Learning for NLP, 2020.
- Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. Advances in Neural Information Processing Systems, 2015a.
- Learning both weights and connections for efficient neural network. Advances in Neural Information Processing Systems, 2015b.
- Second order derivatives for network pruning: Optimal brain surgeon. Advances in Neural Information Processing Systems, 1992.
- The emergence of essential sparsity in large pre-trained models: The weights that matter. Advances in Neural Information Processing Systems, 2023.
- Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017.
- Learning multiple layers of features from tiny images. 2009.
- The optimal BERT surgeon: Scalable and accurate second-order pruning for large language models. Conference on Empirical Methods in Natural Language Processing, 2022.
- A Fast Post-Training Pruning Framework for Transformers. In Advances in Neural Information Processing Systems, 2022.
- Optimal brain damage. Advances in Neural Information Processing Systems, 1989.
- Layer-adaptive Sparsity for the Magnitude-based Pruning. In International Conference on Learning Representations, 2021.
- SNIP: Single-Shot Network Pruning based on connection sensitivity. In International Conference on Learning Representations, 2019.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 2021.
- BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 2022.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, 2023.
- EViT: Expediting Vision Transformers via Token Reorganizations. In International Conference on Learning Representations, 2022.
- Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision. Springer, 2014.
- Learning to win lottery tickets in bert transfer via task-agnostic mask training. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022.
- Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In European Conference on Computer Vision, 2018.
- AdaViT: Adaptive Vision Transformers for Efficient Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- Are sixteen heads really better than one? In Advances in Neural Information Processing Systems, 2019.
- Pruning Convolutional Neural Networks for Resource Efficient Inference. In International Conference on Learning Representations, 2017.
- Importance estimation for neural network pruning. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics & Image Processing, 2008.
- Im2Text: Describing images using 1 million captioned photographs. Advances in Neural Information Processing Systems, 2011.
- BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002.
- Pau de Jorge and Amartya Sanyal and Harkirat Singh Behl and Philip H. S. Torr and Grégory Rogez and Puneet Kumar Dokania. Progressive Skeletonization: Trimming more fat from a network at initialization. In International Conference on Learning Representations, 2021.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
- DynamicVIT: Efficient vision transformers with dynamic token sparsification. Advances in Neural Information Processing Systems, 2021.
- Comparing Rewinding and Fine-tuning in Neural Network Pruning. In International Conference on Learning Representations, 2020.
- Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems, 2020.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018.
- UPop: Unified and progressive pruning for compressing vision-language transformers. In International Conference on Machine Learning, 2023.
- WoodFisher: Efficient second-order approximation for neural network compression. Advances in Neural Information Processing Systems, 2020.
- A Simple and Effective Pruning Approach for Large Language Models. International Conference on Learning Representations, 2024.
- Pruning neural networks without any data by iteratively conserving synaptic flow. In Advances in Neural Information Processing Systems, 2020.
- Patch slimming for efficient vision transformers. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- Any-to-any generation via composable diffusion. Advances in Neural Information Processing Systems, 2024.
- CIDEr: Consensus-based Image Description Evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
- EfficientVLM: Fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning. In Findings of the Association for Computational Linguistics: ACL 2023, 2023.
- From dense to sparse: Contrastive pruning for better pre-trained language model compression. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
- Global Vision Transformer Pruning With Hessian-Aware Saliency. In IEEE Conference on Computer Vision and Pattern Recognition, 2023.
- Yite Wang and Dawei Li and Ruoyu Sun. NTK-SAP: Improving neural network pruning by aligning training dynamics. In International Conference on Learning Representations, 2023.
- The combinatorial brain surgeon: pruning weights that cancel one another in neural networks. In International Conference on Machine Learning, 2022.
- Prune once for all: Sparse pre-trained language models. Advances in Neural Information Processing Systems, 2021.
- Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. In International Conference on Machine Learning, 2022.
- To prune, or not to prune: exploring the efficacy of pruning for model compression. International Conference on Learning Representations, Workshop Track Proceedings, 2018.
- Vision transformer pruning. arXiv preprint arXiv:2104.08500, 2021.