HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models (2403.13447v1)
Abstract: Recent advancements indicate that scaling up Multimodal LLMs (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~\footnote{Our project is available on the link https://github.com/DCDmLLM/HyperLLaVA}.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
- Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. arXiv preprint arXiv:2110.07205.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
- Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443.
- SMASH: one-shot model architecture search through hypernetworks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195.
- Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
- Instructblip: towards general-purpose vision-language models with instruction tuning. arxiv. Preprint posted online on June, 15:2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394.
- Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913.
- Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617.
- Hypernetworks. arXiv preprint arXiv:1609.09106.
- Hypernetworks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045.
- Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709.
- Obelics: An open web-scale filtered dataset of interleaved image-text documents.
- Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125.
- Fine-tuning multimodal llms to follow zero-shot demonstrative instructions. In The Twelfth International Conference on Learning Representations.
- Variational cross-graph reasoning and adaptive structured semantics learning for compositional temporal grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
- Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521.
- An empirical study of scaling instruct-tuned large multimodal models. arXiv preprint arXiv:2309.09958.
- Ideal: Toward high-efficiency device-cloud collaborative and dynamic recommendation system. arXiv preprint arXiv:2302.07335.
- Duet: A tuning-free device-cloud collaborative parameters generation framework for efficient device model generalization. In Proceedings of the ACM Web Conference 2023.
- Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. arXiv preprint arXiv:2106.04489.
- Training language models to follow instructions with human feedback, 2022. URL https://arxiv. org/abs/2203.02155, 13.
- Scaling speech technology to 1,000+ languages. arXiv preprint arXiv:2305.13516.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
- Improving language understanding by generative pre-training.
- Searching for activation functions. arXiv preprint arXiv:1710.05941.
- Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550.
- Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326.
- Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355.
- Deebert: Dynamic early exiting for accelerating bert inference. arXiv preprint arXiv:2004.12993.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490.
- Graph hypernetworks for neural architecture search. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601.
- Revisiting the domain shift and sample uncertainty in multi-source active domain transfer. arXiv preprint arXiv:2311.12905.
- Magic: Multimodal relational graph adversarial inference for diverse and unpaired text-based image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3335–3343.
- Frame augmented alternating attention network for video question answering. IEEE Transactions on Multimedia, 22(4):1032–1041.
- Boostmis: Boosting medical image semi-supervised learning with adaptive pseudo labeling and informative active annotation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20666–20676.
- Enhanced visual instruction tuning for text-rich image understanding. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
- Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087.
- Infmllm: A unified framework for visual-language tasks. arXiv preprint arXiv:2311.06791.