CoTBal: Comprehensive Task Balancing for Multi-Task Visual Instruction Tuning (2403.04343v2)
Abstract: Visual instruction tuning is an important training stage for large multimodal models. Nevertheless, when learning multiple visual tasks simultaneously, this approach may lead to suboptimal and imbalanced overall performance due to latent knowledge conflicts across tasks. To mitigate this issue, we introduce a novel Comprehensive Task Balancing (CoTBal) algorithm tailored for multi-task visual instruction tuning. To our knowledge, this is the first work to explore multi-task optimization in visual instruction tuning. Specifically, we consider two critical dimensions for task balancing: (1) Inter-Task Contribution, which represents the phenomenon where learning one task could enhance the performance on others owing to the overlapping knowledge domains across tasks, and (2) Intra-Task Difficulty, which indicates the inherent learning difficulty of a single task. Furthermore, by quantifying these with performance-based metrics, comprehensive task balancing is thus achieved by assigning greater weight to tasks that offer substantial contributions to others, receive minimal contributions from others, and present high learning difficulties. Extensive experiments on three benchmarks demonstrate that our CoTBal algorithm results in superior and more balanced overall performance in multi-task visual instruction tuning.
- 2023. Sharegpt. https://sharegpt.com/.
- Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv Preprint ArXiv:2308.12966.
- Rich Caruana. 1998. Multitask learning. Springer.
- Sharegpt4v: Improving large multi-modal models with better captions. ArXiv Preprint ArXiv:2311.12793.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv Preprint ArXiv:2305.06500.
- Improvable gap balancing for multi-task learning. In Uncertainty in Artificial Intelligence, pages 496–506. PMLR.
- An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv Preprint ArXiv:2010.11929.
- Mixture of cluster-conditional lora experts for vision-language instruction tuning. ArXiv Preprint ArXiv:2312.12379.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913.
- Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6700–6709.
- Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 787–798.
- Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7482–7491.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. ArXiv Preprint ArXiv:2306.00890.
- Reasonable effectiveness of random weighting: A litmus test for multi-task learning. Transactions on Machine Learning Research.
- Reasonable effectiveness of random weighting: A litmus test for multi-task learning. ArXiv Preprint ArXiv:2111.10603.
- Improved baselines with visual instruction tuning. ArXiv Preprint ArXiv:2310.03744.
- Visual instruction tuning. ArXiv Preprint ArXiv:2304.08485.
- Towards impartial multi-task learning. In International Conference on Learning Representations.
- End-to-end multi-task learning with attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1871–1880.
- Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1930–1939.
- Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11–20.
- Chartqa: A benchmark for question answering about charts with visual and logical reasoning. ArXiv Preprint ArXiv:2203.10244.
- Ocr-vqa: Visual question answering by reading text in images. In International Conference on Document Analysis and Recognition, pages 947–952. IEEE.
- Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3994–4003.
- Multi-task learning as a bargaining game. In International Conference on Machine Learning, pages 16428–16446. PMLR.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. ArXiv Preprint ArXiv:1706.05098.
- Ozan Sener and Vladlen Koltun. 2018. Multi-task learning as multi-objective optimization. Advances in Neural Information Processing Systems, 31.
- Gemini: a family of highly capable multimodal models. ArXiv Preprint ArXiv:2312.11805.
- Llama: Open and efficient foundation language models. ArXiv Preprint ArXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv Preprint ArXiv:2307.09288.
- Multi-task learning for dense prediction tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3614–3633.
- Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4566–4575.
- Vigc: Visual instruction generation and correction. ArXiv Preprint ArXiv:2308.12714.
- Self-instruct: Aligning language model with self generated instructions. ArXiv Preprint ArXiv:2212.10560.
- Finetuned language models are zero-shot learners. ArXiv Preprint ArXiv:2109.01652.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). ArXiv Preprint ArXiv:2309.17421, 9(1).
- mplug-owl: Modularization empowers large language models with multimodality. ArXiv Preprint ArXiv:2304.14178.
- Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836.
- Instruction tuning for large language models: A survey. ArXiv Preprint ArXiv:2308.10792.
- Llavar: Enhanced visual instruction tuning for text-rich image understanding. ArXiv Preprint ArXiv:2306.17107.
- A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609.
- Svit: Scaling up visual instruction tuning. ArXiv Preprint ArXiv:2307.04087.
- Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv Preprint ArXiv:2306.05685.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv Preprint ArXiv:2304.10592.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.