OmniFusion Technical Report (2404.06212v1)
Abstract: Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of LLMs (LLM). We propose an \textit{OmniFusion} model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and their fusing approach, image encoding method (whole image or tiles encoding) and two 7B LLMs (the proprietary one and open-source Mistral). Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup in terms of different VQA tasks in comparison with open-source LLaVA-like solutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. We also propose a variety of situations, where OmniFusion provides highly-detailed answers in different domains: housekeeping, sightseeing, culture, medicine, handwritten and scanned equations recognition, etc. Mistral-based OmniFusion model is an open-source solution with weights, training and inference scripts available at https://github.com/AIRI-Institute/OmniFusion.
- Flamingo: a visual language model for few-shot learning. ArXiv, abs/2204.14198, 2022.
- Llama-adapter v2: Parameter-efficient visual instruction model. ArXiv, abs/2304.15010, 2023.
- Video-llava: Learning united visual representation by alignment before projection. ArXiv, abs/2311.10122, 2023.
- Llava-plus: Learning to use tools for creating multimodal agents. ArXiv, abs/2311.05437, 2023.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. ArXiv, abs/2306.00890, 2023.
- A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436, 2023.
- Learning transferable visual models from natural language supervision, 2021.
- Sigmoid loss for language image pre-training, 2023.
- Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
- Vizwiz grand challenge: Answering visual questions from blind people, 2018.
- Evaluating object hallucination in large vision-language models, 2023.
- Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
- Learn to explain: Multimodal reasoning via thought chains for science question answering, 2022.
- Mmbench: Is your multi-modal model an all-around player?, 2023.
- Towards vqa models that can read, 2019.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6325–6334, 2017.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2023.
- Improved baselines with visual instruction tuning. ArXiv, abs/2310.03744, 2023.
- Vision-flan: Scaling human-labeled tasks in visual instruction tuning, 2024.
- Sharegpt4v: Improving large multi-modal models with better captions, 2023.
- Microsoft coco captions: Data collection and evaluation server, 2015.
- Segment anything, 2023.
- Internlm2 technical report, 2024.
- Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019.
- Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019.
- Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
- Openclip, July 2021. If you use this software, please cite it as below.
- Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model, 2024.
- Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
- From clip to dino: Visual encoders shout in multi-modal large language models, 2024.
- Icfhr2016 crohme: Competition on recognition of online handwritten mathematical expressions. In 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pages 607–612, 2016.
- Ocr-free document understanding transformer. In European Conference on Computer Vision, pages 498–517. Springer, 2022.
- Lmms-eval: Accelerating the development of large multimoal models, March 2024.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. 2023.
- Mini-gemini: Mining the potential of multi-modality vision language models, 2024.
- Deepseek-vl: Towards real-world vision-language understanding, 2024.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Grounding language models to images for multimodal generation. ArXiv, abs/2301.13823, 2023.
- Visual instruction tuning. ArXiv, abs/2304.08485, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv, abs/2304.10592, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500, 2023.
- Bootstrapping vision-language learning with decoupled language pre-training. ArXiv, abs/2307.07063, 2023.
- Lyrics: Boosting fine-grained language-vision alignment and comprehension via semantic-aware visual objects. ArXiv, abs/2312.05278, 2023.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. ArXiv, abs/2303.16199, 2023.
- Infmllm: A unified framework for visual-language tasks. ArXiv, abs/2311.06791, 2023.
- Cosmo: Contrastive streamlined multimodal model with interleaved pre-training, 2024.
- Kosmos-2: Grounding multimodal large language models to the world, 2023.
- Pali: A jointly-scaled multilingual language-image model, 2023.
- Llava-grounding: Grounded visual chat with large multimodal models. ArXiv, abs/2312.02949, 2023.
- Moe-llava: Mixture of experts for large vision-language models. ArXiv, abs/2401.15947, 2024.
- Llava-phi: Efficient multi-modal assistant with small language model. ArXiv, abs/2401.02330, 2024.
- Vila: On pre-training for visual language models. ArXiv, abs/2312.07533, 2023.
- Jack of all tasks, master of many: Designing general-purpose coarse-to-fine vision-language model. ArXiv, abs/2312.12423, 2023.
- Kandinsky: An improved text-to-image synthesis with image prior and latent diffusion. In Yansong Feng and Els Lefever, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 286–295, Singapore, December 2023. Association for Computational Linguistics.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.