OmniFusion Technical Report (2404.06212v1)
Abstract: Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of LLMs (LLM). We propose an \textit{OmniFusion} model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and their fusing approach, image encoding method (whole image or tiles encoding) and two 7B LLMs (the proprietary one and open-source Mistral). Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup in terms of different VQA tasks in comparison with open-source LLaVA-like solutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. We also propose a variety of situations, where OmniFusion provides highly-detailed answers in different domains: housekeeping, sightseeing, culture, medicine, handwritten and scanned equations recognition, etc. Mistral-based OmniFusion model is an open-source solution with weights, training and inference scripts available at https://github.com/AIRI-Institute/OmniFusion.
- Flamingo: a visual language model for few-shot learning. ArXiv, abs/2204.14198, 2022.
- Llama-adapter v2: Parameter-efficient visual instruction model. ArXiv, abs/2304.15010, 2023.
- Video-llava: Learning united visual representation by alignment before projection. ArXiv, abs/2311.10122, 2023.
- Llava-plus: Learning to use tools for creating multimodal agents. ArXiv, abs/2311.05437, 2023.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. ArXiv, abs/2306.00890, 2023.
- A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436, 2023.
- Learning transferable visual models from natural language supervision, 2021.
- Sigmoid loss for language image pre-training, 2023.
- Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
- Vizwiz grand challenge: Answering visual questions from blind people, 2018.
- Evaluating object hallucination in large vision-language models, 2023.
- Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
- Learn to explain: Multimodal reasoning via thought chains for science question answering, 2022.
- Mmbench: Is your multi-modal model an all-around player?, 2023.
- Towards vqa models that can read, 2019.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6325–6334, 2017.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2023.
- Improved baselines with visual instruction tuning. ArXiv, abs/2310.03744, 2023.
- Vision-flan: Scaling human-labeled tasks in visual instruction tuning, 2024.
- Sharegpt4v: Improving large multi-modal models with better captions, 2023.
- Microsoft coco captions: Data collection and evaluation server, 2015.
- Segment anything, 2023.
- Internlm2 technical report, 2024.
- Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019.
- Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019.
- Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
- Openclip, July 2021. If you use this software, please cite it as below.
- Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model, 2024.
- Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
- From clip to dino: Visual encoders shout in multi-modal large language models, 2024.
- Icfhr2016 crohme: Competition on recognition of online handwritten mathematical expressions. In 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pages 607–612, 2016.
- Ocr-free document understanding transformer. In European Conference on Computer Vision, pages 498–517. Springer, 2022.
- Lmms-eval: Accelerating the development of large multimoal models, March 2024.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. 2023.
- Mini-gemini: Mining the potential of multi-modality vision language models, 2024.
- Deepseek-vl: Towards real-world vision-language understanding, 2024.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Grounding language models to images for multimodal generation. ArXiv, abs/2301.13823, 2023.
- Visual instruction tuning. ArXiv, abs/2304.08485, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv, abs/2304.10592, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500, 2023.
- Bootstrapping vision-language learning with decoupled language pre-training. ArXiv, abs/2307.07063, 2023.
- Lyrics: Boosting fine-grained language-vision alignment and comprehension via semantic-aware visual objects. ArXiv, abs/2312.05278, 2023.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. ArXiv, abs/2303.16199, 2023.
- Infmllm: A unified framework for visual-language tasks. ArXiv, abs/2311.06791, 2023.
- Cosmo: Contrastive streamlined multimodal model with interleaved pre-training, 2024.
- Kosmos-2: Grounding multimodal large language models to the world, 2023.
- Pali: A jointly-scaled multilingual language-image model, 2023.
- Llava-grounding: Grounded visual chat with large multimodal models. ArXiv, abs/2312.02949, 2023.
- Moe-llava: Mixture of experts for large vision-language models. ArXiv, abs/2401.15947, 2024.
- Llava-phi: Efficient multi-modal assistant with small language model. ArXiv, abs/2401.02330, 2024.
- Vila: On pre-training for visual language models. ArXiv, abs/2312.07533, 2023.
- Jack of all tasks, master of many: Designing general-purpose coarse-to-fine vision-language model. ArXiv, abs/2312.12423, 2023.
- Kandinsky: An improved text-to-image synthesis with image prior and latent diffusion. In Yansong Feng and Els Lefever, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 286–295, Singapore, December 2023. Association for Computational Linguistics.
- Elizaveta Goncharova (10 papers)
- Anton Razzhigaev (14 papers)
- Matvey Mikhalchuk (6 papers)
- Maxim Kurkin (2 papers)
- Irina Abdullaeva (3 papers)
- Matvey Skripkin (4 papers)
- Ivan Oseledets (187 papers)
- Denis Dimitrov (27 papers)
- Andrey Kuznetsov (36 papers)