Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OmniFusion Technical Report (2404.06212v1)

Published 9 Apr 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of LLMs (LLM). We propose an \textit{OmniFusion} model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and their fusing approach, image encoding method (whole image or tiles encoding) and two 7B LLMs (the proprietary one and open-source Mistral). Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup in terms of different VQA tasks in comparison with open-source LLaVA-like solutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. We also propose a variety of situations, where OmniFusion provides highly-detailed answers in different domains: housekeeping, sightseeing, culture, medicine, handwritten and scanned equations recognition, etc. Mistral-based OmniFusion model is an open-source solution with weights, training and inference scripts available at https://github.com/AIRI-Institute/OmniFusion.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Flamingo: a visual language model for few-shot learning. ArXiv, abs/2204.14198, 2022.
  2. Llama-adapter v2: Parameter-efficient visual instruction model. ArXiv, abs/2304.15010, 2023.
  3. Video-llava: Learning united visual representation by alignment before projection. ArXiv, abs/2311.10122, 2023.
  4. Llava-plus: Learning to use tools for creating multimodal agents. ArXiv, abs/2311.05437, 2023.
  5. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. ArXiv, abs/2306.00890, 2023.
  6. A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436, 2023.
  7. Learning transferable visual models from natural language supervision, 2021.
  8. Sigmoid loss for language image pre-training, 2023.
  9. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
  10. Vizwiz grand challenge: Answering visual questions from blind people, 2018.
  11. Evaluating object hallucination in large vision-language models, 2023.
  12. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
  13. Learn to explain: Multimodal reasoning via thought chains for science question answering, 2022.
  14. Mmbench: Is your multi-modal model an all-around player?, 2023.
  15. Towards vqa models that can read, 2019.
  16. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6325–6334, 2017.
  17. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2023.
  18. Improved baselines with visual instruction tuning. ArXiv, abs/2310.03744, 2023.
  19. Vision-flan: Scaling human-labeled tasks in visual instruction tuning, 2024.
  20. Sharegpt4v: Improving large multi-modal models with better captions, 2023.
  21. Microsoft coco captions: Data collection and evaluation server, 2015.
  22. Segment anything, 2023.
  23. Internlm2 technical report, 2024.
  24. Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019.
  25. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019.
  26. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
  27. Openclip, July 2021. If you use this software, please cite it as below.
  28. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model, 2024.
  29. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
  30. From clip to dino: Visual encoders shout in multi-modal large language models, 2024.
  31. Icfhr2016 crohme: Competition on recognition of online handwritten mathematical expressions. In 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pages 607–612, 2016.
  32. Ocr-free document understanding transformer. In European Conference on Computer Vision, pages 498–517. Springer, 2022.
  33. Lmms-eval: Accelerating the development of large multimoal models, March 2024.
  34. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. 2023.
  35. Mini-gemini: Mining the potential of multi-modality vision language models, 2024.
  36. Deepseek-vl: Towards real-world vision-language understanding, 2024.
  37. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  38. Grounding language models to images for multimodal generation. ArXiv, abs/2301.13823, 2023.
  39. Visual instruction tuning. ArXiv, abs/2304.08485, 2023.
  40. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv, abs/2304.10592, 2023.
  41. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, 2023.
  42. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500, 2023.
  43. Bootstrapping vision-language learning with decoupled language pre-training. ArXiv, abs/2307.07063, 2023.
  44. Lyrics: Boosting fine-grained language-vision alignment and comprehension via semantic-aware visual objects. ArXiv, abs/2312.05278, 2023.
  45. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. ArXiv, abs/2303.16199, 2023.
  46. Infmllm: A unified framework for visual-language tasks. ArXiv, abs/2311.06791, 2023.
  47. Cosmo: Contrastive streamlined multimodal model with interleaved pre-training, 2024.
  48. Kosmos-2: Grounding multimodal large language models to the world, 2023.
  49. Pali: A jointly-scaled multilingual language-image model, 2023.
  50. Llava-grounding: Grounded visual chat with large multimodal models. ArXiv, abs/2312.02949, 2023.
  51. Moe-llava: Mixture of experts for large vision-language models. ArXiv, abs/2401.15947, 2024.
  52. Llava-phi: Efficient multi-modal assistant with small language model. ArXiv, abs/2401.02330, 2024.
  53. Vila: On pre-training for visual language models. ArXiv, abs/2312.07533, 2023.
  54. Jack of all tasks, master of many: Designing general-purpose coarse-to-fine vision-language model. ArXiv, abs/2312.12423, 2023.
  55. Kandinsky: An improved text-to-image synthesis with image prior and latent diffusion. In Yansong Feng and Els Lefever, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 286–295, Singapore, December 2023. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Elizaveta Goncharova (10 papers)
  2. Anton Razzhigaev (14 papers)
  3. Matvey Mikhalchuk (6 papers)
  4. Maxim Kurkin (2 papers)
  5. Irina Abdullaeva (3 papers)
  6. Matvey Skripkin (4 papers)
  7. Ivan Oseledets (187 papers)
  8. Denis Dimitrov (27 papers)
  9. Andrey Kuznetsov (36 papers)
Citations (3)

Summary

  • The paper introduces the OmniFusion model by integrating pretrained LLMs with specialized visual adapters for enhanced multimodal data processing.
  • The paper employs innovative adapter techniques using transformer and MLP solutions alongside dual image encoding strategies for nuanced visual analysis.
  • The paper demonstrates improved performance on benchmarks like VQA and OCR through grid splitting and multi-encoder feature mixing, paving the way for future multimodal applications.

Insightful Overview of the OmniFusion Technical Report

The paper "OmniFusion Technical Report" introduces the OmniFusion model, a novel approach in the field of multimodal architectures that combines pretrained LLMs with specialized adapters for visual modalities. This integration serves to enhance the joint processing capabilities of text and images, aiming to address the inherent challenges in multimodal data coupling. The paper undertakes a comprehensive assessment of architectural design strategies, including the employment of MLP and transformer adapters, diverse image encoders such as CLIP-ViT variants, and their corresponding image encoding methodologies.

Central to the OmniFusion model's development is its adaptability in image encoding, employing both whole image and tiled image encoding strategies. This duality allows the system to attain a nuanced understanding of visual content, which proves instrumental in a wide array of visual-language benchmarks. These benchmarks span several tasks, including visual question answering (VQA) and other domain-specific applications like culture, medicine, and handwritten equation recognition.

Model Architecture and Training

The core architecture of the OmniFusion model involves integrating a pretrained LLM with adapters specifically designed for processing visual embeddings. A noteworthy aspect of the approach lies in the adoption of adapter-based methodologies that avoid the intensive computational demands of end-to-end training pipelines, which often require vast interleaved datasets. Two primary considerations in the model's design are the choice of adapter techniques and the strategy for encoding visual data.

Experimentally, the alignment of visual and textual modalities is achieved through trainable embeddings that demarcate token sequences from visual streams. The visual information is processed via either a transformer adapter or a two-layer MLP, artfully merging features from distinct encoders such as CLIP-ViT-L and DINO-v2.

Training Regimen

OmniFusion's training regime unfolds in a two-stage process. Initially, pretraining involves adapters and tokens operating on extensive datasets of image-text pairs aimed at refining visual embeddings and transitions. Subsequently, fine-tuning harnesses instructional dialogues to hone the model's ability to integrate textual and visual information effectively, leveraging task-specific datasets to enhance robustness and mitigate synthetic data pitfalls.

Experimental Insights

The experimental analysis provides a thorough investigation into various vision encoders and adapter solutions, revealing that the incorporation of larger image encoders like InternViT-6B-448px-V1-2 yields optimal performance across multiple benchmarks. Moreover, strategies involving the mixing of features from multiple encoders have demonstrated improvements in certain task-specific metrics.

A significant advancement noted is the model's ability to handle high-resolution images effectively through innovative grid splitting techniques. This capability boosts results in OCR and document-based tasks—a testament to OmniFusion's versatility in processing complex visual data.

Implications and Future Directions

The implications of the OmniFusion paper are multifaceted. Practically, the integration of sophisticated visual embedding strategies within LLM frameworks signifies a leap forward in multimodal system capabilities. Theoretically, these findings contribute to the foundational understanding of multimodal learning models, underscoring the potential for broader applications and deeper integrations of diverse data types.

Looking ahead, the researchers signal an intention to further explore image embeddings, enhance context processing, and broaden the model's scope to include video modalities. Additionally, synergizing these advancements with image output generation models like Kandinsky might unlock new capabilities in multimedia generation contexts.

In summation, the OmniFusion technical report propounds a coherent and eloquent discourse on the frontiers of multimodal AI systems, presenting substantive evidence for its efficacy while charting a course for future developments in the field.