Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 64 tok/s

Gemini 2.5 Pro 42 tok/s Pro

GPT-5 Medium 22 tok/s Pro

GPT-5 High 28 tok/s Pro

GPT-4o 78 tok/s Pro

Kimi K2 211 tok/s Pro

GPT OSS 120B 458 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

OmniFusion Technical Report (2404.06212v1)

Published 9 Apr 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of LLMs (LLM). We propose an \textit{OmniFusion} model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and their fusing approach, image encoding method (whole image or tiles encoding) and two 7B LLMs (the proprietary one and open-source Mistral). Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup in terms of different VQA tasks in comparison with open-source LLaVA-like solutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. We also propose a variety of situations, where OmniFusion provides highly-detailed answers in different domains: housekeeping, sightseeing, culture, medicine, handwritten and scanned equations recognition, etc. Mistral-based OmniFusion model is an open-source solution with weights, training and inference scripts available at https://github.com/AIRI-Institute/OmniFusion.

References (55)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces the OmniFusion model by integrating pretrained LLMs with specialized visual adapters for enhanced multimodal data processing.
The paper employs innovative adapter techniques using transformer and MLP solutions alongside dual image encoding strategies for nuanced visual analysis.
The paper demonstrates improved performance on benchmarks like VQA and OCR through grid splitting and multi-encoder feature mixing, paving the way for future multimodal applications.

Insightful Overview of the OmniFusion Technical Report

The paper "OmniFusion Technical Report" introduces the OmniFusion model, a novel approach in the field of multimodal architectures that combines pretrained LLMs with specialized adapters for visual modalities. This integration serves to enhance the joint processing capabilities of text and images, aiming to address the inherent challenges in multimodal data coupling. The paper undertakes a comprehensive assessment of architectural design strategies, including the employment of MLP and transformer adapters, diverse image encoders such as CLIP-ViT variants, and their corresponding image encoding methodologies.

Central to the OmniFusion model's development is its adaptability in image encoding, employing both whole image and tiled image encoding strategies. This duality allows the system to attain a nuanced understanding of visual content, which proves instrumental in a wide array of visual-language benchmarks. These benchmarks span several tasks, including visual question answering (VQA) and other domain-specific applications like culture, medicine, and handwritten equation recognition.

Model Architecture and Training

The core architecture of the OmniFusion model involves integrating a pretrained LLM with adapters specifically designed for processing visual embeddings. A noteworthy aspect of the approach lies in the adoption of adapter-based methodologies that avoid the intensive computational demands of end-to-end training pipelines, which often require vast interleaved datasets. Two primary considerations in the model's design are the choice of adapter techniques and the strategy for encoding visual data.

Experimentally, the alignment of visual and textual modalities is achieved through trainable embeddings that demarcate token sequences from visual streams. The visual information is processed via either a transformer adapter or a two-layer MLP, artfully merging features from distinct encoders such as CLIP-ViT-L and DINO-v2.

Training Regimen

OmniFusion's training regime unfolds in a two-stage process. Initially, pretraining involves adapters and tokens operating on extensive datasets of image-text pairs aimed at refining visual embeddings and transitions. Subsequently, fine-tuning harnesses instructional dialogues to hone the model's ability to integrate textual and visual information effectively, leveraging task-specific datasets to enhance robustness and mitigate synthetic data pitfalls.

Experimental Insights

The experimental analysis provides a thorough investigation into various vision encoders and adapter solutions, revealing that the incorporation of larger image encoders like InternViT-6B-448px-V1-2 yields optimal performance across multiple benchmarks. Moreover, strategies involving the mixing of features from multiple encoders have demonstrated improvements in certain task-specific metrics.

A significant advancement noted is the model's ability to handle high-resolution images effectively through innovative grid splitting techniques. This capability boosts results in OCR and document-based tasks—a testament to OmniFusion's versatility in processing complex visual data.

Implications and Future Directions

The implications of the OmniFusion paper are multifaceted. Practically, the integration of sophisticated visual embedding strategies within LLM frameworks signifies a leap forward in multimodal system capabilities. Theoretically, these findings contribute to the foundational understanding of multimodal learning models, underscoring the potential for broader applications and deeper integrations of diverse data types.

Looking ahead, the researchers signal an intention to further explore image embeddings, enhance context processing, and broaden the model's scope to include video modalities. Additionally, synergizing these advancements with image output generation models like Kandinsky might unlock new capabilities in multimedia generation contexts.

In summation, the OmniFusion technical report propounds a coherent and eloquent discourse on the frontiers of multimodal AI systems, presenting substantive evidence for its efficacy while charting a course for future developments in the field.