VITA: Towards Open-Source Interactive Omni Multimodal LLM (2408.05211v2)

Published 9 Aug 2024 in cs.CV, cs.AI, and cs.CL

Abstract: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal LLM (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the LLM with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research. Project Page: https://vita-home.github.io.

Citations (32)

View on Semantic Scholar

Summary

The paper introduces VITA, a novel MLLM that integrates video, image, text, and audio inputs using a duplex deployment strategy.
It leverages bilingual instruction tuning and specialized encoders to boost performance on both Chinese and English benchmarks.
The model demonstrates practical usability with real-time interactive processing, redefining open-source multimodal AI capabilities.

Overview of "VITA: Towards Open-Source Interactive Omni Multimodal LLM"

VITA presents a novel Multimodal LLM (MLLM) designed to handle and integrate video, image, text, and audio inputs, responding effectively within these modalities. The work addresses the persistent gap in open-source multimodal capabilities and interactive functionalities by establishing VITA as a robust platform. This summary elaborates on the core developments, methodologies, and evaluative outcomes of the VITA model.

Development and Training Pipeline

The model's development encompasses three core stages:

LLM Instruction Tuning: Starting with the Mixtral 8x7B as the baseline, the authors expand its Chinese vocabulary and conduct bilingual instruction tuning using a high-quality corpus. This step enhances the model's proficiency in both Chinese and English, a deviation from the primarily English-focused initial model.
Multimodal Alignment and Training: Specialized encoders are employed for visual and audio inputs. The authors utilize a systematic approach to align these modalities with the text, leveraging high-quality datasets across different sources. Training also includes multimodal instruction tuning, teaching the model to understand and react to various input queries, incorporating audio and image data for a comprehensive interactive experience.
Duplex Pipeline Deployment: The duplex scheme is a critical aspect of VITA, allowing real-time human-computer interaction without the need for explicit wake-up commands. This mechanism includes non-awakening interactions and supports audio interruptions, with two VITA instances working concurrently for seamless operation. One model generates responses, while the other monitors incoming queries, enabling swift adaptability and user engagement.

Technical Innovations

State Tokens for Interaction Scenarios: To differentiate between query and noisy audio, the model uses state tokens: <1> for effective query audio, <2> for noisy audio, and <3> for text queries. This categorization ensures that the model selectively processes relevant inputs and efficiently executes interaction tasks.
Architectural Enhancements: The adoption of a detailed visual encoder and sophisticated audio processing framework demonstrates the technical depth of VITA. Visual inputs are dynamically patched into tokens, and audio inputs undergo Mel spectrogram processing, followed by convolutional and transformer-based encoding. These developments significantly enhance multimodal comprehension.

Evaluation and Results

The evaluation of VITA across various benchmarks shows notable performance in multimodal understanding:

Language Performance: VITA shows significant improvements on Chinese datasets (C-EVAL, AGIEVAL) and maintains strong performance on English-based tasks (MMLU, GSM8K). These outcomes underline the effectiveness of the bilingual instruction tuning approach.
Audio Performance: VITA is tested on Wenetspeech and Librispeech datasets, demonstrating robust ASR capabilities, which are indicative of effective audio training and alignment.
Multimodal Benchmarks: In comparison with existing open-source and closed-source models, VITA achieves competitive results in image and video understanding benchmarks. However, there remains a performance gap when compared to proprietary models, particularly in video understanding tasks.

Implications and Future Directions

VITA represents a significant contribution to the field of MLLMs with several practical and theoretical implications:

Practical Usability: The duplex interaction scheme and real-time query processing capabilities make VITA a valuable tool for applications requiring seamless human-computer interaction across multiple modalities.
Enhanced Multimodal Interaction: By addressing non-awakening and audio interrupt interactions, VITA sets a precedent for future models aiming to enhance user engagement and application responsiveness.
Bilingual Capabilities: The integration of Chinese alongside English expands the applicability of VITA across diverse linguistic contexts, promoting inclusivity in advanced language technologies.

Conclusion

While VITA marks a substantial step forward in multimodal interaction and open-source MLLMs, future work can focus on enhancing foundational capabilities, refining noisy audio construction, and potentially integrating end-to-end Text-to-Speech (TTS) capabilities. Such developments would further consolidate VITA's role in advancing multimodal AI research and application, reinforcing its pioneering contributions in combining multimodal understanding with interactive functionalities.

This thorough exploration of the VITA model elucidates its technical framework, evaluative outcomes, and broader implications, situating it as a foundational asset within the ongoing evolution of multimodal LLMs.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/iScienceLuvr/status/1822826994607980568

https://twitter.com/_akhaliq/status/1822839857430831320

https://twitter.com/fly51fly/status/1823122755203772598

https://twitter.com/arXivBangers/status/1823809335761494122

https://twitter.com/KyeGomezB/status/1822992199312683134

https://twitter.com/4confusedemoji/status/1832823654410838202

YouTube

Show All Videos