OneLLM: One Framework to Align All Modalities with Language (2312.03700v1)
Abstract: Multimodal LLMs (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at https://github.com/csuhan/OneLLM
- Sharegpt. https://sharegpt.com/, 2023.
- nocaps: novel object captioning at scale. In ICCV, 2019.
- Audio-visual scene-aware dialog. In CVPR, 2019.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 35:23716–23736, 2022.
- A massive 7t fmri dataset to bridge cognitive neuroscience and artificial intelligence. Nature neuroscience, 25(1):116–126, 2022.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023b.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021.
- Introducing our multimodal models, 2023.
- X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160, 2023a.
- Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023b.
- Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345, 2023c.
- Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset. arXiv preprint arXiv:2305.18500, 2023d.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Objaverse: A universe of annotated 3d objects. In CVPR, pages 13142–13153, 2023.
- Pengi: An audio language model for audio tasks. arXiv preprint arXiv:2305.11834, 2023.
- Clotho: An audio captioning dataset. In ICASSP, pages 736–740. IEEE, 2020.
- Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In ICCV, pages 10786–10796, 2021.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
- Omnivore: A single model for many visual modalities. In CVPR, pages 16102–16112, 2022.
- Imagebind: One embedding space to bind them all. In CVPR, pages 15180–15190, 2023.
- AST: Audio Spectrogram Transformer. In Proc. Interspeech 2021, pages 571–575, 2021.
- Listen, think, and understand. arXiv preprint arXiv:2305.10790, 2023.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017.
- Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, pages 18995–19012, 2022.
- Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615, 2023.
- Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, pages 3608–3617, 2018.
- Audioclip: Extending clip to image, text and audio. In ICASSP, pages 976–980. IEEE, 2022.
- Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905, 2023.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
- Perceiver: General perception with iterative attention. pages 4651–4664. PMLR, 2021.
- Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, pages 787–798, 2014.
- Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132, 2019.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123:32–73, 2017.
- Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527, 2023.
- Multimodal transformer networks for end-to-end video-grounded dialogue systems. arXiv preprint arXiv:1907.01166, 2019.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
- Learning to answer questions in dynamic audio-visual scenarios. In CVPR, pages 19108–19118, 2022.
- Multi-scale attention for audio question answering. arXiv preprint arXiv:2305.17993, 2023b.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023c.
- Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023d.
- Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200, 2020.
- Clotho-aqa: A crowdsourced dataset for audio question answering. In 2022 30th European Signal Processing Conference (EUSIPCO), pages 1140–1144. IEEE, 2022.
- Improved baselines with visual instruction tuning, 2023a.
- Visual instruction tuning. NeurIPS, 2023b.
- Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023c.
- Pretrained transformers as universal computation engines. arXiv preprint arXiv:2103.05247, 1, 2021.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. NeurIPS, 2022.
- Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
- Scalable 3d captioning with pretrained models. arXiv preprint arXiv:2306.07279, 2023.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
- Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv preprint arXiv:2303.17395, 2023.
- Large language models as general pattern machines. In CoRL, 2023.
- Ocr-vqa: Visual question answering by reading text in images. In ICDAR, pages 947–952. IEEE, 2019.
- Anymal: An efficient and scalable any-modality augmented language model. arXiv preprint arXiv:2309.16058, 2023.
- Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
- From film to video: Multi-turn question answering with multi-modal context. arXiv preprint arXiv:1812.07023, 2018.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Video dialog as conversation about objects living in space-time. In ECCV, pages 710–726. Springer, 2022.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, pages 2641–2649, 2015.
- From sparse to soft mixtures of experts. arXiv preprint arXiv:2308.00951, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- Vision transformers for dense prediction. ArXiv preprint, 2021.
- Laion coco: 600m synthetic captions from laion2b-en.
- Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 35:25278–25294, 2022.
- A-okvqa: A benchmark for visual question answering using world knowledge. In ECCV, pages 146–162. Springer, 2022.
- Reconstructing the mind’s eye: fmri-to-image with contrastive learning and diffusion priors. arXiv preprint arXiv:2305.18274, 2023.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL, 2018.
- Textcaps: a dataset for image captioning with reading comprehension. In ECCV, pages 742–758. Springer, 2020.
- Towards vqa models that can read. In CVPR, pages 8317–8326, 2019.
- Sun rgb-d: A rgb-d scene understanding benchmark suite. In CVPR, pages 567–576, 2015.
- Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Gido M Van de Ven and Andreas S Tolias. Three scenarios for continual learning. arXiv preprint arXiv:1904.07734, 2019.
- Vicuna. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org/, 2023.
- Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022a.
- Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, pages 4581–4591, 2019.
- Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022b.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP, pages 1–5. IEEE, 2023.
- Next-qa: Next phase of question-answering to explaining temporal actions. In CVPR, pages 9777–9786, 2021.
- Video as conditional graph hierarchy for multi-granular question answering. In AAAI, pages 2804–2812, 2022.
- Image2point: 3d point-cloud understanding with 2d image pretrained models. In ECCV, pages 638–656. Springer, 2022.
- Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021.
- Msr-vtt: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296, 2016.
- Pointllm: Empowering large language models to understand point clouds. arXiv preprint arXiv:2308.16911, 2023.
- Just ask: Learning to answer questions from millions of narrated videos. In ICCV, pages 1686–1697, 2021.
- Zero-shot video question answering via frozen bidirectional language models. NeurIPS, 35:124–141, 2022.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
- Featurecut: An adaptive data augmentation for automated audio captioning. In 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 313–318. IEEE, 2022.
- Self-chained image-language model for video localization and question answering. arXiv preprint arXiv:2305.06988, 2023a.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023b.
- Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000, 2023a.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023b.
- Pointclip: Point cloud understanding by clip. In CVPR, pages 8552–8562, 2022.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023c.
- Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802, 2023d.
- Chatbridge: Bridging modalities with large language model as a language catalyst. arXiv preprint arXiv:2305.16103, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.