MM-LLMs: Recent Advances in MultiModal Large Language Models (2401.13601v5)
Abstract: In the past year, MultiModal LLMs (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs. Initially, we outline general design formulations for model architecture and training pipeline. Subsequently, we introduce a taxonomy encompassing 126 MM-LLMs, each characterized by its specific formulations. Furthermore, we review the performance of selected MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs. Finally, we explore promising directions for MM-LLMs while concurrently maintaining a real-time tracking website for the latest developments in the field. We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain.
- Jointly Training Large Autoregressive Multimodal Models. arXiv preprint arXiv:2309.15564.
- Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems, 34:24206–24221.
- RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs. arXiv preprint arXiv:2305.08844.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
- Retrieval-based language models and applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pages 41–46.
- Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities. CoRR, abs/2308.12966.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738.
- Introducing our Multimodal Models.
- Latr: Layout-aware transformer for scene-text vqa. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16548–16558.
- High-performance large-scale image recognition without normalization. In International Conference on Machine Learning, pages 1059–1071. PMLR.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Coyo-700m: Image-text pair dataset.
- Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 961–970.
- Cerspense. 2023. Zeroscope: Diffusion-based text-to-video synthesis.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568.
- Vlp: A survey on vision-language pre-training. Machine Intelligence Research, 20(1):38–56.
- X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478.
- Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. arXiv preprint arXiv:2306.15195.
- ShareGPT4V: Improving Large Multi-Modal Models with Better Captions. arXiv preprint arXiv:2311.12793.
- BEATs: Audio Pre-Training with Acoustic Tokenizers. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, pages 5178–5193.
- Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35:16664–16678.
- PaLI-X: On Scaling up a Multilingual Vision and Language Model. arXiv preprint arXiv:2305.18565.
- Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
- Dress: Instructing large vision-language models to align and interact with humans via natural language feedback. arXiv preprint arXiv:2311.10081.
- Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
- Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886.
- Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 958–979.
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. In Thirty-seventh Conference on Neural Information Processing Systems.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- Heterogeneous forgetting compensation for class-incremental learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11742–11751.
- Federated Incremental Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3934–3943.
- Cif: Continuous integrate-and-fire for end-to-end speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6079–6083. IEEE.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
- A Survey of Vision-Language Pre-Trained Models. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 5436–5443.
- GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
- Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097.
- Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369.
- DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding. arXiv preprint arXiv:2311.11810.
- Foundation Models in Robotics: Applications, Challenges, and the Future. arXiv preprint arXiv:2312.07843.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394.
- AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2608–2621.
- Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108.
- Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190.
- Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790.
- An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913.
- Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems, 35:26418–26431.
- Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617.
- Continual instruction tuning for large multimodal models. arXiv preprint arXiv:2311.16206.
- Towards a Unified View of Parameter-Efficient Transfer Learning. In International Conference on Learning Representations.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689.
- Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, pages 2790–2799. PMLR.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
- LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.
- Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey. arXiv preprint arXiv:2312.16602.
- Audiogpt: Understanding and generating speech, music, sound, and talking head. arXiv preprint arXiv:2304.12995.
- Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045.
- Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709.
- IDEFICS. 2023. Introducing IDEFICS: An Open Reproduction of State-of-the-Art Visual Language Model.
- Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR.
- Bootstrapping Vision-Language Learning with Decoupled Language Pre-training. In Thirty-seventh Conference on Neural Information Processing Systems.
- Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656.
- Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34:1022–1035.
- Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
- The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in neural information processing systems, 33:2611–2624.
- Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73.
- The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059.
- Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, pages 19730–19742.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705.
- Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355.
- M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTIT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning. arXiv preprint arXiv:2306.04387.
- Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 121–137. Springer.
- Stablellava: Enhanced visual instruction tuning with synthesized image-dialogue data. arXiv preprint arXiv:2308.10253.
- Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355.
- Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models. arXiv preprint arXiv:2311.06607.
- GOAT-Bench: Safety Insights to Large Multimodal Models through Meme-Based Social Abuse. arXiv preprint arXiv:2401.01523.
- VILA: On Pre-training for Visual Language Models. arXiv preprint arXiv:2312.07533.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
- Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 11:635–651.
- AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, pages 21450–21474.
- AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining. CoRR, abs/2308.05734.
- Improved Baselines with Visual Instruction Tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
- Visual Instruction Tuning. In Thirty-seventh Conference on Neural Information Processing Systems.
- Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281.
- Vision-and-Language Pretrained Models: A Survey. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 5530–5537.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521.
- Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
- WizardCoder: Empowering Code Large Language Models with Evol-Instruct. arXiv preprint arXiv:2306.08568.
- Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. arXiv preprint arXiv:2306.05424.
- Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209.
- Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier.
- Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv preprint arXiv:2303.17395.
- Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE.
- Embodiedgpt: Vision-language pre-training via embodied chain of thought. In Thirty-seventh Conference on Neural Information Processing Systems.
- A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435.
- OpenAI. 2022. OpenAI: Introducing ChatGPT.
- OpenAI. 2023. GPT-4 Technical Report.
- Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning. arXiv preprint arXiv:2311.18799.
- Kosmos-2: Grounding Multimodal Large Language Models to the World. arXiv preprint arXiv:2306.14824.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
- Robust Speech Recognition via Large-Scale Weak Supervision. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, pages 28492–28518.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Learning multiple visual domains with residual adapters. Advances in neural information processing systems, 30.
- Anthony Robins. 1995. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer.
- Ludan Ruan and Qin Jin. 2022. Survey: Transformer based video-language pre-training. AI Open, 3:1–13.
- Salesforce. 2022. Ulip.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294.
- Laion coco: 600m synthetic captions from laion2b-en.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114.
- A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580.
- Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 742–758. Springer.
- Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326.
- How to Bridge the Gap between Modalities: A Comprehensive Survey on Multimodal Large Language Model. arXiv preprint arXiv:2311.07594.
- Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355.
- Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525.
- Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128.
- CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation. arXiv preprint arXiv:2311.18775.
- Any-to-Any Generation via Composable Diffusion. In Thirty-seventh Conference on Neural Information Processing Systems.
- Ul2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Attention is all you need. Advances in neural information processing systems, 30.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR.
- Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442.
- Finetuned Language Models are Zero-Shot Learners. In International Conference on Learning Representations.
- Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671.
- Q-bench: A benchmark for general-purpose foundation models on low-level vision. arXiv preprint arXiv:2309.14181.
- Ai challenger: A large-scale dataset for going deeper in image understanding. arXiv preprint arXiv:1711.06475.
- Multimodal large language models: A survey. arXiv preprint arXiv:2311.13165.
- Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519.
- Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296.
- Self Correspondence Distillation For End-to-End Weakly-Supervised Semantic Segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence.
- Rssformer: Foreground saliency enhancement for remote sensing land-cover segmentation. IEEE Transactions on Image Processing, 32:1052–1064.
- Video-text pre-training with learned regions. arXiv preprint arXiv:2112.01194.
- Vision-language pre-training with triple contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15671–15680.
- Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
- A Survey on Multimodal Large Language Models. arXiv preprint arXiv:2306.13549.
- Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. arXiv preprint arXiv:2306.06687.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78.
- Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490.
- Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19313–19322.
- Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387.
- GLM-130B: An Open Bilingual Pre-trained Model. In The Eleventh International Conference on Learning Representations.
- Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. In International Conference on Machine Learning, pages 25994–26009. PMLR.
- SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 15757–15773.
- Continual Named Entity Recognition without Catastrophic Forgetting. In The 2023 Conference on Empirical Methods in Natural Language Processing.
- Task relation distillation and prototypical pseudo label for incremental named entity recognition. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 3319–3329.
- Decomposing Logits Distillation for Incremental Named Entity Recognition. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1919–1923.
- Recent Advances and New Frontiers in Spiking Neural Networks. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 5670–5677.
- Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - System Demonstrations, Singapore, December 6-10, 2023, pages 543–553.
- Side-tuning: a baseline for network adaptation via additive side networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 698–714. Springer.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107.
- Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087.
- Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning. arXiv preprint arXiv:2307.09474.
- EGSDE: Unpaired Image-to-Image Translation via Energy-Guided Stochastic Differential Equations. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581.
- Beyond Anti-Forgetting: Multimodal Continual Instruction Tuning with Positive Forward Transfer. arXiv preprint arXiv:2401.09181.
- Learn or Recall? Revisiting Incremental Learning with Pre-trained Language Models. arXiv preprint arXiv:2312.07887.
- Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
- Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939.
- Visual7w: Grounded question answering in images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4995–5004.