Emergent Mind

CogVLM: Visual Expert for Pretrained Language Models

(2311.03079)
Published Nov 6, 2023 in cs.CV

Abstract

We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.

CogVLM's effectiveness across multi-modal tasks, outperforming current methods.

Overview

  • VLMs are tools for interpreting and generating visual and textual content, applicable to tasks like image captioning and visual question answering.

  • Training VLMs that effectively integrate vision and language remains challenging, with difficulties in deeply fusing the two while retaining NLP capabilities.

  • CogVLM introduces a 'visual expert module' to a pretrained language model for improved visual-language feature integration without added computational burden.

  • CogVLM achieves impressive results on 14 cross-modal benchmark tests, indicating its strong performance compared to other state-of-the-art VLMs.

  • CogVLM is open-source, which broadens research opportunities, and future advancements may focus on training refinement and reducing content hallucination.

Introduction to Visual Language Models (VLMs)

Visual Language Models (VLMs) have emerged as robust tools capable of understanding and generating content across both visual and textual domains. These models can tackle tasks such as image captioning, visual question answering (VQA), visual grounding, and more. VLMs have also exhibited the ability to learn context dynamically, improving their performance on downstream tasks as their size scales.

Challenges in Training VLMs

Training high-performance VLMs that maintain their language capabilities while incorporating visual understanding is a complex task. The traditional method involves a 'shallow alignment' strategy, connecting a pretrained vision encoder and language model via a trainable modular component. However, these models tend to converge quickly but do not attain the same performance levels as models where vision and language components are trained together. This gap arises because the vision and language features are not deeply integrated in shallow alignment methods. Deeply fusing these features while retaining NLP capabilities remains a key challenge in the field.

Introducing CogVLM

CogVLM presents a solution to the deep fusion challenge by incorporating a 'visual expert module' within the layers of a pretrained language model. This module, which includes distinct transformation matrices for image features and an adapted feedforward neural network, allows for rich visual-language feature integration without increasing the computational load. It effectively retains the original behavior of the language model when dealing with text-only inputs. CogVLM demonstrates a remarkable performance on 14 cross-modal benchmark tests, outperforming or achieving comparable results to state-of-the-art alternatives.

Benefits and Future Directions

CogVLM is open-source, which is significant since most preceding VLMs are proprietary, limiting research and application development capabilities. This model is highly applicable to both research and commercial uses, and its release is expected to significantly contribute to advancements in visual understanding. Future VLM development may explore aspects such as enhanced training alignment, reinforcement learning for human feedback, and strategies to reduce hallucination in generated content. With ongoing evolution in the field, CogVLM establishes a strong foundation for multisensory AI growth.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube