Emergent Mind

Grounded Language-Image Pre-training

(2112.03857)
Published Dec 7, 2021 in cs.CV , cs.AI , cs.CL , cs.LG , and cs.MM

Abstract

This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head. Code is released at https://github.com/microsoft/GLIP.

Unified model for detecting and grounding objects by aligning image regions with text prompts.

Overview

  • The paper introduces GLIP, a model designed for pre-training object-level, language-aware visual representations that integrates object detection and phrase grounding through a novel approach using massive image-text data.

  • GLIP's architecture incorporates deep cross-modality fusion layers and a dual-encoder structure, enabling early alignment of linguistic and visual modalities to enhance semantic richness.

  • GLIP demonstrates state-of-the-art performance on benchmarks like COCO and LVIS, achieving impressive zero-shot and supervised fine-tuning results with significant data efficiency.

Grounded Language-Image Pre-training

The paper "Grounded Language-Image Pre-training" introduces GLIP, a novel model tailored for pre-training object-level, language-aware, and rich semantic visual representations. This model unifies object detection and phrase grounding through a groundbreaking approach that leverages massive image-text data, enabling seamless adaptation to downstream tasks with minimal human annotation while achieving state-of-the-art (SoTA) results.

Model and Methodological Approach

GLIP integrates object detection and phrase grounding under a unified framework. This integration enables the model to benefit from combined data, thereby improving in both tasks. Object detection is reformulated as phrase grounding by using text prompts that describe candidate categories during training. The model predicts bounding boxes corresponding to phrases found in the provided textual descriptions.

The model architecture incorporates deep cross-modality fusion layers that align the linguistic and visual modalities early during processing. This synergy ensures the visual features are language-aware, significantly enhancing the semantic richness captured by the visual representations. Notably, the model employs a dual-encoder structure, with a vision encoder and a language encoder, which facilitates handling complex tasks requiring fine-grained image understanding.

Data and Experimental Setup

In pre-training, GLIP leverages a substantial amount of labeled data. Specifically, it is pre-trained on 27 million grounding data examples, comprising both human-annotated data and web-crawled image-text pairs. The paper presents results from multiple GLIP variants differing in backbone architecture (Swin-Tiny and Swin-Large) and the nature of the pre-training datasets.

The proposed model was evaluated under various experimental settings, including zero-shot domain transfer and supervised fine-tuning, on benchmarks such as COCO, LVIS, and Flickr30K. These settings underscore GLIP's robustness in learning representations transferable across different tasks and datasets. Notably, the model exhibits impressive zero-shot performance, demonstrating its ability to generalize without task-specific re-training.

Numerical Results and Performance

Key numerical results highlight the model's efficacy:

Zero-Shot Performance:

  • On COCO, GLIP-T (C) achieved 46.7 AP and GLIP-L obtained 49.8 AP, surpassing many traditional supervised baselines.
  • On LVIS, GLIP-T (C) and GLIP-L obtained 26.0 AP and 26.9 AP respectively, performing better than various supervised baselines.

Supervised Fine-Tuning:

  • After fine-tuning on COCO, GLIP achieved a state-of-the-art AP of 60.8 on val and 61.5 on test-dev.

Data Efficiency:

  • GLIP demonstrated significant data efficiency in 13 downstream object detection tasks. A zero-shot GLIP-L was competitive with a fully supervised Dynamic Head, highlighting the model's ability to generalize from minimal data.

Implications and Future Directions

From both a practical and theoretical perspective, GLIP presents a robust framework for advancing visual recognition systems. Practically, its data efficiency and transferability can significantly reduce the annotation effort required for deploying object detection systems in new domains. The unification of detection and grounding enhances model adaptability, making it suitable for a wide range of applications.

Theoretically, the deep cross-modality fusion and self-training using large-scale image-text data underscore the potential of leveraging linguistic information in visual tasks. The successful application of this approach hints at future explorations where more nuanced linguistic cues could be integrated, potentially improving fine-grained visual understanding tasks further.

Proposed future developments include:

  • Scaling up the pre-training datasets beyond the currently used 27 million pairs to explore the limits of the model's scalability.
  • Extending the model to handle even more complex visual tasks, such as video understanding and temporal object grounding.
  • Investigating the model's performance across other low-resource settings and diverse languages to generalize its applicability.

In summary, "Grounded Language-Image Pre-training" presents a significant step forward in unified visual-linguistic models, offering scalable and adaptable models that achieve impressive performance across varying tasks with minimal supervision. This work potentially paves the way for new research directions in AI and computer vision, emphasizing the symbiotic relationship between linguistic and visual data.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.