Emergent Mind

Tokenize Anything via Prompting

(2312.09128)
Published Dec 14, 2023 in cs.CV

Abstract

We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything. Unlike SAM, we aim to build a versatile region representation in the wild via visual prompting. To achieve this, we train a generalizable model with massive segmentation masks, e.g., SA-1B masks, and semantic priors from a pre-trained CLIP model with 5 billion parameters. Specifically, we construct a promptable image decoder by adding a semantic token to each mask token. The semantic token is responsible for learning the semantic priors in a predefined concept space. Through joint optimization of segmentation on mask tokens and concept prediction on semantic tokens, our model exhibits strong regional recognition and localization capabilities. For example, an additional 38M-parameter causal text decoder trained from scratch sets a new record with a CIDEr score of 150.7 on the Visual Genome region captioning task. We believe this model can be a versatile region-level image tokenizer, capable of encoding general-purpose region context for a broad range of perception tasks. Code and models are available at https://github.com/baaivision/tokenize-anything.

Enhancement of SAM’s architecture for improved image decoding and fine-tuning for region captioning.

Overview

  • The TAP model is a cutting-edge computer vision framework that unifies segmentation, recognition, and captioning for a holistic interpretation of visual content.

  • The model is built using 1.1 billion segmentation masks from the SA-1B dataset and leverages a pre-trained CLIP model with 5 billion parameters for semantic learning.

  • With a causal text decoder, the TAP model achieves a CIDEr score of 150.7 on the Visual Genome region captioning task, surpassing previous benchmarks.

  • The TAP model’s capability extends to producing contextually relevant descriptions through fine-tuned language modeling from visual inputs.

  • This work paves the way for innovative applications in vision and language tasks, enabling models to 'see' and 'speak' about the visual world with high accuracy.

TAP Model: A Breakthrough in Visual Perception

The Unification of Segmentation, Recognition, and Captioning

The world of computer vision is witnessing a transformative moment with the creation of a breakthrough model that can segment, recognize, and caption any visual content. This innovative design transcends past limitations, integrating functions that were once siloed into a singular comprehensive framework. Enabling the accurate localization and understanding of arbitrary image regions, this model is a significant step toward a more holistic interpretation of visual stimuli.

Promptable Tokenization and Semantic Learning

At its core, this model thrives on the unique melding of segmentation masks—more specifically, 1.1 billion of them—sourced from the expansive SA-1B dataset, and semantic foresight mined from a powerful pre-trained CLIP model hosting an astounding 5 billion parameters. By infusing a semantic token into each predicted mask, the model learns to associate pixel-precise image segments with relevant language-based concepts. This dual-task learning helps the model to not only finely delineate visual elements but also imbue them with meaning.

Breaking Records in Captioning

An exemplary feat to highlight is the model's remarkable skill at captioning. Equipped with a causal text decoder, it outperforms previously established benchmarks in the Visual Genome region captioning task, bagging an impressive CIDEr score of 150.7. All these accomplishments are even more astounding considering the modesty of the model's parameter count when juxtaposed against its predecessors.

Versatile Foundations for Broad Visual Tasks

An extension of this model, through fine-tuned language modeling, allows the semantic tokens derived from the visual cues to directly stimulate the generation of descriptive captions. These functions forge a model that is not just specialist at understanding regions within an image but also adept at articulations that are contextually relevant and enriched.

In essence, the model is redefining boundaries, proposing a foundation where the convergence of vision and language tasks is not just a distant wish but a palpable reality. The tool opens avenues for broader and far-reaching applications in vision and language-oriented tasks, perfecting an exquisite dance of image tokenization that is remarkably sensitive to regional contexts. It undoubtedly lays the ground for the next wave of vision-language models that can 'see' and 'speak' about the visual world with unprecedented accuracy and depth.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.