Tokenize Anything via Prompting (2312.09128v2)

Published 14 Dec 2023 in cs.CV

Abstract: We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything. Unlike SAM, we aim to build a versatile region representation in the wild via visual prompting. To achieve this, we train a generalizable model with massive segmentation masks, \eg, SA-1B masks, and semantic priors from a pre-trained CLIP model with 5 billion parameters. Specifically, we construct a promptable image decoder by adding a semantic token to each mask token. The semantic token is responsible for learning the semantic priors in a predefined concept space. Through joint optimization of segmentation on mask tokens and concept prediction on semantic tokens, our model exhibits strong regional recognition and localization capabilities. For example, an additional 38M-parameter causal text decoder trained from scratch sets a new record with a CIDEr score of 164.7 on the Visual Genome region captioning task. We believe this model can be a versatile region-level image tokenizer, capable of encoding general-purpose region context for a broad range of visual perception tasks. Code and models are available at {\footnotesize \url{https://github.com/baaivision/tokenize-anything}}.

Citations (14)

View on Semantic Scholar

Summary

The paper presents a TAP model that integrates segmentation, recognition, and captioning into a unified framework.
It employs 1.1 billion segmentation masks and a 5-billion-parameter CLIP model to link image regions with semantic tokens.
The model sets a new benchmark by achieving a CIDEr score of 150.7 in captioning, showcasing its precision and efficiency.

TAP Model: A Breakthrough in Visual Perception

The Unification of Segmentation, Recognition, and Captioning

The world of computer vision is witnessing a transformative moment with the creation of a breakthrough model that can segment, recognize, and caption any visual content. This innovative design transcends past limitations, integrating functions that were once siloed into a singular comprehensive framework. Enabling the accurate localization and understanding of arbitrary image regions, this model is a significant step toward a more holistic interpretation of visual stimuli.

Promptable Tokenization and Semantic Learning

At its core, this model thrives on the unique melding of segmentation masks—more specifically, 1.1 billion of them—sourced from the expansive SA-1B dataset, and semantic foresight mined from a powerful pre-trained CLIP model hosting an astounding 5 billion parameters. By infusing a semantic token into each predicted mask, the model learns to associate pixel-precise image segments with relevant language-based concepts. This dual-task learning helps the model to not only finely delineate visual elements but also imbue them with meaning.

Breaking Records in Captioning

An exemplary feat to highlight is the model's remarkable skill at captioning. Equipped with a causal text decoder, it outperforms previously established benchmarks in the Visual Genome region captioning task, bagging an impressive CIDEr score of 150.7. All these accomplishments are even more astounding considering the modesty of the model's parameter count when juxtaposed against its predecessors.

Versatile Foundations for Broad Visual Tasks

An extension of this model, through fine-tuned language modeling, allows the semantic tokens derived from the visual cues to directly stimulate the generation of descriptive captions. These functions forge a model that is not just specialist at understanding regions within an image but also adept at articulations that are contextually relevant and enriched.

In essence, the model is redefining boundaries, proposing a foundation where the convergence of vision and language tasks is not just a distant wish but a palpable reality. The tool opens avenues for broader and far-reaching applications in vision and language-oriented tasks, perfecting an exquisite dance of image tokenization that is remarkably sensitive to regional contexts. It undoubtedly lays the ground for the next wave of vision-LLMs that can 'see' and 'speak' about the visual world with unprecedented accuracy and depth.