UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling

Published 23 Nov 2021 in cs.CV | (2111.12085v2)

Abstract: We propose UniTAB that Unifies Text And Box outputs for grounded vision-language (VL) modeling. Grounded VL tasks such as grounded captioning require the model to generate a text description and align predicted words with object regions. To achieve this, models must generate desired text and box outputs together, and meanwhile indicate the alignments between words and boxes. In contrast to existing solutions that use multiple separate modules for different outputs, UniTAB represents both text and box outputs with a shared token sequence, and introduces a special <obj> token to naturally indicate word-box alignments in the sequence. UniTAB thus could provide a more comprehensive and interpretable image description, by freely grounding generated words to object regions. On grounded captioning, UniTAB presents a simpler solution with a single output head, and significantly outperforms state of the art in both grounding and captioning evaluations. On general VL tasks that have different desired output formats (i.e., text, box, or their combination), UniTAB with a single network achieves better or comparable performance than task-specific state of the art. Experiments cover 7 VL benchmarks, including grounded captioning, visual grounding, image captioning, and visual question answering. Furthermore, UniTAB's unified multi-task network and the task-agnostic output sequence design make the model parameter efficient and generalizable to new tasks.

Abstract PDF Upgrade to Chat

Authors (8)

Citations (102)

View on Semantic Scholar

Summary

The paper introduces a unified sequence generation framework that integrates text and box outputs for grounded vision-language tasks.
It employs a specialized token to align image regions with descriptive text, markedly improving grounding F1 and CIDEr scores.
UniTAB’s streamlined, parameter-efficient architecture enables effective multi-task training and has broad real-world VL applications.

UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling

The paper presents UniTAB, a novel framework for vision-language (VL) modeling that simultaneously handles text and box outputs, unifying their representations within a single sequence generation task. Unlike traditional models which compartmentalize the generation of text and box predictions into distinct modules, UniTAB utilizes a shared token sequence to represent both outputs thereby providing a coherent and natural mechanism for grounding language descriptions in visual content.

UniTAB introduces a key architectural innovation by employing a special token that denotes the association between words and objects within images. This enables the framework to perform grounded captioning tasks, where descriptive text about an image must align with specific object regions. The model's design allows it to efficiently tackle diverse VL tasks, such as visual grounding and visual question answering (VQA), using a consistent, task-agnostic sequence of output tokens.

Numerical Results and Performance

In the evaluations conducted across seven benchmarks, UniTAB demonstrates superior grounding and captioning capabilities. The model achieves remarkable results on the Flickr30k Entities dataset for grounded captioning, with a CIDEr score leap from 62.5 to 69.7 and grounding F1 score improvement from 8.44 to 12.95. UniTAB also excels in referring expression tasks, surpassing the latest state-of-the-art models, including MDETR, in accuracy.

The paper places a strong emphasis on parameter efficiency, with UniTAB employing a unified architecture that negates the need for task-specific models. This leads to considerable improvements in the computational efficiency and adaptability of the model, particularly evident in its ability to perform multi-task training effectively across varied VL challenges.

Theoretical and Practical Implications

The theoretical contribution of UniTAB lies in its unified approach to VL modeling, which harmonizes disparate outputs into a holistic framework, suggesting a move towards more generalized vision systems. The elimination of multiple task-specific modules leads to a streamlined architecture that is conceptually simpler and potentially more robust against variations in input data across tasks. This architectural advancement could drive further research into more integrated models that require fewer manual adjustments and are adaptable across an even broader spectrum of tasks.

Practically, UniTAB’s versatility in handling different VL tasks without modifications to its core design makes it an attractive candidate for deployment in applications demanding high degrees of flexibility, such as interactive media systems or robotic vision applications. Its grounding capabilities open pathways for generating highly interpretable image descriptions, a critical feature in domains where traceability and explanation of AI decisions are necessary, such as healthcare and autonomous driving.

Future Directions

Looking ahead, the unification approach employed by UniTAB could be expanded upon by integrating additional data modalities or further enhancing the LLM with broader pre-training datasets, similar to trends in language modeling with models like GPT. Additionally, further optimizing the sequence generation mechanisms, possibly through more refined sampling techniques or integration of syntactic constraints, might yield even greater improvements in both grounding accuracy and sequence clarity.

In summary, UniTAB sets a vital precedent in the field of grounded vision-language modeling by demonstrating the feasibility and advantages of unifying text and box outputs. Its success across multiple tasks provides a robust platform upon which more advanced, general-purpose vision systems might be developed, and its impact is likely to spur continued exploration into integrated VL systems with expanded capabilities and applications.

Markdown Report Issue