GIT: A Generative Image-to-text Transformer for Vision and Language

Published 27 May 2022 in cs.CV | (2205.14100v5)

Abstract: In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks. Codes are released at \url{https://github.com/microsoft/GenerativeImage2Text}.

Abstract PDF Upgrade to Chat

Authors (9)

Citations (453)

View on Semantic Scholar

Summary

The paper introduces a unified generative transformer that streamlines vision-language tasks using a single image encoder and text decoder, eliminating the need for external modules.
It leverages a large-scale pre-training dataset of 0.8 billion image-text pairs to achieve impressive CIDEr scores on benchmarks such as TextCaps (138.2) and COCO (148.8).
The model's scalable architecture and efficient language modeling loss set new standards, paving the way for future generative approaches in vision-language research.

Overview of Generative Image-to-Text Transformer (iVLM)

This paper introduces the Generative Image-to-Text Transformer (iVLM), a unified architecture designed to tackle a variety of vision-language tasks, such as image/video captioning and question answering (QA). iVLM simplifies the conventional approaches by utilizing a single image encoder and a single text decoder, thus avoiding complex structures and dependencies on external modules like object detectors and OCR. Instead, this model operates under a sole language modeling task.

Performance and Methodology

The model boasts state-of-the-art results on several benchmarks. For example, iVLM surpasses human performance on TextCaps, achieving a CIDEr score of 138.2 versus the human score of 125.5. This is significant, particularly considering the model's relative simplicity. The model’s architecture is sufficiently robust to cover a diverse range of image and video tasks effectively.

Key improvements in performance metrics were noted across a variety of datasets: For COCO, the CIDEr score reached 148.8, and for VizWiz, it scored 114.4. These results highlight the model's ability to generalize well across different contexts. Furthermore, iVLM can be extended to video captions by encoding multiple sampled frames.

Data and Architecture

iVLM exploits a large-scale pre-training dataset of 0.8 billion image-text pairs, enhancing its ability to comprehend and generate relevant descriptions. The image encoder is derived from a Swin-like vision transformer, pre-trained using contrastive tasks, which helps eliminate the need for additional object detection modules.

The pre-training is performed using a language modeling loss, which offers efficiency advantages over typical Masked Language Modeling (MLM) approaches. Additionally, iVLM's generative capabilities yield benefits such as predicting image labels directly, demonstrating a novel generation-based image classification approach.

Analysis of Model and Data Scaling

The analysis shows that both increasing model size and scaling up pre-training datasets significantly improve task performance, especially in scene-text-related QA tasks. It also reveals that a strong image encoder, pre-trained with contrastive methods, crucially impacts the overall VL performance.

Implications and Future Directions

This research underscores the efficacy of generative models in unified vision-language tasks, emphasizing the importance of scalable data and model architectures. The results suggest that a simplified model structure can achieve competitive and even superior performance on complex tasks with appropriate scaling.

The paper opens avenues for further exploration in generative models, particularly regarding extending iVLM beyond its current scope to incorporate text-only data, thus enhancing text decoding capabilities. Future work may also explore in-context learning and control over generated outputs, which remains challenging in the current framework.

Conclusion

The iVLM sets a new standard in vision-language modeling by breaking down complex task-specific architectures into a simple yet highly effective generative model. Its impressive performance across a wide range of benchmarks illustrates the potential of scaling both data and model architecture in advancing AI capabilities. As AI research progresses, the methodologies and insights from this work will likely inform future developments in generative models for vision and language tasks.

Markdown Report Issue