Region-centric Image-Language Pretraining for Open-Vocabulary Detection (2310.00161v2)

Published 29 Sep 2023 in cs.CV, cs.AI, and cs.LG

Abstract: We present a new open-vocabulary detection approach based on region-centric image-language pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we incorporate the detector architecture on top of the classification backbone, which better serves the region-level recognition needs of detection by enabling the detector heads to learn from large-scale image-text pairs. Using only standard contrastive loss and no pseudo-labeling, our approach is a simple yet effective extension of the contrastive learning method to learn emergent object-semantic cues. In addition, we propose a shifted-window learning approach upon window attention to make the backbone representation more robust, translation-invariant, and less biased by the window pattern. On the popular LVIS open-vocabulary detection benchmark, our approach sets a new state of the art of 37.6 mask APr using the common ViT-L backbone and public LAION dataset, and 40.5 mask APr using the DataComp-1B dataset, significantly outperforming the best existing approach by +3.7 mask APr at system level. On the COCO benchmark, we achieve very competitive 39.6 novel AP without pseudo labeling or weak supervision. In addition, we evaluate our approach on the transfer detection setup, where it demonstrates notable improvement over the baseline. Visualization reveals emerging object locality from the pretraining recipes compared to the baseline.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces DITO, which employs a detector architecture during pretraining to infuse region-level semantics from noisy image-text pairs.
It leverages shifted-window learning in vision transformers to enhance translation invariance and improve feature robustness.
Experimental results on LVIS and COCO benchmarks demonstrate significant performance gains over previous methods.

Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection

The paper "Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection" presents an innovative methodology to enhance the performance of open-vocabulary detection (OVD). The authors propose a novel approach called Detection-Oriented Image-Text Pretraining (DITO), which aims to bridge the performance gap between image-level pretraining and region-level recognition required for effective object detection. By using detection architecture during the pretraining phase, DITO is able to inculcate region-level object semantics into the modeling process, subsequently leading to marked improvements in open-vocabulary detection tasks.

Key Contributions

Detection-Oriented Image-Text Pretraining (DITO): The authors replace the standard classification architecture with a detector architecture during the pretraining phase. This allows the detector heads to learn from noisy image-text pairs directly. Standard contrastive loss is applied without the need for pseudo-labeling, making it a straightforward extension of existing contrastive learning methods.
Shifted-Window Learning (SWL): A novel technique is proposed to enhance the robustness and translation-invariance of the backbone representation. This is achieved by modifying window attention mechanisms in the vision transformer (ViT) to mitigate biases introduced by fixed window patterns.

Experimental Results

The approach was empirically validated on the LVIS and COCO open-vocabulary detection benchmarks, demonstrating substantial improvements over previous methods. Notably, DITO achieved:

LVIS Benchmark: The method achieved a state-of-the-art mask AP $_r$ of 40.4 using the ViT-L backbone, which outperforms the best existing approach by +6.5 mask AP $_r$ . In the setting with external box annotations, it reached 45.8 box AP $_r$ , surpassing the previous best by +12.5 points.
COCO Benchmark: DITO demonstrated a competitive 40.8 novel AP without pseudo-labeling or weak supervision. When trained with additional box annotations, DITO achieved 46.1 novel AP, showcasing its adaptability and performance in different settings.

Implications

Practical Implications

The proposed DITO framework enhances the generalization capability of OVD models by leveraging detection-sensitive representations. This is particularly important for applications that require a high degree of accuracy and reliability in object detection, such as autonomous driving, video surveillance, and image search. The removal of complex pseudo-labeling steps and reliance on noisy image-text data make the approach scalable and practical for real-world applications.

Theoretical Implications

From a theoretical perspective, the use of detection-oriented pretraining reshapes the standard understanding of how image-text representations can be optimized for region-level tasks. The introduction of SWL within the vision transformer architecture adds another layer of robustness to the feature extraction process, potentially inspiring further innovations in transformer-based architectures for computer vision tasks.

Future Directions

The results of this paper open several avenues for future research:

Extension to Other Architectures: The applicability of detection-oriented pretraining and shifted-window learning could be extended to other backbone architectures beyond ViT. Evaluating the robustness of these methods across different model structures would be an important area of exploration.
Scaling the Pretraining Dataset: Further research could investigate the impact of significantly larger and more diverse pretraining datasets to understand the scalability and limits of DITO.
Fine-Grained Object Detection: Applying this pretraining methodology to tasks that require fine-grained object detection and segmentation can potentially reveal insights about the capabilities of models pretrained in a detection-oriented manner.

In conclusion, this paper introduces a rigorous and thoughtfully designed approach that significantly advances the field of open-vocabulary detection. The adoption of detection-oriented pretraining and shifted-window learning collectively optimizes both the learning process and the overall detection performance, making it a noteworthy contribution to the domain of computer vision and machine learning. The provided insights and experimental validation suggest its potential for broad adoption and adaptation in future research undertakings.

PDF Markdown