Emergent Mind

Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection

(2310.00161)
Published Sep 29, 2023 in cs.CV , cs.AI , and cs.LG

Abstract

We present a new open-vocabulary detection approach based on detection-oriented image-text pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we replace the commonly used classification architecture with the detector architecture, which better serves the region-level recognition needs of detection by enabling the detector heads to learn from noisy image-text pairs. Using only standard contrastive loss and no pseudo-labeling, our approach is a simple yet effective extension of the contrastive learning method to learn emergent object-semantic cues. In addition, we propose a shifted-window learning approach upon window attention to make the backbone representation more robust, translation-invariant, and less biased by the window pattern. On the popular LVIS open-vocabulary detection benchmark, our approach sets a new state of the art of 40.4 mask AP$r$ using the common ViT-L backbone, significantly outperforming the best existing approach by +6.5 mask AP$r$ at system level. On the COCO benchmark, we achieve very competitive 40.8 novel AP without pseudo labeling or weak supervision. In addition, we evaluate our approach on the transfer detection setup, where ours outperforms the baseline significantly. Visualization reveals emerging object locality from the pretraining recipes compared to the baseline. Code and models will be publicly released.

Detection-oriented pretraining: DITO trains detector heads like FPN for improved performance.

Overview

  • The paper introduces Detection-Oriented Image-Text Pretraining (DITO), a new method for improving open-vocabulary detection by incorporating region-level object semantics during the pretraining phase.

  • It proposes Shifted-Window Learning (SWL), a novel technique to enhance the robustness and translation-invariance of vision transformer (ViT) backbone representations by modifying window attention mechanisms.

  • The DITO approach demonstrates substantial improvements over previous methods on LVIS and COCO benchmarks, achieving state-of-the-art performance metrics without the need for pseudo-labeling.

Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection

The paper "Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection" presents an innovative methodology to enhance the performance of open-vocabulary detection (OVD). The authors propose a novel approach called Detection-Oriented Image-Text Pretraining (DITO), which aims to bridge the performance gap between image-level pretraining and region-level recognition required for effective object detection. By using detection architecture during the pretraining phase, DITO is able to inculcate region-level object semantics into the modeling process, subsequently leading to marked improvements in open-vocabulary detection tasks.

Key Contributions

  1. Detection-Oriented Image-Text Pretraining (DITO): The authors replace the standard classification architecture with a detector architecture during the pretraining phase. This allows the detector heads to learn from noisy image-text pairs directly. Standard contrastive loss is applied without the need for pseudo-labeling, making it a straightforward extension of existing contrastive learning methods.
  2. Shifted-Window Learning (SWL): A novel technique is proposed to enhance the robustness and translation-invariance of the backbone representation. This is achieved by modifying window attention mechanisms in the vision transformer (ViT) to mitigate biases introduced by fixed window patterns.

Experimental Results

The approach was empirically validated on the LVIS and COCO open-vocabulary detection benchmarks, demonstrating substantial improvements over previous methods. Notably, DITO achieved:

  • LVIS Benchmark: The method achieved a state-of-the-art mask AP$r$ of 40.4 using the ViT-L backbone, which outperforms the best existing approach by +6.5 mask AP$r$. In the setting with external box annotations, it reached 45.8 box AP$_r$, surpassing the previous best by +12.5 points.
  • COCO Benchmark: DITO demonstrated a competitive 40.8 novel AP without pseudo-labeling or weak supervision. When trained with additional box annotations, DITO achieved 46.1 novel AP, showcasing its adaptability and performance in different settings.

Implications

Practical Implications

The proposed DITO framework enhances the generalization capability of OVD models by leveraging detection-sensitive representations. This is particularly important for applications that require a high degree of accuracy and reliability in object detection, such as autonomous driving, video surveillance, and image search. The removal of complex pseudo-labeling steps and reliance on noisy image-text data make the approach scalable and practical for real-world applications.

Theoretical Implications

From a theoretical perspective, the use of detection-oriented pretraining reshapes the standard understanding of how image-text representations can be optimized for region-level tasks. The introduction of SWL within the vision transformer architecture adds another layer of robustness to the feature extraction process, potentially inspiring further innovations in transformer-based architectures for computer vision tasks.

Future Directions

The results of this study open several avenues for future research:

  1. Extension to Other Architectures: The applicability of detection-oriented pretraining and shifted-window learning could be extended to other backbone architectures beyond ViT. Evaluating the robustness of these methods across different model structures would be an important area of exploration.
  2. Scaling the Pretraining Dataset: Further research could investigate the impact of significantly larger and more diverse pretraining datasets to understand the scalability and limits of DITO.
  3. Fine-Grained Object Detection: Applying this pretraining methodology to tasks that require fine-grained object detection and segmentation can potentially reveal insights about the capabilities of models pretrained in a detection-oriented manner.

In conclusion, this paper introduces a rigorous and thoughtfully designed approach that significantly advances the field of open-vocabulary detection. The adoption of detection-oriented pretraining and shifted-window learning collectively optimizes both the learning process and the overall detection performance, making it a noteworthy contribution to the domain of computer vision and machine learning. The provided insights and experimental validation suggest its potential for broad adoption and adaptation in future research undertakings.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.