Emergent Mind

YOLO-World: Real-Time Open-Vocabulary Object Detection

(2401.17270)
Published Jan 30, 2024 in cs.CV

Abstract

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation.

Grounding YOLO architecture uses text input and cross-modality fusion for open-vocabulary object detection.

Overview

  • YOLO-World introduces vision-language modeling to traditional YOLO detectors enabling detection of objects beyond predefined categories.

  • It demonstrates superior performance with an AP of 35.4 and 52.0 FPS frame rate on the NVIDIA V100 platform on the LVIS dataset.

  • The RepVL-PAN architecture is integral to YOLO-World, allowing effective integration of visual and linguistic information.

  • Pre-training is crucial for enhancing open-vocabulary capabilities, with a unique scheme for region-text pairing.

  • YOLO-World outperforms comparable solutions in both accuracy and speed, and shows promise for various downstream tasks and instance segmentation.

Unveiling YOLO-World for Open-Vocabulary Object Detection

Enhancing YOLO with Open-Vocabulary Capabilities

The YOLO (You Only Look Once) series have been extensively utilized in practical scenarios due to their efficiency in object detection. A recent advancement has been made with the introduction of YOLO-World which elevates the traditional YOLO detectors by incorporating vision-language modeling, allowing it to detect objects beyond predefined categories. Through large-scale pre-training, YOLO-World has demonstrated superior accuracy and speed, specifically boasting an Average Precision (AP) of 35.4 while maintaining a frame rate of 52.0 FPS on the NVIDIA V100 platform, as tested on the challenging LVIS dataset.

Technical Innovations

A key innovation in YOLO-World is the Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN). This novel architecture facilitates the fusion of visual and linguistic information, ensuring better interaction between the two. Alongside this, the region-text contrastive loss has been commissioned to boost open-vocabulary detection capabilities. A significant takeaway is the model's architecture which follows the standard YOLO blueprint yet significantly benefits from the injection of CLIP-based text encoder for enriched visual-semantic representations.

Pre-training and Precedence Over Existing Solutions

Pre-training has played a pivotal role in YOLO-World’s development. The paper presents a unique training scheme that assimilates detection, grounding, and image-text data into region-text pairings, leading to marked improvements in open-vocabulary capability. Significantly, when pitted against comparable state-of-the-art methods, YOLO-World outperforms them not only in accuracy but also in terms of inference speed, offering a 20x speedup for real-world applications.

Open-vocabulary Detection and Downstream Tasks

YOLO-World transcends the limited lexicon of objects encountered in traditional object detection, adapting seamlessly to various downstream tasks such as object detection and instance segmentation. Its zero-shot performance reveals strong generalization abilities, and the model's adaptability is further underscored by its remarkable performance in open-vocabulary instance segmentation.

Conclusion and Availability

The paper concludes by positioning YOLO-World as a ground-breaking tool for real-world applications requiring efficient and adaptive open-vocabulary detection. What makes YOLO-World particularly entrancing to the research community is the commitment to open-source its pre-trained weights and codes, thereby broadening the horizons for practical applications of large-vocabulary, real-time object detection.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube