A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

Published 18 Jul 2023 in cs.CV | (2307.09220v2)

Abstract: As the most fundamental scene understanding tasks, object detection and segmentation have made tremendous progress in deep learning era. Due to the expensive manual labeling cost, the annotated categories in existing datasets are often small-scale and pre-defined, i.e., state-of-the-art fully-supervised detectors and segmentors fail to generalize beyond the closed vocabulary. To resolve this limitation, in the last few years, the community has witnessed an increasing attention toward Open-Vocabulary Detection (OVD) and Segmentation (OVS). By ``open-vocabulary'', we mean that the models can classify objects beyond pre-defined categories. In this survey, we provide a comprehensive review on recent developments of OVD and OVS. A taxonomy is first developed to organize different tasks and methodologies. We find that the permission and usage of weak supervision signals can well discriminate different methodologies, including: visual-semantic space mapping, novel visual feature synthesis, region-aware training, pseudo-labeling, knowledge distillation, and transfer learning. The proposed taxonomy is universal across different tasks, covering object detection, semantic/instance/panoptic segmentation, 3D and video understanding. The main design principles, key challenges, development routes, methodology strengths, and weaknesses are thoroughly analyzed. In addition, we benchmark each task along with the vital components of each method in appendix and updated online at https://github.com/seanzhuh/awesome-open-vocabulary-detection-and-segmentation. Finally, several promising directions are provided and discussed to stimulate future research.

Abstract PDF HTML Upgrade to Chat

References (321)

Citations (23)

View on Semantic Scholar

Summary

The paper introduces a detailed taxonomy that categorizes approaches based on weak supervision and semantic mapping in open-vocabulary detection and segmentation.
It highlights key methodologies including visual-semantic space mapping, feature synthesis, and region-aware training to enhance zero-shot learning.
The study evaluates benchmark results on datasets like COCO and LVIS while proposing future research directions to improve performance and extend applications.

A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

The paper "A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future" by Chaoyang Zhu and Long Chen provides an elaborate survey on the advancements in the field of open-vocabulary object detection (OVD) and segmentation (OVS). The paper acknowledges the rapid strides made by object detection and segmentation techniques, primarily fueled by deep learning technologies, yet it underscores a significant limitation—these methods are traditionally confined to a closed set of predefined categories due to prohibitive labeling costs.

Core Contributions and Insights

The authors present a detailed taxonomy for classifying OVD and OVS methodologies. They categorize these approaches based on their handling of supervision, primarily distinguishing between methods that utilize weak supervision signals and those that do not. This taxonomy encapsulates techniques pertinent to various tasks such as zero-shot detection and segmentation, and open-vocabulary settings, comprehensively covering object detection, semantic/instance/panoptic segmentation, 3D scene understanding, and video comprehension.

Types of Methodologies

The paper identifies several methodological veins within this field, segmented based on their strategies for incorporating weak supervision or transferring learned features:

Visual-Semantic Space Mapping: This approach includes mapping visual features to the semantic space or vice versa, and sometimes a joint embedding space. This mapping aids in zero-shot learning by leveraging semantic embeddings like Word2Vec or GloVe but can suffer from limitations like the hubness problem and the bias towards seen classes.
Novel Visual Feature Synthesis: By employing generative models, this technique synthesizes visual features of unseen classes, facilitating the learning of classifiers that can recognize these classes in a zero-shot setting.
Region-Aware Training and Pseudo-Labeling: These methodologies involve leveraging VLMs like CLIP to draw implicit connections between image regions and corresponding textual entities, often via pseudo-labeling, thereby enhancing model learning with non-annotated data.
Knowledge Distillation-Based and Transfer Learning-Based Approaches: These approaches encompass distilling knowledge from large VLMs to downstream models or fine-tuning parts of VLMs to adapt them to specific tasks, improving generalization over a diverse set of classes.

Results and Challenges

The paper provides an analysis of numerous benchmark results, highlighting how various models perform under different open-vocabulary tasks and datasets such as COCO, LVIS, and ADE20K. Through empirical comparisons, it underscores the advancements brought by methodologies like region-aware training and VLM's inclusion, which substantially push the boundaries of general object recognition capabilities.

Future Directions

Notably, the authors propose several intriguing avenues for ongoing research. These include improving the data efficiency and quality of weak supervision signals, more efficiently leveraging VLMs, and unifying detection and segmentation tasks to gain a cross-modality understanding. The paper suggests that the future of open-vocabulary detection and segmentation could entail extending these methodologies to tasks beyond image-based detection, such as video instance segmentation and 3D scene understanding.

In sum, this survey offers a comprehensive perspective on the current state and future potential of open-vocabulary detection and segmentation technologies. By articulating both the achievements and the persistent challenges in this energetic area of research, it lays a robust groundwork for future investigations aiming to develop universally applicable, category-agnostic perception models.

Markdown Report Issue