nocaps: novel object captioning at scale

Published 20 Dec 2018 in cs.CV, cs.AI, cs.CL, and cs.LG | (1812.08658v3)

Abstract: Image captioning models have achieved impressive results on datasets containing limited visual concepts and large amounts of paired image-caption training data. However, if these models are to ever function in the wild, a much larger variety of visual concepts must be learned, ideally from less supervision. To encourage the development of image captioning models that can learn visual concepts from alternative data sources, such as object detection datasets, we present the first large-scale benchmark for this task. Dubbed 'nocaps', for novel object captioning at scale, our benchmark consists of 166,100 human-generated captions describing 15,100 images from the OpenImages validation and test sets. The associated training data consists of COCO image-caption pairs, plus OpenImages image-level labels and object bounding boxes. Since OpenImages contains many more classes than COCO, nearly 400 object classes seen in test images have no or very few associated training captions (hence, nocaps). We extend existing novel object captioning models to establish strong baselines for this benchmark and provide analysis to guide future work on this task.

Abstract PDF Upgrade to Chat

Citations (406)

View on Semantic Scholar

Summary

The paper introduces a benchmark that drives enhanced generalization in image captioning by evaluating models across in-domain, near-domain, and out-of-domain images.
The paper extends existing captioning models with techniques like Constrained Beam Search, revealing a significant gap between automated captions and human annotations.
The paper emphasizes integrating object detection cues with language models to overcome limited vocabulary challenges, highlighting paths for future improvements in real-world captioning.

Summary of "nocaps: novel object captioning at scale"

The paper introduces "nocaps," a comprehensive benchmark designed to evaluate image captioning models' prowess in recognizing and describing novel objects, particularly addressing the challenge of captioning in broad domains beyond those captured by existing datasets. The limitations of conventional image captioning datasets, such as COCO, which are predominantly designed with a limited set of object classes, are addressed by the construction of nocaps. This benchmark consists of over 166,000 human-generated captions describing more than 15,000 images, sourced from the Open Images dataset, which includes nearly 400 classes not present in COCO’s training vocabulary.

The primary contribution of this research lies in its ability to drive improvements in models towards better generalization for novel object recognition and description, without additional paired image-caption data. Instead, the study exploits alternative data sources like object detection datasets to bridge the constraints imposed by datasets with limited object class representation. In this endeavor, the authors extend several existing image captioning models, providing baseline performances for nocaps, and undertakes an analytical comparison with human performance, finding a significant disparity that underscores the challenges faced by current models in a real-world-like scenario.

The benchmark evaluates models on three specific sets: in-domain, near-domain, and out-of-domain, which classify images based on object occurrence within the COCO dataset. This classification reveals the model's ability to handle dataset-specific biases and domain shifts, providing nuanced insights into areas where model improvement is pertinent. Experimentally, despite leveraging state-of-the-art models like Neural Baby Talk (NBT) and employing techniques like Constrained Beam Search (CBS) for decoding, automatic models struggle significantly compared to human annotations, especially on out-of-domain images. Key performance indicators such as CIDEr and SPICE scores depict the significant room for improvement in machine-generated captions.

The paper implicitly challenges the research community to develop methodologies that bridge the gap between object detection capabilities and natural language description generation. Emphasis on enriching object recognition from detection datasets while seamlessly integrating this with learned linguistic structures from caption datasets reflects a trajectory of model development that could potentially resolve the observed performance disparity in the presented benchmarks. The results highlight the essential need for models to disentangle object recognition from description generation and suggests that advances in LLMs and object detection, coupled with improvements in the integration of these components, will likely lead to enhanced captioning models capable of real-world generalization.

In essence, nocaps provides an invaluable platform for advancing research in visual understanding and image captioning at scale, offering datasets and tools necessary for evaluating progress against a broad and diverse set of visual concepts. The future trajectory in this domain will likely involve addressing the observed limitations through innovative integrations of detection and language modeling techniques, fostering models that gracefully scale to handle the intricacies and sporadic nature of visual data in real-world applications.

Markdown Report Issue