Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning (2212.13563v2)

Published 27 Dec 2022 in cs.CV and cs.AI

Abstract: Image captioning is one of the straightforward tasks that can take advantage of large-scale web-crawled data which provides rich knowledge about the visual world for a captioning model. However, since web-crawled data contains image-text pairs that are aligned at different levels, the inherent noises (e.g., misaligned pairs) make it difficult to learn a precise captioning model. While the filtering strategy can effectively remove noisy data, it leads to a decrease in learnable knowledge and sometimes brings about a new problem of data deficiency. To take the best of both worlds, we propose a Noise-aware Captioning (NoC) framework, which learns rich knowledge from the whole web-crawled data while being less affected by the noises. This is achieved by the proposed alignment-level-controllable captioner, which is learned using alignment levels of the image-text pairs as a control signal during training. The alignment-level-conditioned training allows the model to generate high-quality captions by simply setting the control signal to the desired alignment level at inference time. An in-depth analysis shows the effectiveness of our framework in handling noise. With two tasks of zero-shot captioning and text-to-image retrieval using generated captions (i.e., self-retrieval), we also demonstrate our model can produce high-quality captions in terms of descriptiveness and distinctiveness. The code is available at \url{https://github.com/kakaobrain/noc}.

Authors (4)

Wooyoung Kang (6 papers)
Jonghwan Mun (16 papers)
Sungjun Lee (3 papers)
Byungseok Roh (16 papers)

Citations (12)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

GitHub

GitHub - kakaobrain/noc (46 stars)

Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning (2212.13563v2)

Summary

Related Papers

GitHub