Self-supervised learning of visual features through embedding images into text topic spaces (1705.08631v1)

Published 24 May 2017 in cs.CV

Abstract: End-to-end training from scratch of current deep architectures for new computer vision problems would require Imagenet-scale datasets, and this is not always possible. In this paper we present a method that is able to take advantage of freely available multi-modal content to train computer vision algorithms without human supervision. We put forward the idea of performing self-supervised learning of visual features by mining a large scale corpus of multi-modal (text and image) documents. We show that discriminative visual features can be learnt efficiently by training a CNN to predict the semantic context in which a particular image is more probable to appear as an illustration. For this we leverage the hidden semantic structures discovered in the text corpus with a well-known topic modeling technique. Our experiments demonstrate state of the art performance in image classification, object detection, and multi-modal retrieval compared to recent self-supervised or natural-supervised approaches.

Authors (5)

Lluis Gomez (42 papers)
Yash Patel (41 papers)
Marçal Rusiñol (20 papers)
Dimosthenis Karatzas (80 papers)
C. V. Jawahar (110 papers)

Citations (122)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Self-supervised learning of visual features through embedding images into text topic spaces (1705.08631v1)

Summary

Related Papers