Probing Contextual Language Models for Common Ground with Visual Representations (2005.00619v5)

Published 1 May 2020 in cs.CL and cs.CV

Abstract: The success of large-scale contextual LLMs has attracted great interest in probing what is encoded in their representations. In this work, we consider a new question: to what extent contextual representations of concrete nouns are aligned with corresponding visual representations? We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations. Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories. Moreover, they are effective in retrieving specific instances of image patches; textual context plays an important role in this process. Visually grounded LLMs slightly outperform text-only LLMs in instance retrieval, but greatly under-perform humans. We hope our analyses inspire future research in understanding and improving the visual capabilities of LLMs.

Citations (14)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Probing Contextual Language Models for Common Ground with Visual Representations (2005.00619v5)

Summary

Related Papers