Using Sparse Semantic Embeddings Learned from Multimodal Text and Image Data to Model Human Conceptual Knowledge (1809.02534v3)

Published 7 Sep 2018 in cs.CL

Abstract: Distributional models provide a convenient way to model semantics using dense embedding spaces derived from unsupervised learning algorithms. However, the dimensions of dense embedding spaces are not designed to resemble human semantic knowledge. Moreover, embeddings are often built from a single source of information (typically text data), even though neurocognitive research suggests that semantics is deeply linked to both language and perception. In this paper, we combine multimodal information from both text and image-based representations derived from state-of-the-art distributional models to produce sparse, interpretable vectors using Joint Non-Negative Sparse Embedding. Through in-depth analyses comparing these sparse models to human-derived behavioural and neuroimaging data, we demonstrate their ability to predict interpretable linguistic descriptions of human ground-truth semantic knowledge.

Citations (11)

View on Semantic Scholar

Summary

The paper proposes Joint Non-Negative Sparse Embedding (JNNSE) to integrate text and image data, yielding sparse, interpretable embeddings that mirror human conceptualization.
The method is validated through extensive comparisons with human behavioral and neuroimaging data, confirming its accuracy in modeling semantics.
The results demonstrate that combining multimodal data produces richer representations that capture the interplay between language and perception.

The paper "Using Sparse Semantic Embeddings Learned from Multimodal Text and Image Data to Model Human Conceptual Knowledge" investigates the creation of semantic embeddings that more closely resemble human conceptual knowledge by integrating information from both textual and visual modalities. Traditional distributional models typically generate dense embeddings from unsupervised algorithms primarily using text data. However, the authors argue that these dense embeddings do not align well with how humans conceptualize semantics, as they do not incorporate multimodal information or produce interpretable dimensions.

To address these limitations, the researchers propose the creation of sparse semantic embeddings using Joint Non-Negative Sparse Embedding (JNNSE). This method leverages both text and image data to generate vectors that are not only sparse but also interpretable, meaning that the dimensions of these vectors can correspond to meaningful human concepts.

The researchers provide extensive analyses to validate their approach. They compare their sparse semantic models to human behavioral data and neuroimaging evidence to demonstrate their effectiveness. The comparison reveals that the multimodal sparse embeddings offer a closer approximation to human ground-truth semantic knowledge, as interpreted through linguistic descriptions.

One of the key contributions of this paper is the introduction of multimodal embeddings that combine the strengths of different types of data—text for linguistic information and images for perceptual information. By doing so, the embeddings capture a richer and more nuanced representation of semantics that aligns more closely with human cognition.

The empirical evaluations confirm that these joint sparse embeddings can successfully predict human semantic knowledge, making them not only effective for computational tasks but also valuable for cognitive science research. This approach demonstrates the importance of multimodal data in creating more accurate models of human conceptual knowledge and opens up new avenues for integrating perceptual information into semantic modeling.

The paper provides a significant step towards understanding and modeling the complex interplay between language and perception, a foundational aspect of human semantics.

PDF Markdown

Using Sparse Semantic Embeddings Learned from Multimodal Text and Image Data to Model Human Conceptual Knowledge (1809.02534v3)

Summary

Related Papers