Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VisualSem: A High-quality Knowledge Graph for Vision and Language (2008.09150v2)

Published 20 Aug 2020 in cs.CL, cs.AI, and cs.CV

Abstract: An exciting frontier in natural language understanding (NLU) and generation (NLG) calls for (vision-and-) LLMs that can efficiently access external structured knowledge repositories. However, many existing knowledge bases only cover limited domains, or suffer from noisy data, and most of all are typically hard to integrate into neural language pipelines. To fill this gap, we release VisualSem: a high-quality knowledge graph (KG) which includes nodes with multilingual glosses, multiple illustrative images, and visually relevant relations. We also release a neural multi-modal retrieval model that can use images or sentences as inputs and retrieves entities in the KG. This multi-modal retrieval model can be integrated into any (neural network) model pipeline. We encourage the research community to use VisualSem for data augmentation and/or as a source of grounding, among other possible uses. VisualSem as well as the multi-modal retrieval models are publicly available and can be downloaded in this URL: https://github.com/iacercalixto/visualsem

Citations (38)

Summary

  • The paper presents VisualSem as a multimodal knowledge graph with 90,000 nodes and over 1.3M glosses across 14 languages.
  • The paper introduces an innovative image filtering method that ensures high-quality visual data compared to noisy alternatives.
  • The paper provides pre-trained models for image and text retrieval, enabling improved entity retrieval in vision-language tasks.

An Examination of VisualSem: A High-quality Knowledge Graph for Vision & Language

The paper presents VisualSem, a knowledge graph designed to facilitate research in vision-and-language tasks, addressing common limitations found in existing knowledge bases. VisualSem is characterized by its robust integration of multilingual glosses, illustrative images, and relations relevant to vision and language, making it a substantial resource for neural LLMs (LMs) that necessitate retrieval of multimodal contextual information.

Core Contributions

VisualSem distinguishes itself from previous multimodal knowledge graphs by delivering richer and more diverse domain coverage. It boasts approximately 90,000 nodes, with over 1.3 million glosses across 14 diverse languages and around 938,000 curated images sourced from reputable databases like Wikipedia and ImageNet. The paper describes an innovative approach to image filtering that ensures high-quality visual data, a notable improvement over noisy images in other resources like BabelNet.

The authors also address the challenge of entity retrieval in knowledge graphs by releasing pre-trained models capable of retrieving entities using either images or textual sentences. This multi-modal retrieval capability is seamlessly integrable into existing neural pipelines, providing a solid foundation for further exploration and application in diverse AI tasks.

Evaluation of Retrieval Models

The performance of retrieval models was evaluated using metrics such as Hits@k and mean rank. Sentence retrieval exhibited high efficacy across languages, notably Portuguese and Chinese, while image retrieval performance was assessed with the CLIP model, indicating room for future enhancement. The models' ability to employ multiple languages in retrieval processes highlights the potential to extend applications in multilingual settings.

Practical and Theoretical Implications

Practically, VisualSem provides a versatile tool for diverse AI applications including, but not limited to, named entity recognition, image captioning, and visual question answering. Theoretically, this work suggests an impactful shift towards integrating structured multimodal knowledge in LM pipelines, advancing the field of multimodal AI research. It serves as a stepping-stone for computational models capable of deeper contextual understanding by incorporating rich, diverse multimedia inputs.

Future Directions

The paper outlines several avenues for future work, particularly improving the image retrieval aspect and expanding VisualSem's node coverage. There is substantial potential to leverage this resource for further experimentation in data augmentation strategies across NLP and vision-language tasks.

VisualSem stands as a noteworthy contribution to the domain of knowledge graphs tailored for vision-and-language applications, poised to inspire continued advancements and innovations in multimodal AI research.