Evolution of Semantic Similarity -- A Survey (2004.13820v2)

Published 19 Apr 2020 in cs.CL and cs.IR

Abstract: Estimating the semantic similarity between text data is one of the challenging and open research problems in the field of NLP. The versatility of natural language makes it difficult to define rule-based methods for determining semantic similarity measures. In order to address this issue, various semantic similarity methods have been proposed over the years. This survey article traces the evolution of such methods, categorizing them based on their underlying principles as knowledge-based, corpus-based, deep neural network-based methods, and hybrid methods. Discussing the strengths and weaknesses of each method, this survey provides a comprehensive view of existing systems in place, for new researchers to experiment and develop innovative ideas to address the issue of semantic similarity.

Citations (238)

View on Semantic Scholar

Summary

The paper provides a comprehensive analysis of semantic similarity techniques by categorizing knowledge-based, corpus-based, and neural methods.
It details methodologies like edge-counting, information content, and modern embeddings such as BERT, highlighting their performance metrics.
It concludes that hybrid approaches, which merge complementary strengths from various methods, offer robust solutions for NLP challenges.

Evolution of Semantic Similarity -- A Survey

Semantic similarity in NLP tasks is pivotal in unlocking functionalities ranging from information retrieval and text classification to machine translation and question answering. The paper "Evolution of Semantic Similarity – A Survey" provides a robust analysis of methodologies developed over decades, categorically diving into knowledge-based, corpus-based, deep neural network-based, and hybrid methods.

Introduction to Semantic Similarity

Semantic Textual Similarity (STS) assesses the semantic closeness between text blocks crucial for applications like text summarization and essay evaluation. Initial attempts relied on lexical measures such as BoW and TF-IDF, which lacked semantic depth. The paper emphasizes the distinction between semantic similarity and relatedness, with similarity being a subset of relatedness. Semantic distance often serves as the inversely proportional measure to these relationships.

Figure 1: Survey Architecture.

Knowledge-Based Methods

Lexical Databases

WordNet: Graph-based ontology utilizing synonym sets, or synsets, for semantic relation extraction.
Wiktionary: Multilingual collection offering broad-coverage lexical information, though weaker taxonomically.
Wikipedia: Widely leveraged for semantic analysis, providing both structured taxonomic data and corpus.
BabelNet: A comprehensive hybrid resource fusing WordNet synsets with Wikipedia data.

Methods Employed

Edge-Counting: Utilizes path lengths within taxonomy graphs; Wu and Palmer's measure integrated depth attributes.
Feature-Based: Measures like Lesk exploit gloss overlaps to compute relatedness.
Information Content (IC): Metrics like Resnik's rely on LCS informativeness for similarity scoring, with extensions by Lin and Jiang-Conrath considering word-specific details.

Corpus-Based Approaches

These revolve around the distributional hypothesis, harnessing word co-occurrence to derive semantic distance.

Word Embeddings

Word2Vec: Offers high-quality word vectors via neural networks, featuring models like CBOW and Skip-gram.
GloVe: Utilizes co-occurrence matrices for vector generation, aligning local context and global statistics.
FastText: Focuses on character n-gram representations, elevating morphological and contextual specificity.
BERT: Transformer-based embeddings achieving high STS benchmark scores via attention mechanisms.

Figure 2: Distribution of articles over venues.

Other Models

LSA/ESA and HAL: Use mathematical techniques for dimensionality reduction in co-occurrence data.

Deep Neural Network-Based Methods

The survey showcases advanced architectures leveraging LSTM, Bi-LSTM, CNN, and recursive models. These surpass many traditional approaches, with transformer models like BERT pioneering current advancements. Additionally, integration of word attention mechanisms becomes critical, collectively enhancing feature significance in vectors.

Figure 3: Distribution of citation range over the articles.

Hybrid Methods

Hybrid models exploit strengths across categories. Techniques such as NASARI integrate corpus statistics with structured knowledge sources like BabelNet, while MSSA focuses on refining multi-sense embeddings using WordNet synsets. Ensemble models, highlighted by works like UESTS, aggregate probabilistic measures and embedding-based techniques for improved semantic similarity scoring.

Conclusion

The survey suggests that while individual methods bring unique advantages, hybrid approaches offer the most robust solutions by integrating complementary strengths. With the landscape leaning on efficient embeddings and transformer models, future research could focus on computational optimization and cross-domain applicability. The paper serves as a comprehensive resource for researchers aiming to advance the field of semantic similarity in NLP.