Text Relatedness Based on a Word Thesaurus

Published 15 Jan 2014 in cs.CL | (1401.5699v1)

Abstract: The computation of relatedness between two fragments of text in an automated manner requires taking into account a wide range of factors pertaining to the meaning the two fragments convey, and the pairwise relations between their words. Without doubt, a measure of relatedness between text segments must take into account both the lexical and the semantic relatedness between words. Such a measure that captures well both aspects of text relatedness may help in many tasks, such as text retrieval, classification and clustering. In this paper we present a new approach for measuring the semantic relatedness between words based on their implicit semantic links. The approach exploits only a word thesaurus in order to devise implicit semantic links between words. Based on this approach, we introduce Omiotis, a new measure of semantic relatedness between texts which capitalizes on the word-to-word semantic relatedness measure (SR) and extends it to measure the relatedness between texts. We gradually validate our method: we first evaluate the performance of the semantic relatedness measure between individual words, covering word-to-word similarity and relatedness, synonym identification and word analogy; then, we proceed with evaluating the performance of our method in measuring text-to-text semantic relatedness in two tasks, namely sentence-to-sentence similarity and paraphrase recognition. Experimental evaluation shows that the proposed method outperforms every lexicon-based method of semantic relatedness in the selected tasks and the used data sets, and competes well against corpus-based and hybrid approaches.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (169)

View on Semantic Scholar

Summary

The paper's main contribution is the development of Omiotis, a measure that quantifies word-to-word and text-to-text semantic relatedness using WordNet.
It employs a modified Dijkstra’s algorithm and harmonic TF-IDF weighting to robustly capture semantic paths and lexical relevance.
Experimental evaluations demonstrate superior performance in synonym identification and paraphrase recognition on benchmark datasets.

Text Relatedness Based on a Word Thesaurus: An Expert Analysis

The paper "Text Relatedness Based on a Word Thesaurus," published in the Journal of Artificial Intelligence Research, presents an advanced computational method for estimating semantic relatedness between text segments utilizing a thesaurus, specifically WordNet. This work introduces a novel measure, termed Omiotis, designed to capture both lexical and semantic relatedness between text elements.

Key Contributions and Methodology

The core contribution of the paper is the development of Omiotis, a semantic relatedness measure that integrates two levels of semantic computation: word-to-word and text-to-text relatedness. Omiotis builds upon the semantic relatedness of individual words, utilizing a word thesaurus to establish implicit semantic links between them. The word-to-word semantic relatedness measure (SR) evaluates the semantic paths connecting word pairs by considering the length of these paths, the specificity of intermediate nodes, as reflected by their depth in WordNet's hierarchy, and the weights of semantic edges in the path.

Computational Approach

Semantic relatedness between word senses is computed through a modified Dijkstra's algorithm that identifies the path maximizing the product of edge weights. The SR measure achieves increased coverage and performance by leveraging all available parts of speech in WordNet and utilizing a comprehensive set of semantic relations rather than relying solely on hierarchical links.

For texts, Omiotis evaluates the lexical relevance using a harmonic mean of TF-IDF weights combined with the semantic relatedness of words, thereby determining the degree of semantic connectivity between texts. Omiotis computes the semantic relatedness between text fragments with a focus on integrating lexical similarity and semantic connectivity.

Experimental Evaluation

Omiotis was validated through rigorous experimental evaluation across diverse linguistic tasks. In word-to-word similarity assessments using benchmark datasets such as Rubenstein and Goodenough and Miller and Charles, the SR measure demonstrates superior correlation with human-judged relatedness scores compared to traditional lexicon-based, corpus-based, and hybrid methods.

In the synonym identification tasks using TOEFL and ESL data sets, SR displayed strong performances, surpassing several established methods. In addition, the paper reports competent results in Scholastic Aptitude Test analogy questions, showcasing the measure's capability to address nuanced semantic relationships. Furthermore, Omiotis exhibited promising results in sentence similarity tasks and paraphrase recognition using datasets such as the Microsoft Research Paraphrase Corpus, further highlighting its applicability in real-world text processing.

Theoretical and Practical Implications

The introduction of Omiotis potentially enhances various computational linguistics applications, including text classification, clustering, paraphrase recognition, and much more. Notably, its ability to integrate semantic relatedness at multiple granularity levels marks a step forward in text analysis methodologies, enabling more refined document retrieval, summarization, and understanding.

Future Prospects

While Omiotis showcases compelling results, future work entails refining its computational scalability and exploring its utility in broader NLP tasks such as cross-lingual information retrieval, document clustering, and query expansion. The integration of semantic relatedness in machine learning frameworks paves the way for enriched linguistic models and sophisticated analysis tools.

In summary, this paper presents a robust and comprehensive semantic relatedness measure, Omiotis, proving effective across linguistic tasks and offering promising directions for future research in semantic processing technologies. The use of a thesaurus-driven approach heralds potential improvements in interpreting and utilizing semantic information in computational systems.

Markdown Report Issue