Semantic similarity between words and sentences using lexical database and word embeddings

Abstract

Calculating the semantic similarity between sentences is a long-standing problem in the area of natural language processing. The semantic analysis field has a crucial role to play in the research related to the text analytics. The meaning of the word in general English language differs as the context changes. Hence, the semantic similarity varies significantly as the domain of operation differs. For this reason, it is crucial to consider the appropriate definition of the words when they are compared semantically. We present an unsupervised method that can be applied across multiple domains by incorporating corpora based statistics into a standardized semantic similarity algorithm. To calculate the semantic similarity between words and sentences, the proposed method follows an edge-based approach using a lexical database. When tested on both benchmark standards and mean human similarity dataset, the methodology achieves a high correlation value for both word (Pearsons Correlation Coefficient = 0.8753) and sentence similarity (PCC = 0.8793) while comparing Rubenstein and Goodenough standard; and the SICK dataset (PCC = 0.8324) outperforming other unsupervised models. We use the semantic similarity algorithm and extend it to compare the Learning Objectives from course outlines. The course description provided by instructors is an essential piece of information as it defines what is expected from the instructor and what he/she is going to deliver during a particular course. One of the key components of a course description is the Learning Objectives section. The contents of this section are used by program managers who are tasked to compare and match two different courses during the development of Transfer Agreements between various institutions. This research introduces the development of semantic similarity algorithms to calculate the similarity between two learning objectives of the same domain. We present a methodology which deals with the semantic similarity by using a previously established algorithm and integrating it with the domain corpus to utilize domain statistics. The disambiguated domain serves as a supervised learning data for the algorithm. We also introduce Bloom Index to calculate the similarity between action verbs in the Learning Objectives referring to the Bloom's taxonomy. We also study and present the approach to calculate the semantic similarity between words under the word2vec model for a specific domain. We present a methodology to compile a corpus for a specific domain using Wikipedia. We then present a case to show the variance in the semantic similarity between words using different corpora. The core contributions of this thesis are a semantic similarity algorithm for words and sentences, and the corpus compilation of a specific domain to train the word2vec model. We also provide the practical uses of algorithms and the implementation

    Similar works