Search CORE

4 research outputs found

Creating a Probabilistic Graph for WordNet using Markov Logic Network

Author: Stanchev Lubomir
Publication venue: DigitalCommons@CalPoly
Publication date: 01/06/2016
Field of study

The paper shows how to create a probabilistic graph for WordNet. A node is created for every word and phrase in WordNet. An edge between two nodes is labeled with the probability that a user that is interested in the source concept will also be interested in the destination concept. For example, an edge with weight 0.3 between “canine” and “dog” indicates that there is a 30% probability that a user who searches for “canine” will be interested in results that contain the word “dog”. We refer to the graph as probabilistic because we enforce the constraint that the sum of the weights of all the edges that go out of a node add up to one. Structural (e.g., the word “canine” is a hypernym (i.e., kind of) of the word “dog”) and textual (e.g., the word “canine” appears in the textual definition of the word “dog”) data from WordNet is used to create a Markov logic network, that is, a set of first order formulas with probabilities. The Markov logic network is then used to compute the weights of the edges in the probabilistic graph. We experimentally validate the quality of the data in the probabilistic graph on two independent benchmarks: Miller and Charles and WordSimilarity-353

Crossref

DigitalCommons@CalPoly

Semantic Document Clustering Using Information from WordNet and DBPedia

Author: Stanchev Lubomir
Publication venue: DigitalCommons@CalPoly
Publication date: 31/01/2018
Field of study

Semantic document clustering is a type of unsupervised learning in which documents are grouped together based on their meaning. Unlike traditional approaches that cluster documents based on common keywords, this technique can group documents that share no words in common as long as they are on the same subject. We compute the similarity between two documents as a function of the semantic similarity between the words and phrases in the documents. We model information from WordNet and DBPedia as a probabilistic graph that can be used to compute the similarity between two terms. We experimentally validate our algorithm on the Reuters-21578 benchmark, which contains 11,362 newswire stories that are grouped in 82 categories using human judgment. We apply the k-means clustering algorithm to group the documents using a similarity metric that is based on keyword matching and one that uses the probabilistic graph. We show that the second approach produces higher precision and recall, which corresponds to better alignment with the classification that was done by human experts

Crossref

DigitalCommons@CalPoly

Implementing Semantic Document Search Using a Bounded Random Walk in a Probabilistic Graph

Author: Stanchev Lubomir
Publication venue: DigitalCommons@CalPoly
Publication date: 01/01/2017
Field of study

Given a set of documents and an input query that is expressed using natural language, the problem of document search is retrieving all relevant documents ordered by the degree of relevance. Semantic document search fetches not only documents that contain words from the input query, but also documents that are semantically relevant. For example, the query friendly pets will consider documents that contain the words dog and cat , among others. One way to implement semantic search is to use a probabilistic graph in which the input query is connected to the documents through paths that contain semantically similar words and phrases, where we use WordNet to initially populate the graph. Each edge in the graph is labeled with the conditional probability that the destination node is relevant given that the source node is relevant. Our semantic document search algorithm works in two phases. In the first phase, we find all documents in the graph that are close to the input query and create a bounded subgraph that includes the query, the found documents, and the paths that connect them. In the second phase, we simulate multiple random walks. Each random walk starts at the input query and continues until a document is reached, a jump outside the bounding subgraph is made, or the number of allowed jumps is exhausted. This allows us to rank the documents based on the number of random walks that terminated in them. We experimentally validated the algorithm on the Cranfield benchmark that contains 1400 documents and 225 natural language queries. We show that we achieve higher value for the mean average precision (MAP) measure than a keywords-based search algorithm and a previously published algorithm that relies on a variation of the probabilistic graph

Crossref

DigitalCommons@CalPoly

Creating a Probabilistic Model for WordNet

Author: Stanchev Lubomir
Publication venue: DigitalCommons@CalPoly
Publication date: 01/09/2016
Field of study

We present a probabilistic model for extracting and storing information from WordNet and the British National Corpus. We map the data into a directed probabilistic graph that can be used to compute the conditional probability between a pair of words from the English language. For example, the graph can be used to deduce that there is a 10% probability that someone who is interested in dogs is also interested in the word “canine”. We propose three ways for computing this probability, where the best results are achieved when performing multiple random walks in the graph. Unlike existing approaches that only process the structured data in WordNet, we process all available information, including natural language descriptions. The available evidence is expressed as simple Horn clauses with probabilities. It is then aggregated using a Markov Logic Network model to create the probabilistic graph. We experimentally validate the quality of the data on five different benchmarks that contain collections of pairs of words and their semantic similarity as determined by humans. In the experimental section, we show that our random walk algorithm with logarithmic distance metric produces higher correlation with the results of the human judgment on three of the five benchmarks and better overall average correlation than the current state-of-the-art algorithms

DigitalCommons@CalPoly