4 research outputs found

    Unsupervised Graph-Based Similarity Learning Using Heterogeneous Features.

    Full text link
    Relational data refers to data that contains explicit relations among objects. Nowadays, relational data are universal and have a broad appeal in many different application domains. The problem of estimating similarity between objects is a core requirement for many standard Machine Learning (ML), Natural Language Processing (NLP) and Information Retrieval (IR) problems such as clustering, classiffication, word sense disambiguation, etc. Traditional machine learning approaches represent the data using simple, concise representations such as feature vectors. While this works very well for homogeneous data, i.e, data with a single feature type such as text, it does not exploit the availability of dfferent feature types fully. For example, scientic publications have text, citations, authorship information, venue information. Each of the features can be used for estimating similarity. Representing such objects has been a key issue in efficient mining (Getoor and Taskar, 2007). In this thesis, we propose natural representations for relational data using multiple, connected layers of graphs; one for each feature type. Also, we propose novel algorithms for estimating similarity using multiple heterogeneous features. Also, we present novel algorithms for tasks like topic detection and music recommendation using the estimated similarity measure. We demonstrate superior performance of the proposed algorithms (root mean squared error of 24.81 on the Yahoo! KDD Music recommendation data set and classiffication accuracy of 88% on the ACL Anthology Network data set) over many of the state of the art algorithms, such as Latent Semantic Analysis (LSA), Multiple Kernel Learning (MKL) and spectral clustering and baselines on large, standard data sets.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/89824/1/mpradeep_1.pd

    The Noisy Substring Matching Problem

    No full text
    Let T(U) be the set of words in the dictionary H which contains U as a substring. The problem considered here is the estimation of the set T(U) when U is not known, but Y, a noisy version of U is available. The suggested set estimate S*(Y) of T(U) is a proper subset of H such that its every element contains at least one substring which resembles Y most according to the Levenshtein metric. The proposed algorithm for the computation of S*(Y) requires cubic time. The algorithm uses the recursively computable dissimilarity measure Dk(X, Y), termed as the kth distance between two strings X and Y which is a dissimilarity measure between Y and a certain subset of the set of contiguous substrings of X. Another estimate of T(U), namely SM(Y) is also suggested. The accuracy of SM(Y) is only slightly less than that of S*(Y), but the computation time of SM(Y) is substantially less than that of S*(Y). Experimental results involving 1900 noisy substrings and dictionaries which are subsets of 1023 most common English words [11] indicate that the accuracy of the estimate S*(Y) is around 99 percent and that of SM(Y) is about 98 percent. Copyrigh

    PATTERN RECOGNITION WITH STRINGS, SUBSTRINGS AND BOUNDARIES

    No full text
    The purpose of this research is to study similarity and dissimilarity measures between strings, substrings and polygons, and to use these measures in various pattern recognition problems. An abstract basis for many of the known similarity and dissimilarity measures involving a set of strings has been presented. By virtue of the abstract formulation, many of the numerical and non-numerical measures of similarity involving strings can be computed using a common computational scheme. A deterministic algorithm which possesses certain optimal computational properties has been proposed for the recognition of noisy strings. Further, a stochastic model for a channel causing deletion, insertion and substitution errors in strings according to an arbitrary distribution has been discussed. An algorithm to compute the probability of receiving one string Y, given that a string X was transmitted, has been presented. Using these results, error correction of strings can be achieved with a minimum probability of error. The question of estimating a set of words containing a certain string by processing a noisy version of this string has been studied. This problem, which has been untackled in the literature, is called the noisy substring matching problem. A deterministic algorithm has been proposed to solve this problem. Finally, some geometrical dissimilarity measures between polygons has been proposed. These measures utilize the entire geometrical information in the boundaries of the contours, and not merely the global features of the boundaries. Using these dissimilarity measures pattern recognition of closed contours can be performed. Experimental results have been included which justify the theoretical results presented. In the study of strings and substrings, the experiments have been conducted using subsets of the 1023 most common English words. The Four Great Lakes of North America, Erie, Huron, Michigan and Superior, have been used in the experiments related to the recognition of closed contours
    corecore