4 research outputs found
Recommended from our members
Exploiting Intrinsic Clustering Structure in Discrete-Valued Data Sets for Efficient Knowledge Discovery in the Presence of Missing Data
Scalable algorithm design has become central in the era of large-scale data analysis. The vast amounts of data pouring in from a diverse set of application domains, such as bioinformatics, recommender systems, sensor systems, and social networks, cannot be analyzed efficiently using many data mining and statistical tools that were designed for a small scale setting. It is an ongoing challenge to the data mining, machine learning, and statistics communities to design new methods for efficient data analysis. Confounding this challenge is the noisy and incomplete nature of real-world data sets. Research scientists as well as practitioners in industry need to find meaningful patterns in data with missing value rates often as high as 99%, in addition to errors in the data that can obstruct accurate analyses. My contribution to this line of research is the design of new algorithms for scalable clustering, data reduction, and similarity evaluation by exploiting inherent clustering structure in the input data to overcome the challenges of significant amounts of missing entries. I demonstrate that, by focusing on underlying clustering properties of the data, we can improve the efficiency of several data analysis methods on sparse, discrete-valued data sets. I highlight new methods that I have developed with my collaborators for three diverse knowledge discovery tasks: (1) clustering genetic markers into linkage groups, (2) reducing large-scale genetic data to a much smaller, more accurate representative data set, and (3) computing similarity between users in recommender systems. In each case, I point out how the underlying clustering structure can be used to design more efficient algorithms, even when high missing value rates are present
LiRa: A New Likelihood-Based Similarity Score for Collaborative Filtering
Recommender system data presents unique challenges to the data mining,
machine learning, and algorithms communities. The high missing data rate, in
combination with the large scale and high dimensionality that is typical of
recommender systems data, requires new tools and methods for efficient data
analysis. Here, we address the challenge of evaluating similarity between two
users in a recommender system, where for each user only a small set of ratings
is available. We present a new similarity score, that we call LiRa, based on a
statistical model of user similarity, for large-scale, discrete valued data
with many missing values. We show that this score, based on a ratio of
likelihoods, is more effective at identifying similar users than traditional
similarity scores in user-based collaborative filtering, such as the Pearson
correlation coefficient. We argue that our approach has significant potential
to improve both accuracy and scalability in collaborative filtering