501 research outputs found
Statistical Significance of the Netflix Challenge
Inspired by the legacy of the Netflix contest, we provide an overview of what
has been learned---from our own efforts, and those of others---concerning the
problems of collaborative filtering and recommender systems. The data set
consists of about 100 million movie ratings (from 1 to 5 stars) involving some
480 thousand users and some 18 thousand movies; the associated ratings matrix
is about 99% sparse. The goal is to predict ratings that users will give to
movies; systems which can do this accurately have significant commercial
applications, particularly on the world wide web. We discuss, in some detail,
approaches to "baseline" modeling, singular value decomposition (SVD), as well
as kNN (nearest neighbor) and neural network models; temporal effects,
cross-validation issues, ensemble methods and other considerations are
discussed as well. We compare existing models in a search for new models, and
also discuss the mission-critical issues of penalization and parameter
shrinkage which arise when the dimensions of a parameter space reaches into the
millions. Although much work on such problems has been carried out by the
computer science and machine learning communities, our goal here is to address
a statistical audience, and to provide a primarily statistical treatment of the
lessons that have been learned from this remarkable set of data.Comment: Published in at http://dx.doi.org/10.1214/11-STS368 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Maximum Inner-Product Search using Tree Data-structures
The problem of {\em efficiently} finding the best match for a query in a
given set with respect to the Euclidean distance or the cosine similarity has
been extensively studied in literature. However, a closely related problem of
efficiently finding the best match with respect to the inner product has never
been explored in the general setting to the best of our knowledge. In this
paper we consider this general problem and contrast it with the existing
best-match algorithms. First, we propose a general branch-and-bound algorithm
using a tree data structure. Subsequently, we present a dual-tree algorithm for
the case where there are multiple queries. Finally we present a new data
structure for increasing the efficiency of the dual-tree algorithm. These
branch-and-bound algorithms involve novel bounds suited for the purpose of
best-matching with inner products. We evaluate our proposed algorithms on a
variety of data sets from various applications, and exhibit up to five orders
of magnitude improvement in query time over the naive search technique.Comment: Under submission in KDD 201
Message-Passing Inference on a Factor Graph for Collaborative Filtering
This paper introduces a novel message-passing (MP) framework for the
collaborative filtering (CF) problem associated with recommender systems. We
model the movie-rating prediction problem popularized by the Netflix Prize,
using a probabilistic factor graph model and study the model by deriving
generalization error bounds in terms of the training error. Based on the model,
we develop a new MP algorithm, termed IMP, for learning the model. To show
superiority of the IMP algorithm, we compare it with the closely related
expectation-maximization (EM) based algorithm and a number of other matrix
completion algorithms. Our simulation results on Netflix data show that, while
the methods perform similarly with large amounts of data, the IMP algorithm is
superior for small amounts of data. This improves the cold-start problem of the
CF systems in practice. Another advantage of the IMP algorithm is that it can
be analyzed using the technique of density evolution (DE) that was originally
developed for MP decoding of error-correcting codes
Content-boosted Matrix Factorization Techniques for Recommender Systems
Many businesses are using recommender systems for marketing outreach.
Recommendation algorithms can be either based on content or driven by
collaborative filtering. We study different ways to incorporate content
information directly into the matrix factorization approach of collaborative
filtering. These content-boosted matrix factorization algorithms not only
improve recommendation accuracy, but also provide useful insights about the
contents, as well as make recommendations more easily interpretable
- …