2,541 research outputs found
Statistical Significance of the Netflix Challenge
Inspired by the legacy of the Netflix contest, we provide an overview of what
has been learned---from our own efforts, and those of others---concerning the
problems of collaborative filtering and recommender systems. The data set
consists of about 100 million movie ratings (from 1 to 5 stars) involving some
480 thousand users and some 18 thousand movies; the associated ratings matrix
is about 99% sparse. The goal is to predict ratings that users will give to
movies; systems which can do this accurately have significant commercial
applications, particularly on the world wide web. We discuss, in some detail,
approaches to "baseline" modeling, singular value decomposition (SVD), as well
as kNN (nearest neighbor) and neural network models; temporal effects,
cross-validation issues, ensemble methods and other considerations are
discussed as well. We compare existing models in a search for new models, and
also discuss the mission-critical issues of penalization and parameter
shrinkage which arise when the dimensions of a parameter space reaches into the
millions. Although much work on such problems has been carried out by the
computer science and machine learning communities, our goal here is to address
a statistical audience, and to provide a primarily statistical treatment of the
lessons that have been learned from this remarkable set of data.Comment: Published in at http://dx.doi.org/10.1214/11-STS368 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Support vector machine for functional data classification
In many applications, input data are sampled functions taking their values in
infinite dimensional spaces rather than standard vectors. This fact has complex
consequences on data analysis algorithms that motivate modifications of them.
In fact most of the traditional data analysis tools for regression,
classification and clustering have been adapted to functional inputs under the
general name of functional Data Analysis (FDA). In this paper, we investigate
the use of Support Vector Machines (SVMs) for functional data analysis and we
focus on the problem of curves discrimination. SVMs are large margin classifier
tools based on implicit non linear mappings of the considered data into high
dimensional spaces thanks to kernels. We show how to define simple kernels that
take into account the unctional nature of the data and lead to consistent
classification. Experiments conducted on real world data emphasize the benefit
of taking into account some functional aspects of the problems.Comment: 13 page
Iteratively-Reweighted Least-Squares Fitting of Support Vector Machines: A Majorization--Minimization Algorithm Approach
Support vector machines (SVMs) are an important tool in modern data analysis.
Traditionally, support vector machines have been fitted via quadratic
programming, either using purpose-built or off-the-shelf algorithms. We present
an alternative approach to SVM fitting via the majorization--minimization (MM)
paradigm. Algorithms that are derived via MM algorithm constructions can be
shown to monotonically decrease their objectives at each iteration, as well as
be globally convergent to stationary points. We demonstrate the construction of
iteratively-reweighted least-squares (IRLS) algorithms, via the MM paradigm,
for SVM risk minimization problems involving the hinge, least-square,
squared-hinge, and logistic losses, and 1-norm, 2-norm, and elastic net
penalizations. Successful implementations of our algorithms are presented via
some numerical examples
- …