12,438 research outputs found
Statistical Significance of the Netflix Challenge
Inspired by the legacy of the Netflix contest, we provide an overview of what
has been learned---from our own efforts, and those of others---concerning the
problems of collaborative filtering and recommender systems. The data set
consists of about 100 million movie ratings (from 1 to 5 stars) involving some
480 thousand users and some 18 thousand movies; the associated ratings matrix
is about 99% sparse. The goal is to predict ratings that users will give to
movies; systems which can do this accurately have significant commercial
applications, particularly on the world wide web. We discuss, in some detail,
approaches to "baseline" modeling, singular value decomposition (SVD), as well
as kNN (nearest neighbor) and neural network models; temporal effects,
cross-validation issues, ensemble methods and other considerations are
discussed as well. We compare existing models in a search for new models, and
also discuss the mission-critical issues of penalization and parameter
shrinkage which arise when the dimensions of a parameter space reaches into the
millions. Although much work on such problems has been carried out by the
computer science and machine learning communities, our goal here is to address
a statistical audience, and to provide a primarily statistical treatment of the
lessons that have been learned from this remarkable set of data.Comment: Published in at http://dx.doi.org/10.1214/11-STS368 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Graph-RAT: Combining data sources in music recommendation systems
The complexity of music recommendation systems has increased rapidly in recent years, drawing upon different sources of information: content analysis, web-mining, social tagging, etc. Unfortunately, the tools to scientifically evaluate such integrated systems are not readily available; nor are the base algorithms available. This article describes Graph-RAT (Graph-based Relational Analysis Toolkit), an open source toolkit that provides a framework for developing and evaluating novel hybrid systems. While this toolkit is designed for music recommendation, it has applications outside its discipline as well. An experiment—indicative of the sort of procedure that can be configured using the toolkit—is provided to illustrate its usefulness
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Recommender Systems
The ongoing rapid expansion of the Internet greatly increases the necessity
of effective recommender systems for filtering the abundant information.
Extensive research for recommender systems is conducted by a broad range of
communities including social and computer scientists, physicists, and
interdisciplinary researchers. Despite substantial theoretical and practical
achievements, unification and comparison of different approaches are lacking,
which impedes further advances. In this article, we review recent developments
in recommender systems and discuss the major challenges. We compare and
evaluate available algorithms and examine their roles in the future
developments. In addition to algorithms, physical aspects are described to
illustrate macroscopic behavior of recommender systems. Potential impacts and
future directions are discussed. We emphasize that recommendation has a great
scientific depth and combines diverse research fields which makes it of
interests for physicists as well as interdisciplinary researchers.Comment: 97 pages, 20 figures (To appear in Physics Reports
Analysis of UK and European NOx and VOC emission scenarios in the Defra model intercomparison exercise
This is a PDF file of an unedited manuscript that has been accepted for publication. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertainSimple emission scenarios have been implemented in eight United Kingdom air quality models with the aim of assessing how these models compared when addressing whether photochemical ozone formation in southern England was NOx- or VOC-sensitive and whether ozone precursor sources in the UK or in the Rest of Europe (RoE) were the most important during July 2006. The suite of models included three Eulerian-grid models (three implementations of one of these models), a Lagrangian atmospheric dispersion model and two moving box air parcel models. The assignments as to NOx- or VOC-sensitive and to UK- versus RoE-dominant, turned out to be highly variable and often contradictory between the individual models. However, when the assignments were filtered by model performance on each day, many of the contradictions could be eliminated. Nevertheless, no one model was found to be the 'best' model on all days, indicating that no single air quality model could currently be relied upon to inform policymakers robustly in terms of NOx- versus VOC-sensitivity and UK- versus RoE-dominance on each day. It is important to maintain a diversity in model approaches.Peer reviewedFinal Accepted Versio
- …