358 research outputs found
MLI: An API for Distributed Machine Learning
MLI is an Application Programming Interface designed to address the
challenges of building Machine Learn- ing algorithms in a distributed setting
based on data-centric computing. Its primary goal is to simplify the
development of high-performance, scalable, distributed algorithms. Our initial
results show that, relative to existing systems, this interface can be used to
build distributed implementations of a wide variety of common Machine Learning
algorithms with minimal complexity and highly competitive performance and
scalability
Stability of matrix factorization for collaborative filtering
We study the stability vis a vis adversarial noise of matrix factorization
algorithm for matrix completion. In particular, our results include: (I) we
bound the gap between the solution matrix of the factorization method and the
ground truth in terms of root mean square error; (II) we treat the matrix
factorization as a subspace fitting problem and analyze the difference between
the solution subspace and the ground truth; (III) we analyze the prediction
error of individual users based on the subspace stability. We apply these
results to the problem of collaborative filtering under manipulator attack,
which leads to useful insights and guidelines for collaborative filtering
system design.Comment: ICML201
Petuum: A New Platform for Distributed Machine Learning on Big Data
What is a systematic way to efficiently apply a wide spectrum of advanced ML
programs to industrial scale problems, using Big Models (up to 100s of billions
of parameters) on Big Data (up to terabytes or petabytes)? Modern
parallelization strategies employ fine-grained operations and scheduling beyond
the classic bulk-synchronous processing paradigm popularized by MapReduce, or
even specialized graph-based execution that relies on graph representations of
ML programs. The variety of approaches tends to pull systems and algorithms
design in different directions, and it remains difficult to find a universal
platform applicable to a wide range of ML programs at scale. We propose a
general-purpose framework that systematically addresses data- and
model-parallel challenges in large-scale ML, by observing that many ML programs
are fundamentally optimization-centric and admit error-tolerant,
iterative-convergent algorithmic solutions. This presents unique opportunities
for an integrative system design, such as bounded-error network synchronization
and dynamic scheduling based on ML program structure. We demonstrate the
efficacy of these system designs versus well-known implementations of modern ML
algorithms, allowing ML programs to run in much less time and at considerably
larger model sizes, even on modestly-sized compute clusters.Comment: 15 pages, 10 figures, final version in KDD 2015 under the same titl
Outlier-Resilient Web Service QoS Prediction
The proliferation of Web services makes it difficult for users to select the
most appropriate one among numerous functionally identical or similar service
candidates. Quality-of-Service (QoS) describes the non-functional
characteristics of Web services, and it has become the key differentiator for
service selection. However, users cannot invoke all Web services to obtain the
corresponding QoS values due to high time cost and huge resource overhead.
Thus, it is essential to predict unknown QoS values. Although various QoS
prediction methods have been proposed, few of them have taken outliers into
consideration, which may dramatically degrade the prediction performance. To
overcome this limitation, we propose an outlier-resilient QoS prediction method
in this paper. Our method utilizes Cauchy loss to measure the discrepancy
between the observed QoS values and the predicted ones. Owing to the robustness
of Cauchy loss, our method is resilient to outliers. We further extend our
method to provide time-aware QoS prediction results by taking the temporal
information into consideration. Finally, we conduct extensive experiments on
both static and dynamic datasets. The results demonstrate that our method is
able to achieve better performance than state-of-the-art baseline methods.Comment: 12 pages, to appear at the Web Conference (WWW) 202
A scalable recommender system : using latent topics and alternating least squares techniques
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsA recommender system is one of the major techniques that handles information overload problem of Information Retrieval. Improves access and proactively recommends relevant information to each user, based on preferences and objectives. During the implementation and planning phases, designers have to cope with several issues and challenges that need proper attention. This thesis aims to show the issues and challenges in developing high-quality recommender systems.
A paper solves a current research problem in the field of job recommendations using a distributed algorithmic framework built on top of Spark for parallel computation which allows the algorithm to scale linearly with the growing number of users.
The final solution consists of two different recommenders which could be utilised for different purposes. The first method is mainly driven by latent topics among users, meanwhile the second technique utilises a latent factor algorithm that directly addresses the preference-confidence paradigm
- …