80,361 research outputs found
Balancing the Communication Load of Asynchronously Parallelized Machine Learning Algorithms
Stochastic Gradient Descent (SGD) is the standard numerical method used to
solve the core optimization problem for the vast majority of machine learning
(ML) algorithms. In the context of large scale learning, as utilized by many
Big Data applications, efficient parallelization of SGD is in the focus of
active research. Recently, we were able to show that the asynchronous
communication paradigm can be applied to achieve a fast and scalable
parallelization of SGD. Asynchronous Stochastic Gradient Descent (ASGD)
outperforms other, mostly MapReduce based, parallel algorithms solving large
scale machine learning problems. In this paper, we investigate the impact of
asynchronous communication frequency and message size on the performance of
ASGD applied to large scale ML on HTC cluster and cloud environments. We
introduce a novel algorithm for the automatic balancing of the asynchronous
communication load, which allows to adapt ASGD to changing network bandwidths
and latencies.Comment: arXiv admin note: substantial text overlap with arXiv:1505.0495
A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures
Scientific problems that depend on processing large amounts of data require
overcoming challenges in multiple areas: managing large-scale data
distribution, co-placement and scheduling of data with compute resources, and
storing and transferring large volumes of data. We analyze the ecosystems of
the two prominent paradigms for data-intensive applications, hereafter referred
to as the high-performance computing and the Apache-Hadoop paradigm. We propose
a basis, common terminology and functional factors upon which to analyze the
two approaches of both paradigms. We discuss the concept of "Big Data Ogres"
and their facets as means of understanding and characterizing the most common
application workloads found across the two paradigms. We then discuss the
salient features of the two paradigms, and compare and contrast the two
approaches. Specifically, we examine common implementation/approaches of these
paradigms, shed light upon the reasons for their current "architecture" and
discuss some typical workloads that utilize them. In spite of the significant
software distinctions, we believe there is architectural similarity. We discuss
the potential integration of different implementations, across the different
levels and components. Our comparison progresses from a fully qualitative
examination of the two paradigms, to a semi-quantitative methodology. We use a
simple and broadly used Ogre (K-means clustering), characterize its performance
on a range of representative platforms, covering several implementations from
both paradigms. Our experiments provide an insight into the relative strengths
of the two paradigms. We propose that the set of Ogres will serve as a
benchmark to evaluate the two paradigms along different dimensions.Comment: 8 pages, 2 figure
Challenges of Big Data Analysis
Big Data bring new opportunities to modern society and challenges to data
scientists. On one hand, Big Data hold great promises for discovering subtle
population patterns and heterogeneities that are not possible with small-scale
data. On the other hand, the massive sample size and high dimensionality of Big
Data introduce unique computational and statistical challenges, including
scalability and storage bottleneck, noise accumulation, spurious correlation,
incidental endogeneity, and measurement errors. These challenges are
distinguished and require new computational and statistical paradigm. This
article give overviews on the salient features of Big Data and how these
features impact on paradigm change on statistical and computational methods as
well as computing architectures. We also provide various new perspectives on
the Big Data analysis and computation. In particular, we emphasis on the
viability of the sparsest solution in high-confidence set and point out that
exogeneous assumptions in most statistical methods for Big Data can not be
validated due to incidental endogeneity. They can lead to wrong statistical
inferences and consequently wrong scientific conclusions
- …