207 research outputs found
MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail!
Hadoop is currently the large-scale data analysis "hammer" of choice, but
there exist classes of algorithms that aren't "nails", in the sense that they
are not particularly amenable to the MapReduce programming model. To address
this, researchers have proposed MapReduce extensions or alternative programming
models in which these algorithms can be elegantly expressed. This essay
espouses a very different position: that MapReduce is "good enough", and that
instead of trying to invent screwdrivers, we should simply get rid of
everything that's not a nail. To be more specific, much discussion in the
literature surrounds the fact that iterative algorithms are a poor fit for
MapReduce: the simple solution is to find alternative non-iterative algorithms
that solve the same problem. This essay captures my personal experiences as an
academic researcher as well as a software engineer in a "real-world" production
analytics environment. From this combined perspective I reflect on the current
state and future of "big data" research
GIANT: Globally Improved Approximate Newton Method for Distributed Optimization
For distributed computing environment, we consider the empirical risk
minimization problem and propose a distributed and communication-efficient
Newton-type optimization method. At every iteration, each worker locally finds
an Approximate NewTon (ANT) direction, which is sent to the main driver. The
main driver, then, averages all the ANT directions received from workers to
form a {\it Globally Improved ANT} (GIANT) direction. GIANT is highly
communication efficient and naturally exploits the trade-offs between local
computations and global communications in that more local computations result
in fewer overall rounds of communications. Theoretically, we show that GIANT
enjoys an improved convergence rate as compared with first-order methods and
existing distributed Newton-type methods. Further, and in sharp contrast with
many existing distributed Newton-type methods, as well as popular first-order
methods, a highly advantageous practical feature of GIANT is that it only
involves one tuning parameter. We conduct large-scale experiments on a computer
cluster and, empirically, demonstrate the superior performance of GIANT.Comment: Fixed some typos. Improved writin
Speculative Approximations for Terascale Analytics
Model calibration is a major challenge faced by the plethora of statistical
analytics packages that are increasingly used in Big Data applications.
Identifying the optimal model parameters is a time-consuming process that has
to be executed from scratch for every dataset/model combination even by
experienced data scientists. We argue that the incapacity to evaluate multiple
parameter configurations simultaneously and the lack of support to quickly
identify sub-optimal configurations are the principal causes. In this paper, we
develop two database-inspired techniques for efficient model calibration.
Speculative parameter testing applies advanced parallel multi-query processing
methods to evaluate several configurations concurrently. The number of
configurations is determined adaptively at runtime, while the configurations
themselves are extracted from a distribution that is continuously learned
following a Bayesian process. Online aggregation is applied to identify
sub-optimal configurations early in the processing by incrementally sampling
the training dataset and estimating the objective function corresponding to
each configuration. We design concurrent online aggregation estimators and
define halting conditions to accurately and timely stop the execution. We apply
the proposed techniques to distributed gradient descent optimization -- batch
and incremental -- for support vector machines and logistic regression models.
We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big
Data analytics system -- and evaluate their performance over terascale-size
synthetic and real datasets. The results confirm that as many as 32
configurations can be evaluated concurrently almost as fast as one, while
sub-optimal configurations are detected accurately in as little as a
fraction of the time
Recommended from our members
Ray: A Distributed Execution Engine for the Machine Learning Ecosystem
In recent years, growing data volumes and more sophisticated computational procedures have greatly increased the demand for computational power. Machine learning and artificial intelligence applications, for example, are notorious for their computational requirements. At the same time, Moores law is ending and processor speeds are stalling. As a result, distributed computing has become ubiquitous. While the cloud makes distributed hardware infrastructure widely accessible and therefore offers the potential of horizontal scale, developing these distributed algorithms and applications remains surprisingly hard. This is due to the inherent complexity of concurrent algorithms, the engineering challenges that arise when communicating between many machines, the requirements like fault tolerance and straggler mitigation that arise at large scale and the lack of a general-purpose distributed execution engine that can support a wide variety of applications.In this thesis, we study the requirements for a general-purpose distributed computation model and present a solution that is easy to use yet expressive and resilient to faults. At its core our model takes familiar concepts from serial programming, namely functions and classes, and generalizes them to the distributed world, therefore unifying stateless and stateful distributed computation. This model not only supports many machine learning workloads like training or serving, but is also a good t for cross-cutting machine learning applications like reinforcement learning and data processing applications like streaming or graph processing. We implement this computational model as an open-source system called Ray, which matches or exceeds the performance of specialized systems in many application domains, while also offering horizontally scalability and strong fault tolerance properties
- …