331 research outputs found
Revisiting Distributed Synchronous SGD
Distributed training of deep learning models on large-scale training data is
typically conducted with asynchronous stochastic optimization to maximize the
rate of updates, at the cost of additional noise introduced from asynchrony. In
contrast, the synchronous approach is often thought to be impractical due to
idle time wasted on waiting for straggling workers. We revisit these
conventional beliefs in this paper, and examine the weaknesses of both
approaches. We demonstrate that a third approach, synchronous optimization with
backup workers, can avoid asynchronous noise while mitigating for the worst
stragglers. Our approach is empirically validated and shown to converge faster
and to better test accuracies.Comment: 10 page
Robust Gradient Descent via Moment Encoding with LDPC Codes
This paper considers the problem of implementing large-scale gradient descent
algorithms in a distributed computing setting in the presence of {\em
straggling} processors. To mitigate the effect of the stragglers, it has been
previously proposed to encode the data with an erasure-correcting code and
decode at the master server at the end of the computation. We, instead, propose
to encode the second-moment of the data with a low density parity-check (LDPC)
code. The iterative decoding algorithms for LDPC codes have very low
computational overhead and the number of decoding iterations can be made to
automatically adjust with the number of stragglers in the system. We show that
for a random model for stragglers, the proposed moment encoding based gradient
descent method can be viewed as the stochastic gradient descent method. This
allows us to obtain convergence guarantees for the proposed solution.
Furthermore, the proposed moment encoding based method is shown to outperform
the existing schemes in a real distributed computing setup
Towards Stochastically Optimizing Data Computing Flows
With rapid growth in the amount of unstructured data produced by
memory-intensive applications, large scale data analytics has recently
attracted increasing interest. Processing, managing and analyzing this huge
amount of data poses several challenges in cloud and data center computing
domain. Especially, conventional frameworks for distributed data analytics are
based on the assumption of homogeneity and non-stochastic distribution of
different data-processing nodes. The paper argues the fundamental limiting
factors for scaling big data computation. It is shown that as the number of
series and parallel computing servers increase, the tail (mean and variance) of
the job execution time increase. We will first propose a model to predict the
response time of highly distributed processing tasks and then propose a new
practical computational algorithm to optimize the response time
Anytime Stochastic Gradient Descent: A Time to Hear from all the Workers
In this paper, we focus on approaches to parallelizing stochastic gradient
descent (SGD) wherein data is farmed out to a set of workers, the results of
which, after a number of updates, are then combined at a central master node.
Although such synchronized SGD approaches parallelize well in idealized
computing environments, they often fail to realize their promised computational
acceleration in practical settings. One cause is slow workers, termed
stragglers, who can cause the fusion step at the master node to stall, which
greatly slowing convergence. In many straggler mitigation approaches work
completed by these nodes, while only partial, is discarded completely. In this
paper, we propose an approach to parallelizing synchronous SGD that exploits
the work completed by all workers. The central idea is to fix the computation
time of each worker and then to combine distinct contributions of all workers.
We provide a convergence analysis and optimize the combination function. Our
numerical results demonstrate an improvement of several factors of magnitude in
comparison to existing methods
High Throughput Synchronous Distributed Stochastic Gradient Descent
We introduce a new, high-throughput, synchronous, distributed, data-parallel,
stochastic-gradient-descent learning algorithm. This algorithm uses amortized
inference in a compute-cluster-specific, deep, generative, dynamical model to
perform joint posterior predictive inference of the mini-batch gradient
computation times of all worker-nodes in a parallel computing cluster. We show
that a synchronous parameter server can, by utilizing such a model, choose an
optimal cutoff time beyond which mini-batch gradient messages from slow workers
are ignored that maximizes overall mini-batch gradient computations per second.
In keeping with earlier findings we observe that, under realistic conditions,
eagerly discarding the mini-batch gradient computations of stragglers not only
increases throughput but actually increases the overall rate of convergence as
a function of wall-clock time by virtue of eliminating idleness. The principal
novel contribution and finding of this work goes beyond this by demonstrating
that using the predicted run-times from a generative model of cluster worker
performance to dynamically adjust the cutoff improves substantially over the
static-cutoff prior art, leading to, among other things, significantly reduced
deep neural net training times on large computer clusters
AdaDelay: Delay Adaptive Distributed Stochastic Convex Optimization
We study distributed stochastic convex optimization under the delayed
gradient model where the server nodes perform parameter updates, while the
worker nodes compute stochastic gradients. We discuss, analyze, and experiment
with a setup motivated by the behavior of real-world distributed computation
networks, where the machines are differently slow at different time. Therefore,
we allow the parameter updates to be sensitive to the actual delays
experienced, rather than to worst-case bounds on the maximum delay. This
sensitivity leads to larger stepsizes, that can help gain rapid initial
convergence without having to wait too long for slower machines, while
maintaining the same asymptotic complexity. We obtain encouraging improvements
to overall convergence for distributed experiments on real datasets with up to
billions of examples and features.Comment: 19 page
Optimal Placement of Cores, Caches and Memory Controllers in Network On-Chip
Parallel programming is emerging fast and intensive applications need more
resources, so there is a huge demand for on-chip multiprocessors. Accessing L1
caches beside the cores are the fastest after registers but the size of private
caches cannot increase because of design, cost and technology limits. Then
split I-cache and D-cache are used with shared LLC (last level cache). For a
unified shared LLC, bus interface is not scalable, and it seems that
distributed shared LLC (DSLLC) is a better choice. Most of papers assume a
distributed shared LLC beside each core in on-chip network. Many works assume
that DSLLCs are placed in all cores; however, we will show that this design
ignores the effect of traffic congestion in on-chip network. In fact, our work
focuses on optimal placement of cores, DSLLCs and even memory controllers to
minimize the expected latency based on traffic load in a mesh on-chip network
with fixed number of cores and total cache capacity. We try to do some
analytical modeling deriving intended cost function and then optimize the mean
delay of the on-chip network communication. This work is supposed to be
verified using some traffic patterns that are run on CSIM simulator
Asynchronous Complex Analytics in a Distributed Dataflow Architecture
Scalable distributed dataflow systems have recently experienced widespread
adoption, with commodity dataflow engines such as Hadoop and Spark, and even
commodity SQL engines routinely supporting increasingly sophisticated analytics
tasks (e.g., support vector machines, logistic regression, collaborative
filtering). However, these systems' synchronous (often Bulk Synchronous
Parallel) dataflow execution model is at odds with an increasingly important
trend in the machine learning community: the use of asynchrony via shared,
mutable state (i.e., data races) in convex programming tasks, which has---in a
single-node context---delivered noteworthy empirical performance gains and
inspired new research into asynchronous algorithms. In this work, we attempt to
bridge this gap by evaluating the use of lightweight, asynchronous state
transfer within a commodity dataflow engine. Specifically, we investigate the
use of asynchronous sideways information passing (ASIP) that presents
single-stage parallel iterators with a Volcano-like intra-operator iterator
that can be used for asynchronous information passing. We port two synchronous
convex programming algorithms, stochastic gradient descent and the alternating
direction method of multipliers (ADMM), to use ASIPs. We evaluate an
implementation of ASIPs within on Apache Spark that exhibits considerable
speedups as well as a rich set of performance trade-offs in the use of these
asynchronous algorithms
Harnessing the Power of Serverless Runtimes for Large-Scale Optimization
The event-driven and elastic nature of serverless runtimes makes them a very
efficient and cost-effective alternative for scaling up computations. So far,
they have mostly been used for stateless, data parallel and ephemeral
computations. In this work, we propose using serverless runtimes to solve
generic, large-scale optimization problems. Specifically, we build a
master-worker setup using AWS Lambda as the source of our workers, implement a
parallel optimization algorithm to solve a regularized logistic regression
problem, and show that relative speedups up to 256 workers and efficiencies
above 70% up to 64 workers can be expected. We also identify possible
algorithmic and system-level bottlenecks, propose improvements, and discuss the
limitations and challenges in realizing these improvements.Comment: 9 pages, double column, 9 figure
Strategies and Principles of Distributed Machine Learning on Big Data
The rise of Big Data has led to new demands for Machine Learning (ML) systems
to learn complex models with millions to billions of parameters, that promise
adequate capacity to digest massive datasets and offer powerful predictive
analytics thereupon. In order to run ML algorithms at such scales, on a
distributed cluster with 10s to 1000s of machines, it is often the case that
significant engineering efforts are required --- and one might fairly ask if
such engineering truly falls within the domain of ML research or not. Taking
the view that Big ML systems can benefit greatly from ML-rooted statistical and
algorithmic insights --- and that ML researchers should therefore not shy away
from such systems design --- we discuss a series of principles and strategies
distilled from our recent efforts on industrial-scale ML solutions. These
principles and strategies span a continuum from application, to engineering,
and to theoretical research and development of Big ML systems and
architectures, with the goal of understanding how to make them efficient,
generally-applicable, and supported with convergence and scaling guarantees.
They concern four key questions which traditionally receive little attention in
ML research: How to distribute an ML program over a cluster? How to bridge ML
computation with inter-machine communication? How to perform such
communication? What should be communicated between machines? By exposing
underlying statistical and algorithmic characteristics unique to ML programs
but not typically seen in traditional computer programs, and by dissecting
successful cases to reveal how we have harnessed these principles to design and
develop both high-performance distributed ML software as well as
general-purpose ML frameworks, we present opportunities for ML researchers and
practitioners to further shape and grow the area that lies between ML and
systems
- …