4,680 research outputs found
GIANT: Globally Improved Approximate Newton Method for Distributed Optimization
For distributed computing environment, we consider the empirical risk
minimization problem and propose a distributed and communication-efficient
Newton-type optimization method. At every iteration, each worker locally finds
an Approximate NewTon (ANT) direction, which is sent to the main driver. The
main driver, then, averages all the ANT directions received from workers to
form a {\it Globally Improved ANT} (GIANT) direction. GIANT is highly
communication efficient and naturally exploits the trade-offs between local
computations and global communications in that more local computations result
in fewer overall rounds of communications. Theoretically, we show that GIANT
enjoys an improved convergence rate as compared with first-order methods and
existing distributed Newton-type methods. Further, and in sharp contrast with
many existing distributed Newton-type methods, as well as popular first-order
methods, a highly advantageous practical feature of GIANT is that it only
involves one tuning parameter. We conduct large-scale experiments on a computer
cluster and, empirically, demonstrate the superior performance of GIANT.Comment: Fixed some typos. Improved writin
Structure-Aware Dynamic Scheduler for Parallel Machine Learning
Training large machine learning (ML) models with many variables or parameters
can take a long time if one employs sequential procedures even with stochastic
updates. A natural solution is to turn to distributed computing on a cluster;
however, naive, unstructured parallelization of ML algorithms does not usually
lead to a proportional speedup and can even result in divergence, because
dependencies between model elements can attenuate the computational gains from
parallelization and compromise correctness of inference. Recent efforts toward
this issue have benefited from exploiting the static, a priori block structures
residing in ML algorithms. In this paper, we take this path further by
exploring the dynamic block structures and workloads therein present during ML
program execution, which offers new opportunities for improving convergence,
correctness, and load balancing in distributed ML. We propose and showcase a
general-purpose scheduler, STRADS, for coordinating distributed updates in ML
algorithms, which harnesses the aforementioned opportunities in a systematic
way. We provide theoretical guarantees for our scheduler, and demonstrate its
efficacy versus static block structures on Lasso and Matrix Factorization
Distributed Bayesian Learning with Stochastic Natural-gradient Expectation Propagation and the Posterior Server
This paper makes two contributions to Bayesian machine learning algorithms.
Firstly, we propose stochastic natural gradient expectation propagation (SNEP),
a novel alternative to expectation propagation (EP), a popular variational
inference algorithm. SNEP is a black box variational algorithm, in that it does
not require any simplifying assumptions on the distribution of interest, beyond
the existence of some Monte Carlo sampler for estimating the moments of the EP
tilted distributions. Further, as opposed to EP which has no guarantee of
convergence, SNEP can be shown to be convergent, even when using Monte Carlo
moment estimates. Secondly, we propose a novel architecture for distributed
Bayesian learning which we call the posterior server. The posterior server
allows scalable and robust Bayesian learning in cases where a data set is
stored in a distributed manner across a cluster, with each compute node
containing a disjoint subset of data. An independent Monte Carlo sampler is run
on each compute node, with direct access only to the local data subset, but
which targets an approximation to the global posterior distribution given all
data across the whole cluster. This is achieved by using a distributed
asynchronous implementation of SNEP to pass messages across the cluster. We
demonstrate SNEP and the posterior server on distributed Bayesian learning of
logistic regression and neural networks.
Keywords: Distributed Learning, Large Scale Learning, Deep Learning, Bayesian
Learn- ing, Variational Inference, Expectation Propagation, Stochastic
Approximation, Natural Gradient, Markov chain Monte Carlo, Parameter Server,
Posterior Server.Comment: 37 pages, 7 figure
Stochastic Dual Ascent for Solving Linear Systems
We develop a new randomized iterative algorithm---stochastic dual ascent
(SDA)---for finding the projection of a given vector onto the solution space of
a linear system. The method is dual in nature: with the dual being a
non-strongly concave quadratic maximization problem without constraints. In
each iteration of SDA, a dual variable is updated by a carefully chosen point
in a subspace spanned by the columns of a random matrix drawn independently
from a fixed distribution. The distribution plays the role of a parameter of
the method. Our complexity results hold for a wide family of distributions of
random matrices, which opens the possibility to fine-tune the stochasticity of
the method to particular applications. We prove that primal iterates associated
with the dual process converge to the projection exponentially fast in
expectation, and give a formula and an insightful lower bound for the
convergence rate. We also prove that the same rate applies to dual function
values, primal function values and the duality gap. Unlike traditional
iterative methods, SDA converges under no additional assumptions on the system
(e.g., rank, diagonal dominance) beyond consistency. In fact, our lower bound
improves as the rank of the system matrix drops. Many existing randomized
methods for linear systems arise as special cases of SDA, including randomized
Kaczmarz, randomized Newton, randomized coordinate descent, Gaussian descent,
and their variants. In special cases where our method specializes to a known
algorithm, we either recover the best known rates, or improve upon them.
Finally, we show that the framework can be applied to the distributed average
consensus problem to obtain an array of new algorithms. The randomized gossip
algorithm arises as a special case.Comment: This is a slightly refreshed version of the paper originally
submitted on Dec 21, 2015. We have added a numerical experiment involving
randomized Kaczmarz for rank-deficient systems, added a few relevant
references, and corrected a few typos. Stats: 29 pages, 2 algorithms, 1
figur
Asynchronous Approximation of a Single Component of the Solution to a Linear System
We present a distributed asynchronous algorithm for approximating a single
component of the solution to a system of linear equations , where
is a positive definite real matrix, and . This is
equivalent to solving for in for some and such that
the spectral radius of is less than 1. Our algorithm relies on the Neumann
series characterization of the component , and is based on residual
updates. We analyze our algorithm within the context of a cloud computation
model, in which the computation is split into small update tasks performed by
small processors with shared access to a distributed file system. We prove a
robust asymptotic convergence result when the spectral radius ,
regardless of the precise order and frequency in which the update tasks are
performed. We provide convergence rate bounds which depend on the order of
update tasks performed, analyzing both deterministic update rules via counting
weighted random walks, as well as probabilistic update rules via concentration
bounds. The probabilistic analysis requires analyzing the product of random
matrices which are drawn from distributions that are time and path dependent.
We specifically consider the setting where is large, yet is sparse,
e.g., each row has at most nonzero entries. This is motivated by
applications in which is derived from the edge structure of an underlying
graph. Our results prove that if the local neighborhood of the graph does not
grow too quickly as a function of , our algorithm can provide significant
reduction in computation cost as opposed to any algorithm which computes the
global solution vector . Our algorithm obtains an
additive approximation for in constant time with respect to the size of
the matrix when the maximum row sparsity and
- …