72,209 research outputs found
Evaluating the Efficiency of Asynchronous Systems with FASE
In this paper, we present FASE (Faster Asynchronous Systems Evaluation), a
tool for evaluating the worst-case efficiency of asynchronous systems. The tool
is based on some well-established results in the setting of a timed process
algebra (PAFAS: a Process Algebra for Faster Asynchronous Systems). To show the
applicability of FASE to concrete meaningful examples, we consider three
implementations of a bounded buffer and use FASE to automatically evaluate
their worst-case efficiency. We finally contrast our results with previous ones
where the efficiency of the same implementations has already been considered.Comment: 14 pages, 5 figures. A preliminary version has been presented as
extended abstract in Pre-Proc. of The 1st Int. Workshop on Quantitative
Formal Methods, pp.101-106, Technische Universiteit Eindhoven, 200
A Framework for the Evaluation of Worst-Case System Efficiency
In this paper we present FASE (Fast Asynchronous Systems Evaluation), a tool
for evaluating worst-case efficiency of asynchronous systems. This tool
implements some well-established results in the setting of a timed CCS-like
process algebra: PAFAS (a Process Algebra for Faster Asynchronous Systems).
Moreover, we discuss some new solutions that are useful to improve the
applicability of FASE to concrete meaningful examples. We finally use fase to
evaluate the efficiency of three different implementations of a bounded buffer
and compare our results with previous ones obtained when the same
implementations have been contrasted according to an efficiency preorder.Comment: 5 Pages. In ICTCS 2010: 12th Italian Conference on Theoretical
Computer Science, University of Camerino, Camerino, 201
Asynchronous ADMM for Distributed Non-Convex Optimization in Power Systems
Large scale, non-convex optimization problems arising in many complex
networks such as the power system call for efficient and scalable distributed
optimization algorithms. Existing distributed methods are usually iterative and
require synchronization of all workers at each iteration, which is hard to
scale and could result in the under-utilization of computation resources due to
the heterogeneity of the subproblems. To address those limitations of
synchronous schemes, this paper proposes an asynchronous distributed
optimization method based on the Alternating Direction Method of Multipliers
(ADMM) for non-convex optimization. The proposed method only requires local
communications and allows each worker to perform local updates with information
from a subset of but not all neighbors. We provide sufficient conditions on the
problem formulation, the choice of algorithm parameter and network delay, and
show that under those mild conditions, the proposed asynchronous ADMM method
asymptotically converges to the KKT point of the non-convex problem. We
validate the effectiveness of asynchronous ADMM by applying it to the Optimal
Power Flow problem in multiple power systems and show that the convergence of
the proposed asynchronous scheme could be faster than its synchronous
counterpart in large-scale applications
Asynchronous Decentralized Parallel Stochastic Gradient Descent
Most commonly used distributed machine learning systems are either
synchronous or centralized asynchronous. Synchronous algorithms like
AllReduce-SGD perform poorly in a heterogeneous environment, while asynchronous
algorithms using a parameter server suffer from 1) communication bottleneck at
parameter servers when workers are many, and 2) significantly worse convergence
when the traffic to parameter server is congested. Can we design an algorithm
that is robust in a heterogeneous environment, while being communication
efficient and maintaining the best-possible convergence rate? In this paper, we
propose an asynchronous decentralized stochastic gradient decent algorithm
(AD-PSGD) satisfying all above expectations. Our theoretical analysis shows
AD-PSGD converges at the optimal rate as SGD and has linear
speedup w.r.t. number of workers. Empirically, AD-PSGD outperforms the best of
decentralized parallel SGD (D-PSGD), asynchronous parallel SGD (A-PSGD), and
standard data parallel SGD (AllReduce-SGD), often by orders of magnitude in a
heterogeneous environment. When training ResNet-50 on ImageNet with up to 128
GPUs, AD-PSGD converges (w.r.t epochs) similarly to the AllReduce-SGD, but each
epoch can be up to 4-8X faster than its synchronous counterparts in a
network-sharing HPC environment. To the best of our knowledge, AD-PSGD is the
first asynchronous algorithm that achieves a similar epoch-wise convergence
rate as AllReduce-SGD, at an over 100-GPU scale
Fast quantum Monte Carlo on a GPU
We present a scheme for the parallelization of quantum Monte Carlo on
graphical processing units, focusing on bosonic systems and variational Monte
Carlo. We use asynchronous execution schemes with shared memory persistence,
and obtain an excellent acceleration. Comparing with single core execution,
GPU-accelerated code runs over x100 faster. The CUDA code is provided along
with the package that is necessary to execute variational Monte Carlo for a
system representing liquid helium-4. The program was benchmarked on several
models of Nvidia GPU, including Fermi GTX560 and M2090, and the latest Kepler
architecture K20 GPU. Kepler-specific optimization is discussed.Comment: Version two has improved figures and text changes in response to
pier-review proces
Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs
We study the factors affecting training time in multi-device deep learning
systems. Given a specification of a convolutional neural network, our goal is
to minimize the time to train this model on a cluster of commodity CPUs and
GPUs. We first focus on the single-node setting and show that by using standard
batching and data-parallel techniques, throughput can be improved by at least
5.5x over state-of-the-art systems on CPUs. This ensures an end-to-end training
speed directly proportional to the throughput of a device regardless of its
underlying hardware, allowing each node in the cluster to be treated as a black
box. Our second contribution is a theoretical and empirical study of the
tradeoffs affecting end-to-end training time in a multiple-device setting. We
identify the degree of asynchronous parallelization as a key factor affecting
both hardware and statistical efficiency. We see that asynchrony can be viewed
as introducing a momentum term. Our results imply that tuning momentum is
critical in asynchronous parallel configurations, and suggest that published
results that have not been fully tuned might report suboptimal performance for
some configurations. For our third contribution, we use our novel understanding
of the interaction between system and optimization dynamics to provide an
efficient hyperparameter optimizer. Our optimizer involves a predictive model
for the total time to convergence and selects an allocation of resources to
minimize that time. We demonstrate that the most popular distributed deep
learning systems fall within our tradeoff space, but do not optimize within the
space. By doing this optimization, our prototype runs 1.9x to 12x faster than
the fastest state-of-the-art systems
Papaya: Practical, Private, and Scalable Federated Learning
Cross-device Federated Learning (FL) is a distributed learning paradigm with
several challenges that differentiate it from traditional distributed learning,
variability in the system characteristics on each device, and millions of
clients coordinating with a central server being primary ones. Most FL systems
described in the literature are synchronous - they perform a synchronized
aggregation of model updates from individual clients. Scaling synchronous FL is
challenging since increasing the number of clients training in parallel leads
to diminishing returns in training speed, analogous to large-batch training.
Moreover, stragglers hinder synchronous FL training. In this work, we outline a
production asynchronous FL system design. Our work tackles the aforementioned
issues, sketches of some of the system design challenges and their solutions,
and touches upon principles that emerged from building a production FL system
for millions of clients. Empirically, we demonstrate that asynchronous FL
converges faster than synchronous FL when training across nearly one hundred
million devices. In particular, in high concurrency settings, asynchronous FL
is 5x faster and has nearly 8x less communication overhead than synchronous FL
Handover Control in Wireless Systems via Asynchronous Multi-User Deep Reinforcement Learning
In this paper, we propose a two-layer framework to learn the optimal handover
(HO) controllers in possibly large-scale wireless systems supporting mobile
Internet-of-Things (IoT) users or traditional cellular users, where the user
mobility patterns could be heterogeneous. In particular, our proposed framework
first partitions the user equipments (UEs) with different mobility patterns
into clusters, where the mobility patterns are similar in the same cluster.
Then, within each cluster, an asynchronous multi-user deep reinforcement
learning scheme is developed to control the HO processes across the UEs in each
cluster, in the goal of lowering the HO rate while ensuring certain system
throughput. In this scheme, we use a deep neural network (DNN) as an HO
controller learned by each UE via reinforcement learning in a collaborative
fashion. Moreover, we use supervised learning in initializing the DNN
controller before the execution of reinforcement learning to exploit what we
already know with traditional HO schemes and to mitigate the negative effects
of random exploration at the initial stage. Furthermore, we show that the
adopted global-parameter-based asynchronous framework enables us to train
faster with more UEs, which could nicely address the scalability issue to
support large systems. Finally, simulation results demonstrate that the
proposed framework can achieve better performance than the state-of-art on-line
schemes, in terms of HO rates.Comment: 12 pages, 10 figures and 1 tabl
A2BCD: An Asynchronous Accelerated Block Coordinate Descent Algorithm With Optimal Complexity
In this paper, we propose the Asynchronous Accelerated Nonuniform Randomized
Block Coordinate Descent algorithm (A2BCD), the first asynchronous
Nesterov-accelerated algorithm that achieves optimal complexity. This parallel
algorithm solves the unconstrained convex minimization problem, using p
computing nodes which compute updates to shared solution vectors, in an
asynchronous fashion with no central coordination. Nodes in asynchronous
algorithms do not wait for updates from other nodes before starting a new
iteration, but simply compute updates using the most recent solution
information available. This allows them to complete iterations much faster than
traditional ones, especially at scale, by eliminating the costly
synchronization penalty of traditional algorithms.
We first prove that A2BCD converges linearly to a solution with a fast
accelerated rate that matches the recently proposed NU_ACDM, so long as the
maximum delay is not too large. Somewhat surprisingly, A2BCD pays no complexity
penalty for using outdated information. We then prove lower complexity bounds
for randomized coordinate descent methods, which show that A2BCD (and hence
NU_ACDM) has optimal complexity to within a constant factor. We confirm with
numerical experiments that A2BCD outperforms NU_ACDM, which is the current
fastest coordinate descent algorithm, even at small scale. We also derive and
analyze a second-order ordinary differential equation, which is the
continuous-time limit of our algorithm, and prove it converges linearly to a
solution with a similar accelerated rate.Comment: 33 pages, 6 figure
Revisiting Distributed Synchronous SGD
Distributed training of deep learning models on large-scale training data is
typically conducted with asynchronous stochastic optimization to maximize the
rate of updates, at the cost of additional noise introduced from asynchrony. In
contrast, the synchronous approach is often thought to be impractical due to
idle time wasted on waiting for straggling workers. We revisit these
conventional beliefs in this paper, and examine the weaknesses of both
approaches. We demonstrate that a third approach, synchronous optimization with
backup workers, can avoid asynchronous noise while mitigating for the worst
stragglers. Our approach is empirically validated and shown to converge faster
and to better test accuracies.Comment: 10 page
- …