72,209 research outputs found

    Evaluating the Efficiency of Asynchronous Systems with FASE

    Full text link
    In this paper, we present FASE (Faster Asynchronous Systems Evaluation), a tool for evaluating the worst-case efficiency of asynchronous systems. The tool is based on some well-established results in the setting of a timed process algebra (PAFAS: a Process Algebra for Faster Asynchronous Systems). To show the applicability of FASE to concrete meaningful examples, we consider three implementations of a bounded buffer and use FASE to automatically evaluate their worst-case efficiency. We finally contrast our results with previous ones where the efficiency of the same implementations has already been considered.Comment: 14 pages, 5 figures. A preliminary version has been presented as extended abstract in Pre-Proc. of The 1st Int. Workshop on Quantitative Formal Methods, pp.101-106, Technische Universiteit Eindhoven, 200

    A Framework for the Evaluation of Worst-Case System Efficiency

    Full text link
    In this paper we present FASE (Fast Asynchronous Systems Evaluation), a tool for evaluating worst-case efficiency of asynchronous systems. This tool implements some well-established results in the setting of a timed CCS-like process algebra: PAFAS (a Process Algebra for Faster Asynchronous Systems). Moreover, we discuss some new solutions that are useful to improve the applicability of FASE to concrete meaningful examples. We finally use fase to evaluate the efficiency of three different implementations of a bounded buffer and compare our results with previous ones obtained when the same implementations have been contrasted according to an efficiency preorder.Comment: 5 Pages. In ICTCS 2010: 12th Italian Conference on Theoretical Computer Science, University of Camerino, Camerino, 201

    Asynchronous ADMM for Distributed Non-Convex Optimization in Power Systems

    Full text link
    Large scale, non-convex optimization problems arising in many complex networks such as the power system call for efficient and scalable distributed optimization algorithms. Existing distributed methods are usually iterative and require synchronization of all workers at each iteration, which is hard to scale and could result in the under-utilization of computation resources due to the heterogeneity of the subproblems. To address those limitations of synchronous schemes, this paper proposes an asynchronous distributed optimization method based on the Alternating Direction Method of Multipliers (ADMM) for non-convex optimization. The proposed method only requires local communications and allows each worker to perform local updates with information from a subset of but not all neighbors. We provide sufficient conditions on the problem formulation, the choice of algorithm parameter and network delay, and show that under those mild conditions, the proposed asynchronous ADMM method asymptotically converges to the KKT point of the non-convex problem. We validate the effectiveness of asynchronous ADMM by applying it to the Optimal Power Flow problem in multiple power systems and show that the convergence of the proposed asynchronous scheme could be faster than its synchronous counterpart in large-scale applications

    Asynchronous Decentralized Parallel Stochastic Gradient Descent

    Full text link
    Most commonly used distributed machine learning systems are either synchronous or centralized asynchronous. Synchronous algorithms like AllReduce-SGD perform poorly in a heterogeneous environment, while asynchronous algorithms using a parameter server suffer from 1) communication bottleneck at parameter servers when workers are many, and 2) significantly worse convergence when the traffic to parameter server is congested. Can we design an algorithm that is robust in a heterogeneous environment, while being communication efficient and maintaining the best-possible convergence rate? In this paper, we propose an asynchronous decentralized stochastic gradient decent algorithm (AD-PSGD) satisfying all above expectations. Our theoretical analysis shows AD-PSGD converges at the optimal O(1/K)O(1/\sqrt{K}) rate as SGD and has linear speedup w.r.t. number of workers. Empirically, AD-PSGD outperforms the best of decentralized parallel SGD (D-PSGD), asynchronous parallel SGD (A-PSGD), and standard data parallel SGD (AllReduce-SGD), often by orders of magnitude in a heterogeneous environment. When training ResNet-50 on ImageNet with up to 128 GPUs, AD-PSGD converges (w.r.t epochs) similarly to the AllReduce-SGD, but each epoch can be up to 4-8X faster than its synchronous counterparts in a network-sharing HPC environment. To the best of our knowledge, AD-PSGD is the first asynchronous algorithm that achieves a similar epoch-wise convergence rate as AllReduce-SGD, at an over 100-GPU scale

    Fast quantum Monte Carlo on a GPU

    Full text link
    We present a scheme for the parallelization of quantum Monte Carlo on graphical processing units, focusing on bosonic systems and variational Monte Carlo. We use asynchronous execution schemes with shared memory persistence, and obtain an excellent acceleration. Comparing with single core execution, GPU-accelerated code runs over x100 faster. The CUDA code is provided along with the package that is necessary to execute variational Monte Carlo for a system representing liquid helium-4. The program was benchmarked on several models of Nvidia GPU, including Fermi GTX560 and M2090, and the latest Kepler architecture K20 GPU. Kepler-specific optimization is discussed.Comment: Version two has improved figures and text changes in response to pier-review proces

    Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs

    Full text link
    We study the factors affecting training time in multi-device deep learning systems. Given a specification of a convolutional neural network, our goal is to minimize the time to train this model on a cluster of commodity CPUs and GPUs. We first focus on the single-node setting and show that by using standard batching and data-parallel techniques, throughput can be improved by at least 5.5x over state-of-the-art systems on CPUs. This ensures an end-to-end training speed directly proportional to the throughput of a device regardless of its underlying hardware, allowing each node in the cluster to be treated as a black box. Our second contribution is a theoretical and empirical study of the tradeoffs affecting end-to-end training time in a multiple-device setting. We identify the degree of asynchronous parallelization as a key factor affecting both hardware and statistical efficiency. We see that asynchrony can be viewed as introducing a momentum term. Our results imply that tuning momentum is critical in asynchronous parallel configurations, and suggest that published results that have not been fully tuned might report suboptimal performance for some configurations. For our third contribution, we use our novel understanding of the interaction between system and optimization dynamics to provide an efficient hyperparameter optimizer. Our optimizer involves a predictive model for the total time to convergence and selects an allocation of resources to minimize that time. We demonstrate that the most popular distributed deep learning systems fall within our tradeoff space, but do not optimize within the space. By doing this optimization, our prototype runs 1.9x to 12x faster than the fastest state-of-the-art systems

    Papaya: Practical, Private, and Scalable Federated Learning

    Full text link
    Cross-device Federated Learning (FL) is a distributed learning paradigm with several challenges that differentiate it from traditional distributed learning, variability in the system characteristics on each device, and millions of clients coordinating with a central server being primary ones. Most FL systems described in the literature are synchronous - they perform a synchronized aggregation of model updates from individual clients. Scaling synchronous FL is challenging since increasing the number of clients training in parallel leads to diminishing returns in training speed, analogous to large-batch training. Moreover, stragglers hinder synchronous FL training. In this work, we outline a production asynchronous FL system design. Our work tackles the aforementioned issues, sketches of some of the system design challenges and their solutions, and touches upon principles that emerged from building a production FL system for millions of clients. Empirically, we demonstrate that asynchronous FL converges faster than synchronous FL when training across nearly one hundred million devices. In particular, in high concurrency settings, asynchronous FL is 5x faster and has nearly 8x less communication overhead than synchronous FL

    Handover Control in Wireless Systems via Asynchronous Multi-User Deep Reinforcement Learning

    Full text link
    In this paper, we propose a two-layer framework to learn the optimal handover (HO) controllers in possibly large-scale wireless systems supporting mobile Internet-of-Things (IoT) users or traditional cellular users, where the user mobility patterns could be heterogeneous. In particular, our proposed framework first partitions the user equipments (UEs) with different mobility patterns into clusters, where the mobility patterns are similar in the same cluster. Then, within each cluster, an asynchronous multi-user deep reinforcement learning scheme is developed to control the HO processes across the UEs in each cluster, in the goal of lowering the HO rate while ensuring certain system throughput. In this scheme, we use a deep neural network (DNN) as an HO controller learned by each UE via reinforcement learning in a collaborative fashion. Moreover, we use supervised learning in initializing the DNN controller before the execution of reinforcement learning to exploit what we already know with traditional HO schemes and to mitigate the negative effects of random exploration at the initial stage. Furthermore, we show that the adopted global-parameter-based asynchronous framework enables us to train faster with more UEs, which could nicely address the scalability issue to support large systems. Finally, simulation results demonstrate that the proposed framework can achieve better performance than the state-of-art on-line schemes, in terms of HO rates.Comment: 12 pages, 10 figures and 1 tabl

    A2BCD: An Asynchronous Accelerated Block Coordinate Descent Algorithm With Optimal Complexity

    Full text link
    In this paper, we propose the Asynchronous Accelerated Nonuniform Randomized Block Coordinate Descent algorithm (A2BCD), the first asynchronous Nesterov-accelerated algorithm that achieves optimal complexity. This parallel algorithm solves the unconstrained convex minimization problem, using p computing nodes which compute updates to shared solution vectors, in an asynchronous fashion with no central coordination. Nodes in asynchronous algorithms do not wait for updates from other nodes before starting a new iteration, but simply compute updates using the most recent solution information available. This allows them to complete iterations much faster than traditional ones, especially at scale, by eliminating the costly synchronization penalty of traditional algorithms. We first prove that A2BCD converges linearly to a solution with a fast accelerated rate that matches the recently proposed NU_ACDM, so long as the maximum delay is not too large. Somewhat surprisingly, A2BCD pays no complexity penalty for using outdated information. We then prove lower complexity bounds for randomized coordinate descent methods, which show that A2BCD (and hence NU_ACDM) has optimal complexity to within a constant factor. We confirm with numerical experiments that A2BCD outperforms NU_ACDM, which is the current fastest coordinate descent algorithm, even at small scale. We also derive and analyze a second-order ordinary differential equation, which is the continuous-time limit of our algorithm, and prove it converges linearly to a solution with a similar accelerated rate.Comment: 33 pages, 6 figure

    Revisiting Distributed Synchronous SGD

    Full text link
    Distributed training of deep learning models on large-scale training data is typically conducted with asynchronous stochastic optimization to maximize the rate of updates, at the cost of additional noise introduced from asynchrony. In contrast, the synchronous approach is often thought to be impractical due to idle time wasted on waiting for straggling workers. We revisit these conventional beliefs in this paper, and examine the weaknesses of both approaches. We demonstrate that a third approach, synchronous optimization with backup workers, can avoid asynchronous noise while mitigating for the worst stragglers. Our approach is empirically validated and shown to converge faster and to better test accuracies.Comment: 10 page
    • …
    corecore