1,178 research outputs found

    Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

    Full text link
    Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates similar to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput (e.g., 2.1x on 1,024 GPUs for reinforcement learning), and achieves the fastest time-to-solution (e.g., the highest score using the shortest training time for Transformer).Comment: Published in IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS), vol. 32, no. 7, pp. 1725-1739, 1 July 202

    SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

    Full text link
    Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions. We analyze the performance of existing model-parallel algorithms in these conditions and find configurations where training larger models becomes less communication-intensive. Based on these findings, we propose SWARM parallelism, a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices. SWARM creates temporary randomized pipelines between nodes that are rebalanced in case of failure. We empirically validate our findings and compare SWARM parallelism with existing large-scale training approaches. Finally, we combine our insights with compression strategies to train a large Transformer language model with 1B shared parameters (approximately 13B before sharing) on preemptible T4 GPUs with less than 200Mb/s network.Comment: Accepted to International Conference on Machine Learning (ICML) 2023. 25 pages, 8 figure

    Improved asynchronous parallel optimization analysis for stochastic incremental methods

    Get PDF
    As datasets continue to increase in size and multi-core computer architectures are developed, asynchronous parallel optimization algorithms become more and more essential to the field of Machine Learning. Unfortunately, conducting the theoretical analysis asynchronous methods is difficult, notably due to the introduction of delay and inconsistency in inherently sequential algorithms. Handling these issues often requires resorting to simplifying but unrealistic assumptions. Through a novel perspective, we revisit and clarify a subtle but important technical issue present in a large fraction of the recent convergence rate proofs for asynchronous parallel optimization algorithms, and propose a simplification of the recently introduced "perturbed iterate" framework that resolves it. We demonstrate the usefulness of our new framework by analyzing three distinct asynchronous parallel incremental optimization algorithms: Hogwild (asynchronous SGD), KROMAGNON (asynchronous SVRG) and ASAGA, a novel asynchronous parallel version of the incremental gradient algorithm SAGA that enjoys fast linear convergence rates. We are able to both remove problematic assumptions and obtain better theoretical results. Notably, we prove that ASAGA and KROMAGNON can obtain a theoretical linear speedup on multi-core systems even without sparsity assumptions. We present results of an implementation on a 40-core architecture illustrating the practical speedups as well as the hardware overhead. Finally, we investigate the overlap constant, an ill-understood but central quantity for the theoretical analysis of asynchronous parallel algorithms. We find that it encompasses much more complexity than suggested in previous work, and often is order-of-magnitude bigger than traditionally thought.Comment: 67 pages, published in JMLR, can be found online at http://jmlr.org/papers/v19/17-650.html. arXiv admin note: substantial text overlap with arXiv:1606.0480

    The Quantum Adiabatic Algorithm applied to random optimization problems: the quantum spin glass perspective

    Full text link
    Among various algorithms designed to exploit the specific properties of quantum computers with respect to classical ones, the quantum adiabatic algorithm is a versatile proposition to find the minimal value of an arbitrary cost function (ground state energy). Random optimization problems provide a natural testbed to compare its efficiency with that of classical algorithms. These problems correspond to mean field spin glasses that have been extensively studied in the classical case. This paper reviews recent analytical works that extended these studies to incorporate the effect of quantum fluctuations, and presents also some original results in this direction.Comment: 151 pages, 21 figure

    Autonomous grid scheduling using probabilistic job runtime scheduling

    Get PDF
    Computational Grids are evolving into a global, service-oriented architecture – a universal platform for delivering future computational services to a range of applications of varying complexity and resource requirements. The thesis focuses on developing a new scheduling model for general-purpose, utility clusters based on the concept of user requested job completion deadlines. In such a system, a user would be able to request each job to finish by a certain deadline, and possibly to a certain monetary cost. Implementing deadline scheduling is dependent on the ability to predict the execution time of each queued job, and on an adaptive scheduling algorithm able to use those predictions to maximise deadline adherence. The thesis proposes novel solutions to these two problems and documents their implementation in a largely autonomous and self-managing way. The starting point of the work is an extensive analysis of a representative Grid workload revealing consistent workflow patterns, usage cycles and correlations between the execution times of jobs and its properties commonly collected by the Grid middleware for accounting purposes. An automated approach is proposed to identify these dependencies and use them to partition the highly variable workload into subsets of more consistent and predictable behaviour. A range of time-series forecasting models, applied in this context for the first time, were used to model the job execution times as a function of their historical behaviour and associated properties. Based on the resulting predictions of job runtimes a novel scheduling algorithm is able to estimate the latest job start time necessary to meet the requested deadline and sort the queue accordingly to minimise the amount of deadline overrun. The testing of the proposed approach was done using the actual job trace collected from a production Grid facility. The best performing execution time predictor (the auto-regressive moving average method) coupled to workload partitioning based on three simultaneous job properties returned the median absolute percentage error centroid of only 4.75%. This level of prediction accuracy enabled the proposed deadline scheduling method to reduce the average deadline overrun time ten-fold compared to the benchmark batch scheduler. Overall, the thesis demonstrates that deadline scheduling of computational jobs on the Grid is achievable using statistical forecasting of job execution times based on historical information. The proposed approach is easily implementable, substantially self-managing and better matched to the human workflow making it well suited for implementation in the utility Grids of the future

    Computational Methods in Science and Engineering : Proceedings of the Workshop SimLabs@KIT, November 29 - 30, 2010, Karlsruhe, Germany

    Get PDF
    In this proceedings volume we provide a compilation of article contributions equally covering applications from different research fields and ranging from capacity up to capability computing. Besides classical computing aspects such as parallelization, the focus of these proceedings is on multi-scale approaches and methods for tackling algorithm and data complexity. Also practical aspects regarding the usage of the HPC infrastructure and available tools and software at the SCC are presented

    Progressive load balancing of asynchronous algorithms

    Get PDF
    Massively parallel supercomputers are susceptible to variable performance due to factors such as differences in chip manufacturing, heat management and network congestion. As a result, the same code with the same input can have a different execution time from run to run. Synchronisation under these circumstances is a key challenge that prevents applications from scaling to large problems and machines. Asynchronous algorithms offer a partial solution. In these algorithms fast processes are not forced to synchronise with slower ones. Instead, they continue computing updates, and moving towards the solution, using the latest data available to them, which may have become stale (i.e. the data is a number of iterations out of date compared to the most recent version). While this allows for high computational efficiency, the convergence rate of asynchronous algorithms tends to be lower than synchronous algorithms due to the use of stale values. A large degree of performance variability can eliminate the performance advantage of asynchronous algorithms or even cause the results to diverge. To address this problem, we use the unique properties of asynchronous algorithms to develop a load balancing strategy for iterative convergent asynchronous algorithms in both shared and distributed memory. The proposed approach – Progressive Load Balancing (PLB) – aims to balance progress levels over time, rather than attempting to equalise iteration rates across parallel workers. This approach attenuates noise without sacrificing performance, resulting in a significant reduction in progress imbalance and improving time to solution. The developed method is evaluated in a variety of scenarios using the asynchronous Jacobi algorithm. In shared memory, we show that it can essentially eliminate the negative effects of a single core in a node slowed down by 19%. Work stealing, an alternative load balancing approach, is shown to be ineffective. In distributed memory, the method reduces the impact of up to 8 slow nodes out of 15, each slowed down by 40%, resulting in 1.03×–1.10× reduction in time to solution and 1.11×–2.89× reduction in runtime variability. Furthermore, we successfully apply the method in a scenario with real faulty components running 75% slower than normal. Broader applicability of progressive load balancing is established by emulating its application to asynchronous stochastic gradient descent where it is found to improve both training time and the learned model’s accuracy. Overall, this thesis demonstrates that enhancing asynchronous algorithms with PLB is an effective method for tackling performance variability in supercomputers

    Phases of Polymers and Biopolymers

    Get PDF
    In this thesis we develop coarse grained models aiming at understanding physical problems arising from phase transitions which occur at the single molecule level. The thesis will consist of two parts, grossly related to and motivated by the two subjects dealt with above. In the first half, we will focus on critical phenomena in stretching experiments, namely in DNA unzipping and polymer stretching in a bad solvent. In the second part, we will develop a model of thick polymers, with the goal of understanding the origin of the protein folds and the physics underlying the folding \u2018transition\u2019, as well as with the hope of shedding some light on some of the fundamental questions highlighted in this Introduction. In the first part of the thesis we will introduce a simple model of self-avoiding walks for DNA unzipping. In this way we can map out the phase diagram in the force vs. temperature plane. This reveals the present of an interesting cold unzipping transition. We then go on to study the dynamics of this coarse grained model. The main result which we will discuss is that the unzipping dynamics below the melting temperature obeys different scaling laws with respect to the opening above thermal denaturation, which is governed by temperature induced fluctuating bubbles. Motivated by this and by recent results from other theoretical groups, we move on to study the relation to DNA unzipping of the stretching of a homopolymer below the theta point. Though also in this case a cold unzipping is present in the phase diagram, this situation is richer from the theoretical point of view because the physics depends crucially on dimension: the underlying phase transition indeed is second order in two dimensions and first order in three. This is shown to be intimately linked to the failure of mean field in this phenomena, unlike for DNA unzipping. In particular, the globule unfolds via a series (hierarchy) of minima. In two dimensions they survive in the thermodynamic limit whereas if the dimension, d, is greater than 2, there is a crossover and for very long polymers the intermediate minima disappear. We deem it intriguing that an intermediate step in this minima hierarchy for polymers of finite length in the three-dimensional case is a regular mathematical helix, followed by a zig-zag structure. This is found to be general and almost independent of the interaction potential details. It suggests that a helix, one of the well-known protein secondary structure, is a natural choice for the ground state of a hydrophobic protein which has to withstand an effective pulling force. In the second part, we will follow the inverse route and ask for a minimal model which is able to account for the basic aspects of folding. By this, we mean a model which contains a suitable potential which has as its ground state a protein-like structure and which can account for the known thermodynamical properties of the folding transition. The existing potential which are able to do that[32] are usually constructed \u2018ad hoc\u2019 from knowledge of the native state. We stress that our procedure here is completely different and the model which we propose should be built up starting from minimal assumptions. Our main result is the following. If we throw away the usual view of a polymer as a sequence of hard spheres tethered together by a chain (see also Chapter 1) and substitute it with the notion of a flexible tube with a given thickness, then upon compaction our \u2019thick polymer\u2019 or \u2019tube\u2019 will display a rich secondary structure with protein-like helices and sheets, in sharp contrast with the degenerate and messy crumpled collapsed phase which is found with a conventional bead-and-link or bead-and-spring homopolymer model. Sheets and helices show up as the polymer gets thinner and passes from the swollen to the compact phase. In this sense the most interesting regime is a \u2018twilight\u2019 zone which consists of tubes which are at the edge of the compact phase, and we thus identify them as \u2018marginally compact strucures\u2019. Note the analogy with the result on stretching, in which the helices were in the same way the \u2018last compact\u2019 structures or the \u2018first extended\u2019 ones when the polymer is being unwinded by a force. After this property of ground states is discussed, we proceed to characterize the thermodynamics of a flexible thick polymer with attraction. The resulting phase diagram is shown to have many of the properties which are usually required from protein effective models, namely for thin polymers there is a second order collapse transition (O collapse) followed, as the temperature is lowered, by a first order transition to a semicrystalline phase where the compact phase orders forming long strands all aligned preferentially along some direction. For thicker polymers the transition to this latter phase occurs directly from the swollen phase, upon lowering T, through a first order transition resembling the folding transition of short proteins
    • 

    corecore