Search CORE

1,178 research outputs found

Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

Author: Alistarh Dan
Ben-Nun Tal
Di Girolamo Salvatore
Dryden Nikoli
Hoefler Torsten
Li Shigang
Nadiradze Giorgi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates similar to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput (e.g., 2.1x on 1,024 GPUs for reinforcement learning), and achieves the fastest time-to-solution (e.g., the highest score using the shortest training time for Transformer).Comment: Published in IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS), vol. 32, no. 7, pp. 1725-1739, 1 July 202

arXiv.org e-Print Archive

Repository for Publications and Research Data

IST Austria: PubRep (Institute of Science and Technology)

SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

Author: Borzunov Alexander
Dettmers Tim
Diskin Michael
Ryabinin Max
Publication venue
Publication date: 29/06/2023
Field of study

Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions. We analyze the performance of existing model-parallel algorithms in these conditions and find configurations where training larger models becomes less communication-intensive. Based on these findings, we propose SWARM parallelism, a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices. SWARM creates temporary randomized pipelines between nodes that are rebalanced in case of failure. We empirically validate our findings and compare SWARM parallelism with existing large-scale training approaches. Finally, we combine our insights with compression strategies to train a large Transformer language model with 1B shared parameters (approximately 13B before sharing) on preemptible T4 GPUs with less than 200Mb/s network.Comment: Accepted to International Conference on Machine Learning (ICML) 2023. 25 pages, 8 figure

arXiv.org e-Print Archive

Improved asynchronous parallel optimization analysis for stochastic incremental methods

Author: Lacoste-Julien Simon
Leblond Rémi
Pedregosa Fabian
Publication venue
Publication date: 01/01/2018
Field of study

As datasets continue to increase in size and multi-core computer architectures are developed, asynchronous parallel optimization algorithms become more and more essential to the field of Machine Learning. Unfortunately, conducting the theoretical analysis asynchronous methods is difficult, notably due to the introduction of delay and inconsistency in inherently sequential algorithms. Handling these issues often requires resorting to simplifying but unrealistic assumptions. Through a novel perspective, we revisit and clarify a subtle but important technical issue present in a large fraction of the recent convergence rate proofs for asynchronous parallel optimization algorithms, and propose a simplification of the recently introduced "perturbed iterate" framework that resolves it. We demonstrate the usefulness of our new framework by analyzing three distinct asynchronous parallel incremental optimization algorithms: Hogwild (asynchronous SGD), KROMAGNON (asynchronous SVRG) and ASAGA, a novel asynchronous parallel version of the incremental gradient algorithm SAGA that enjoys fast linear convergence rates. We are able to both remove problematic assumptions and obtain better theoretical results. Notably, we prove that ASAGA and KROMAGNON can obtain a theoretical linear speedup on multi-core systems even without sparsity assumptions. We present results of an implementation on a 40-core architecture illustrating the practical speedups as well as the hardware overhead. Finally, we investigate the overlap constant, an ill-understood but central quantity for the theoretical analysis of asynchronous parallel algorithms. We find that it encompasses much more complexity than suggested in previous work, and often is order-of-magnitude bigger than traditionally thought.Comment: 67 pages, published in JMLR, can be found online at http://jmlr.org/papers/v19/17-650.html. arXiv admin note: substantial text overlap with arXiv:1606.0480

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

The Quantum Adiabatic Algorithm applied to random optimization problems: the quantum spin glass perspective

Author: Bapst Victor
Foini Laura
Krzakala Florent
Semerjian Guilhem
Zamponi Francesco
Publication venue: 'Elsevier BV'
Publication date: 02/10/2012
Field of study

Among various algorithms designed to exploit the specific properties of quantum computers with respect to classical ones, the quantum adiabatic algorithm is a versatile proposition to find the minimal value of an arbitrary cost function (ground state energy). Random optimization problems provide a natural testbed to compare its efficiency with that of classical algorithms. These problems correspond to mean field spin glasses that have been extensively studied in the classical case. This paper reviews recent analytical works that extended these studies to incorporate the effect of quantum fluctuations, and presents also some original results in this direction.Comment: 151 pages, 21 figure

arXiv.org e-Print Archive

Hal-Diderot

Autonomous grid scheduling using probabilistic job runtime scheduling

Author: Lazarević A.
Publication venue: 'Queen Mary University of London'
Publication date: 01/01/2008
Field of study

Computational Grids are evolving into a global, service-oriented architecture – a universal platform for delivering future computational services to a range of applications of varying complexity and resource requirements. The thesis focuses on developing a new scheduling model for general-purpose, utility clusters based on the concept of user requested job completion deadlines. In such a system, a user would be able to request each job to finish by a certain deadline, and possibly to a certain monetary cost. Implementing deadline scheduling is dependent on the ability to predict the execution time of each queued job, and on an adaptive scheduling algorithm able to use those predictions to maximise deadline adherence. The thesis proposes novel solutions to these two problems and documents their implementation in a largely autonomous and self-managing way. The starting point of the work is an extensive analysis of a representative Grid workload revealing consistent workflow patterns, usage cycles and correlations between the execution times of jobs and its properties commonly collected by the Grid middleware for accounting purposes. An automated approach is proposed to identify these dependencies and use them to partition the highly variable workload into subsets of more consistent and predictable behaviour. A range of time-series forecasting models, applied in this context for the first time, were used to model the job execution times as a function of their historical behaviour and associated properties. Based on the resulting predictions of job runtimes a novel scheduling algorithm is able to estimate the latest job start time necessary to meet the requested deadline and sort the queue accordingly to minimise the amount of deadline overrun. The testing of the proposed approach was done using the actual job trace collected from a production Grid facility. The best performing execution time predictor (the auto-regressive moving average method) coupled to workload partitioning based on three simultaneous job properties returned the median absolute percentage error centroid of only 4.75%. This level of prediction accuracy enabled the proposed deadline scheduling method to reduce the average deadline overrun time ten-fold compared to the benchmark batch scheduler. Overall, the thesis demonstrates that deadline scheduling of computational jobs on the Grid is achievable using statistical forecasting of job execution times based on historical information. The proposed approach is easily implementable, substantially self-managing and better matched to the human workflow making it well suited for implementation in the utility Grids of the future

UCL Discovery

OpenGrey Repository

Computational Methods in Science and Engineering : Proceedings of the Workshop SimLabs@KIT, November 29 - 30, 2010, Karlsruhe, Germany

Author: Kirner Ole
Kondov Ivan
Poghosyan Gevorg
Schmitz Frank
Schneider Olaf
Publication venue: KIT Scientific Publishing, Karlsruhe
Publication date: 01/01/2011
Field of study

In this proceedings volume we provide a compilation of article contributions equally covering applications from different research fields and ranging from capacity up to capability computing. Besides classical computing aspects such as parallelization, the focus of these proceedings is on multi-scale approaches and methods for tackling algorithm and data complexity. Also practical aspects regarding the usage of the HPC infrastructure and available tools and software at the SCC are presented

KITopen

Progressive load balancing of asynchronous algorithms

Author: Zarins Justs
Publication venue: The University of Edinburgh
Publication date: 30/11/2021
Field of study

Massively parallel supercomputers are susceptible to variable performance due to factors such as differences in chip manufacturing, heat management and network congestion. As a result, the same code with the same input can have a different execution time from run to run. Synchronisation under these circumstances is a key challenge that prevents applications from scaling to large problems and machines. Asynchronous algorithms offer a partial solution. In these algorithms fast processes are not forced to synchronise with slower ones. Instead, they continue computing updates, and moving towards the solution, using the latest data available to them, which may have become stale (i.e. the data is a number of iterations out of date compared to the most recent version). While this allows for high computational efficiency, the convergence rate of asynchronous algorithms tends to be lower than synchronous algorithms due to the use of stale values. A large degree of performance variability can eliminate the performance advantage of asynchronous algorithms or even cause the results to diverge. To address this problem, we use the unique properties of asynchronous algorithms to develop a load balancing strategy for iterative convergent asynchronous algorithms in both shared and distributed memory. The proposed approach – Progressive Load Balancing (PLB) – aims to balance progress levels over time, rather than attempting to equalise iteration rates across parallel workers. This approach attenuates noise without sacrificing performance, resulting in a significant reduction in progress imbalance and improving time to solution. The developed method is evaluated in a variety of scenarios using the asynchronous Jacobi algorithm. In shared memory, we show that it can essentially eliminate the negative effects of a single core in a node slowed down by 19%. Work stealing, an alternative load balancing approach, is shown to be ineffective. In distributed memory, the method reduces the impact of up to 8 slow nodes out of 15, each slowed down by 40%, resulting in 1.03×–1.10× reduction in time to solution and 1.11×–2.89× reduction in runtime variability. Furthermore, we successfully apply the method in a scenario with real faulty components running 75% slower than normal. Broader applicability of progressive load balancing is established by emulating its application to asynchronous stochastic gradient descent where it is found to improve both training time and the learned model’s accuracy. Overall, this thesis demonstrates that enhancing asynchronous algorithms with PLB is an effective method for tackling performance variability in supercomputers

Edinburgh Research Archive

Phases of Polymers and Biopolymers

Author: Marenduzzo Davide
Publication venue: place:Trieste
Publication date: 17/10/2002
Field of study

In this thesis we develop coarse grained models aiming at understanding physical problems arising from phase transitions which occur at the single molecule level. The thesis will consist of two parts, grossly related to and motivated by the two subjects dealt with above. In the first half, we will focus on critical phenomena in stretching experiments, namely in DNA unzipping and polymer stretching in a bad solvent. In the second part, we will develop a model of thick polymers, with the goal of understanding the origin of the protein folds and the physics underlying the folding \u2018transition\u2019, as well as with the hope of shedding some light on some of the fundamental questions highlighted in this Introduction. In the first part of the thesis we will introduce a simple model of self-avoiding walks for DNA unzipping. In this way we can map out the phase diagram in the force vs. temperature plane. This reveals the present of an interesting cold unzipping transition. We then go on to study the dynamics of this coarse grained model. The main result which we will discuss is that the unzipping dynamics below the melting temperature obeys different scaling laws with respect to the opening above thermal denaturation, which is governed by temperature induced fluctuating bubbles. Motivated by this and by recent results from other theoretical groups, we move on to study the relation to DNA unzipping of the stretching of a homopolymer below the theta point. Though also in this case a cold unzipping is present in the phase diagram, this situation is richer from the theoretical point of view because the physics depends crucially on dimension: the underlying phase transition indeed is second order in two dimensions and first order in three. This is shown to be intimately linked to the failure of mean field in this phenomena, unlike for DNA unzipping. In particular, the globule unfolds via a series (hierarchy) of minima. In two dimensions they survive in the thermodynamic limit whereas if the dimension, d, is greater than 2, there is a crossover and for very long polymers the intermediate minima disappear. We deem it intriguing that an intermediate step in this minima hierarchy for polymers of finite length in the three-dimensional case is a regular mathematical helix, followed by a zig-zag structure. This is found to be general and almost independent of the interaction potential details. It suggests that a helix, one of the well-known protein secondary structure, is a natural choice for the ground state of a hydrophobic protein which has to withstand an effective pulling force. In the second part, we will follow the inverse route and ask for a minimal model which is able to account for the basic aspects of folding. By this, we mean a model which contains a suitable potential which has as its ground state a protein-like structure and which can account for the known thermodynamical properties of the folding transition. The existing potential which are able to do that[32] are usually constructed \u2018ad hoc\u2019 from knowledge of the native state. We stress that our procedure here is completely different and the model which we propose should be built up starting from minimal assumptions. Our main result is the following. If we throw away the usual view of a polymer as a sequence of hard spheres tethered together by a chain (see also Chapter 1) and substitute it with the notion of a flexible tube with a given thickness, then upon compaction our \u2019thick polymer\u2019 or \u2019tube\u2019 will display a rich secondary structure with protein-like helices and sheets, in sharp contrast with the degenerate and messy crumpled collapsed phase which is found with a conventional bead-and-link or bead-and-spring homopolymer model. Sheets and helices show up as the polymer gets thinner and passes from the swollen to the compact phase. In this sense the most interesting regime is a \u2018twilight\u2019 zone which consists of tubes which are at the edge of the compact phase, and we thus identify them as \u2018marginally compact strucures\u2019. Note the analogy with the result on stretching, in which the helices were in the same way the \u2018last compact\u2019 structures or the \u2018first extended\u2019 ones when the polymer is being unwinded by a force. After this property of ground states is discussed, we proceed to characterize the thermodynamics of a flexible thick polymer with attraction. The resulting phase diagram is shown to have many of the properties which are usually required from protein effective models, namely for thin polymers there is a second order collapse transition (O collapse) followed, as the temperature is lowered, by a first order transition to a semicrystalline phase where the compact phase orders forming long strands all aligned preferentially along some direction. For thicker polymers the transition to this latter phase occurs directly from the swollen phase, upon lowering T, through a first order transition resembling the folding transition of short proteins

Sissa Digital Library

Recommended from our members

Kinetics of Brownian Transport

Author: Gladrow Jannes
Publication venue: University of Cambridge
Publication date: 08/01/2020
Field of study

The rate of progress of Brownian processes is not easily quantifiable. An importantmeasure of the ”speed” of Brownian motion is themean first-passage time (FPT) to a given distance. FPTs exist in various flavours including exit- and transition-path times, which, for instance, can be used to quantify the length of reaction paths in folding transitions inmolecules such as DNA. Due to their inherently stochastic nature, measurements of any FPTs require repeated experiments under controlled conditions. In my thesis, I systematically explore FPTs in various contexts using a custom-built automated holographic optical tweezers (HOT) setup. More precisely, I investigate transition- and exit-path-time symmetries in equilibrium systems and demonstrate the breakdown of the symmetry in out-of-equilibriumsystems. Experimental data from folding DNA-hairpins show that the principles established on the mesoscale extend well into the molecular regime. In Kramers escape problem, the reciprocal of the escape rate corresponds to the time of first-passage to leave the initial state. A lower bound for the achievable FPT, e.g. of the reaction coordinate of a folding molecule, therefore corresponds to a speed-limit of the ensemble reaction rate. Using my setup, I show that certain barrier shapes can substantially lower the escape time across the barrier without changing the overall energy balance. This result has deep implications for reaction kinetics, e.g. in protein folding. Furthermore, I investigate the role of entropic forces in Brownian transport, show that hydrodynamic drag plays a crucial role in Brownian motion in confined systems, and give an experimental realisation of Fick-Jacobs theory. The thermodynamic applications of HOTs considered here necessitate the creation of fine-tuned optical landscapes, which requires precise phase-retrieval to compute the necessary holograms. In order to address this problem, I explore novel algorithms based on deep conditional generative models and test whether such models can assist in finding holograms for a given desired light distribution. I compare several differentmodels, including conditional generative-adversarial networks and conditional variational autoencoders, which are trained on data sets sampled on the HOT setup. Furthermore, I propose a novel forward-loss-minimising architecture and demonstrate its excellent performance on both validation and artificially-created test data sets.European Training Network (ETN) Grant No. 674979-NANOTRANS Winton Programme for the Physics of Sustainabilit

Apollo (Cambridge)

An Introduction to Markov State Models and Their Application to Long Timescale Molecular Simulation

Author
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2014
Field of study

Repository: Freie Universität Berlin (FU), Math Department (fu_mi_publications)