1,178 research outputs found
Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging
Deep learning at scale is dominated by communication time. Distributing
samples across nodes usually yields the best performance, but poses scaling
challenges due to global information dissemination and load imbalance across
uneven sample lengths. State-of-the-art decentralized optimizers mitigate the
problem, but require more iterations to achieve the same accuracy as their
globally-communicating counterparts. We present Wait-Avoiding Group Model
Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global
communication via subgroup weight exchange. The key insight is a combination of
algorithmic changes to the averaging scheme and the use of a group allreduce
operation. We prove the convergence of WAGMA-SGD, and empirically show that it
retains convergence rates similar to Allreduce-SGD. For evaluation, we train
ResNet-50 on ImageNet; Transformer for machine translation; and deep
reinforcement learning for navigation at scale. Compared with state-of-the-art
decentralized SGD variants, WAGMA-SGD significantly improves training
throughput (e.g., 2.1x on 1,024 GPUs for reinforcement learning), and achieves
the fastest time-to-solution (e.g., the highest score using the shortest
training time for Transformer).Comment: Published in IEEE Transactions on Parallel and Distributed Systems
(IEEE TPDS), vol. 32, no. 7, pp. 1725-1739, 1 July 202
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient
Many deep learning applications benefit from using large models with billions
of parameters. Training these models is notoriously expensive due to the need
for specialized HPC clusters. In this work, we consider alternative setups for
training large models: using cheap "preemptible" instances or pooling existing
resources from multiple regions. We analyze the performance of existing
model-parallel algorithms in these conditions and find configurations where
training larger models becomes less communication-intensive. Based on these
findings, we propose SWARM parallelism, a model-parallel training algorithm
designed for poorly connected, heterogeneous and unreliable devices. SWARM
creates temporary randomized pipelines between nodes that are rebalanced in
case of failure. We empirically validate our findings and compare SWARM
parallelism with existing large-scale training approaches. Finally, we combine
our insights with compression strategies to train a large Transformer language
model with 1B shared parameters (approximately 13B before sharing) on
preemptible T4 GPUs with less than 200Mb/s network.Comment: Accepted to International Conference on Machine Learning (ICML) 2023.
25 pages, 8 figure
Improved asynchronous parallel optimization analysis for stochastic incremental methods
As datasets continue to increase in size and multi-core computer
architectures are developed, asynchronous parallel optimization algorithms
become more and more essential to the field of Machine Learning. Unfortunately,
conducting the theoretical analysis asynchronous methods is difficult, notably
due to the introduction of delay and inconsistency in inherently sequential
algorithms. Handling these issues often requires resorting to simplifying but
unrealistic assumptions. Through a novel perspective, we revisit and clarify a
subtle but important technical issue present in a large fraction of the recent
convergence rate proofs for asynchronous parallel optimization algorithms, and
propose a simplification of the recently introduced "perturbed iterate"
framework that resolves it. We demonstrate the usefulness of our new framework
by analyzing three distinct asynchronous parallel incremental optimization
algorithms: Hogwild (asynchronous SGD), KROMAGNON (asynchronous SVRG) and
ASAGA, a novel asynchronous parallel version of the incremental gradient
algorithm SAGA that enjoys fast linear convergence rates. We are able to both
remove problematic assumptions and obtain better theoretical results. Notably,
we prove that ASAGA and KROMAGNON can obtain a theoretical linear speedup on
multi-core systems even without sparsity assumptions. We present results of an
implementation on a 40-core architecture illustrating the practical speedups as
well as the hardware overhead. Finally, we investigate the overlap constant, an
ill-understood but central quantity for the theoretical analysis of
asynchronous parallel algorithms. We find that it encompasses much more
complexity than suggested in previous work, and often is order-of-magnitude
bigger than traditionally thought.Comment: 67 pages, published in JMLR, can be found online at
http://jmlr.org/papers/v19/17-650.html. arXiv admin note: substantial text
overlap with arXiv:1606.0480
The Quantum Adiabatic Algorithm applied to random optimization problems: the quantum spin glass perspective
Among various algorithms designed to exploit the specific properties of
quantum computers with respect to classical ones, the quantum adiabatic
algorithm is a versatile proposition to find the minimal value of an arbitrary
cost function (ground state energy). Random optimization problems provide a
natural testbed to compare its efficiency with that of classical algorithms.
These problems correspond to mean field spin glasses that have been extensively
studied in the classical case. This paper reviews recent analytical works that
extended these studies to incorporate the effect of quantum fluctuations, and
presents also some original results in this direction.Comment: 151 pages, 21 figure
Autonomous grid scheduling using probabilistic job runtime scheduling
Computational Grids are evolving into a global, service-oriented architecture â
a universal platform for delivering future computational services to a range of
applications of varying complexity and resource requirements. The thesis focuses
on developing a new scheduling model for general-purpose, utility clusters
based on the concept of user requested job completion deadlines. In such a
system, a user would be able to request each job to finish by a certain deadline,
and possibly to a certain monetary cost. Implementing deadline scheduling is
dependent on the ability to predict the execution time of each queued job, and
on an adaptive scheduling algorithm able to use those predictions to maximise
deadline adherence. The thesis proposes novel solutions to these two problems
and documents their implementation in a largely autonomous and self-managing
way.
The starting point of the work is an extensive analysis of a representative
Grid workload revealing consistent workflow patterns, usage cycles and correlations between the execution times of jobs and its properties commonly collected
by the Grid middleware for accounting purposes. An automated approach is
proposed to identify these dependencies and use them to partition the highly
variable workload into subsets of more consistent and predictable behaviour.
A range of time-series forecasting models, applied in this context for the first
time, were used to model the job execution times as a function of their historical
behaviour and associated properties. Based on the resulting predictions of job
runtimes a novel scheduling algorithm is able to estimate the latest job start
time necessary to meet the requested deadline and sort the queue accordingly to
minimise the amount of deadline overrun.
The testing of the proposed approach was done using the actual job trace
collected from a production Grid facility. The best performing execution time
predictor (the auto-regressive moving average method) coupled to workload
partitioning based on three simultaneous job properties returned the median
absolute percentage error centroid of only 4.75%. This level of prediction
accuracy enabled the proposed deadline scheduling method to reduce the average deadline overrun time ten-fold compared to the benchmark batch scheduler.
Overall, the thesis demonstrates that deadline scheduling of computational
jobs on the Grid is achievable using statistical forecasting of job execution times
based on historical information. The proposed approach is easily implementable,
substantially self-managing and better matched to the human workflow making
it well suited for implementation in the utility Grids of the future
Computational Methods in Science and Engineering : Proceedings of the Workshop SimLabs@KIT, November 29 - 30, 2010, Karlsruhe, Germany
In this proceedings volume we provide a compilation of article contributions equally covering applications from different research fields and ranging from capacity up to capability computing. Besides classical computing aspects such as parallelization, the focus of these proceedings is on multi-scale approaches and methods for tackling algorithm and data complexity. Also practical aspects regarding the usage of the HPC infrastructure and available tools and software at the SCC are presented
Progressive load balancing of asynchronous algorithms
Massively parallel supercomputers are susceptible to variable performance due to
factors such as differences in chip manufacturing, heat management and network congestion. As a result, the same code with the same input can have a different execution
time from run to run. Synchronisation under these circumstances is a key challenge
that prevents applications from scaling to large problems and machines.
Asynchronous algorithms offer a partial solution. In these algorithms fast processes
are not forced to synchronise with slower ones. Instead, they continue computing updates, and moving towards the solution, using the latest data available to them, which
may have become stale (i.e. the data is a number of iterations out of date compared
to the most recent version). While this allows for high computational efficiency, the
convergence rate of asynchronous algorithms tends to be lower than synchronous algorithms due to the use of stale values. A large degree of performance variability can
eliminate the performance advantage of asynchronous algorithms or even cause the
results to diverge.
To address this problem, we use the unique properties of asynchronous algorithms
to develop a load balancing strategy for iterative convergent asynchronous algorithms
in both shared and distributed memory. The proposed approach â Progressive Load
Balancing (PLB) â aims to balance progress levels over time, rather than attempting to
equalise iteration rates across parallel workers. This approach attenuates noise without
sacrificing performance, resulting in a significant reduction in progress imbalance and
improving time to solution.
The developed method is evaluated in a variety of scenarios using the asynchronous
Jacobi algorithm. In shared memory, we show that it can essentially eliminate the
negative effects of a single core in a node slowed down by 19%. Work stealing, an
alternative load balancing approach, is shown to be ineffective. In distributed memory,
the method reduces the impact of up to 8 slow nodes out of 15, each slowed down
by 40%, resulting in 1.03Ăâ1.10Ă reduction in time to solution and 1.11Ăâ2.89Ă
reduction in runtime variability. Furthermore, we successfully apply the method in
a scenario with real faulty components running 75% slower than normal. Broader
applicability of progressive load balancing is established by emulating its application
to asynchronous stochastic gradient descent where it is found to improve both training
time and the learned modelâs accuracy.
Overall, this thesis demonstrates that enhancing asynchronous algorithms with
PLB is an effective method for tackling performance variability in supercomputers
Phases of Polymers and Biopolymers
In this thesis we develop coarse grained models aiming at understanding physical
problems arising from phase transitions which occur at the single molecule level. The
thesis will consist of two parts, grossly related to and motivated by the two subjects
dealt with above. In the first half, we will focus on critical phenomena in stretching
experiments, namely in DNA unzipping and polymer stretching in a bad solvent. In
the second part, we will develop a model of thick polymers, with the goal of understanding the origin of the protein folds and the physics underlying the folding \u2018transition\u2019,
as well as with the hope of shedding some light on some of the fundamental
questions highlighted in this Introduction.
In the first part of the thesis we will introduce a simple model of self-avoiding
walks for DNA unzipping. In this way we can map out the phase diagram in the
force vs. temperature plane. This reveals the present of an interesting cold unzipping
transition. We then go on to study the dynamics of this coarse grained model. The
main result which we will discuss is that the unzipping dynamics below the melting
temperature obeys different scaling laws with respect to the opening above thermal
denaturation, which is governed by temperature induced fluctuating bubbles.
Motivated by this and by recent results from other theoretical groups, we move on
to study the relation to DNA unzipping of the stretching of a homopolymer below the
theta point. Though also in this case a cold unzipping is present in the phase diagram,
this situation is richer from the theoretical point of view because the physics depends
crucially on dimension: the underlying phase transition indeed is second order in two
dimensions and first order in three. This is shown to be intimately linked to the failure
of mean field in this phenomena, unlike for DNA unzipping. In particular, the globule
unfolds via a series (hierarchy) of minima. In two dimensions they survive in the thermodynamic
limit whereas if the dimension, d, is greater than 2, there is a crossover
and for very long polymers the intermediate minima disappear. We deem it intriguing
that an intermediate step in this minima hierarchy for polymers of finite length in the
three-dimensional case is a regular mathematical helix, followed by a zig-zag structure.
This is found to be general and almost independent of the interaction potential
details. It suggests that a helix, one of the well-known protein secondary structure, is
a natural choice for the ground state of a hydrophobic protein which has to withstand
an effective pulling force.
In the second part, we will follow the inverse route and ask for a minimal model
which is able to account for the basic aspects of folding. By this, we mean a model
which contains a suitable potential which has as its ground state a protein-like structure
and which can account for the known thermodynamical properties of the folding
transition. The existing potential which are able to do that[32] are usually constructed
\u2018ad hoc\u2019 from knowledge of the native state. We stress that our procedure here is
completely different and the model which we propose should be built up starting
from minimal assumptions. Our main result is the following. If we throw away the
usual view of a polymer as a sequence of hard spheres tethered together by a chain
(see also Chapter 1) and substitute it with the notion of a flexible tube with a given
thickness, then upon compaction our \u2019thick polymer\u2019 or \u2019tube\u2019 will display a rich secondary structure with protein-like helices and sheets, in sharp contrast with the
degenerate and messy crumpled collapsed phase which is found with a conventional
bead-and-link or bead-and-spring homopolymer model. Sheets and helices show up
as the polymer gets thinner and passes from the swollen to the compact phase. In this
sense the most interesting regime is a \u2018twilight\u2019 zone which consists of tubes which
are at the edge of the compact phase, and we thus identify them as \u2018marginally compact
strucures\u2019. Note the analogy with the result on stretching, in which the helices
were in the same way the \u2018last compact\u2019 structures or the \u2018first extended\u2019 ones when
the polymer is being unwinded by a force.
After this property of ground states is discussed, we proceed to characterize the
thermodynamics of a flexible thick polymer with attraction. The resulting phase diagram
is shown to have many of the properties which are usually required from protein
effective models, namely for thin polymers there is a second order collapse transition
(O collapse) followed, as the temperature is lowered, by a first order transition
to a semicrystalline phase where the compact phase orders forming long strands all
aligned preferentially along some direction. For thicker polymers the transition to
this latter phase occurs directly from the swollen phase, upon lowering T, through a
first order transition resembling the folding transition of short proteins
Recommended from our members
Kinetics of Brownian Transport
The rate of progress of Brownian processes is not easily quantifiable. An importantmeasure
of the âspeedâ of Brownian motion is themean first-passage time (FPT) to a given
distance. FPTs exist in various flavours including exit- and transition-path times, which,
for instance, can be used to quantify the length of reaction paths in folding transitions
inmolecules such as DNA. Due to their inherently stochastic nature, measurements of
any FPTs require repeated experiments under controlled conditions. In my thesis, I systematically
explore FPTs in various contexts using a custom-built automated holographic
optical tweezers (HOT) setup. More precisely, I investigate transition- and exit-path-time
symmetries in equilibrium systems and demonstrate the breakdown of the symmetry in
out-of-equilibriumsystems. Experimental data from folding DNA-hairpins show that the
principles established on the mesoscale extend well into the molecular regime.
In Kramers escape problem, the reciprocal of the escape rate corresponds to the time
of first-passage to leave the initial state. A lower bound for the achievable FPT, e.g. of
the reaction coordinate of a folding molecule, therefore corresponds to a speed-limit
of the ensemble reaction rate. Using my setup, I show that certain barrier shapes can
substantially lower the escape time across the barrier without changing the overall energy
balance. This result has deep implications for reaction kinetics, e.g. in protein folding.
Furthermore, I investigate the role of entropic forces in Brownian transport, show that
hydrodynamic drag plays a crucial role in Brownian motion in confined systems, and give
an experimental realisation of Fick-Jacobs theory.
The thermodynamic applications of HOTs considered here necessitate the creation
of fine-tuned optical landscapes, which requires precise phase-retrieval to compute the
necessary holograms. In order to address this problem, I explore novel algorithms based
on deep conditional generative models and test whether such models can assist in finding
holograms for a given desired light distribution. I compare several differentmodels,
including conditional generative-adversarial networks and conditional variational autoencoders,
which are trained on data sets sampled on the HOT setup. Furthermore, I propose
a novel forward-loss-minimising architecture and demonstrate its excellent performance
on both validation and artificially-created test data sets.European Training Network (ETN) Grant No. 674979-NANOTRANS
Winton Programme for the Physics of Sustainabilit
- âŠ