Search CORE

6 research outputs found

Progressive Load Balancing in Distributed Memory

Author: Weiland Michele
Zarins Justs
Publication venue: 'IOS Press'
Publication date: 01/04/2020
Field of study

Progressive load balancing of asynchronous algorithms

Author: Zarins Justs
Publication venue: The University of Edinburgh
Publication date: 30/11/2021
Field of study

Massively parallel supercomputers are susceptible to variable performance due to factors such as differences in chip manufacturing, heat management and network congestion. As a result, the same code with the same input can have a different execution time from run to run. Synchronisation under these circumstances is a key challenge that prevents applications from scaling to large problems and machines. Asynchronous algorithms offer a partial solution. In these algorithms fast processes are not forced to synchronise with slower ones. Instead, they continue computing updates, and moving towards the solution, using the latest data available to them, which may have become stale (i.e. the data is a number of iterations out of date compared to the most recent version). While this allows for high computational efficiency, the convergence rate of asynchronous algorithms tends to be lower than synchronous algorithms due to the use of stale values. A large degree of performance variability can eliminate the performance advantage of asynchronous algorithms or even cause the results to diverge. To address this problem, we use the unique properties of asynchronous algorithms to develop a load balancing strategy for iterative convergent asynchronous algorithms in both shared and distributed memory. The proposed approach – Progressive Load Balancing (PLB) – aims to balance progress levels over time, rather than attempting to equalise iteration rates across parallel workers. This approach attenuates noise without sacrificing performance, resulting in a significant reduction in progress imbalance and improving time to solution. The developed method is evaluated in a variety of scenarios using the asynchronous Jacobi algorithm. In shared memory, we show that it can essentially eliminate the negative effects of a single core in a node slowed down by 19%. Work stealing, an alternative load balancing approach, is shown to be ineffective. In distributed memory, the method reduces the impact of up to 8 slow nodes out of 15, each slowed down by 40%, resulting in 1.03×–1.10× reduction in time to solution and 1.11×–2.89× reduction in runtime variability. Furthermore, we successfully apply the method in a scenario with real faulty components running 75% slower than normal. Broader applicability of progressive load balancing is established by emulating its application to asynchronous stochastic gradient descent where it is found to improve both training time and the learned model’s accuracy. Overall, this thesis demonstrates that enhancing asynchronous algorithms with PLB is an effective method for tackling performance variability in supercomputers

Edinburgh Research Archive

Exploring the Suitability of the Cerebras Wafer Scale Engine for Stencil-Based Computation Codes

Author: Brown Nick
Echols Brandon
Grosser Tobias
Zarins Justs
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 02/05/2023
Field of study

Edinburgh Research Explorer

TensorFlow as a DSL for stencil-based computation on the Cerebras Wafer Scale Engine

Author: Brown Nick
Echols Brandon
Grosser Tobias
Zarins Justs
Publication venue
Publication date: 26/08/2022
Field of study

The Cerebras Wafer Scale Engine (WSE) is an accelerator that combines hundreds of thousands of AI-cores onto a single chip. Whilst this technology has been designed for machine learning workloads, the significant amount of available raw compute means that it is also a very interesting potential target for accelerating traditional HPC computational codes. Many of these algorithms are stencil-based, where update operations involve contributions from neighbouring elements, and in this paper we explore the suitability of this technology for such codes from the perspective of an early adopter of the technology, compared to CPUs and GPUs. Using TensorFlow as the interface, we explore the performance and demonstrate that, whilst there is still work to be done around exposing the programming interface to users, performance of the WSE is impressive as it out performs four V100 GPUs by two and a half times and two Intel Xeon Platinum CPUs by around 114 times in our experiments. There is significant potential therefore for this technology to play an important role in accelerating HPC codes on future exascale supercomputers.Comment: This preprint has not undergone any post-submission improvements or corrections. Preprint of paper submitted to Euro-Par DSL-HPC worksho

arXiv.org e-Print Archive

Detecting scale-induced overflow bugs in production HPC codes

Author: Bartholomew Paul
Lapworth Leigh
Parsons Mark
Weiland Michele
Zarins Justs
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 04/01/2023
Field of study

Edinburgh Research Explorer

Performance Evaluation of Adaptive Routing on Dragonfly-based Production Systems

Author: Chunduri Sudheer
Ghadar Yasaman
Groves Taylor
Harms Kevin
Mendygral Peter
Weiland Michele
Zarins Justs
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/06/2021
Field of study

Edinburgh Research Explorer