Search CORE

11 research outputs found

Generalisation of Recursive Doubling for AllReduce: Now with Simulation

Author: Brooks
Chan
Culler
End
Hensgen
Hoefler
Hoefler
Hoefler
Hoefler
Mark Bull
Martin Ruefenacht
Pritchard
Rabenseifner
Stephen Booth
Thakur
Thakur
Publication venue: 'Elsevier BV'
Publication date: 18/08/2017
Field of study

Crossref

Edinburgh Research Explorer

Towards larger scale collective operations in the Message Passing Interface

Author: Rüfenacht Martin Peter Albert
Publication venue: The University of Edinburgh
Publication date: 31/07/2021
Field of study

Supercomputers continue to expand both in size and complexity as we reach the beginning of the exascale era. Networks have evolved, from simple mechanisms which transport data to subsystems of computers which fulfil a significant fraction of the workload that computers are tasked with. Inevitably with this change, assumptions which were made at the beginning of the last major shift in computing are becoming outdated. We introduce a new latency-bandwidth model which captures the characteristics of sending multiple small messages in quick succession on modern networks. Contrary to other models representing the same effects, the pipelining latency-bandwidth model is simple and physically based. In addition, we develop a discrete-event simulation, Fennel, to capture non-analytical effects of communication within models. AllReduce operations with small messages are common throughout supercomputing, particularly for iterative methods. The performance of network operations are crucial to the overall time-to-solution of an application as a whole. The Message Passing Interface standard was introduced to abstract complex communications from application level development. The underlying algorithms used for the implementation to achieve the specified behaviour, such as the recursive doubling algorithm for AllReduce, have to evolve with the computers on which they are used. We introduce the recursive multiplying algorithm as a generalisation of recursive doubling. By utilising the pipelining nature of modern networks, we lower the latency of AllReduce operations and enable greater choice of schedule. A heuristic is used to quickly generate a near-optimal schedule, by using the pipelining latency-bandwidth model. Alongside recursive multiplying, the endpoints of collective operations must be able to handle larger numbers of incoming messages. Typically this is done by duplicating receive queues for remote peers, but this requires a linear amount of memory space for the size of the application. We introduce a single-consumer multipleproducer queue which is designed to be used with MPI as a protocol to insert messages remotely, with minimal contention for shared receive queues

Edinburgh Research Archive

An elastic, parallel and distributed computing architecture for machine learning

Author: Li Anthony Zhenyu
Publication venue
Publication date
Field of study

Machine learning is a powerful tool that allows us to make better and faster decisions in a data-driven fashion based on training data. Neural networks are especially popular in the context of supervised learning due to their ability to approximate auxiliary functions. However, building these models is typically computationally intensive, which can take significant time to complete on a conventional CPU-based computer. Such a long turnaround time makes business and research infeasible using these models. This research seeks to accelerate this training process through parallel and distributed computing using High-Performance Computing (HPC) resources. To understand machine learning on HPC platforms, theoretical performance analysis from this thesis summarises four key factors for data-parallel machine learning: convergence, batch size, computational and communication efficiency. It is discovered that a maximum computational speed-up exists through parallel and distributed computing for a fixed experimental setup. This primary focus of this thesis is convolutional neural network applications on the Apache Spark platform. The work presented in this thesis directly addresses the computational and communication inefficiencies associated with the Spark platform with improvements to the Resilient Distributed Dataset (RDD) and the introduction of an elastic non-blocking all-reduce. In addition to implementation optimisations, the computational performance has been further improved by overlapping computation and communication, and the use of large batch sizes through fine-grained control. The impacts of these improvements are more prominent with the rise of massively parallel processors and high-speed networks. With all the techniques combined, it is predicted that training the ResNet50 model on the ImageNet dataset for 100 epochs at an effective batch size of 16K will take under 20 minutes on an NVIDIA Tesla P100 cluster, in contrast to 26 months on a single Intel Xeon E5-2660 v3 2.6 GHz processor. Due to the similarities to scientific computing, the resulting computing model of this thesis serves as an exemplar of the integration of high-performance computing and elastic computing with dynamic workloads, which lays the foundation for future research in emerging computational steering applications, such as interactive physics simulations and data assimilation in weather forecast and research

Warwick Research Archives Portal Repository

Scaling-up reinforcement learning using parallelization and symbolic planning

Author: Grounds Matthew Jon
Publication venue: University of York
Publication date: 01/01/2007
Field of study

EThOS - Electronic Theses Online ServiceGBUnited Kingdo

White Rose E-theses Online

OpenGrey Repository

Parallelised Bayesian Optimisation for Deep Learning

Author: Kekempanos L
Publication venue
Publication date
Field of study

University of Liverpool Repository