Optimizing Collective Communication for Scalable Scientific Computing and Deep Learning

Abstract

In the realm of distributed computing, collective operations involve coordinated communication and synchronization among multiple processing units, enabling efficient data exchange and collaboration. Scientific applications, such as simulations, computational fluid dynamics, and scalable deep learning, require complex computations that can be parallelized across multiple nodes in a distributed system. These applications often involve data-dependent communication patterns, where collective operations are critical for achieving high performance in data exchange. Optimizing collective operations for scientific applications and deep learning involves improving the algorithms, communication patterns, and data distribution strategies to minimize communication overhead and maximize computational efficiency. Within the context of this dissertation, the specific focus is on optimizing the alltoall operation in 3D Fast Fourier Transform (FFT) applications and the allreduce operation in parallel deep learning, particularly on High-Performance Computing (HPC) systems. Advanced communication algorithms and methods are explored and implemented to improve communication efficiency, consequently enhancing the overall performance of 3D FFT applications. Furthermore, this dissertation investigates the identification of performance bottlenecks during collective communication over Horovod on distributed systems. These bottlenecks are addressed by proposing an optimized parallel communication pattern specifically tailored to alleviate the aforementioned limitations during the training phase in distributed deep learning. The objective is to achieve faster convergence and improve the overall training efficiency. Moreover, this dissertation proposes fault tolerance and elastic scaling features for distributed deep learning by leveraging the User-Level Failure Mitigation (ULFM) from Message Passing Interface (MPI). By incorporating ULFM MPI, the dissertation aims to enhance the elastic capabilities of distributed deep learning systems. This approach enables graceful and lightweight handling of failures while facilitating seamless scaling in dynamic computing environments

    Similar works