Search CORE

8 research outputs found

Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

Author: Ebrahimi Eiman
Fu Yaosheng
Gupta Puneet
Migacz Szymon
Nellans David
Pal Saptadeep
Zhang Victor
Zulfiqar Arslan
Publication venue
Publication date: 30/07/2019
Field of study

Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used parallelization strategy, but as the number of devices in data parallel training grows, so does the communication overhead between devices. Additionally, a larger aggregate batch size per step leads to statistical efficiency loss, i.e., a larger number of epochs are required to converge to a desired accuracy. These factors affect overall training time and beyond a certain number of devices, the speedup from leveraging DP begins to scale poorly. In addition to DP, each training step can be accelerated by exploiting model parallelism (MP). This work explores hybrid parallelization, where each data parallel worker is comprised of more than one device, across which the model dataflow graph (DFG) is split using MP. We show that at scale, hybrid training will be more effective at minimizing end-to-end training time than exploiting DP alone. We project that for Inception-V3, GNMT, and BigLSTM, the hybrid strategy provides an end-to-end training speedup of at least 26.5%, 8%, and 22% respectively compared to what DP alone can achieve at scale

arXiv.org e-Print Archive

Efficient Asynchronous GCN Training on a GPU Cluster

Author: Zhang Yi
Publication venue
Publication date: 26/03/2021
Field of study

A common assumption in traditional synchronous parallel training of Graph Convolutional Networks (GCNs) using multiple GPUs is that load is perfectly balanced among all GPUs. However, this assumption does not hold in a real-world scenario where there can be imbalances in workloads among GPUs for various reasons. In a synchronous parallel implementation, a straggler in the system can limit the overall speed-up of parallel training. To address these issues, this research investigates approaches for asynchronous decentralized parallel training for GCNs. The techniques investigated are based on graph clustering and gossiping. The research specifically adapts the approach of Cluster-GCN, which uses graph partitioning for SGD-based training, and combines with a novel gossip algorithm specifically designed for a GPU cluster to periodically exchange gradients among randomly chosen partners. In addition, it incorporates a work-pool mechanism for load balancing among GPUs. The gossip algorithm is proven to be deadlock free. The implementation is done on a GPU cluster with 8 Tesla V100 GPUs per compute node, and PyTorch and DGL as the software platforms. Experiments are conducted for different benchmark datasets. The results demonstrate superior performance, at the compromise of minor accuracy loss in some runs, as compared to traditional synchronous training which uses all-reduce to synchronously accumulate parallel training results

Concordia University Research Repository

Deep Learning-based Speech Enhancement for Real-life Applications

Author: Abdallah Abdelhafiz Nossier S.
Abdallah Abdelhafiz Nossier S.
Publication venue: University of East London
Publication date: 01/01/2023
Field of study

Speech enhancement is the process of improving speech quality and intelligibility by suppressing noise. Inspired by the outstanding performance of the deep learning approach for speech enhancement, this thesis aims to add to this research area through the following contributions. The thesis presents an experimental analysis of different deep neural networks for speech enhancement, to compare their performance and investigate factors and approaches that improve the performance. The outcomes of this analysis facilitate the development of better speech enhancement networks in this work. Moreover, this thesis proposes a new deep convolutional denoising autoencoderbased speech enhancement architecture, in which strided and dilated convolutions were applied to improve the performance while keeping network complexity to a minimum. Furthermore, a two-stage speech enhancement approach is proposed that reduces distortion, by performing a speech denoising first stage in the frequency domain, followed by a second speech reconstruction stage in the time domain. This approach was proven to reduce speech distortion, leading to better overall quality of the processed speech in comparison to state-of-the-art speech enhancement models. Finally, the work presents two deep neural network speech enhancement architectures for hearing aids and automatic speech recognition, as two real-world speech enhancement applications. A smart speech enhancement architecture was proposed for hearing aids, which is an integrated hearing aid and alert system. This architecture enhances both speech and important emergency noise, and only eliminates undesired noise. The results show that this idea is applicable to improve the performance of hearing aids. On the other hand, the architecture proposed for automatic speech recognition solves the mismatch issue between speech enhancement automatic speech recognition systems, leading to significant reduction in the word error rate of a baseline automatic speech recognition system, provided by Intelligent Voice for research purposes. In conclusion, the results presented in this thesis show promising performance for the proposed architectures for real time speech enhancement applications

UEL Research Repository at University of East London