43,954 research outputs found
Multi-Tenant Virtual GPUs for Optimising Performance of a Financial Risk Application
Graphics Processing Units (GPUs) are becoming popular accelerators in modern
High-Performance Computing (HPC) clusters. Installing GPUs on each node of the
cluster is not efficient resulting in high costs and power consumption as well
as underutilisation of the accelerator. The research reported in this paper is
motivated towards the use of few physical GPUs by providing cluster nodes
access to remote GPUs on-demand for a financial risk application. We
hypothesise that sharing GPUs between several nodes, referred to as
multi-tenancy, reduces the execution time and energy consumed by an
application. Two data transfer modes between the CPU and the GPUs, namely
concurrent and sequential, are explored. The key result from the experiments is
that multi-tenancy with few physical GPUs using sequential data transfers
lowers the execution time and the energy consumed, thereby improving the
overall performance of the application.Comment: Accepted to the Journal of Parallel and Distributed Computing (JPDC),
10 June 201
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
TensorFlow has been the most widely adopted Machine/Deep Learning framework.
However, little exists in the literature that provides a thorough understanding
of the capabilities which TensorFlow offers for the distributed training of
large ML/DL models that need computation and communication at scale. Most
commonly used distributed training approaches for TF can be categorized as
follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand
Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu
Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this
paper, we provide an in-depth performance characterization and analysis of
these distributed training approaches on various GPU clusters including the Piz
Daint system (6 on Top500). We perform experiments to gain novel insights along
the following vectors: 1) Application-level scalability of DNN training, 2)
Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used
for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on
these experiments, we present two key insights: 1) Overall, No-gRPC designs
achieve better performance compared to gRPC-based approaches for most
configurations, and 2) The performance of No-gRPC is heavily influenced by the
gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware
MPI Allreduce design that exploits CUDA kernels and pointer caching to perform
large reductions efficiently. Our proposed designs offer 5-17X better
performance than NCCL2 for small and medium messages, and reduces latency by
29% for large messages. The proposed optimizations help Horovod-MPI to achieve
approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs.
Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native
gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint
cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie
Neonatal Diagnostics: Toward Dynamic Growth Charts of Neuromotor Control
© 2016 Torres, Smith, Mistry, Brincker and Whyatt. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY).The current rise of neurodevelopmental disorders poses a critical need to detect risk early in order to rapidly intervene. One of the tools pediatricians use to track development is the standard growth chart. The growth charts are somewhat limited in predicting possible neurodevelopmental issues. They rely on linear models and assumptions of normality for physical growth data – obscuring key statistical information about possible neurodevelopmental risk in growth data that actually has accelerated, non-linear rates-of-change and variability encompassing skewed distributions. Here, we use new analytics to profile growth data from 36 newborn babies that were tracked longitudinally for 5 months. By switching to incremental (velocity-based) growth charts and combining these dynamic changes with underlying fluctuations in motor performance – as the transition from spontaneous random noise to a systematic signal – we demonstrate a method to detect very early stunting in the development of voluntary neuromotor control and to flag risk of neurodevelopmental derail.Peer reviewedFinal Published versio
Hand classification of fMRI ICA noise components
We present a practical "how-to" guide to help determine whether single-subject fMRI independent components (ICs) characterise structured noise or not. Manual identification of signal and noise after ICA decomposition is required for efficient data denoising: to train supervised algorithms, to check the results of unsupervised ones or to manually clean the data. In this paper we describe the main spatial and temporal features of ICs and provide general guidelines on how to evaluate these. Examples of signal and noise components are provided from a wide range of datasets (3T data, including examples from the UK Biobank and the Human Connectome Project, and 7T data), together with practical guidelines for their identification. Finally, we discuss how the data quality, data type and preprocessing can influence the characteristics of the ICs and present examples of particularly challenging datasets
- …