43,954 research outputs found

    Multi-Tenant Virtual GPUs for Optimising Performance of a Financial Risk Application

    Get PDF
    Graphics Processing Units (GPUs) are becoming popular accelerators in modern High-Performance Computing (HPC) clusters. Installing GPUs on each node of the cluster is not efficient resulting in high costs and power consumption as well as underutilisation of the accelerator. The research reported in this paper is motivated towards the use of few physical GPUs by providing cluster nodes access to remote GPUs on-demand for a financial risk application. We hypothesise that sharing GPUs between several nodes, referred to as multi-tenancy, reduces the execution time and energy consumed by an application. Two data transfer modes between the CPU and the GPUs, namely concurrent and sequential, are explored. The key result from the experiments is that multi-tenancy with few physical GPUs using sequential data transfers lowers the execution time and the energy consumed, thereby improving the overall performance of the application.Comment: Accepted to the Journal of Parallel and Distributed Computing (JPDC), 10 June 201

    Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

    Full text link
    TensorFlow has been the most widely adopted Machine/Deep Learning framework. However, little exists in the literature that provides a thorough understanding of the capabilities which TensorFlow offers for the distributed training of large ML/DL models that need computation and communication at scale. Most commonly used distributed training approaches for TF can be categorized as follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this paper, we provide an in-depth performance characterization and analysis of these distributed training approaches on various GPU clusters including the Piz Daint system (6 on Top500). We perform experiments to gain novel insights along the following vectors: 1) Application-level scalability of DNN training, 2) Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on these experiments, we present two key insights: 1) Overall, No-gRPC designs achieve better performance compared to gRPC-based approaches for most configurations, and 2) The performance of No-gRPC is heavily influenced by the gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware MPI Allreduce design that exploits CUDA kernels and pointer caching to perform large reductions efficiently. Our proposed designs offer 5-17X better performance than NCCL2 for small and medium messages, and reduces latency by 29% for large messages. The proposed optimizations help Horovod-MPI to achieve approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs. Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie

    Neonatal Diagnostics: Toward Dynamic Growth Charts of Neuromotor Control

    Get PDF
    © 2016 Torres, Smith, Mistry, Brincker and Whyatt. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY).The current rise of neurodevelopmental disorders poses a critical need to detect risk early in order to rapidly intervene. One of the tools pediatricians use to track development is the standard growth chart. The growth charts are somewhat limited in predicting possible neurodevelopmental issues. They rely on linear models and assumptions of normality for physical growth data – obscuring key statistical information about possible neurodevelopmental risk in growth data that actually has accelerated, non-linear rates-of-change and variability encompassing skewed distributions. Here, we use new analytics to profile growth data from 36 newborn babies that were tracked longitudinally for 5 months. By switching to incremental (velocity-based) growth charts and combining these dynamic changes with underlying fluctuations in motor performance – as the transition from spontaneous random noise to a systematic signal – we demonstrate a method to detect very early stunting in the development of voluntary neuromotor control and to flag risk of neurodevelopmental derail.Peer reviewedFinal Published versio

    Hand classification of fMRI ICA noise components

    Get PDF
    We present a practical "how-to" guide to help determine whether single-subject fMRI independent components (ICs) characterise structured noise or not. Manual identification of signal and noise after ICA decomposition is required for efficient data denoising: to train supervised algorithms, to check the results of unsupervised ones or to manually clean the data. In this paper we describe the main spatial and temporal features of ICs and provide general guidelines on how to evaluate these. Examples of signal and noise components are provided from a wide range of datasets (3T data, including examples from the UK Biobank and the Human Connectome Project, and 7T data), together with practical guidelines for their identification. Finally, we discuss how the data quality, data type and preprocessing can influence the characteristics of the ICs and present examples of particularly challenging datasets
    • …
    corecore