3,104 research outputs found

    Topology-aware GPU scheduling for learning workloads in cloud environments

    Get PDF
    Recent advances in hardware, such as systems with multiple GPUs and their availability in the cloud, are enabling deep learning in various domains including health care, autonomous vehicles, and Internet of Things. Multi-GPU systems exhibit complex connectivity among GPUs and between GPUs and CPUs. Workload schedulers must consider hardware topology and workload communication requirements in order to allocate CPU and GPU resources for optimal execution time and improved utilization in shared cloud environments. This paper presents a new topology-aware workload placement strategy to schedule deep learning jobs on multi-GPU systems. The placement strategy is evaluated with a prototype on a Power8 machine with Tesla P100 cards, showing speedups of up to ≈1.30x compared to state-of-the-art strategies; the proposed algorithm achieves this result by allocating GPUs that satisfy workload requirements while preventing interference. Additionally, a large-scale simulation shows that the proposed strategy provides higher resource utilization and performance in cloud systems.This project is supported by the IBM/BSC Technology Center for Supercomputing collaboration agreement. It has also received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 639595). It is also partially supported by the Ministry of Economy of Spain under contract TIN2015-65316-P and Generalitat de Catalunya under contract 2014SGR1051, by the ICREA Academia program, and by the BSC-CNS Severo Ochoa program (SEV-2015-0493). We thank our IBM Research colleagues Alaa Youssef and Asser Tantawi for the valuable discussions. We also thank SC17 committee member Blair Bethwaite of Monash University for his constructive feedback on the earlier drafts of this paper.Peer ReviewedPostprint (published version

    Best bang for your buck: GPU nodes for GROMACS biomolecular simulations

    Full text link
    The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well exploited with a combination of SIMD, multi-threading, and MPI-based SPMD/MPMD parallelism, while GPUs can be used as accelerators to compute interactions offloaded from the CPU. Here we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance-to-price ratio, energy efficiency, and several other criteria. Though hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer-class GPUs this improvement equally reflects in the performance-to-price ratio. Although memory issues in consumer-class GPUs could pass unnoticed since these cards do not support ECC memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost-efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well-balanced ratio of CPU and consumer-class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime

    RewardProfiler: A Reward Based Design Space Profiler on DVFS Enabled MPSoCs

    Get PDF
    Resource mapping on a heterogeneous multi-processor system-on-chip (MPSoC) imposes enormous challenges such as identifying important design points for appropriate resource mapping for improved efficiency or performance, time consumption of exploring all the important design points for each profiled applications, etc. Moreover, incorporating a profiler into integrated development environments (IDEs) in order to achieve more detailed and accurate profiling information? on the application being targeted during runtime such that improved efficiency or performance while executing the application is achieved, the runtime resource management decision to achieve such improved "reward" has to be utilized in a certain way. In this paper, we propose a hybrid approach of resource mapping technique on DVFS enabled MPSoC, which is suitable for IDE integration due to the reduced design points in our methodology resulting in significant reduction in profiling time. We coined our approach as "RewardProfiler" (a Reward based design space Profiler), which is well capable of reducing the design space exploration without losing most of the important design points based on our heuristic approach. In our strategy, an application has to be mapped onto the available resources in such a way so that the "reward" obtained can be maximized. Our approach can also be utilized to maximize multiple "rewards" (Multivariate Reward Maximization) while executing an application. Implementation of our RewardProfiler on the Exynos 5422 MPSoC reveals the efficacy of our proposed approach under various experimental test cases and has a potential of saving 170× more time in profiling for our chosen MPSoC compared to the state-of-the-art methodologies

    Towards Real-time Remote Processing of Laparoscopic Video

    Get PDF
    Laparoscopic surgery is a minimally invasive technique where surgeons insert a small video camera into the patient\u27s body to visualize internal organs and use small tools to perform these procedures. However, the benefit of small incisions has a disadvantage of limited visualization of subsurface tissues. Image-guided surgery (IGS) uses pre-operative and intra-operative images to map subsurface structures and can reduce the limitations of laparoscopic surgery. One particular laparoscopic system is the daVinci-si robotic surgical vision system. The video streams generate approximately 360 megabytes of data per second, demonstrating a trend toward increased data sizes in medicine, primarily due to higher-resolution video cameras and imaging equipment. Real-time processing this large stream of data on a bedside PC, single or dual node setup, may be challenging and a high-performance computing (HPC) environment is not typically available at the point of care. To process this data on remote HPC clusters at the typical 30 frames per second rate (fps), it is required that each 11.9 MB (1080p) video frame be processed by a server and returned within the time this frame is displayed or 1/30th of a second. The ability to acquire, process, and visualize data in real time is essential for the performance of complex tasks as well as minimizing risk to the patient. We have implemented and compared performance of compression, segmentation and registration algorithms on Clemson\u27s Palmetto supercomputer using dual Nvidia graphics processing units (GPUs) per node and compute unified device architecture (CUDA) programming model. We developed three separate applications that run simultaneously: video acquisition, image processing, and video display. The image processing application allows several algorithms to run simultaneously on different cluster nodes and transfer images through message passing interface (MPI). Our segmentation and registration algorithms resulted in an acceleration factor of around 2 and 8 times respectively. To achieve a higher frame rate, we also resized images and reduced the overall processing time. As a result, using high-speed network to access computing clusters with GPUs to implement these algorithms in parallel will improve surgical procedures by providing real-time medical image processing and laparoscopic data
    • 

    corecore