2 research outputs found

    Análisis de rendimiento de un clúster HPC y, arquitecturas manycore y multicorer

    Get PDF
    Actualmente la tendencia para obtener una gran cantidad de cómputo, es mediante la computación paralela, un claro ejemplo radica en el hecho que los computadores más rápidos del mundo son clústeres, formados por aceleradores y procesadores. Los clústeres HPC ofrecen una capacidad computacional para solventar requerimientos de alto costo computacional en áreas como: predicción climática y aprendizaje automático. Sin embargo para tener un rendimiento óptimo en aplicaciones que hacen uso del paralelismo, es necesario analizar requerimientos como: carga de comunicaciones, plataformas de paralelización, entre otras. El objetivo de este estudio es analizar la escalabilidad del modelo de predicción climática WRF en clústeres HPC en base a las comunicaciones y procesos MPI; y el rendimiento del algoritmo Horizontal Diffusion en aceleradores XeonPhi y TeslaKepler, usando OpenCL y CUDA. Los resultados muestran una dependencia de la escalabilidad del WRF con las comunicaciones, debido a que, se obtuvo una aceleración máxima para InfiniBand FDR de 25.9, QDR 21.42 y Ethernet 6.4 veces más con respecto a una ejecución secuencial. El acelerador TeslaK40m muestra un rendimiento 6 veces mayor que XeonPhi, debido a que el algoritmo no utiliza la vectorización eficientemente. OpenCL tiene una curva de aprendizaje mayor que CUDA debido a sus propiedades multiplataforma, en cuanto a rendimiento CUDA es un 6% mejor, debido a que está orientado a GPUs NVIDIA, y posee configuraciones que mejoran el desempeño, por otro lado OpenCL no es afectado en gran medida por ser multiplataforma, y es una opción cuando los requerimientos, están orientados hacia la portabilidad.Currently the tendency to obtain a large amount of computing, is through parallel computing, a clear example lies in the fact that the fastest computers in the world are clusters for high performance computing, formed by accelerators and multiprocessors. HPC clusters offer an important computational capacity to solve high computational cost requirements, in areas such as: weather prediction, machine learning. However, to achieve optimal performance is necessary to analyze their requirements, such as: communications, parallelization platforms. The objective of this study is to analyze the scalability of the climate prediction model WRF in HPC clusters based on inter-node communication and MPI processes; and the performance of the Horizontal Diffusion algorithm on XeonPhi and TeslaKepler accelerators, using OpenCL and CUDA. The results show a dependence of the scalability of the WRF with communications, since; a maximum acceleration was obtained for InfiniBand FDR of 25.9, QDR 21.42 and Ethernet 6.4 times more than sequential execution. The accelerator Tesla K40m shows a performance 6 times greater than XeonPhi, since the algorithm does not efficiently use vectorization Intel property, in addition Intel OpenCL drivers for its architecture manycore, they are deprecated. OpenCL has a higher learning curve than CUDA due to its multiplatform properties, in terms of performance CUDA shows a 6% improvement since it is oriented to NVIDIA GPUs, and has configurations that improve the performance, on the other hand OpenCL is not very affected by its multiplatform property, and is an option to consider if the requirements are oriented towards portability.Ingeniero en SistemasCuenc

    Multi-GPU implementation of the horizontal diffusion method of the weather research and forecast model

    No full text
    The Weather Research and Forecasting (WRF), a next generation mesoscale numerical weather prediction system, has a considerable amount of work regarding GPU acceleration. However, the amount of works exploiting multi-GPU sys- tems is limited. This work constitutes an effort on using GPU computing over the WRF model and is focused on a computationally intensive portion of the WRF: the Horizontal Diffusion method. Particularly, this work presents the enhancements that enable a single-GPU based implementation to exploit the parallelism of multi-GPU systems. The performance of the multi-GPU and single-GPU based implementations are compared on a computational domain of 433x308 horizontal grid points with 35 vertical levels, and the resulting speedup of the kernel is 3.5x relative to one GPU. The experiments were carried out on a multi-core computer with two NVIDIA Tesla K40m GPUs
    corecore