Search CORE

10 research outputs found

Improving Memory Hierarchy Utilisation for Stencil Computations on Multicore Machines

Author: Boeres Cristina
Bulcão André
Nascimento Aline
Rebello Vinod E. F.
Sena Alexandre
Publication venue
Publication date: 30/10/2013
Field of study

Although modern supercomputers are composed of multicore machines, one can find scientists that still execute their legacy applications which were developed to monocore cluster where memory hierarchy is dedicated to a sole core. The main objective of this paper is to propose and evaluate an algorithm that identify an efficient blocksize to be applied on MPI stencil computations on multicore machines. Under the light of an extensive experimental analysis, this work shows the benefits of identifying blocksizes that will dividing data on the various cores and suggest a methodology that explore the memory hierarchy available in modern machines

arXiv.org e-Print Archive

CiteSeerX

EPSILOD: efficient parallel skeleton for generic iterative stencil computations in distributed GPUs

Author: Castro Manuel de
González Escribano Arturo
Llanos Ferraris Diego Rafael
Santamaria Valenzuela Inmaculada
Torres de la Sierra Yuri
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2023
Field of study

Producción CientíficaIterative stencil computations are widely used in numerical simulations. They present a high degree of parallelism, high locality and mostly-coalesced memory access patterns. Therefore, GPUs are good candidates to speed up their computa- tion. However, the development of stencil programs that can work with huge grids in distributed systems with multiple GPUs is not straightforward, since it requires solv- ing problems related to the partition of the grid across nodes and devices, and the synchronization and data movement across remote GPUs. In this work, we present EPSILOD, a high-productivity parallel programming skeleton for iterative stencil computations on distributed multi-GPUs, of the same or different vendors that sup- ports any type of n-dimensional geometric stencils of any order. It uses an abstract specification of the stencil pattern (neighbors and weights) to internally derive the data partition, synchronizations and communications. Computation is split to better overlap with communications. This paper describes the underlying architecture of EPSILOD, its main components, and presents an experimental evaluation to show the benefits of our approach, including a comparison with another state-of-the-art solution. The experimental results show that EPSILOD is faster and shows good strong and weak scalability for platforms with both homogeneous and heterogene- ous types of GPUJunta de Castilla y León, Ministerio de Economía, Industria y Competitividad, y Fondo Europeo de Desarrollo Regional (FEDER): Proyecto PCAS (TIN2017-88614-R) y Proyecto PROPHET-2 (VA226P20).Ministerio de Ciencia e Innovación, Agencia Estatal de Investigación y “European Union NextGenerationEU/PRTR” : (MCIN/ AEI/10.13039/501100011033) - grant TED2021-130367B-I00CTE-POWER and Minotauro and the technical support provided by Barcelona Supercomputing Center (RES-IM-2021-2-0005, RES-IM-2021-3-0024, RES- IM-2022-1-0014).Publicación en abierto financiada por el Consorcio de Bibliotecas Universitarias de Castilla y León (BUCLE), con cargo al Programa Operativo 2014ES16RFOP009 FEDER 2014-2020 DE CASTILLA Y LEÓN, Actuación:20007-CL - Apoyo Consorcio BUCL

Repositorio Documental de la Universidad de Valladolid

Tiling Optimization For Nested Loops On Gpus

Author: Li Yuanzhe
Publication venue: DigitalCommons@WayneState
Publication date: 01/01/2020
Field of study

Optimizing nested loops has been considered as an important topic and widely studied in parallel programming. With the development of GPU architectures, the performance of these computations can be significantly boosted with the massively parallel hardware. General matrix-matrix multiplication is a typical example where executing such an algorithm on GPUs outperforms the performance obtained on other multicore CPUs. However, achieving ideal performance on GPUs usually requires a lot of human effort to manage the massively parallel computation resources. Therefore, the efficient implementation of optimizing nested loops on GPUs became a popular topic in recent years. We present our work based on the tiling strategy in this dissertation to address three kinds of popular problems. Different kinds of computations bring in different latency issues where dependencies in the computation may result in insufficient parallelism and the performance of computations without dependencies may be degraded due to intensive memory accesses. In this thesis, we tackle the challenges for each kind of problem and believe that other computations performed in nested loops can also benefit from the presented techniques. We improve a parallel approximation algorithm for the problem of scheduling jobs on parallel identical machines to minimize makespan with a high-dimensional tiling method. The algorithm is designed and optimized for solving this kind of problem efficiently on GPUs. Because the algorithm is based on a higher-dimensional dynamic programming approach, where dimensionality refers to the number of variables in the dynamic programming equation characterizing the problem, the existing implementation suffers from the pain of dimensionality and cannot fully utilize GPU resources. We design a novel data-partitioning technique to accelerate the higher-dimensional dynamic programming component of the algorithm. Both the load imbalance and exceeding memory capacity issues are addressed in our GPU solution. We present performance results to demonstrate how our proposed design improves the GPU utilization and makes it possible to solve large higher-dimensional dynamic programming problems within the limited GPU memory. Experimental results show that the GPU implementation achieves up to 25X speedup compared to the best existing OpenMP implementation. In addition, we focus on optimizing wavefront parallelism on GPUs. Wavefront parallelism is a well-known technique for exploiting the concurrency of applications that execute nested loops with uniform data dependencies. Recent research on such applications, which range from sequence alignment tools to partial differential equation solvers, has used GPUs to benefit from the massively parallel computing resources. Wavefront parallelism faces the load imbalance issue because the parallelism is passing along the diagonal. The tiling method has been introduced as a popular solution to address this issue. However, the use of hyperplane tiles increases the cost of synchronization and leads to poor data locality. In this paper, we present a highly optimized implementation of the wavefront parallelism technique that harnesses the GPU architecture. A balanced workload and maximum resource utilization are achieved with an extremely low synchronization overhead. We design the kernel configuration to significantly reduce the minimum number of synchronizations required and also introduce an inter-block lock to minimize the overhead of each synchronization. We evaluate the performance of our proposed technique for four different applications: Sequence Alignment, Edit Distance, Summed-Area Table, and 2DSOR. The performance results demonstrate that our method achieves speedups of up to six times compared to the previous best-known hyperplane tiling-based GPU implementation. Finally, we extend the hyperplane tiling to high order 2D stencil computations. Unlike wavefront parallelism that has dependence in the spatial dimension, dependence remains only across two adjacent time steps along the temporal dimension in stencil computations. Even if the no-dependence property significantly increases the parallelism obtained in the spatial dimensions, full parallelism may not be efficient on GPUs. Due to the limited cache capacity owned by each streaming multiprocessor, full parallelism can be obtained on global memory only, which has high latency to access. Therefore, the tiling technique can be applied to improve the memory efficiency by caching the small tiled blocks. Because the widely studied tiling methods, like overlapped tiling and split tiling, have considerable computation overhead caused by load imbalance or extra operations, we propose a time skewed tiling method, which is designed upon the GPU architecture. We work around the serialized computation issue and coordinate the intra-tile parallelism and inter-tile parallelism to minimize the load imbalance caused by pipelined processing. Moreover, we address the high-order stencil computations in our development, which has not been comprehensively studied. The proposed method achieves up to 3.5X performance improvement when the stencil computation is performed on a Moore neighborhood pattern

Digital Commons@Wayne State University

Distribución dinámica de carga y redistribuciones de datos en aplicaciones paralelas

Author: Sánchez Girón María
Publication venue
Publication date: 01/01/2020
Field of study

Actualmente, la computación de alto rendimiento es la técnica utilizada para resolver grandes problemas en diversas áreas de investigación (ciencia, ingeniería, etc.) debido al aumento de rendimiento que proporcionan los supercomputadores. La computación heterogénea permite adaptar un sistema a un rango mayor de aplicaciones gracias a la integración de componentes de naturalezas diferentes en el sistema de cómputo, aprovechando así cada uno de los recursos hardware de cada dispositivo. Los supercomputadores con arquitecturas heterogéneas se cuentan actualmente entre los más potentes del mundo. Tiling es un método utilizado para mejorar el rendimiento de los sistemas paralelos que consiste en dividir el espacio de datos de un problema entre los procesos. Para equilibrar el tiempo de ejecución de cada proceso y mejorar así el tiempo total del programa se puede aplicar el balanceo de carga, un particionado irregular en el que el tamaño asignado a cada proceso depende de su capacidad computacional. La estimación de la capacidad puede realizarse antes de la ejecución de un programa o en tiempo de ejecución. El balanceo de carga adaptativo permite reestimar la carga y modificar el reparto de trabajo entre los procesos a lo largo de la ejecución de un programa, pero es un trabajo que tiene que hacer el programador para su aplicación concreta. No existe una función estándar de balanceo adaptativo dinámico para aplicaciones paralelas. La librería Hitmap proporciona herramientas para la gestión del particionado y mapeado de arrays de una manera simple y eficiente en un modelo de paralelismo SPMD. Cuenta con diversos tipos de particiones y separa la parte de comunicación del particionado de los datos, adaptando automáticamente las funciones al tipo de partición elegido gracias al uso de abstracciones. Este trabajo propone una función de balanceo de carga dinámico, adaptativo y transparente para entornos de computación paralela utilizando los recursos de Hitmap. Los resultados experimentales muestran que su uso mejora el rendimiento de los programas, reduciendo el tiempo total de ejecución frente a un reparto equitativo de la carga entre los procesos, sin suponer un sobrecoste de tiempo o recursos.Nowadays, high performace computing is the technique used to solve big problems in varied research areas (science, engineering, etc.) due to the increase of performance the supercomputers provide. Heterogeneous computing allows to adapt a system to a bigger range of applications thanks to the integration of different natured components into the computing system, making advantage of every one of the hardware resources of each device. Supercomputers with heterogeneous architectures are currently included among the most powerful ones in the world. Tiling is a method used to improve the performance of parallel systems, consisting of splitting the data space of a problem among the processes. In order to balance the execution time of each process and therefore improving the total execution time of the program, load balancing can be applied, an irregular tiling where the size assigned to each process depends on its computational capacity. The estimation of capacity can be done before the execution of the program or at runtime. Adaptive load balancing allows to re-estimate the load and to modify the distribution of work amongst the processes throughout the execution of a program, but is a task that must be made by the programmer for his particular application. There is not a standard function for dynamic, adaptive load balancing for parallel applications. The Hitmap library provides tools to manage the tiling and mapping of arrays in a simple and efficient way in a SPMD parallel model. It features many partition types and it isolates the communication part from the data partitioning, automatically adapting the functions to the partition type chosen thanks to the use of abstractions. This work proposes a dynamic, adaptive and transparent load balancing function for parallel computing environments using Hitmap’s resources. The experimental results show that using it improves the performance of the programmes, reducing the execution time compared to an equitative load distribution among the processes, without signifying a time or resources overhead.Grado en Ingeniería Informátic

Repositorio Documental de la Universidad de Valladolid

Lattice Quantum Chromodynamics on Intel Xeon Phi based supercomputers

Author: Labus Peter
Publication venue: place:Trieste
Publication date: 16/12/2016
Field of study

Preface The aim of this master\u2019s thesis project was to expand the QPhiX library for twisted-mass fermions with and without clover term. To this end, I continued work initiated by Mario Schr\uf6ck et al. [63]. In writing this thesis, I was following two main goals. Firstly, I wanted to stress the intricate interplay of the four pillars of High Performance Computing: Algorithms, Hardware, Software and Performance Evaluation. Surely, algorithmic development is utterly important in Scientific Computing, in particular in LQCD, where it even outweighed the improvements made in Hardware architecture in the last decade\u2014cf. the section about computational costs of LQCD. It is strongly influenced by the available hardware\u2014think of the advent of parallel algorithms\u2014but in turn also influenced the design of hardware itself. The IBM BlueGene series is only one of many examples in LQCD. Furthermore, there will be no benefit from the best algorithms, when one cannot implement the ideas into correct, performant, user-friendly, read- and maintainable (sometimes over several decades) software code. But again, truly outstanding HPC software cannot be written without a profound knowledge of its target hardware. Lastly, an HPC software architect and computational scientist has to be able to evaluate and benchmark the performance of a software program, in the often very heterogeneous environment of supercomputers with multiple software and hardware layers. My second goal in writing this thesis was to produce a self-contained introduction into the computational aspects of LQCD and in particular, to the features of QPhiX, so the reader would be able to compile, read and understand the code of one truly amazing pearl of HPC [40]. It is a pleasure to thank S. Cozzini, R. Frezzotti, E. Gregory, B. Jo\uf3, B. Kostrzewa, S. Krieg, T. Luu, G. Martinelli, R. Percacci, S. Simula, M. Ueding, C. Urbach, M. Werner, the Intel company for providing me with a copy of [55], and the J\ufclich Supercomputing Center for granting me access to their KNL test cluster DEE

Sissa Digital Library