6 research outputs found

    Facilitating the development of stencil applications using the Heterogeneous Programming Library

    Get PDF
    [Abstract] Stencil computations are very common in scientific codes. Heterogeneous systems achieve good results solving these problems, but their programming is complex because of the ghost regions required in multi-device implementations and the difficulty to properly exploit their hardware. The Heterogeneous Programming Library (HPL) is a recent framework that improves the programmability of heterogeneous devices. This paper describes two extensions of HPL focused on stencil computations. The first one allows to automatically update the ghost regions they involve. The second one automates the implementation of the computational kernels of these algorithms. In our evaluation, the first mechanism reduces on average the number of lines of code and the Halstead programming effort of the host code of comparable HPL baselines by 34% and 64.2%, respectively, while the second contribution reduces these metrics by 72% and 79% in the computational kernels, respectively. Also, the first technique has negligible performance overheads, while the second one matches the performance of manually developed kernels. As an added benefit, the facilitation of the development of these codes thanks to these techniques helps programmers experiment with optimizations suited for this applications such as the ghost cell expansion technique, which provides speedups of up to 13% in our experiments.Ministerio de Economía y Competitividad de España; TIN2013-42148-PMinisterio de Economía y Competitividad de España; TIN2016-75845-PXunta de Galicia; ED431G/0

    Compiler-Directed Transformation for Higher-Order Stencils

    Full text link
    As the cost of data movement increasingly dominates performance, developers of finite-volume and finite-difference solutions for partial differential equations (PDEs) are exploring novel higher-order stencils that increase numerical accuracy and computational intensity. This paper describes a new compiler reordering transformation applied to stencil operators that performs partial sums in buffers, and reuses the partial sums in computing multiple results. This optimization has multiple effect son improving stencil performance that are particularly important to higher-order stencils: exploits data reuse, reduces floating-point operations, and exposes efficient SIMD parallelism to backend compilers. We study the benefit of this optimization in the context of Geometric Multigrid (GMG), a widely used method to solvePDEs, using four different Jacobi smoothers built from 7-, 13-, 27-and 125-point stencils. We quantify performance, speedup, andnumerical accuracy, and use the Roofline model to qualify our results. Ultimately, we obtain over 4× speedup on the smoothers themselves and up to a 3× speedup on the multigrid solver. Finally, we demonstrate that high-order multigrid solvers have the potential of reducing total data movement and energy by several orders of magnitude

    Doctor of Philosophy in Computer Science

    Get PDF
    dissertationStencil computations are operations on structured grids. They are frequently found in partial differential equation solvers, making their performance critical to a range of scientific applications. On modern architectures where data movement costs dominate computation, optimizing stencil computations is a challenging task. Typically, domain scientists must reduce and orchestrate data movement to tackle the memory bandwidth and latency bottlenecks. Furthermore, optimized code must map efficiently to ever increasing parallelism on a chip. This dissertation studies several stencils with varying arithmetic intensities, thus requiring contrasting optimization strategies. Stencils traditionally have low arithmetic intensity, making their performance limited by memory bandwidth. Contemporary higher-order stencils are designed to require smaller grids, hence less memory, but are bound by increased floating-point operations. This dissertation develops communication-avoiding optimizations to reduce data movement in memory-bound stencils. For higher-order stencils, a novel transformation, partial sums, is designed to reduce the number of floating-point operations and improve register reuse. These optimizations are implemented in a compiler framework, which is further extended to generate parallel code targeting multicores and graphics processor units (GPUs). The augmented compiler framework is then combined with autotuning to productively address stencil optimization challenges. Autotuning explores a search space of possible implementations of a computation to find the optimal code for an execution context. In this dissertation, autotuning is used to compose sequences of optimizations to drive the augmented compiler framework. This compiler-directed autotuning approach is used to optimize stencils in the context of a linear solver, Geometric Multigrid (GMG). GMG uses sequences of stencil computations, and presents greater optimization challenges than isolated stencils, as interactions between stencils must also be considered. The efficacy of our approach is demonstrated by comparing the performance of generated code against manually tuned code, over commercial compiler-generated code, and against analytic performance bounds. Generated code outperforms manually optimized codes on multicores and GPUs. Against Intel's compiler on multicores, generated code achieves up to 4x speedup for stencils, and 3x for the solver. On GPUs, generated code achieves 80% of an analytically computed performance bound

    Improving the programmability of heterogeneous systems by means of libraries

    Get PDF
    Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo] O emprego de dispositivos heteroxéneos coma co-procesadores en entornos de computación de altas prestacións (HPC) medrou ininterrompidamente nos últimos anos debido ás súas excelentes propiedades en termos de rendemento e consumo de enerx:ía. A ma.ior dispoñibilidade de sistemas HPC híbridos conlevou de forma natural a necesidade de desenrolar ferra.mentas de programación adecuadas para eles, sendo CUDA e OpenCL as máis a.mplamente empregadas na actualidade. Desafortunadamente, estas ferramentas son relativamente de baixo nivel, o cal emparellado co ma.ior número de detalles que deben de ser controlados cando se programan aceleradoras, fa.i da programación destes sistemas mediante elas, moito roáis complexa que a. programación tradicional de CPUs. Isto levou á. proposta de alternativas de roáis alto nivel para facilitar a programación de dispositivos heteroxéneos. Esta tesis contribúe neste campo presentando dúas libreríe.<i que mellora.n ampla.mente a programabilidade de sistemas heteroxéneos en C++, permitindo aos usuarios centrarse no que hai que facer en vez de nas tarefas de baixo nivel. As nosas propostas, a librería. Heterogeneous Progromming Libmry (HPL) e a. librería Heterogene.ous Hiemrchically Tiled Arrays (H2TA), están deseñadas para nodos con unha ou má.is aceleradoras, e para clusters heteroxéneos, respectivamente. Ambas librerías, demostraron ser capaces de incrementar a. productividade dos usuarios mellora.ndo a programabilidade dos sem; códigos, e ó mesmo tempo, lograr un rendemento semella.nte ó de solucións de roáis baixo nivel.[Abstract] The usage of heterogeneous devices as co-processors in high performance computing (HPC) environments has steadily grown during the last years due to their excellent properties in terms of perfonnance and energy consumption. The larger a.vailability of hybrid HPC systems naturally led to the need to develop suitable programming tools for them, being the most widely tL'ied nowadays CUDA and OpenCL. Unfortlmatciy, these tools are relativcly low leve), which coupled with the large DUlllber of deta.ils that must be monaged when programming accelerators, makes the programm.ing of these systems using them much more complex thon that of trad.itional CPUs. This has led to the proposal of higher leve) alternatives that facilitate the progranuning of heterogeneous devices. This thesis contri bu tes to this field presenting two libraries that largely improve the programma.bility of heterogeneous systeins in C++, helping users to focus on what todo rather thtlJl onlow leve) tasks. These two libraries, the Heterogeneous Programming Library (HPL) and the Heterogeneous Hierarch.ically Tiled Arrays (H2TA), are well suited to nodes with one or more accelerators, a.nd to heterogeneous clusters, respectively. Both libraries have proveo to be able to incresse the productivity of the users improving the progro. mmability of their codes, and at the s8llle time, achieving performance similar to that of lower leve) solutions.[Resumen] El empleo de dispositivos heterogéneos como co-procesadores en entornos de computación de altas prestaciones (HPC) ha. crecido ininterrumpidamente durante los últimos años debido a. sus excelentes propiedades en términos de rendimiento y consumo de energía. La mayor disponibilidad de sistemas HPC híbridos conllevó de forma natural la necesidad de desarrollar herramientas de programación adecuadas para. ellos, siendo CUDA y OpenCL las más ampliamente utilizadas en la actualidad. Desafortunadamente, estas herramientas son relativamente de bajo nivel, lo cual emparejado con el mayor número de detalles que han de ser controlados cuando se programan aceleradoras, hacen de la programación de estos sistemas mediante ell8S mucho más compleja que la programación tradicional de CPUs. Esto ha llevado a la propuesta de alternativ8S de más alto nivel para facilitar la programación de dispositivos heterogéneos. Esta tesis contribuye a este campo presentando dos librerías que mejoran ampliamente la programabilidad de sistemas heterogéneos en C++, permitiendo a los usuarios centrarse en lo que hay que hacer en vez de en las tareas de bajo nivel. Nuestras propuestas, la librería Heterogeneous Progromming Librory (HPL) y la librería Heterogeneous Hierorchíoolly Tíled Arrays (H2TA), están diseñadas para nodos con una o más aceleradoras, y para clusters heterogéneos, respectivamente. Ambas librerías, han demostrado ser capaces de incrementar la productividad de los usuarios mejorando la programabilidad de sus códigos, y al mismo tiempo, lograr un rendimiento similar al de soluciones de más bajo nivel

    Memorias del Congreso Argentino en Ciencias de la Computación - CACIC 2021

    Get PDF
    Trabajos presentados en el XXVII Congreso Argentino de Ciencias de la Computación (CACIC), celebrado en la ciudad de Salta los días 4 al 8 de octubre de 2021, organizado por la Red de Universidades con Carreras en Informática (RedUNCI) y la Universidad Nacional de Salta (UNSA).Red de Universidades con Carreras en Informátic
    corecore