28 research outputs found

    Temporal blocking of finite-difference stencil operators with sparse "off-the-grid" sources

    Get PDF
    Stencil kernels dominate a range of scientific applications, including seismic and medical imaging, image processing, and neural networks. Temporal blocking is a performance optimization that aims to reduce the required memory bandwidth of stencil computations by re-using data from the cache for multiple time steps. It has already been shown to be beneficial for this class of algorithms. However, applying temporal blocking to practical applications' stencils remains challenging. These computations often consist of sparsely located operators not aligned with the computational grid (“off-the-grid”). Our work is motivated by modelling problems in which source injections result in wavefields that must then be measured at receivers by interpolation from the grided wavefield. The resulting data dependencies make the adoption of temporal blocking much more challenging. We propose a methodology to inspect these data dependencies and reorder the computation, leading to performance gains in stencil codes where temporal blocking has not been applicable. We implement this novel scheme in the Devito domain-specific compiler toolchain. Devito implements a domain-specific language embedded in Python to generate optimized partial differential equation solvers using the finite-difference method from high-level symbolic problem definitions. We evaluate our scheme using isotropic acoustic, anisotropic acoustic, and isotropic elastic wave propagators of industrial significance. After auto-tuning, performance evaluation shows that this enables substantial performance improvement through temporal blocking over highly-optimized vectorized spatially-blocked code of up to 1.6x

    Automated cache optimisations of stencil computations for partial differential equations

    Get PDF
    This thesis focuses on numerical methods that solve partial differential equations. Our focal point is the finite difference method, which solves partial differential equations by approximating derivatives with explicit finite differences. These partial differential equation solvers consist of stencil computations on structured grids. Stencils for computing real-world practical applications are patterns often characterised by many memory accesses and non-trivial arithmetic expressions that lead to high computational costs compared to simple stencils used in much prior proof-of-concept work. In addition, the loop nests to express stencils on structured grids may often be complicated. This work is highly motivated by a specific domain of stencil computations where one of the challenges is non-aligned to the structured grid ("off-the-grid") operations. These operations update neighbouring grid points through scatter and gather operations via non-affine memory accesses, such as {A[B[i]]}. In addition to this challenge, these practical stencils often include many computation fields (need to store multiple grid copies), complex data dependencies and imperfect loop nests. In this work, we aim to increase the performance of stencil kernel execution. We study automated cache-memory-dependent optimisations for stencil computations. This work consists of two core parts with their respective contributions.The first part of our work tries to reduce the data movement in stencil computations of practical interest. Data movement is a dominant factor affecting the performance of high-performance computing applications. It has long been a target of optimisations due to its impact on execution time and energy consumption. This thesis tries to relieve this cost by applying temporal blocking optimisations, also known as time-tiling, to stencil computations. Temporal blocking is a well-known technique to enhance data reuse in stencil computations. However, it is rarely used in practical applications but rather in theoretical examples to prove its efficacy. Applying temporal blocking to scientific simulations is more complex. More specifically, in this work, we focus on the application context of seismic and medical imaging. In this area, we often encounter scatter and gather operations due to signal sources and receivers at arbitrary locations in the computational domain. These operations make the application of temporal blocking challenging. We present an approach to overcome this challenge and successfully apply temporal blocking.In the second part of our work, we extend the first part as an automated approach targeting a wide range of simulations modelled with partial differential equations. Since temporal blocking is error-prone, tedious to apply by hand and highly complex to assimilate theoretically and practically, we are motivated to automate its application and automatically generate code that benefits from it. We discuss algorithmic approaches and present a generalised compiler pipeline to automate the application of temporal blocking. These passes are written in the Devito compiler. They are used to accelerate the computation of stencil kernels in areas such as seismic and medical imaging, computational fluid dynamics and machine learning. \href{www.devitoproject.org}{Devito} is a Python package to implement optimised stencil computation (e.g., finite differences, image processing, machine learning) from high-level symbolic problem definitions. Devito builds on \href{www.sympy.org}{SymPy} and employs automated code generation and just-in-time compilation to execute optimised computational kernels on several computer platforms, including CPUs, GPUs, and clusters thereof. We show how we automate temporal blocking code generation without user intervention and often achieve better time-to-solution. We enable domain-specific optimisation through compiler passes and offer temporal blocking gains from a high-level symbolic abstraction. These automated optimisations benefit various computational kernels for solving real-world application problems.Open Acces

    Architecture and performance of Devito, a system for automated stencil computation

    Get PDF
    Stencil computations are a key part of many high-performance computing applications, such as image processing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devito is a framework capable of generating highly-optimized code given symbolic equations expressed in Python, specialized in, but not limited to, affine (stencil) codes. The lowering process -- from mathematical equations down to C++ code -- is performed by the Devito compiler through a series of intermediate representations. Several performance optimizations are introduced, including advanced common sub-expressions elimination, tiling and parallelization. Some of these are obtained through well-established stencil optimizers, integrated in the back-end of the Devito compiler. The architecture of the Devito compiler, as well as the performance optimizations that are applied when generating code, are presented. The effectiveness of such performance optimizations is demonstrated using operators drawn from seismic imaging applications

    Leveraging performance of 3D finite difference schemes in large scientific computing simulations

    Get PDF
    Gone are the days when engineers and scientists conducted most of their experiments empirically. During these decades, actual tests were carried out in order to assess the robustness and reliability of forthcoming product designs and prove theoretical models. With the advent of the computational era, scientific computing has definetely become a feasible solution compared with empirical methods, in terms of effort, cost and reliability. Large and massively parallel computational resources have reduced the simulation execution times and have improved their numerical results due to the refinement of the sampled domain. Several numerical methods coexist for solving the Partial Differential Equations (PDEs). Methods such as the Finite Element (FE) and the Finite Volume (FV) are specially well suited for dealing with problems where unstructured meshes are frequent. Unfortunately, this flexibility is not bestowed for free. These schemes entail higher memory latencies due to the handling of irregular data accesses. Conversely, the Finite Difference (FD) scheme has shown to be an efficient solution for problems where the structured meshes suit the domain requirements. Many scientific areas use this scheme due to its higher performance. This thesis focuses on improving FD schemes to leverage the performance of large scientific computing simulations. Different techniques are proposed such as the Semi-stencil, a novel algorithm that increases the FLOP/Byte ratio for medium- and high-order stencils operators by reducing the accesses and endorsing data reuse. The algorithm is orthogonal and can be combined with techniques such as spatial- or time-blocking, adding further improvement. New trends on Symmetric Multi-Processing (SMP) systems -where tens of cores are replicated on the same die- pose new challenges due to the exacerbation of the memory wall problem. In order to alleviate this issue, our research is focused on different strategies to reduce pressure on the cache hierarchy, particularly when different threads are sharing resources due to Simultaneous Multi-Threading (SMT). Several domain decomposition schedulers for work-load balance are introduced ensuring quasi-optimal results without jeopardizing the overall performance. We combine these schedulers with spatial-blocking and auto-tuning techniques, exploring the parametric space and reducing misses in last level cache. As alternative to brute-force methods used in auto-tuning, where a huge parametric space must be traversed to find a suboptimal candidate, performance models are a feasible solution. Performance models can predict the performance on different architectures, selecting suboptimal parameters almost instantly. In this thesis, we devise a flexible and extensible performance model for stencils. The proposed model is capable of supporting multi- and many-core architectures including complex features such as hardware prefetchers, SMT context and algorithmic optimizations. Our model can be used not only to forecast execution time, but also to make decisions about the best algorithmic parameters. Moreover, it can be included in run-time optimizers to decide the best SMT configuration based on the execution environment. Some industries rely heavily on FD-based techniques for their codes. Nevertheless, many cumbersome aspects arising in industry are still scarcely considered in academia research. In this regard, we have collaborated in the implementation of a FD framework which covers the most important features that an HPC industrial application must include. Some of the node-level optimization techniques devised in this thesis have been included into the framework in order to contribute in the overall application performance. We show results for a couple of strategic applications in industry: an atmospheric transport model that simulates the dispersal of volcanic ash and a seismic imaging model used in Oil & Gas industry to identify hydrocarbon-rich reservoirs.Atrás quedaron los días en los que ingenieros y científicos realizaban sus experimentos empíricamente. Durante esas décadas, se llevaban a cabo ensayos reales para verificar la robustez y fiabilidad de productos venideros y probar modelos teóricos. Con la llegada de la era computacional, la computación científica se ha convertido en una solución factible comparada con métodos empíricos, en términos de esfuerzo, coste y fiabilidad. Los supercomputadores han reducido el tiempo de las simulaciones y han mejorado los resultados numéricos gracias al refinamiento del dominio. Diversos métodos numéricos coexisten para resolver las Ecuaciones Diferenciales Parciales (EDPs). Métodos como Elementos Finitos (EF) y Volúmenes Finitos (VF) están bien adaptados para tratar problemas donde las mallas no estructuradas son frecuentes. Desafortunadamente, esta flexibilidad no se confiere de forma gratuita. Estos esquemas conllevan latencias más altas debido al acceso irregular de datos. En cambio, el esquema de Diferencias Finitas (DF) ha demostrado ser una solución eficiente cuando las mallas estructuradas se adaptan a los requerimientos. Esta tesis se enfoca en mejorar los esquemas DF para impulsar el rendimiento de las simulaciones en la computación científica. Se proponen diferentes técnicas, como el Semi-stencil, un nuevo algoritmo que incrementa el ratio de FLOP/Byte para operadores de stencil de orden medio y alto reduciendo los accesos y promoviendo el reuso de datos. El algoritmo es ortogonal y puede ser combinado con técnicas como spatial- o time-blocking, añadiendo mejoras adicionales. Las nuevas tendencias hacia sistemas con procesadores multi-simétricos (SMP) -donde decenas de cores son replicados en el mismo procesador- plantean nuevos retos debido a la exacerbación del problema del ancho de memoria. Para paliar este problema, nuestra investigación se centra en estrategias para reducir la presión en la jerarquía de cache, particularmente cuando diversos threads comparten recursos debido a Simultaneous Multi-Threading (SMT). Introducimos diversos planificadores de descomposición de dominios para balancear la carga asegurando resultados casi óptimos sin poner en riesgo el rendimiento global. Combinamos estos planificadores con técnicas de spatial-blocking y auto-tuning, explorando el espacio paramétrico y reduciendo los fallos en la cache de último nivel. Como alternativa a los métodos de fuerza bruta usados en auto-tuning donde un espacio paramétrico se debe recorrer para encontrar un candidato, los modelos de rendimiento son una solución factible. Los modelos de rendimiento pueden predecir el rendimiento en diferentes arquitecturas, seleccionando parámetros suboptimos casi de forma instantánea. En esta tesis, ideamos un modelo de rendimiento para stencils flexible y extensible. El modelo es capaz de soportar arquitecturas multi-core incluyendo características complejas como prefetchers, SMT y optimizaciones algorítmicas. Nuestro modelo puede ser usado no solo para predecir los tiempos de ejecución, sino también para tomar decisiones de los mejores parámetros algorítmicos. Además, puede ser incluido en optimizadores run-time para decidir la mejor configuración SMT. Algunas industrias confían en técnicas DF para sus códigos. Sin embargo, no todos los aspectos que aparecen en la industria han sido sometidos a investigación. En este aspecto, hemos diseñado e implementado desde cero una infraestructura DF que cubre las características más importantes que una aplicación industrial debe incluir. Algunas de las técnicas de optimización propuestas en esta tesis han sido incluidas para contribuir en el rendimiento global a nivel industrial. Mostramos resultados de un par de aplicaciones estratégicas para la industria: un modelo de transporte atmosférico que simula la dispersión de ceniza volcánica y un modelo de imagen sísmica usado en la industria del petroleo y gas para identificar reservas ricas en hidrocarburo

    Runtime-guided management of stacked DRAM memories in task parallel programs

    Get PDF
    Stacked DRAM memories have become a reality in High-Performance Computing (HPC) architectures. These memories provide much higher bandwidth while consuming less power than traditional off-chip memories, but their limited memory capacity is insufficient for modern HPC systems. For this reason, both stacked DRAM and off-chip memories are expected to co-exist in HPC architectures, giving raise to different approaches for architecting the stacked DRAM in the system. This paper proposes a runtime approach to transparently manage stacked DRAM memories in task-based programming models. In this approach the runtime system is in charge of copying the data accessed by the tasks to the stacked DRAM, without any complex hardware support nor modifications to the application code. To mitigate the cost of copying data between the stacked DRAM and the off-chip memory, the proposal includes an optimization to parallelize the copies across idle or additional helper threads. In addition, the runtime system is aware of the reuse pattern of the data accessed by the tasks, and can exploit this information to avoid unworthy copies of data to the stacked DRAM. Results on the Intel Knights Landing processor show that the proposed techniques achieve an average speedup of 14% against the state-of-the-art library to manage the stacked DRAM and 29% against a stacked DRAM architected as a hardware cache.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the European Union’s Horizon 2020 research and innovation programme (grant agreement 779877). M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104.Peer ReviewedPostprint (author's final draft
    corecore