72 research outputs found

    Tiling Optimizations for Stencil Computations Using Rewrite Rules in Lift

    Get PDF
    Stencil computations are a widely used type of algorithm, found in applications from physical simulations to machine learning. Stencils are embarrassingly parallel, therefore fit on modern hardware such as Graphic Processing Units perfectly. Although stencil computations have been extensively studied, optimizing them for increasingly diverse hardware remains challenging. Domain-specific Languages (DSLs) have raised the programming abstraction and offer good performance; however, this method places the burden on DSL implementers to write almost full-fledged parallelizing compilers and optimizers. Lift has recently emerged as a promising approach to achieve performance portability by using a small set of reusable parallel primitives that DSL or library writers utilize. Lift’s key novelty is in its encoding of optimizations as a system of extensible rewrite rules which are used to explore the optimization space. This article demonstrates how complex multi-dimensional stencil code and optimizations are expressed using compositions of simple 1D Lift primitives and rewrite rules. We introduce two optimizations that provide high performance for stencils in particular: classical overlapped tiling for multi-dimensional stencils and 2.5D tiling specifically for 3D stencils. We provide an in-depth analysis on how the tiling optimizations affects stencils of different shapes and sizes across different applications. Our experimental results show that our approach outperforms existing compiler approaches and hand-tuned codes

    High Performance Stencil Code Generation with LIFT

    Get PDF
    Stencil computations are widely used from physical simulations to machine-learning. They are embarrassingly parallel and perfectly fit modern hardware such as Graphic Processing Units. Although stencil computations have been extensively studied, optimizing them for increasingly diverse hardware remains challenging. Domain Specific Languages (DSLs) have raised the programming abstraction and offer good performance. However, this places the burden on DSL implementers who have to write almost full-fledged parallelizing compilers and optimizers. Lift has recently emerged as a promising approach to achieve performance portability and is based on a small set of reusable parallel primitives that DSL or library writers can build upon. Lift’s key novelty is in its encoding of optimizations as a system of extensible rewrite rules which are used to explore the optimization space. However, Lift has mostly focused on linear algebra operations and it remains to be seen whether this approach is applicable for other domains. This paper demonstrates how complex multidimensional stencil code and optimizations such as tiling are expressible using compositions of simple 1D Lift primitives. By leveraging existing Lift primitives and optimizations, we only require the addition of two primitives and one rewrite rule to do so. Our results show that this approach outperforms existing compiler approaches and hand-tuned codes

    AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs

    Full text link
    Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architecture and memory hierarchy of GPUs to achieve high performance is difficult. We propose AN5D, an automated stencil framework which is capable of automatically transforming and optimizing stencil patterns in a given C source code, and generating corresponding CUDA code. Parameter tuning in our framework is guided by our performance model. Our novel optimization strategy reduces shared memory and register pressure in comparison to existing implementations, allowing performance scaling up to a temporal blocking degree of 10. We achieve the highest performance reported so far for all evaluated stencil benchmarks on the state-of-the-art Tesla V100 GPU

    Architectural explorations for streaming accelerators with customized memory layouts

    Get PDF
    El concepto básico de la arquitectura mono-nucleo en los procesadores de propósito general se ajusta bien a un modelo de programación secuencial. La integración de multiples núcleos en un solo chip ha permitido a los procesadores correr partes del programa en paralelo. Sin embargo, la explotación del enorme paralelismo disponible en muchas aplicaciones de alto rendimiento y de los datos correspondientes es difícil de conseguir usando unicamente multicores de propósito general. La aparición de aceleradores tipo streaming y de los correspondientes modelos de programación han mejorado esta situación proporcionando arquitecturas orientadas al proceso de flujos de datos. La idea básica detrás del diseño de estas arquitecturas responde a la necesidad de procesar conjuntos enormes de datos. Estos dispositivos de alto rendimiento orientados a flujos permiten el procesamiento rapido de datos mediante el uso eficiente de computación paralela y comunicación entre procesos. Los aceleradores streaming orientados a flujos, igual que en otros procesadores, consisten en diversos componentes micro-arquitectonicos como por ejemplo las estructuras de memoria, las unidades de computo, las unidades de control, los canales de Entrada/Salida y controles de Entrada/Salida, etc. Sin embargo, los requisitos del flujo de datos agregan algunas características especiales e imponen otras restricciones que afectan al rendimiento. Estos dispositivos, por lo general, ofrecen un gran número de recursos computacionales, pero obligan a reorganizar los conjuntos de datos en paralelo, maximizando la independiencia para alimentar los recursos de computación en forma de flujos. La disposición de datos en conjuntos independientes de flujos paralelos no es una tarea sencilla. Es posible que se tenga que cambiar la estructura de un algoritmo en su conjunto o, incluso, puede requerir la reescritura del algoritmo desde cero. Sin embargo, todos estos esfuerzos para la reordenación de los patrones de las aplicaciones de acceso a datos puede que no sean muy útiles para lograr un rendimiento óptimo. Esto es debido a las posibles limitaciones microarquitectonicas de la plataforma de destino para los mecanismos hardware de prefetch, el tamaño y la granularidad del almacenamiento local, y la flexibilidad para disponer de forma serial los datos en el interior del almacenamiento local. Las limitaciones de una plataforma de streaming de proposito general para el prefetching de datos, almacenamiento y demas procedimientos para organizar y mantener los datos en forma de flujos paralelos e independientes podría ser eliminado empleando técnicas a nivel micro-arquitectonico. Esto incluye el uso de memorias personalizadas especificamente para las aplicaciones en el front-end de una arquitectura streaming. El objetivo de esta tesis es presentar exploraciones arquitectónicas de los aceleradores streaming con diseños de memoria personalizados. En general, la tesis cubre tres aspectos principales de tales aceleradores. Estos aspectos se pueden clasificar como: i) Diseño de aceleradores de aplicaciones específicas con diseños de memoria personalizados, ii) diseño de aceleradores con memorias personalizadas basados en plantillas, y iii) exploraciones del espacio de diseño para dispositivos orientados a flujos con las memorias estándar y personalizadas. Esta tesis concluye con la propuesta conceptual de una Blacksmith Streaming Architecture (BSArc). El modelo de computación Blacksmith permite la adopción a nivel de hardware de un front-end de aplicación específico utilizando una GPU como back-end. Esto permite maximizar la explotación de la localidad de datos y el paralelismo a nivel de datos de una aplicación mientras que proporciona un flujo mayor de datos al back-end. Consideramos que el diseño de estos procesadores con memorias especializadas debe ser proporcionado por expertos del dominio de aplicación en la forma de plantillas.The basic concept behind the architecture of a general purpose CPU core conforms well to a serial programming model. The integration of more cores on a single chip helped CPUs in running parts of a program in parallel. However, the utilization of huge parallelism available from many high performance applications and the corresponding data is hard to achieve from these general purpose multi-cores. Streaming accelerators and the corresponding programing models improve upon this situation by providing throughput oriented architectures. The basic idea behind the design of these architectures matches the everyday increasing requirements of processing huge data sets. These high-performance throughput oriented devices help in high performance processing of data by using efficient parallel computations and streaming based communications. The throughput oriented streaming accelerators ¿ similar to the other processors ¿ consist of numerous types of micro-architectural components including the memory structures, compute units, control units, I/O channels and I/O controls etc. However, the throughput requirements add some special features and impose other restrictions for the performance purposes. These devices, normally, offer a large number of compute resources but restrict the applications to arrange parallel and maximally independent data sets to feed the compute resources in the form of streams. The arrangement of data into independent sets of parallel streams is not an easy and simple task. It may need to change the structure of an algorithm as a whole or even it can require to write a new algorithm from scratch for the target application. However, all these efforts for the re-arrangement of application data access patterns may still not be very helpful to achieve the optimal performance. This is because of the possible micro-architectural constraints of the target platform for the hardware pre-fetching mechanisms, the size and the granularity of the local storage and the flexibility in data marshaling inside the local storage. The constraints of a general purpose streaming platform on the data pre-fetching, storing and maneuvering to arrange and maintain it in the form of parallel and independent streams could be removed by employing micro-architectural level design approaches. This includes the usage of application specific customized memories in the front-end of a streaming architecture. The focus of this thesis is to present architectural explorations for the streaming accelerators using customized memory layouts. In general the thesis covers three main aspects of such streaming accelerators in this research. These aspects can be categorized as : i) Design of Application Specific Accelerators with Customized Memory Layout ii) Template Based Design Support for Customized Memory Accelerators and iii) Design Space Explorations for Throughput Oriented Devices with Standard and Customized Memories. This thesis concludes with a conceptual proposal on a Blacksmith Streaming Architecture (BSArc). The Blacksmith Computing allow the hardware-level adoption of an application specific front-end with a GPU like streaming back-end. This gives an opportunity to exploit maximum possible data locality and the data level parallelism from an application while providing a throughput natured powerful back-end. We consider that the design of these specialized memory layouts for the front-end of the device are provided by the application domain experts in the form of templates. These templates are adjustable according to a device and the problem size at the device's configuration time. The physical availability of such an architecture may still take time. However, simulation framework helps in architectural explorations to give insight into the proposal and predicts potential performance benefits for such an architecture

    Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators

    Get PDF
    In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing (HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications’ output correlating the number of corrupted elements with their spatial locality. Also, we provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude. We apply the selected metrics to experimental results obtained in various radiation test campaigns for a total of more than 400 hours of beam time per device. The amount of data we gathered allows us to evaluate the error criticality of a representative set of algorithms from HPC suites. Additionally, based on the characteristics of the tested algorithms, we draw generic reliability conclusions for broader classes of codes. We show that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods. Finally, iterative stencil operations seem the most reliable on both architectures.This work was supported by the STIC-AmSud/CAPES scientific cooperation program under the EnergySFE research project grant 99999.007556/2015-02, EU H2020 Programme, and MCTI/RNP-Brazil under the HPC4E Project, grant agreement n° 689772. Tested K40 boards were donated thanks to Steve Keckler, Timothy Tsai, and Siva Hari from NVIDIA.Postprint (author's final draft

    Architectural explorations for streaming accelerators with customized memory layouts

    Get PDF
    El concepto básico de la arquitectura mono-nucleo en los procesadores de propósito general se ajusta bien a un modelo de programación secuencial. La integración de multiples núcleos en un solo chip ha permitido a los procesadores correr partes del programa en paralelo. Sin embargo, la explotación del enorme paralelismo disponible en muchas aplicaciones de alto rendimiento y de los datos correspondientes es difícil de conseguir usando unicamente multicores de propósito general. La aparición de aceleradores tipo streaming y de los correspondientes modelos de programación han mejorado esta situación proporcionando arquitecturas orientadas al proceso de flujos de datos. La idea básica detrás del diseño de estas arquitecturas responde a la necesidad de procesar conjuntos enormes de datos. Estos dispositivos de alto rendimiento orientados a flujos permiten el procesamiento rapido de datos mediante el uso eficiente de computación paralela y comunicación entre procesos. Los aceleradores streaming orientados a flujos, igual que en otros procesadores, consisten en diversos componentes micro-arquitectonicos como por ejemplo las estructuras de memoria, las unidades de computo, las unidades de control, los canales de Entrada/Salida y controles de Entrada/Salida, etc. Sin embargo, los requisitos del flujo de datos agregan algunas características especiales e imponen otras restricciones que afectan al rendimiento. Estos dispositivos, por lo general, ofrecen un gran número de recursos computacionales, pero obligan a reorganizar los conjuntos de datos en paralelo, maximizando la independiencia para alimentar los recursos de computación en forma de flujos. La disposición de datos en conjuntos independientes de flujos paralelos no es una tarea sencilla. Es posible que se tenga que cambiar la estructura de un algoritmo en su conjunto o, incluso, puede requerir la reescritura del algoritmo desde cero. Sin embargo, todos estos esfuerzos para la reordenación de los patrones de las aplicaciones de acceso a datos puede que no sean muy útiles para lograr un rendimiento óptimo. Esto es debido a las posibles limitaciones microarquitectonicas de la plataforma de destino para los mecanismos hardware de prefetch, el tamaño y la granularidad del almacenamiento local, y la flexibilidad para disponer de forma serial los datos en el interior del almacenamiento local. Las limitaciones de una plataforma de streaming de proposito general para el prefetching de datos, almacenamiento y demas procedimientos para organizar y mantener los datos en forma de flujos paralelos e independientes podría ser eliminado empleando técnicas a nivel micro-arquitectonico. Esto incluye el uso de memorias personalizadas especificamente para las aplicaciones en el front-end de una arquitectura streaming. El objetivo de esta tesis es presentar exploraciones arquitectónicas de los aceleradores streaming con diseños de memoria personalizados. En general, la tesis cubre tres aspectos principales de tales aceleradores. Estos aspectos se pueden clasificar como: i) Diseño de aceleradores de aplicaciones específicas con diseños de memoria personalizados, ii) diseño de aceleradores con memorias personalizadas basados en plantillas, y iii) exploraciones del espacio de diseño para dispositivos orientados a flujos con las memorias estándar y personalizadas. Esta tesis concluye con la propuesta conceptual de una Blacksmith Streaming Architecture (BSArc). El modelo de computación Blacksmith permite la adopción a nivel de hardware de un front-end de aplicación específico utilizando una GPU como back-end. Esto permite maximizar la explotación de la localidad de datos y el paralelismo a nivel de datos de una aplicación mientras que proporciona un flujo mayor de datos al back-end. Consideramos que el diseño de estos procesadores con memorias especializadas debe ser proporcionado por expertos del dominio de aplicación en la forma de plantillas.The basic concept behind the architecture of a general purpose CPU core conforms well to a serial programming model. The integration of more cores on a single chip helped CPUs in running parts of a program in parallel. However, the utilization of huge parallelism available from many high performance applications and the corresponding data is hard to achieve from these general purpose multi-cores. Streaming accelerators and the corresponding programing models improve upon this situation by providing throughput oriented architectures. The basic idea behind the design of these architectures matches the everyday increasing requirements of processing huge data sets. These high-performance throughput oriented devices help in high performance processing of data by using efficient parallel computations and streaming based communications. The throughput oriented streaming accelerators ¿ similar to the other processors ¿ consist of numerous types of micro-architectural components including the memory structures, compute units, control units, I/O channels and I/O controls etc. However, the throughput requirements add some special features and impose other restrictions for the performance purposes. These devices, normally, offer a large number of compute resources but restrict the applications to arrange parallel and maximally independent data sets to feed the compute resources in the form of streams. The arrangement of data into independent sets of parallel streams is not an easy and simple task. It may need to change the structure of an algorithm as a whole or even it can require to write a new algorithm from scratch for the target application. However, all these efforts for the re-arrangement of application data access patterns may still not be very helpful to achieve the optimal performance. This is because of the possible micro-architectural constraints of the target platform for the hardware pre-fetching mechanisms, the size and the granularity of the local storage and the flexibility in data marshaling inside the local storage. The constraints of a general purpose streaming platform on the data pre-fetching, storing and maneuvering to arrange and maintain it in the form of parallel and independent streams could be removed by employing micro-architectural level design approaches. This includes the usage of application specific customized memories in the front-end of a streaming architecture. The focus of this thesis is to present architectural explorations for the streaming accelerators using customized memory layouts. In general the thesis covers three main aspects of such streaming accelerators in this research. These aspects can be categorized as : i) Design of Application Specific Accelerators with Customized Memory Layout ii) Template Based Design Support for Customized Memory Accelerators and iii) Design Space Explorations for Throughput Oriented Devices with Standard and Customized Memories. This thesis concludes with a conceptual proposal on a Blacksmith Streaming Architecture (BSArc). The Blacksmith Computing allow the hardware-level adoption of an application specific front-end with a GPU like streaming back-end. This gives an opportunity to exploit maximum possible data locality and the data level parallelism from an application while providing a throughput natured powerful back-end. We consider that the design of these specialized memory layouts for the front-end of the device are provided by the application domain experts in the form of templates. These templates are adjustable according to a device and the problem size at the device's configuration time. The physical availability of such an architecture may still take time. However, simulation framework helps in architectural explorations to give insight into the proposal and predicts potential performance benefits for such an architecture.Postprint (published version

    Automated cache optimisations of stencil computations for partial differential equations

    Get PDF
    This thesis focuses on numerical methods that solve partial differential equations. Our focal point is the finite difference method, which solves partial differential equations by approximating derivatives with explicit finite differences. These partial differential equation solvers consist of stencil computations on structured grids. Stencils for computing real-world practical applications are patterns often characterised by many memory accesses and non-trivial arithmetic expressions that lead to high computational costs compared to simple stencils used in much prior proof-of-concept work. In addition, the loop nests to express stencils on structured grids may often be complicated. This work is highly motivated by a specific domain of stencil computations where one of the challenges is non-aligned to the structured grid ("off-the-grid") operations. These operations update neighbouring grid points through scatter and gather operations via non-affine memory accesses, such as {A[B[i]]}. In addition to this challenge, these practical stencils often include many computation fields (need to store multiple grid copies), complex data dependencies and imperfect loop nests. In this work, we aim to increase the performance of stencil kernel execution. We study automated cache-memory-dependent optimisations for stencil computations. This work consists of two core parts with their respective contributions.The first part of our work tries to reduce the data movement in stencil computations of practical interest. Data movement is a dominant factor affecting the performance of high-performance computing applications. It has long been a target of optimisations due to its impact on execution time and energy consumption. This thesis tries to relieve this cost by applying temporal blocking optimisations, also known as time-tiling, to stencil computations. Temporal blocking is a well-known technique to enhance data reuse in stencil computations. However, it is rarely used in practical applications but rather in theoretical examples to prove its efficacy. Applying temporal blocking to scientific simulations is more complex. More specifically, in this work, we focus on the application context of seismic and medical imaging. In this area, we often encounter scatter and gather operations due to signal sources and receivers at arbitrary locations in the computational domain. These operations make the application of temporal blocking challenging. We present an approach to overcome this challenge and successfully apply temporal blocking.In the second part of our work, we extend the first part as an automated approach targeting a wide range of simulations modelled with partial differential equations. Since temporal blocking is error-prone, tedious to apply by hand and highly complex to assimilate theoretically and practically, we are motivated to automate its application and automatically generate code that benefits from it. We discuss algorithmic approaches and present a generalised compiler pipeline to automate the application of temporal blocking. These passes are written in the Devito compiler. They are used to accelerate the computation of stencil kernels in areas such as seismic and medical imaging, computational fluid dynamics and machine learning. \href{www.devitoproject.org}{Devito} is a Python package to implement optimised stencil computation (e.g., finite differences, image processing, machine learning) from high-level symbolic problem definitions. Devito builds on \href{www.sympy.org}{SymPy} and employs automated code generation and just-in-time compilation to execute optimised computational kernels on several computer platforms, including CPUs, GPUs, and clusters thereof. We show how we automate temporal blocking code generation without user intervention and often achieve better time-to-solution. We enable domain-specific optimisation through compiler passes and offer temporal blocking gains from a high-level symbolic abstraction. These automated optimisations benefit various computational kernels for solving real-world application problems.Open Acces
    • …
    corecore