19 research outputs found

    High Performance Stencil Code Generation with LIFT

    Get PDF
    Stencil computations are widely used from physical simulations to machine-learning. They are embarrassingly parallel and perfectly fit modern hardware such as Graphic Processing Units. Although stencil computations have been extensively studied, optimizing them for increasingly diverse hardware remains challenging. Domain Specific Languages (DSLs) have raised the programming abstraction and offer good performance. However, this places the burden on DSL implementers who have to write almost full-fledged parallelizing compilers and optimizers. Lift has recently emerged as a promising approach to achieve performance portability and is based on a small set of reusable parallel primitives that DSL or library writers can build upon. Lift’s key novelty is in its encoding of optimizations as a system of extensible rewrite rules which are used to explore the optimization space. However, Lift has mostly focused on linear algebra operations and it remains to be seen whether this approach is applicable for other domains. This paper demonstrates how complex multidimensional stencil code and optimizations such as tiling are expressible using compositions of simple 1D Lift primitives. By leveraging existing Lift primitives and optimizations, we only require the addition of two primitives and one rewrite rule to do so. Our results show that this approach outperforms existing compiler approaches and hand-tuned codes

    Temporal blocking of finite-difference stencil operators with sparse "off-the-grid" sources

    Get PDF
    Stencil kernels dominate a range of scientific applications, including seismic and medical imaging, image processing, and neural networks. Temporal blocking is a performance optimization that aims to reduce the required memory bandwidth of stencil computations by re-using data from the cache for multiple time steps. It has already been shown to be beneficial for this class of algorithms. However, applying temporal blocking to practical applications' stencils remains challenging. These computations often consist of sparsely located operators not aligned with the computational grid (“off-the-grid”). Our work is motivated by modelling problems in which source injections result in wavefields that must then be measured at receivers by interpolation from the grided wavefield. The resulting data dependencies make the adoption of temporal blocking much more challenging. We propose a methodology to inspect these data dependencies and reorder the computation, leading to performance gains in stencil codes where temporal blocking has not been applicable. We implement this novel scheme in the Devito domain-specific compiler toolchain. Devito implements a domain-specific language embedded in Python to generate optimized partial differential equation solvers using the finite-difference method from high-level symbolic problem definitions. We evaluate our scheme using isotropic acoustic, anisotropic acoustic, and isotropic elastic wave propagators of industrial significance. After auto-tuning, performance evaluation shows that this enables substantial performance improvement through temporal blocking over highly-optimized vectorized spatially-blocked code of up to 1.6x

    Code generation for room acoustics simulations with complex boundary conditions using LIFT

    Get PDF

    EPSILOD: efficient parallel skeleton for generic iterative stencil computations in distributed GPUs

    Get PDF
    Producción CientíficaIterative stencil computations are widely used in numerical simulations. They present a high degree of parallelism, high locality and mostly-coalesced memory access patterns. Therefore, GPUs are good candidates to speed up their computa- tion. However, the development of stencil programs that can work with huge grids in distributed systems with multiple GPUs is not straightforward, since it requires solv- ing problems related to the partition of the grid across nodes and devices, and the synchronization and data movement across remote GPUs. In this work, we present EPSILOD, a high-productivity parallel programming skeleton for iterative stencil computations on distributed multi-GPUs, of the same or different vendors that sup- ports any type of n-dimensional geometric stencils of any order. It uses an abstract specification of the stencil pattern (neighbors and weights) to internally derive the data partition, synchronizations and communications. Computation is split to better overlap with communications. This paper describes the underlying architecture of EPSILOD, its main components, and presents an experimental evaluation to show the benefits of our approach, including a comparison with another state-of-the-art solution. The experimental results show that EPSILOD is faster and shows good strong and weak scalability for platforms with both homogeneous and heterogene- ous types of GPUJunta de Castilla y León, Ministerio de Economía, Industria y Competitividad, y Fondo Europeo de Desarrollo Regional (FEDER): Proyecto PCAS (TIN2017-88614-R) y Proyecto PROPHET-2 (VA226P20).Ministerio de Ciencia e Innovación, Agencia Estatal de Investigación y “European Union NextGenerationEU/PRTR” : (MCIN/ AEI/10.13039/501100011033) - grant TED2021-130367B-I00CTE-POWER and Minotauro and the technical support provided by Barcelona Supercomputing Center (RES-IM-2021-2-0005, RES-IM-2021-3-0024, RES- IM-2022-1-0014).Publicación en abierto financiada por el Consorcio de Bibliotecas Universitarias de Castilla y León (BUCLE), con cargo al Programa Operativo 2014ES16RFOP009 FEDER 2014-2020 DE CASTILLA Y LEÓN, Actuación:20007-CL - Apoyo Consorcio BUCL

    P3 problem and Magnolia language: Specializing array computations for emerging architectures

    Get PDF
    The problem of producing portable high-performance computing (HPC) software that is cheap to develop and maintain is called the P3 (performance, portability, productivity) problem. Good solutions to the P3 problem have been achieved when the performance profiles of the target machines have been similar. The variety of HPC architectures is, however, large and can be expected to grow larger. Software for HPC therefore needs to be highly adaptable, and there is a pressing need to provide developers with tools to produce software that can target machines with vastly different profiles. Multi-dimensional array manipulation constitutes a core component of numerous numerical methods, such as finite difference solvers of Partial Differential Equations (PDEs). The efficiency of these computations is tightly connected to traversing and distributing array data in a hardware-friendly way. The Mathematics of Arrays (MoA) allows for formally reasoning about array computations and enables systematic transformations of array-based programs, e.g., to use data layouts that fit to a specific architecture. This paper presents a programming methodology aimed for tackling the P3 problem in domains that are well-explored using Magnolia, a language designed to embody generic programming. The Magnolia programmer can restrict the semantic properties of abstract generic types and operations by defining so-called axioms. Axioms can be used to produce tests for concrete implementations of specifications, for formal verification, or to perform semantics-preserving program transformations. We leverage Magnolia's semantic specification facilities to extend the Magnolia compiler with a term rewriting system. We implement MoA's transformation rules in Magnolia, and demonstrate through a case study on a finite difference solver of PDEs how our rewriting system allows exploring the space of possible optimizations.publishedVersio

    Tiling Optimizations for Stencil Computations Using Rewrite Rules in Lift

    Get PDF
    Stencil computations are a widely used type of algorithm, found in applications from physical simulations to machine learning. Stencils are embarrassingly parallel, therefore fit on modern hardware such as Graphic Processing Units perfectly. Although stencil computations have been extensively studied, optimizing them for increasingly diverse hardware remains challenging. Domain-specific Languages (DSLs) have raised the programming abstraction and offer good performance; however, this method places the burden on DSL implementers to write almost full-fledged parallelizing compilers and optimizers. Lift has recently emerged as a promising approach to achieve performance portability by using a small set of reusable parallel primitives that DSL or library writers utilize. Lift’s key novelty is in its encoding of optimizations as a system of extensible rewrite rules which are used to explore the optimization space. This article demonstrates how complex multi-dimensional stencil code and optimizations are expressed using compositions of simple 1D Lift primitives and rewrite rules. We introduce two optimizations that provide high performance for stencils in particular: classical overlapped tiling for multi-dimensional stencils and 2.5D tiling specifically for 3D stencils. We provide an in-depth analysis on how the tiling optimizations affects stencils of different shapes and sizes across different applications. Our experimental results show that our approach outperforms existing compiler approaches and hand-tuned codes

    Full-System Simulation of Mobile CPU/GPU Platforms

    Get PDF
    Graphics Processing Units (GPUs) critically rely on a complex system software stack comprising kernel- and userspace drivers and Just-in-time (JIT) compilers. Yet, existing GPU simulators typically abstract away details of the software stack and GPU instruction set. Partly, this is because GPU vendors rarely release sufficient information about their latest GPU products. However, this is also due to the lack of an integrated CPU/GPU simulation framework, which is complete and powerful enough to drive the complex GPU software environment. This has led to a situation where research on GPU architectures and compilers is largely based on outdated or greatly simplified architectures and software stacks, undermining the validity of the generated results. In this paper we develop a full-system system simulation environment for a mobile platform, which enables users to run a complete and unmodified software stack for a state-of-the-art mobile Arm CPU and Mali-G71 GPU powered device. We validate our simulator against a hardware implementation and Arm’s stand-alone GPU simulator, achieving 100% architectural accuracy across all available toolchains. We demonstrate the capability of our GPU simulation framework by optimizing an advanced Computer Vision application using simulated statistics unavailable with other simulation approaches or physical GPU implementations. We demonstrate that performance optimizations for desktop GPUs trigger bottlenecks on mobile GPUs, and show the importance of efficient memory use.Postprin
    corecore