130 research outputs found

    Search-based Model-driven Loop Optimizations for Tensor Contractions

    Get PDF
    Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the coupled cluster method. The Tensor Contraction Engine (TCE) is a high-level program synthesis system that facilitates the generation of high-performance parallel programs from tensor contraction equations. We are developing a new software infrastructure for the TCE that is designed to allow experimentation with optimization algorithms for modern computing platforms, including for heterogeneous architectures employing general-purpose graphics processing units (GPGPUs). In this dissertation, we present improvements and extensions to the loop fusion optimization algorithm, which can be used with cost models, e.g., for minimizing memory usage or for minimizing data movement costs under a memory constraint. We show that our data structure and pruning improvements to the loop fusion algorithm result in significant performance improvements that enable complex cost models being use for large input equations. We also present an algorithm for optimizing the fused loop structure of handwritten code. It determines the regions in handwritten code that are safe to be optimized and then runs the loop fusion algorithm on the dependency graph of the code. Finally, we develop an optimization framework for generating GPGPU code consisting of loop fusion optimization with a novel cost model, tiling optimization, and layout optimization. Depending on the memory available on the GPGPU and the sizes of the tensors, our framework decides which processor (CPU or GPGPU) should perform an operation and where the result should be moved. We present extensive measurements for tuning the loop fusion algorithm, for validating our optimization framework, and for measuring the performance characteristics of GPGPUs. Our measurements demonstrate that our optimization framework outperforms existing general-purpose optimization approaches both on multi-core CPUs and on GPGPUs

    Model-driven search-based loop fusion optimization for handwritten code

    Get PDF
    The Tensor Contraction Engine (TCE) is a compiler that translates high-level, mathematical tensor contraction expressions into efficient, parallel Fortran code. A pair of optimizations in the TCE, the fusion and tiling optimizations, have proven successful for minimizing disk-to-memory traffic for dense tensor computations. While other optimizations are specific to tensor contraction expressions, these two model-driven search-based optimization algorithms could also be useful for optimizing handwritten dense array computations to minimize disk to memory traffic. In this thesis, we show how to apply the loop fusion algorithm to handwritten code in a procedural language. While in the TCE the loop fusion algorithm operated on high-level expression trees, in a standard compiler it needs to operate on abstract syntax trees. For simplicity, we use the fusion algorithm only for memory minimization instead of for minimizing disk-to-memory traffic. Also, we limit ourselves to handwritten, dense array computations in which loop bounds expressions are constant, subscript expressions are simple loop variables, and there are no common subexpressions. After type-checking, we canonicalize the abstract syntax tree to move side effects and loop-invariant code out of larger expressions. Using dataflow analysis, we then compute reaching definitions and add use-def chains to the abstract syntax tree. After undoing any partial loop fusion, a generalized loop fusion algorithm traverses the abstract syntax tree together with the use-def chains. Finally, the abstract syntax tree is rewritten to reflect the loop structure found by the loop fusion algorithm. We outline how the constraints on loop bounds expressions and array index expressions could be removed in the future using an algebraic cost model and an analysis of the iteration space using a polyhedral model

    Performance Optimization With An Integrated View Of Compiler And Application Knowledge

    Get PDF
    Compiler optimization is a long-standing research field that enhances program performance with a set of rigorous code analyses and transformations. Traditional compiler optimization focuses on general programs or program structures without considering too much high-level application operations or data structure knowledge. In this thesis, we claim that an integrated view of the application and compiler is helpful to further improve program performance. Particularly, we study integrated optimization opportunities for three kinds of applications: irregular tree-based query processing systems such as B+ tree, security enhancement such as buffer overflow protection, and tensor/matrix-based linear algebra computation. The performance of B+ tree query processing is important for many applications, such as file systems and databases. Latch-free B+ tree query processing is efficient since the queries are processed in batches without locks. To avoid long latency, the batch size can not be very large. However, modern processors provide opportunities to process larger batches parallel with acceptable latency. From studying real-world data, we find that there are many redundant and unnecessary queries especially when the real-world data is highly skewed. We develop a query sequence transformation framework Qtrans to reduce the redundancies in queries by applying classic dataflow analysis to queries. To further confirm the effectiveness, we integrate Qtrans into an existing BSP-based B+ tree query processing system, PALM tree. The evaluations show that the throughput can be improved up to 16X. Heap overflows are still the most common vulnerabilities in C/C++ programs. Common approaches incur high overhead since it checks every memory access. By analyzing dozens of bugs, we find that all heap overflows are related to arrays. We only need to check array-related memory accesses. We propose Prober to efficiently detect and prevent heap overflows. It contains Prober-Static to identify the array-related allocations and Prober-Dynamic to protect objects at runtime. In this thesis, our contributions lie on the Prober-Static side. The key challenge is to correctly identify the array-related allocations. We propose a hybrid method. Some objects can be identified as array-related (or not) by static analysis. For the remaining ones, we instrument the basic allocation type size statically and then determine the real allocation size at runtime. The evaluations show Prober-Static is effective. Tensor algebra is widely used in many applications, such as machine learning and data analytics. Tensors representing real-world data are usually large and sparse. There are many sparse tensor storage formats, and the kernels are different with varied formats. These different kernels make performance optimization for sparse tensor algebra challenging. We propose a tensor algebra domain-specific language and a compiler to automatically generate kernels for sparse tensor algebra computations, called SPACe. This compiler supports a wide range of sparse tensor formats. To further improve the performance, we integrate the data reordering into SPACe to improve data locality. The evaluations show that the code generated by SPACe outperforms state-of-the-art sparse tensor algebra compilers

    Design and Code Optimization for Systems with Next-generation Racetrack Memories

    Get PDF
    With the rise of computationally expensive application domains such as machine learning, genomics, and fluids simulation, the quest for performance and energy-efficient computing has gained unprecedented momentum. The significant increase in computing and memory devices in modern systems has resulted in an unsustainable surge in energy consumption, a substantial portion of which is attributed to the memory system. The scaling of conventional memory technologies and their suitability for the next-generation system is also questionable. This has led to the emergence and rise of nonvolatile memory ( NVM ) technologies. Today, in different development stages, several NVM technologies are competing for their rapid access to the market. Racetrack memory ( RTM ) is one such nonvolatile memory technology that promises SRAM -comparable latency, reduced energy consumption, and unprecedented density compared to other technologies. However, racetrack memory ( RTM ) is sequential in nature, i.e., data in an RTM cell needs to be shifted to an access port before it can be accessed. These shift operations incur performance and energy penalties. An ideal RTM , requiring at most one shift per access, can easily outperform SRAM . However, in the worst-cast shifting scenario, RTM can be an order of magnitude slower than SRAM . This thesis presents an overview of the RTM device physics, its evolution, strengths and challenges, and its application in the memory subsystem. We develop tools that allow the programmability and modeling of RTM -based systems. For shifts minimization, we propose a set of techniques including optimal, near-optimal, and evolutionary algorithms for efficient scalar and instruction placement in RTMs . For array accesses, we explore schedule and layout transformations that eliminate the longer overhead shifts in RTMs . We present an automatic compilation framework that analyzes static control flow programs and transforms the loop traversal order and memory layout to maximize accesses to consecutive RTM locations and minimize shifts. We develop a simulation framework called RTSim that models various RTM parameters and enables accurate architectural level simulation. Finally, to demonstrate the RTM potential in non-Von-Neumann in-memory computing paradigms, we exploit its device attributes to implement logic and arithmetic operations. As a concrete use-case, we implement an entire hyperdimensional computing framework in RTM to accelerate the language recognition problem. Our evaluation shows considerable performance and energy improvements compared to conventional Von-Neumann models and state-of-the-art accelerators

    Optimizing the Four-Index Integral Transform Using Data Movement Lower Bounds Analysis

    Get PDF
    International audienceThe four-index integral transform is a fundamental and com-putationally demanding calculation used in many computational chemistry suites such as NWChem. It transforms a four-dimensional tensor from one basis to another. This transformation is most efficiently implemented as a sequence of four tensor contractions that each contract a four-dimensional tensor with a two-dimensional transformation matrix. Differing degrees of permutation symmetry in the intermediate and final tensors in the sequence of contractions cause intermediate tensors to be much larger than the final tensor and limit the number of electronic states in the modeled systems. Loop fusion, in conjunction with tiling, can be very effective in reducing the total space requirement, as well as data movement. However, the large number of possible choices for loop fusion and tiling, and data/computation distribution across a parallel system, make it challenging to develop an optimized parallel implementation for the four-index integral transform. We develop a novel approach to address this problem, using lower bounds modeling of data movement complexity. We establish relationships between available aggregate physical memory in a parallel computer system and ineffective fusion configurations, enabling their pruning and consequent identification of effective choices and a characterization of optimality criteria. This work has resulted in the development of a significantly improved implementation of the four-index transform that enables higher performance and the ability to model larger electronic systems than the current implementation in the NWChem quantum chemistry software suite

    Multilayered abstractions for partial differential equations

    Get PDF
    How do we build maintainable, robust, and performance-portable scientific applications? This thesis argues that the answer to this software engineering question in the context of the finite element method is through the use of layers of Domain-Specific Languages (DSLs) to separate the various concerns in the engineering of such codes. Performance-portable software achieves high performance on multiple diverse hardware platforms without source code changes. We demonstrate that finite element solvers written in a low-level language are not performance-portable, and therefore code must be specialised to the target architecture by a code generation framework. A prototype compiler for finite element variational forms that generates CUDA code is presented, and is used to explore how good performance on many-core platforms in automatically-generated finite element applications can be achieved. The differing code generation requirements for multi- and many-core platforms motivates the design of an additional abstraction, called PyOP2, that enables unstructured mesh applications to be performance-portable. We present a runtime code generation framework comprised of the Unified Form Language (UFL), the FEniCS Form Compiler, and PyOP2. This toolchain separates the succinct expression of a numerical method from the selection and generation of efficient code for local assembly. This is further decoupled from the selection of data formats and algorithms for efficient parallel implementation on a specific target architecture. We establish the successful separation of these concerns by demonstrating the performance-portability of code generated from a single high-level source code written in UFL across sequential C, CUDA, MPI and OpenMP targets. The performance of the generated code exceeds the performance of comparable alternative toolchains on multi-core architectures.Open Acces

    The Tensor Networks Anthology: Simulation techniques for many-body quantum lattice systems

    Full text link
    We present a compendium of numerical simulation techniques, based on tensor network methods, aiming to address problems of many-body quantum mechanics on a classical computer. The core setting of this anthology are lattice problems in low spatial dimension at finite size, a physical scenario where tensor network methods, both Density Matrix Renormalization Group and beyond, have long proven to be winning strategies. Here we explore in detail the numerical frameworks and methods employed to deal with low-dimension physical setups, from a computational physics perspective. We focus on symmetries and closed-system simulations in arbitrary boundary conditions, while discussing the numerical data structures and linear algebra manipulation routines involved, which form the core libraries of any tensor network code. At a higher level, we put the spotlight on loop-free network geometries, discussing their advantages, and presenting in detail algorithms to simulate low-energy equilibrium states. Accompanied by discussions of data structures, numerical techniques and performance, this anthology serves as a programmer's companion, as well as a self-contained introduction and review of the basic and selected advanced concepts in tensor networks, including examples of their applications.Comment: 115 pages, 56 figure

    Automated cache optimisations of stencil computations for partial differential equations

    Get PDF
    This thesis focuses on numerical methods that solve partial differential equations. Our focal point is the finite difference method, which solves partial differential equations by approximating derivatives with explicit finite differences. These partial differential equation solvers consist of stencil computations on structured grids. Stencils for computing real-world practical applications are patterns often characterised by many memory accesses and non-trivial arithmetic expressions that lead to high computational costs compared to simple stencils used in much prior proof-of-concept work. In addition, the loop nests to express stencils on structured grids may often be complicated. This work is highly motivated by a specific domain of stencil computations where one of the challenges is non-aligned to the structured grid ("off-the-grid") operations. These operations update neighbouring grid points through scatter and gather operations via non-affine memory accesses, such as {A[B[i]]}. In addition to this challenge, these practical stencils often include many computation fields (need to store multiple grid copies), complex data dependencies and imperfect loop nests. In this work, we aim to increase the performance of stencil kernel execution. We study automated cache-memory-dependent optimisations for stencil computations. This work consists of two core parts with their respective contributions.The first part of our work tries to reduce the data movement in stencil computations of practical interest. Data movement is a dominant factor affecting the performance of high-performance computing applications. It has long been a target of optimisations due to its impact on execution time and energy consumption. This thesis tries to relieve this cost by applying temporal blocking optimisations, also known as time-tiling, to stencil computations. Temporal blocking is a well-known technique to enhance data reuse in stencil computations. However, it is rarely used in practical applications but rather in theoretical examples to prove its efficacy. Applying temporal blocking to scientific simulations is more complex. More specifically, in this work, we focus on the application context of seismic and medical imaging. In this area, we often encounter scatter and gather operations due to signal sources and receivers at arbitrary locations in the computational domain. These operations make the application of temporal blocking challenging. We present an approach to overcome this challenge and successfully apply temporal blocking.In the second part of our work, we extend the first part as an automated approach targeting a wide range of simulations modelled with partial differential equations. Since temporal blocking is error-prone, tedious to apply by hand and highly complex to assimilate theoretically and practically, we are motivated to automate its application and automatically generate code that benefits from it. We discuss algorithmic approaches and present a generalised compiler pipeline to automate the application of temporal blocking. These passes are written in the Devito compiler. They are used to accelerate the computation of stencil kernels in areas such as seismic and medical imaging, computational fluid dynamics and machine learning. \href{www.devitoproject.org}{Devito} is a Python package to implement optimised stencil computation (e.g., finite differences, image processing, machine learning) from high-level symbolic problem definitions. Devito builds on \href{www.sympy.org}{SymPy} and employs automated code generation and just-in-time compilation to execute optimised computational kernels on several computer platforms, including CPUs, GPUs, and clusters thereof. We show how we automate temporal blocking code generation without user intervention and often achieve better time-to-solution. We enable domain-specific optimisation through compiler passes and offer temporal blocking gains from a high-level symbolic abstraction. These automated optimisations benefit various computational kernels for solving real-world application problems.Open Acces

    KINETICALLY CONSISTENT THERMAL LATTICE BOLTZMANN MODELS

    Get PDF
    The lattice Boltzmann (LB) method has developed into a numerically robust and efficient technique for simulating a wide variety of complex fluid flows. Unlike conventional CFD methods, the LB method is based on microscopic models and mesoscopic kinetic equations in which the collective long-term behavior of pseudo-particles is used to simulate the hydrodynamic limit of a system. Due to its kinetic basis, the LB method is particularly useful in applications involving interfacial dynamics and complex boundaries, such as multiphase or multicomponent flows. However, most of the LB models, both single and multiphase, do not satisfy the energy conservation principle, thus limiting their ability to provide quantitatively accurate predictions for cases with substantial heat transfer rates. To address this issue, this dissertation focuses on developing kinetically consistent and energy conserving LB models for single phase flows, in particular. Firstly, through a procedure similar to the Galerkin method, we present a mathematical formulation of the LB method based on the concept of projection of the distributions onto a Hermite-polynomial basis and their systematic truncation. This formulation is shown to be capable of approximating the near incompressible, weakly compressible, and fully compressible (thermal) limits of the continuous Boltzmann equation, thus obviating the previous low-Mach number assumption. Physically it means that this formulation allows a kinetically-accurate description of flows involving large heat transfer rates. The various higher-order discrete-velocity sets (lattices) that follow from this formulation are also compiled. The resulting higher-order thermal model is validated for benchmark thermal flows, such as Rayleigh-Benard convection and thermal Couette flow, in an off-lattice framework. Our tests indicate that the D2Q39-based thermal models are capable of modeling incompressible and weakly compressible thermal flows accurately. In the validation process, through a finite-difference-type boundary treatment, we also extend the applicability of higher-order la ttices to flow-domains with solid boundaries, which was previously restricted. Secondly, we present various off-lattice time-marching schemes for solving the discrete Boltzmann equation. Specifically, the various temporal schemes are analyzed with respect to their numerical stability as a function of the maximum allowable time-step . We show that the characteristics-based temporal schemes offer the best numerical stability among all other comparable schemes. Due to this enhanced numerical stability, we show that the usual restriction no longer applies, enabling larger time-steps, and thereby reducing the computational run-time. The off-lattice scheme were also successfully extended to higher-order LB models. Finally, we present the algorithm and single-core optimization techniques for a off-lattice, higher-order LB code. Using simple cache optimization techniques and a proper choice of the data-structure, we obtain a 5-7X improvement in performance compared to a naive, unoptimized code. Thereafter, the optimized code is parallelized using OpenMP. Scalability tests indicate a parallel efficiency of 80% on shared-memory systems with up to 50 cores (strong scaling). An analysis of the higher-order LB models also show that they are less memory-bound if the off-lattice temporal schemes are used, thus making them more scalable compared to the stream-collide type scheme

    Optimization and validation of discontinuous Galerkin Code for the 3D Navier-Stokes equations

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Aeronautics and Astronautics, 2011.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references (p. 165-170).From residual and Jacobian assembly to the linear solve, the components of a high-order, Discontinuous Galerkin Finite Element Method (DGFEM) for the Navier-Stokes equations in 3D are presented. Emphasis is given to residual and Jacobian assembly, since these are rarely discussed in the literature; in particular, this thesis focuses on code optimization. Performance properties of DG methods are identified, including key memory bottlenecks. A detailed overview of the memory hierarchy on modern CPUs is given along with discussion on optimization suggestions for utilizing the hierarchy efficiently. Other programming suggestions are also given, including the process for rewriting residual and Jacobian assembly using matrix-matrix products. Finally, a validation of the performance of the 3D, viscous DG solver is presented through a series of canonical test cases.by Eric Hung-Lin Liu.S.M
    • …
    corecore