51 research outputs found

    Automated cache optimisations of stencil computations for partial differential equations

    Get PDF
    This thesis focuses on numerical methods that solve partial differential equations. Our focal point is the finite difference method, which solves partial differential equations by approximating derivatives with explicit finite differences. These partial differential equation solvers consist of stencil computations on structured grids. Stencils for computing real-world practical applications are patterns often characterised by many memory accesses and non-trivial arithmetic expressions that lead to high computational costs compared to simple stencils used in much prior proof-of-concept work. In addition, the loop nests to express stencils on structured grids may often be complicated. This work is highly motivated by a specific domain of stencil computations where one of the challenges is non-aligned to the structured grid ("off-the-grid") operations. These operations update neighbouring grid points through scatter and gather operations via non-affine memory accesses, such as {A[B[i]]}. In addition to this challenge, these practical stencils often include many computation fields (need to store multiple grid copies), complex data dependencies and imperfect loop nests. In this work, we aim to increase the performance of stencil kernel execution. We study automated cache-memory-dependent optimisations for stencil computations. This work consists of two core parts with their respective contributions.The first part of our work tries to reduce the data movement in stencil computations of practical interest. Data movement is a dominant factor affecting the performance of high-performance computing applications. It has long been a target of optimisations due to its impact on execution time and energy consumption. This thesis tries to relieve this cost by applying temporal blocking optimisations, also known as time-tiling, to stencil computations. Temporal blocking is a well-known technique to enhance data reuse in stencil computations. However, it is rarely used in practical applications but rather in theoretical examples to prove its efficacy. Applying temporal blocking to scientific simulations is more complex. More specifically, in this work, we focus on the application context of seismic and medical imaging. In this area, we often encounter scatter and gather operations due to signal sources and receivers at arbitrary locations in the computational domain. These operations make the application of temporal blocking challenging. We present an approach to overcome this challenge and successfully apply temporal blocking.In the second part of our work, we extend the first part as an automated approach targeting a wide range of simulations modelled with partial differential equations. Since temporal blocking is error-prone, tedious to apply by hand and highly complex to assimilate theoretically and practically, we are motivated to automate its application and automatically generate code that benefits from it. We discuss algorithmic approaches and present a generalised compiler pipeline to automate the application of temporal blocking. These passes are written in the Devito compiler. They are used to accelerate the computation of stencil kernels in areas such as seismic and medical imaging, computational fluid dynamics and machine learning. \href{www.devitoproject.org}{Devito} is a Python package to implement optimised stencil computation (e.g., finite differences, image processing, machine learning) from high-level symbolic problem definitions. Devito builds on \href{www.sympy.org}{SymPy} and employs automated code generation and just-in-time compilation to execute optimised computational kernels on several computer platforms, including CPUs, GPUs, and clusters thereof. We show how we automate temporal blocking code generation without user intervention and often achieve better time-to-solution. We enable domain-specific optimisation through compiler passes and offer temporal blocking gains from a high-level symbolic abstraction. These automated optimisations benefit various computational kernels for solving real-world application problems.Open Acces

    Task-based Runtime Optimizations Towards High Performance Computing Applications

    Get PDF
    The last decades have witnessed a rapid improvement of computational capabilities in high-performance computing (HPC) platforms thanks to hardware technology scaling. HPC architectures benefit from mainstream advances on the hardware with many-core systems, deep hierarchical memory subsystem, non-uniform memory access, and an ever-increasing gap between computational power and memory bandwidth. This has necessitated continuous adaptations across the software stack to maintain high hardware utilization. In this HPC landscape of potentially million-way parallelism, task-based programming models associated with dynamic runtime systems are becoming more popular, which fosters developers’ productivity at extreme scale by abstracting the underlying hardware complexity. In this context, this dissertation highlights how a software bundle powered by a task-based programming model can address the heterogeneous workloads engendered by HPC applications., i.e., data redistribution, geospatial modeling and 3D unstructured mesh deformation here. Data redistribution aims to reshuffle data to optimize some objective for an algorithm, whose objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal of increasing the efficiency and therefore reducing the time-to-solution for the algorithm. Geostatistical modeling, one of the prime motivating applications for exascale computing, is a technique for predicting desired quantities from geographically distributed data, based on statistical models and optimization of parameters. Meshing the deformable contour of moving 3D bodies is an expensive operation that can cause huge computational challenges in fluid-structure interaction (FSI) applications. Therefore, in this dissertation, Redistribute-PaRSEC, ExaGeoStat-PaRSEC and HiCMA-PaRSEC are proposed to efficiently tackle these HPC applications respectively at extreme scale, and they are evaluated on multiple HPC clusters, including AMD-based, Intel-based, Arm-based CPU systems and IBM-based multi-GPU system. This multidisciplinary work emphasizes the need for runtime systems to go beyond their primary responsibility of task scheduling on massively parallel hardware system for servicing the next-generation scientific applications

    Practical synthesis from real-world oracles

    Get PDF
    As software systems become increasingly heterogeneous, the ability of compilers to reason about an entire system has decreased. When components of a system are not implemented as traditional programs, but rather as specialised hardware, optimised architecture-specific libraries, or network services, the compiler is unable to cross these abstraction barriers and analyse the system as a whole. If these components could be modelled or understood as programs, then the compiler would be able to reason about their behaviour without concern for their internal implementation details: a homogeneous view of the entire system would be afforded. However, it is not often the case that such components ever corresponded to an original program. This means that to facilitate this homogenenous analysis, programmatic models of component behaviour must be learned or constructed automatically. Constructing these models is an inductive program synthesis problem, albeit a challenging one that is largely beyond the ability of existing implementations. In order for the problem to be made tractable, information provided by the underlying context (i.e. the real component behaviour to be matched) must be integrated. This thesis presents three program synthesis approaches that integrate contextual information to synthesise programmatic models for real, existing components. The first, Annote, exploits informally-encoded information about a component's interface (e.g. from documentation) by weaving that information into an extended type-and-attribute system for component interfaces. The second, Presyn, learns a pair of cooperating probabilistic models from prior syntheses, that aim to predict likely program structure based on a component's interface. Finally, Haze uses observations of common side-effects of component executions to bias the search for programs. These approaches are each evaluated against comparable synthesisers from the literature, on a set of benchmark problems derived from real components. Learning models for component behaviour is only a partial solution; the compiler must also have some mechanism to use those models for program analysis and transformation. This thesis additionally proposes a novel mechanism for context-sensitive automatic API migration based on synthesised programmatic models, and evaluates the effectiveness of doing so on real application code. In summary, this thesis proposes a new framing for program synthesis problems that target the behaviour of real components, and demonstrates three different potential approaches to synthesis in this spirit. The success of these approaches is evaluated against implementations from the literature, and their results used to drive a novel API migration technique

    A domain-extensible compiler with controllable automation of optimisations

    Get PDF
    In high performance domains like image processing, physics simulation or machine learning, program performance is critical. Programmers called performance engineers are responsible for the challenging task of optimising programs. Two major challenges prevent modern compilers targeting heterogeneous architectures from reliably automating optimisation. First, domain specific compilers such as Halide for image processing and TVM for machine learning are difficult to extend with the new optimisations required by new algorithms and hardware. Second, automatic optimisation is often unable to achieve the required performance, and performance engineers often fall back to painstaking manual optimisation. This thesis shows the potential of the Shine compiler to achieve domain-extensibility, controllable automation, and generate high performance code. Domain-extensibility facilitates adapting compilers to new algorithms and hardware. Controllable automation enables performance engineers to gradually take control of the optimisation process. The first research contribution is to add 3 code generation features to Shine, namely: synchronisation barrier insertion, kernel execution, and storage folding. Adding these features requires making novel design choices in terms of compiler extensibility and controllability. The rest of this thesis builds on these features to generate code with competitive runtime compared to established domain-specific compilers. The second research contribution is to demonstrate how extensibility and controllability are exploited to optimise a standard image processing pipeline for corner detection. Shine achieves 6 well-known image processing optimisations, 2 of them not being supported by Halide. Our results on 4 ARM multi-core CPUs show that the code generated by Shine for corner detection runs up to 1.4× faster than the Halide code. However, we observe that controlling rewriting is tedious, motivating the need for more automation. The final research contribution is to introduce sketch-guided equality saturation, a semiautomated technique that allows performance engineers to guide program rewriting by specifying rewrite goals as sketches: program patterns that leave details unspecified. We evaluate this approach by applying 7 realistic optimisations of matrix multiplication. Without guidance, the compiler fails to apply the 5 most complex optimisations even given an hour and 60GB of RAM. With the guidance of at most 3 sketch guides, each 10 times smaller than the complete program, the compiler applies the optimisations in seconds using less than 1GB

    Partial aggregation for collective communication in distributed memory machines

    Get PDF
    High Performance Computing (HPC) systems interconnect a large number of Processing Elements (PEs) in high-bandwidth networks to simulate complex scientific problems. The increasing scale of HPC systems poses great challenges on algorithm designers. As the average distance between PEs increases, data movement across hierarchical memory subsystems introduces high latency. Minimizing latency is particularly challenging in collective communications, where many PEs may interact in complex communication patterns. Although collective communications can be optimized for network-level parallelism, occasional synchronization delays due to dependencies in the communication pattern degrade application performance. To reduce the performance impact of communication and synchronization costs, parallel algorithms are designed with sophisticated latency hiding techniques. The principle is to interleave computation with asynchronous communication, which increases the overall occupancy of compute cores. However, collective communication primitives abstract parallelism which limits the integration of latency hiding techniques. Approaches to work around these limitations either modify the algorithmic structure of application codes, or replace collective primitives with verbose low-level communication calls. While these approaches give fine-grained control for latency hiding, implementing collective communication algorithms is challenging and requires expertise knowledge about HPC network topologies. A collective communication pattern is commonly described as a Directed Acyclic Graph (DAG) where a set of PEs, represented as vertices, resolve data dependencies through communication along the edges. Our approach improves latency hiding in collective communication through partial aggregation. Based on mathematical rules of binary operations and homomorphism, we expose data parallelism in a respective DAG to overlap computation with communication. The proposed concepts are implemented and evaluated with a subset of collective primitives in the Message Passing Interface (MPI), an established communication standard in scientific computing. An experimental analysis with communication-bound microbenchmarks shows considerable performance benefits for the evaluated collective primitives. A detailed case study with a large-scale distributed sort algorithm demonstrates, how partial aggregation significantly improves performance in data-intensive scenarios. Besides better latency hiding capabilities with collective communication primitives, our approach enables further optimizations of their implementations within MPI libraries. The vast amount of asynchronous programming models, which are actively studied in the HPC community, benefit from partial aggregation in collective communication patterns. Future work can utilize partial aggregation to improve the interaction of MPI collectives with acclerator architectures, and to design more efficient communication algorithms

    The Paragraph: Design and Implementation of the STAPL Parallel Task Graph

    Get PDF
    Parallel programming is becoming mainstream due to the increased availability of multiprocessor and multicore architectures and the need to solve larger and more complex problems. Languages and tools available for the development of parallel applications are often difficult to learn and use. The Standard Template Adaptive Parallel Library (STAPL) is being developed to help programmers address these difficulties. STAPL is a parallel C++ library with functionality similar to STL, the ISO adopted C++ Standard Template Library. STAPL provides a collection of parallel pContainers for data storage and pViews that provide uniform data access operations by abstracting away the details of the pContainer data distribution. Generic pAlgorithms are written in terms of PARAGRAPHs, high level task graphs expressed as a composition of common parallel patterns. These task graphs define a set of operations on pViews as well as any ordering (i.e., dependences) on these operations that must be enforced by STAPL for a valid execution. The subject of this dissertation is the PARAGRAPH Executor, a framework that manages the runtime instantiation and execution of STAPL PARAGRAPHS. We address several challenges present when using a task graph program representation and discuss a novel approach to dependence specification which allows task graph creation and execution to proceed concurrently. This overlapping increases scalability and reduces the resources required by the PARAGRAPH Executor. We also describe the interface for task specification as well as optimizations that address issues such as data locality. We evaluate the performance of the PARAGRAPH Executor on several parallel machines including massively parallel Cray XT4 and Cray XE6 systems and an IBM Power5 cluster. Using tests including generic parallel algorithms, kernels from the NAS NPB suite, and a nuclear particle transport application written in STAPL, we demonstrate that the PARAGRAPH Executor enables STAPL to exhibit good scalability on more than 10410^4 processors

    Parallel Program Composition with Paragraphs in Stapl

    Get PDF
    Languages and tools currently available for the development of parallel applications are difficult to learn and use. The Standard Template Adaptive Parallel Library (STAPL) is being developed to make it easier for programmers to implement a parallel application. STAPL is a parallel programming library for C++ that adopts the generic programming philosophy of the C++ Standard Template Library. STAPL provides collections of parallel algorithms (pAlgorithms) and containers (pContainers) that allow a developer to write their application without reimplementing the algorithms and data structures commonly used in parallel computing. pViews in STAPL are abstract data types that provide generic data access operations independently of the type of pContainer used to store the data. Algorithms and applications have a formal, high level representation in STAPL. A computation in STAPL is represented as a parallel task graph, which we call a PARAGRAPH. A PARAGRAPH contains a representation of the algorithm's input data, the operations that are used to transform individual data elements, and the ordering between the application of operations that transform the same data element. Just as programs are the result of a composition of algorithms, STAPL programs are the result of a composition of PARAGRAPHs. This dissertation develops the PARAGRAPH program representation and its compositional methods. PARAGRAPHs improve the developer's difficult situation by simplifying what she must specify when writing a parallel algorithm. The performance of the PARAGRAPH is evaluated using parallel generic algorithms, benchmarks from the NAS suite, and a nuclear particle transport application that has been written using STAPL. Our experiments were performed on Cray XT4 and Cray XE6 massively parallel systems and an IBM Power5 cluster, and show that scalable performance beyond 16,000 processors is possible using the PARAGRAPH

    Code generation for 3D partial differential equation models from a high-level functional intermediate language

    Get PDF
    Partial Differential Equation (PDE) modelling is an important tool in scientific domains for bridging theory with reality; however, they can be complex to program and even more difficult to abstract. The evolving parallel computing landscape is also making it increasingly difficult to write and maintain codes (such as PDE models) which retain performance across different parallel platforms. Computational scientists should be able to focus on their science instead of also having to become high performance computing experts in order to take advantage of faster parallel hardware. Current methods targeting this problem either concentrate on very niche applications, are too simplistic for real world problems or are too low-level to be easily programmable. Domain Specific Languages (DSLs) are a popular approach, but they have two opposing goals: improving programmability, while also providing high performance. This thesis presents a solution for developing performance portable 3D PDE models, using room acoustics simulations as a case study, by raising the abstraction level in the existing hardware-agnostic, intermediary language LIFT. This functional language and compiler is designed for DSLs to compile into and provides a separation of concerns for developing parallel applications. This separation enables DSL writers to focus on developing high-level abstractions providing productivity to the user, while LIFT turns the intermediary parallel representation these abstractions compile down to into hardware-optimised code. A suite of composable, algorithmic primitives enables LIFT to reuse functionality across domains and an exploratory search space provides a way to find the best optimisations for a given platform. As this thesis shows, room acoustic simulations are expressible in LIFT with only a few small changes to the framework. These expressions are able to achieve comparable or better performance to original hand-written benchmarks. Furthermore, such expressions enable room acoustics models to run across multiple platforms and easily swap in optimisations. Being able to test out what optimisations give the best performance for a given platform — without rewriting or retuning — allows computational scientists to focus on their own work. Optimisations previously inaccessible in LIFT are developed that target 3D stencils generally, including 3D PDE models. In particular, 2.5D Tiling and compiler passes to inline private arrays and structs are added to the LIFT ecosystem, giving high performance to various 3D stencil codes. The 2.5D Tiling optimisation is coded functionally for the first time in LIFT and is selected automatically by additional rewrite rules. These rewrite rules, such as the one for 2.5D Tiling, are explored in a search space to find the best set of optimisations for an application on a given platform. Building on previous work, LIFT is extended to enable complex boundary conditions and room shapes for room acoustics models. This is the first intermediate representation in a high-level code generator to do so. Additionally, it is also the first high-level framework to support frequency-dependent boundary handling for room acoustics simulations. Combined, these contributions show that high-level abstractions for 3D PDE models are possible, enabling computational scientists to optimise and parallelise their codes more easily across different parallel platforms

    High-Performance Modelling and Simulation for Big Data Applications

    Get PDF
    This open access book was prepared as a Final Publication of the COST Action IC1406 “High-Performance Modelling and Simulation for Big Data Applications (cHiPSet)“ project. Long considered important pillars of the scientific method, Modelling and Simulation have evolved from traditional discrete numerical methods to complex data-intensive continuous analytical optimisations. Resolution, scale, and accuracy have become essential to predict and analyse natural and complex systems in science and engineering. When their level of abstraction raises to have a better discernment of the domain at hand, their representation gets increasingly demanding for computational and data resources. On the other hand, High Performance Computing typically entails the effective use of parallel and distributed processing units coupled with efficient storage, communication and visualisation systems to underpin complex data-intensive applications in distinct scientific and technical domains. It is then arguably required to have a seamless interaction of High Performance Computing with Modelling and Simulation in order to store, compute, analyse, and visualise large data sets in science and engineering. Funded by the European Commission, cHiPSet has provided a dynamic trans-European forum for their members and distinguished guests to openly discuss novel perspectives and topics of interests for these two communities. This cHiPSet compendium presents a set of selected case studies related to healthcare, biological data, computational advertising, multimedia, finance, bioinformatics, and telecommunications

    Minimising mathematical anxiety in teaching mathematics and assessing student’s work

    Get PDF
    This paper builds up a theoretical perspective and supports a possibility of creating a special assessment environment for students, where mathematical knowledge and understanding can be assessed with a reduced number of external psychological factors that may affect such assessment. A concept of a zone with minimal effect of anxiety is introduced and described. Students’ successful work on extending the zone by means of a carefully selected chain of questions, where some questions only are part of a real assessment, allows students to reconsider their attitudes towards mathematics and assist teachers to identify some students’ main learning difficulties as of psychological character. Further suggestions about developing and investigating special assessment environments are outlined and discussed
    • …
    corecore