655 research outputs found

    Acceleration of a Full-scale Industrial CFD Application with OP2

    Get PDF

    Redesigning OP2 Compiler to Use HPX Runtime Asynchronous Techniques

    Full text link
    Maximizing parallelism level in applications can be achieved by minimizing overheads due to load imbalances and waiting time due to memory latencies. Compiler optimization is one of the most effective solutions to tackle this problem. The compiler is able to detect the data dependencies in an application and is able to analyze the specific sections of code for parallelization potential. However, all of these techniques provided with a compiler are usually applied at compile time, so they rely on static analysis, which is insufficient for achieving maximum parallelism and producing desired application scalability. One solution to address this challenge is the use of runtime methods. This strategy can be implemented by delaying certain amount of code analysis to be done at runtime. In this research, we improve the parallel application performance generated by the OP2 compiler by leveraging HPX, a C++ runtime system, to provide runtime optimizations. These optimizations include asynchronous tasking, loop interleaving, dynamic chunk sizing, and data prefetching. The results of the research were evaluated using an Airfoil application which showed a 40-50% improvement in parallel performance.Comment: 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017

    Compiler and Runtime Optimization Techniques for Implementation Scalable Parallel Applications

    Get PDF
    The compiler is able to detect the data dependencies in an application and is able to analyze the specific sections of code for parallelization potential. However, all of these techniques provided by a compiler are usually applied at compile time, so they rely on static analysis, which is insufficient for achieving maximum parallelism and desired application scalability. These compiler techniques should consider both the static information gathered at compile time and dynamic analysis captured at runtime about the system to generate a safe parallel application. On the other hand, runtime information is often speculative. Solely relying on it doesn\u27t guarantee maximal parallel performance. So collecting information at compile time could significantly improve the runtime techniques performance. The goal is achieved in this research by introducing new techniques proposed for both compiler and runtime system that enable them to contribute with each other and utilize both static and dynamic analysis information to maximize application parallel performance. In the proposed framework, a compiler can implement dynamic runtime methods in its parallelization optimizations and a runtime system can apply static information in its parallelization methods implementation. The proposed techniques are able to use high-level programming abstractions and machine learning to relieve the programmer of difficult and tedious decisions that can significantly affect program behavior and performance

    Analysis and Optimization of Scientific Applications through Set and Relation Abstractions

    Get PDF
    Writing high performance code has steadily become more challenging since the design of computing systems has moved toward parallel processors in forms of multi and many-core architectures. This trend has resulted in exceedingly more heterogeneous architectures and programming models. Moreover, the prevalence of distributed systems, especially in fields relying on supercomputers, has caused the programming of such diverse environment more difficulties. To mitigate such challenges, an assortment of tools and programming models have been introduced in the past decade or so. Some efforts focused on the characteristics of the code, such as polyhedral compilers (e.g. Pluto, PPCG, etc.) while others took in consideration the aspects of the application domain and proposed domain specific languages (DSLs). DSLs are developed either in the form of a stand-alone language, like Halide for image processing, or as a part of a general purpose language (e.g., Firedrake- a DSL embedded in Python for solving PDEs using FEM.) called embedded. All these approaches attempt to provide the best input to the underlying common programming models like MPI and OpenMP for distributed and shared memory systems respectively. This dissertation introduces Kaashi, a high-level run-time system, embedded in C++ language, designed to manage memory and execution order of programs with large input data and complex dependencies. Kaashi provides a uniform front-end to multiple back-ends focusing on distributed systems. Kaashi abstractions allows the programmer to define the problem’s data domain as a collection of sets and relations between pairs of such sets. The aforesaid level of abstraction could enable series of optimizations which, otherwise, are very expensive to detect or not feasible at all. Furthermore, Kaashi’s API helps novice programmers to write their code more structurally without getting involved in details of data management and communication

    Multilayered abstractions for partial differential equations

    Get PDF
    How do we build maintainable, robust, and performance-portable scientific applications? This thesis argues that the answer to this software engineering question in the context of the finite element method is through the use of layers of Domain-Specific Languages (DSLs) to separate the various concerns in the engineering of such codes. Performance-portable software achieves high performance on multiple diverse hardware platforms without source code changes. We demonstrate that finite element solvers written in a low-level language are not performance-portable, and therefore code must be specialised to the target architecture by a code generation framework. A prototype compiler for finite element variational forms that generates CUDA code is presented, and is used to explore how good performance on many-core platforms in automatically-generated finite element applications can be achieved. The differing code generation requirements for multi- and many-core platforms motivates the design of an additional abstraction, called PyOP2, that enables unstructured mesh applications to be performance-portable. We present a runtime code generation framework comprised of the Unified Form Language (UFL), the FEniCS Form Compiler, and PyOP2. This toolchain separates the succinct expression of a numerical method from the selection and generation of efficient code for local assembly. This is further decoupled from the selection of data formats and algorithms for efficient parallel implementation on a specific target architecture. We establish the successful separation of these concerns by demonstrating the performance-portability of code generated from a single high-level source code written in UFL across sequential C, CUDA, MPI and OpenMP targets. The performance of the generated code exceeds the performance of comparable alternative toolchains on multi-core architectures.Open Acces

    Vectorizing unstructured mesh computations for many-core architectures.

    Get PDF
    Achieving optimal performance on the latest multi-core and many-core architectures increasingly depends on making efficient use of the hardware's vector units. This paper presents results on achieving high performance through vectorization on CPUs and the Xeon-Phi on a key class of irregular applications: unstructured mesh computations. Using single instruction multiple thread (SIMT) and single instruction multiple data (SIMD) programming models, we show how unstructured mesh computations map to OpenCL or vector intrinsics through the use of code generation techniques in the OP2 Domain Specific Library and explore how irregular memory accesses and race conditions can be organized on different hardware. We benchmark Intel Xeon CPUs and the Xeon-Phi, using a tsunami simulation and a representative CFD benchmark. Results are compared with previous work on CPUs and NVIDIA GPUs to provide a comparison of achievable performance on current many-core systems. We show that auto-vectorization and the OpenCL SIMT model do not map efficiently to CPU vector units because of vectorization issues and threading overheads. In contrast, using SIMD vector intrinsics imposes some restrictions and requires more involved programming techniques but results in efficient code and near-optimal performance, two times faster than non-vectorized code. We observe that the Xeon-Phi does not provide good performance for these applications but is still comparable with a pair of mid-range Xeon chips

    Devito: Towards a generic Finite Difference DSL using Symbolic Python

    Full text link
    Domain specific languages (DSL) have been used in a variety of fields to express complex scientific problems in a concise manner and provide automated performance optimization for a range of computational architectures. As such DSLs provide a powerful mechanism to speed up scientific Python computation that goes beyond traditional vectorization and pre-compilation approaches, while allowing domain scientists to build applications within the comforts of the Python software ecosystem. In this paper we present Devito, a new finite difference DSL that provides optimized stencil computation from high-level problem specifications based on symbolic Python expressions. We demonstrate Devito's symbolic API and performance advantages over traditional Python acceleration methods before highlighting its use in the scientific context of seismic inversion problems.Comment: pyHPC 2016 conference submissio
    • …
    corecore