6,213 research outputs found

    Tiling Optimizations for Stencil Computations Using Rewrite Rules in Lift

    Get PDF
    Stencil computations are a widely used type of algorithm, found in applications from physical simulations to machine learning. Stencils are embarrassingly parallel, therefore fit on modern hardware such as Graphic Processing Units perfectly. Although stencil computations have been extensively studied, optimizing them for increasingly diverse hardware remains challenging. Domain-specific Languages (DSLs) have raised the programming abstraction and offer good performance; however, this method places the burden on DSL implementers to write almost full-fledged parallelizing compilers and optimizers. Lift has recently emerged as a promising approach to achieve performance portability by using a small set of reusable parallel primitives that DSL or library writers utilize. Lift’s key novelty is in its encoding of optimizations as a system of extensible rewrite rules which are used to explore the optimization space. This article demonstrates how complex multi-dimensional stencil code and optimizations are expressed using compositions of simple 1D Lift primitives and rewrite rules. We introduce two optimizations that provide high performance for stencils in particular: classical overlapped tiling for multi-dimensional stencils and 2.5D tiling specifically for 3D stencils. We provide an in-depth analysis on how the tiling optimizations affects stencils of different shapes and sizes across different applications. Our experimental results show that our approach outperforms existing compiler approaches and hand-tuned codes

    Using the High Productivity Language Chapel to Target GPGPU Architectures

    Get PDF
    It has been widely shown that GPGPU architectures offer large performance gains compared to their traditional CPU counterparts for many applications. The downside to these architectures is that the current programming models present numerous challenges to the programmer: lower-level languages, explicit data movement, loss of portability, and challenges in performance optimization. In this paper, we present novel methods and compiler transformations that increase productivity by enabling users to easily program GPGPU architectures using the high productivity programming language Chapel. Rather than resorting to different parallel libraries or annotations for a given parallel platform, we leverage a language that has been designed from first principles to address the challenge of programming for parallelism and locality. This also has the advantage of being portable across distinct classes of parallel architectures, including desktop multicores, distributed memory clusters, large-scale shared memory, and now CPU-GPU hybrids. We present experimental results from the Parboil benchmark suite which demonstrate that codes written in Chapel achieve performance comparable to the original versions implemented in CUDA.NSF CCF 0702260Cray Inc. Cray-SRA-2010-016962010-2011 Nvidia Research Fellowshipunpublishednot peer reviewe

    High Performance Stencil Code Generation with LIFT

    Get PDF
    Stencil computations are widely used from physical simulations to machine-learning. They are embarrassingly parallel and perfectly fit modern hardware such as Graphic Processing Units. Although stencil computations have been extensively studied, optimizing them for increasingly diverse hardware remains challenging. Domain Specific Languages (DSLs) have raised the programming abstraction and offer good performance. However, this places the burden on DSL implementers who have to write almost full-fledged parallelizing compilers and optimizers. Lift has recently emerged as a promising approach to achieve performance portability and is based on a small set of reusable parallel primitives that DSL or library writers can build upon. Lift’s key novelty is in its encoding of optimizations as a system of extensible rewrite rules which are used to explore the optimization space. However, Lift has mostly focused on linear algebra operations and it remains to be seen whether this approach is applicable for other domains. This paper demonstrates how complex multidimensional stencil code and optimizations such as tiling are expressible using compositions of simple 1D Lift primitives. By leveraging existing Lift primitives and optimizations, we only require the addition of two primitives and one rewrite rule to do so. Our results show that this approach outperforms existing compiler approaches and hand-tuned codes

    PYDAC: A DISTRIBUTED RUNTIME SYSTEM AND PROGRAMMING MODEL FOR A HETEROGENEOUS MANY-CORE ARCHITECTURE

    Get PDF
    Heterogeneous many-core architectures that consist of big, fast cores and small, energy-efficient cores are very promising for future high-performance computing (HPC) systems. These architectures offer a good balance between single-threaded perfor- mance and multithreaded throughput. Such systems impose challenges on the design of programming model and runtime system. Specifically, these challenges include (a) how to fully utilize the chip’s performance, (b) how to manage heterogeneous, un- reliable hardware resources, and (c) how to generate and manage a large amount of parallel tasks. This dissertation proposes and evaluates a Python-based programming framework called PyDac. PyDac supports a two-level programming model. At the high level, a programmer creates a very large number of tasks, using the divide-and-conquer strategy. At the low level, tasks are written in imperative programming style. The runtime system seamlessly manages the parallel tasks, system resilience, and inter- task communication with architecture support. PyDac has been implemented on both an field-programmable gate array (FPGA) emulation of an unconventional het- erogeneous architecture and a conventional multicore microprocessor. To evaluate the performance, resilience, and programmability of the proposed system, several micro-benchmarks were developed. We found that (a) the PyDac abstracts away task communication and achieves programmability, (b) the micro-benchmarks are scalable on the hardware prototype, but (predictably) serial operation limits some micro-benchmarks, and (c) the degree of protection versus speed could be varied in redundant threading that is transparent to programmers

    Safe programming Languages for ABB Automation System 800xA

    Get PDF
    More than 90 % of all computers are embedded in different types of systems, for example mobile phones and industrial robots. Some of these systems are real-time systems; they have to produce their output within certain time constraints. They can also be safety critical; if something goes wrong, there is a risk that a great deal of damage is caused. Industrial Extended Automation System 800xA, developed by ABB, is a realtime control system intended for industrial use within a wide variety of applications where a certain focus on safety is required, for example power plants and oil platforms. The software is currently written in C and C++, languages that are not optimal from a safety point of view. In this master's thesis, it is investigated whether there are any plausible alternatives to using C/C++ for safety critical real-time systems. A number of requirements that programming languages used in this area have to fulfill are stated and it is evaluated if some candidate languages fulfill these requirements. The candidate languages, Java and Ada, are compared to C and C++. It is determined that the Java-to-C compiler LJRT (Lund Java-based Real Time) is a suitable alternative. The practical part of this thesis is concerned with the introduction of Java in 800xA. A module of the system is ported to Java and executed together with the original C/C++ solution. The functionality of the system is tested using a formal test suite and the performance and memory footprint of our solution is measured. The results show that it is possible to gradually introduce Java in 800xA using LJRT, which is the main contribution of this thesis

    Leveraging Semantics Attached to Function Calls to Isolate Applications from Hardware

    Get PDF
    International audienceTo improve performance, computer systems are forcing more microarchitectural and parallel hardware details to be directly exploited by application programmers, exposing limitations in existing compiler and OS infras- tructure, which is failing to maintain the software productivity of the past. In this paper we propose a prag- matic approach, motivated by our experience with BLIS [11], for building applications that tolerate changing hardware, delivering good performance from the same source across diverse parallel targets. Applications are coded in terms of generic parallel patterns using a “piggy back” language, embedded into a base sequential lan- uage by attaching semantics to function calls. Our approach allows programmers to leverage multiple pro- cessor-specific and domain-specific toolchains encapsulated in specialization modules, which extract their input information from the semantics of the function calls, creating an isolation layer bewteen application and target platforms. Developers use existing sequential development tools and languages to code and debug, performing specialization as a separate step when shipping the code. We show this approach can successfully specialize a single source to diverse and evolving heterogeneous multi-core targets and enable aggressive compiler optimiza- tions
    • 

    corecore