8 research outputs found

    Using the High Productivity Language Chapel to Target GPGPU Architectures

    Get PDF
    It has been widely shown that GPGPU architectures offer large performance gains compared to their traditional CPU counterparts for many applications. The downside to these architectures is that the current programming models present numerous challenges to the programmer: lower-level languages, explicit data movement, loss of portability, and challenges in performance optimization. In this paper, we present novel methods and compiler transformations that increase productivity by enabling users to easily program GPGPU architectures using the high productivity programming language Chapel. Rather than resorting to different parallel libraries or annotations for a given parallel platform, we leverage a language that has been designed from first principles to address the challenge of programming for parallelism and locality. This also has the advantage of being portable across distinct classes of parallel architectures, including desktop multicores, distributed memory clusters, large-scale shared memory, and now CPU-GPU hybrids. We present experimental results from the Parboil benchmark suite which demonstrate that codes written in Chapel achieve performance comparable to the original versions implemented in CUDA.NSF CCF 0702260Cray Inc. Cray-SRA-2010-016962010-2011 Nvidia Research Fellowshipunpublishednot peer reviewe

    Compilation techniques and language support to facilitate dependence-driven computation

    Get PDF
    As the demand increases for high performance and power efficiency in modern computer runtime systems and architectures, programmers are left with the daunting challenge of fully exploiting these systems for efficiency, high-level expressibility, and portability across different computing architectures. Emerging programming models such as the task-based runtime StarPU and many-core architectures such as GPUs force programmers into choosing either low-level programming languages or putting complete faith in the compiler. As has been previously studied in extensive detail, both development approaches have their own respective trade-offs. The goal of this thesis is to help make parallel programming easier. It addresses these challenges by providing new compilation techniques for high-level programming languages that conform to commonly-accepted paradigms in order to leverage these emerging runtime systems and architectures. In particular, this dissertation makes several contributions to these challenges by leveraging the high-level programming language Chapel in order to efficiently map computation and data onto both the task-based runtime system StarPU and onto GPU-based accelerators. Different loop-based parallel programs and experiments are evaluated in order to measure the effectiveness of the proposed compiler algorithms and their optimizations, while also providing programmability metrics when leveraging high-level languages. In order to exploit additional performance when mapping onto shared memory systems, this thesis proposes a set of compiler and runtime-based heuristics that determine the profitable processor tile shapes and sizes when mapping multiply-nested parallel loops. Finally, a new benchmark-suite named P-Ray is presented. This is used to provide machine characteristics in a portable manner that can be used by either a compiler, an auto-tuning framework, or the programmer when optimizing their applications

    Dynamic thread block launch: A lightweight execution mechanism to support irregular applications on gpus

    No full text
    Abstract GPUs have been proven effective for structured applications that map well to the rigid 1D-3D grid of threads in modern bulk synchronous parallel (BSP) programming languages. However, less success has been encountered in mapping data intensive irregular applications such as graph analytics, relational databases, and machine learning. Recently introduced nested device-side kernel launching functionality in the GPU is a step in the right direction, but still falls short of being able to effectively harness the GPUs performance potential. We propose a new mechanism called Dynamic Thread Block Launch (DTBL) to extend the current bulk synchronous parallel model underlying the current GPU execution model by supporting dynamic spawning of lightweight thread blocks. This mechanism supports the nested launching of thread blocks rather than kernels to execute dynamically occurring parallel work elements. This paper describes the execution model of DTBL, device-runtime support, and microarchitecture extensions to track and execute dynamically spawned thread blocks. Experiments with a set of irregular data intensive CUDA applications executing on a cycle-level simulator show that DTBL achieves average 1.21x speedup over the original flat implementation and average 1.40x over the implementation with device-side kernel launches using CUDA Dynamic Parallelism

    Performance Portability with the Chapel Language

    No full text
    Abstract—It has been widely shown that high-throughput computing architectures such as GPUs offer large performance gains compared to their traditional low-latency counterparts for many applications. The downside to these architectures is that the current programming models present numerous challenges to the programmer: lower-level languages, loss of portability across different architectures, explicit data movement, and challenges in performance optimization. In this paper, we present novel methods and compiler transformations that increase programmer productivity by enabling users of the language Chapel to provide just a single code implementation, that the compiler can then use to target not only conventional multiprocessors, but also highthroughput, as well as hybrid machines. Rather than resorting to different parallel libraries or annotations for a given parallel platform, we leverage a language that has been designed from first principles to address the challenge of programming for parallelism and locality. This also has the advantage of providing portability across different parallel architectures. Finally, we present experimental results from the Parboil benchmark suite which demonstrate that codes written in Chapel achieve performance comparable to the original versions implemented in CUDA on both GPUs and multicore platforms. I

    The Pierre Auger Cosmic Ray Observatory

    Get PDF
    See paper for full list of authors – Paper submitted to NIM AInternational audienceThe Pierre Auger Observatory, located on a vast, high plain in western Argentina, is the world's largest cosmic ray observatory. The objectives of the Observatory are to probe the origin and characteristics of cosmic rays above 101710^{17} eV and to study the interactions of these, the most energetic particles observed in nature. The Auger design features an array of 1660 water-Cherenkov particle detector stations spread over 3000 km2^2 overlooked by 24 air fluorescence telescopes. In addition, three high elevation fluorescence telescopes overlook a 23.5 km2^2, 61 detector infill array. The Observatory has been in successful operation since completion in 2008 and has recorded data from an exposure exceeding 40,000 km2^2 sr yr. This paper describes the design and performance of the detectors, related subsystems and infrastructure that make up the Auger Observatory

    The Pierre Auger Cosmic Ray Observatory

    Get PDF
    The Pierre Auger Observatory, located on a vast, high plain in western Argentina, is the world's largest cosmic ray observatory. The objectives of the Observatory are to probe the origin and characteristics of cosmic rays above 1017 eV and to study the interactions of these, the most energetic particles observed in nature. The Auger design features an array of 1660 water Cherenkov particle detector stations spread over 3000 km2 overlooked by 24 air fluorescence telescopes. In addition, three high elevation fluorescence telescopes overlook a 23.5 km2, 61-detector infilled array with 750 m spacing. The Observatory has been in successful operation since completion in 2008 and has recorded data from an exposure exceeding 40,000 km2 sr yr. This paper describes the design and performance of the detectors, related subsystems and infrastructure that make up the Observatory

    The Pierre Auger Cosmic Ray Observatory

    No full text
    corecore