703 research outputs found

    Using the High Productivity Language Chapel to Target GPGPU Architectures

    Get PDF
    It has been widely shown that GPGPU architectures offer large performance gains compared to their traditional CPU counterparts for many applications. The downside to these architectures is that the current programming models present numerous challenges to the programmer: lower-level languages, explicit data movement, loss of portability, and challenges in performance optimization. In this paper, we present novel methods and compiler transformations that increase productivity by enabling users to easily program GPGPU architectures using the high productivity programming language Chapel. Rather than resorting to different parallel libraries or annotations for a given parallel platform, we leverage a language that has been designed from first principles to address the challenge of programming for parallelism and locality. This also has the advantage of being portable across distinct classes of parallel architectures, including desktop multicores, distributed memory clusters, large-scale shared memory, and now CPU-GPU hybrids. We present experimental results from the Parboil benchmark suite which demonstrate that codes written in Chapel achieve performance comparable to the original versions implemented in CUDA.NSF CCF 0702260Cray Inc. Cray-SRA-2010-016962010-2011 Nvidia Research Fellowshipunpublishednot peer reviewe

    Code Generation and Global Optimization Techniques for a Reconfigurable PRAM-NUMA Multicore Architecture

    Full text link

    Device level communication libraries for high‐performance computing in Java

    Get PDF
    This is the peer reviewed version of the following article: Taboada, G. L., Touriño, J. , Doallo, R. , Shafi, A. , Baker, M. and Carpenter, B. (2011), Device level communication libraries for high‐performance computing in Java. Concurrency Computat.: Pract. Exper., 23: 2382-2403. doi:10.1002/cpe.1777, which has been published in final form at https://doi.org/10.1002/cpe.1777. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions.[Abstract] Since its release, the Java programming language has attracted considerable attention from the high‐performance computing (HPC) community because of its portability, high programming productivity, and built‐in multithreading and networking support. As a consequence, several initiatives have been taken to develop a high‐performance Java message‐passing library to program distributed memory architectures, such as clusters. The performance of Java message‐passing applications relies heavily on the communications performance. Thus, the design and implementation of low‐level communication devices that support message‐passing libraries is an important research issue in Java for HPC. MPJ Express is our Java message‐passing implementation for developing high‐performance parallel Java applications. Its public release currently contains three communication devices: the first one is built using the Java New Input/Output (NIO) package for the TCP/IP; the second one is specifically designed for the Myrinet Express library on Myrinet; and the third one supports thread‐based shared memory communications. Although these devices have been successfully deployed in many production environments, previous performance evaluations of MPJ Express suggest that the buffering layer, tightly coupled with these devices, incurs a certain degree of copying overhead, which represents one of the main performance penalties. This paper presents a more efficient Java message‐passing communications device, based on Java Input/Output sockets, that avoids this buffering overhead. Moreover, this device implements several strategies, both in the communication protocol and in the HPC hardware support, which optimizes Java message‐passing communications. In order to evaluate its benefits, this paper analyzes the performance of this device comparatively with other Java and native message‐passing libraries on various high‐speed networks, such as Gigabit Ethernet, Scalable Coherent Interface, Myrinet, and InfiniBand, as well as on a shared memory multicore scenario. The reported communication overhead reduction encourages the upcoming incorporation of this device in MPJ ExpressMinisterio de Ciencia e Innovación; TIN2010-16735

    PySke: Algorithmic Skeletons for Python

    Get PDF
    International audiencePySke is a library of parallel algorithmic skeletons in Python designed for list and tree data structures. Such algorithmic skeletons are high-order functions implemented in parallel. An application developed with PySke is a composition of skeletons. To ease the write of parallel programs, PySke does not follow the Single Program Multiple Data (SPMD) paradigm but offers a global view of parallel programs to users. This approach aims at writing scalable programs easily. In addition to the library, we present experiments performed on a high-performance computing cluster (distributed memory) on a set of example applications developed with PySke

    Flexible Management on BSP Process Rescheduling: Offering Migration at Middleware and Application Levels

    Get PDF
    This article describes the rationales for developing jMigBSP - a Java programming library that offers object rescheduling. It was designed to work on grid computing environments and offers an interface that follows the BSP (Bulk Synchronous Parallel) style. jMigBSP’s main contribution focuses on the rescheduling facility in two different ways: (i) by using migration directives on the application coded irectly and (ii) through automatic load balancing at middleware level. Especially, this second idea is feasible thanks to the Java’s inheritance feature, in which transforms a simple jMigBSP application in amigratable one only by changing a single line of code. In addition, the presented library makes the object interaction easier by providing one-sided message passing directives and hides network latency through asynchronous communications. Finally, we developed three BSP applications: (i) Prefix Sum; (ii) Fractal Image Compression (FIC) and; (iii) Fast Fourier Transform (FFT).They show our library as viable solution to offer load balancing on BSP applications. Specially, the FIC results present gains up to 37% when applying migration directives inside the code. Finally, the FFT tests emphasize strength of jMigBSP. In this situation, it outperforms a native library denoted BSPlib when migration facilities take place.Keywords: Bulk Synchronous Parallel, rescheduling, Java, adaptation, object migration, grid computing

    HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi

    Get PDF
    This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library, that incorporates the developments presented here and, more broadly, provides the DLA functionality equivalent to that of the popular LAPACK library while targeting heterogeneous architectures that feature a mix of multicore CPUs and coprocessors. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through the use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology whereby we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA
    corecore