409 research outputs found

    UPIR: Toward the Design of Unified Parallel Intermediate Representation for Parallel Programming Models

    Full text link
    The complexity of heterogeneous computing architectures, as well as the demand for productive and portable parallel application development, have driven the evolution of parallel programming models to become more comprehensive and complex than before. Enhancing the conventional compilation technologies and software infrastructure to be parallelism-aware has become one of the main goals of recent compiler development. In this paper, we propose the design of unified parallel intermediate representation (UPIR) for multiple parallel programming models and for enabling unified compiler transformation for the models. UPIR specifies three commonly used parallelism patterns (SPMD, data and task parallelism), data attributes and explicit data movement and memory management, and synchronization operations used in parallel programming. We demonstrate UPIR via a prototype implementation in the ROSE compiler for unifying IR for both OpenMP and OpenACC and in both C/C++ and Fortran, for unifying the transformation that lowers both OpenMP and OpenACC code to LLVM runtime, and for exporting UPIR to LLVM MLIR dialect.Comment: Typos corrected. Format update

    SemCache: Semantics-Aware Caching for Efficient GPU Offloading

    Get PDF
    Graphical Processing Units (GPUs) offer massive, highly-efficient parallelism, making them an attractive target for computation-intensive applications. However, GPUs have a separate memory space which introduces the complexity of manually handling explicit data movements between GPU and CPU memory spaces. Although GPU kernels/libraries have made it easy to improve application performance by offloading computation to GPUs, unfortunately it is very difficult to manually optimize CPU-GPU communication between multiple kernel invocations to avoid redundant communication when using these kernels with complex applications. ^ In this thesis, we introduce SemCache, a semantics-aware GPU cache that automatically manages CPU-GPU communication in addition to optimizing communication by eliminating redundant transfers using caching. It uses library semantics to determine the appropriate caching granularity for a given offloaded library (e.g., matrices). Our caching technique is efficient; it only tracks matrices instead of tracking every memory access at fine granularity. We applied SemCache to Basic Linear Algebra Subprograms (BLAS) library to provide a GPU drop-in replacement library which requires no program rewriting or annotations. ^ SemCache++ extends SemCache to support offloading to multiple GPUs. SemCache++ is used to build the first multi-GPU drop-in replacement library that (a) uses the virtual memory to automatically manage and optimize multi-GPU communication and (b) requires no program rewriting or annotations. SemCache++ also enables new features like asynchronous transfers, parallel execution and overlapping communication with computation. Experimental results show that our system can dramatically reduce redundant communication for real-world computational science application and deliver significant performance improvements, beating GPU-based implementations like MAGMA, CULA, CUBLAS, StarPU and CUBLASX

    JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization

    Get PDF
    The rapid development in computing technology has paved the way for directive-based programming models towards a principal role in maintaining software portability of performance-critical applications. Efforts on such models involve a least engineering cost for enabling computational acceleration on multiple architectures while programmers are only required to add meta information upon sequential code. Optimizations for obtaining the best possible efficiency, however, are often challenging. The insertions of directives by the programmer can lead to side-effects that limit the available compiler optimization possible, which could result in performance degradation. This is exacerbated when targeting multi-GPU systems, as pragmas do not automatically adapt to such systems, and require expensive and time consuming code adjustment by programmers. This paper introduces JACC, an OpenACC runtime framework which enables the dynamic extension of OpenACC programs by serving as a transparent layer between the program and the compiler. We add a versatile code-translation method for multi-device utilization by which manually-optimized applications can be distributed automatically while keeping original code structure and parallelism. We show in some cases nearly linear scaling on the part of kernel execution with the NVIDIA V100 GPUs. While adaptively using multi-GPUs, the resulting performance improvements amortize the latency of GPU-to-GPU communications.Comment: Extended version of a paper to appear in: Proceedings of the 28th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC), December 17-18, 202

    Data-centric Performance Measurement and Mapping for Highly Parallel Programming Models

    Get PDF
    Modern supercomputers have complex features: many hardware threads, deep memory hierarchies, and many co-processors/accelerators. Productively and effectively designing programs to utilize those hardware features is crucial in gaining the best performance. There are several highly parallel programming models in active development that allow programmers to write efficient code on those architectures. Performance profiling is a very important technique in the development to achieve the best performance. In this dissertation, I proposed a new performance measurement and mapping technique that can associate performance data with program variables instead of code blocks. To validate the applicability of my data-centric profiling idea, I designed and implemented a profiler for PGAS and CUDA. For PGAS, I developed ChplBlamer, for both single-node and multi-node Chapel programs. My tool also provides new features such as data-centric inter-node load imbalance identification. For CUDA, I developed CUDABlamer for GPU-accelerated applications. CUDABlamer also attributes performance data to program variables, which is a feature that was not found in any previous CUDA profilers. Directed by the insights from the tools, I optimized several widely-studied benchmarks and significantly improved program performance by a factor of up to 4x for Chapel and 47x for CUDA kernels

    Heterogeneous computing with an algorithmic skeleton framework

    Get PDF
    The Graphics Processing Unit (GPU) is present in almost every modern day personal computer. Despite its specific purpose design, they have been increasingly used for general computations with very good results. Hence, there is a growing effort from the community to seamlessly integrate this kind of devices in everyday computing. However, to fully exploit the potential of a system comprising GPUs and CPUs, these devices should be presented to the programmer as a single platform. The efficient combination of the power of CPU and GPU devices is highly dependent on each device’s characteristics, resulting in platform specific applications that cannot be ported to different systems. Also, the most efficient work balance among devices is highly dependable on the computations to be performed and respective data sizes. In this work, we propose a solution for heterogeneous environments based on the abstraction level provided by algorithmic skeletons. Our goal is to take full advantage of the power of all CPU and GPU devices present in a system, without the need for different kernel implementations nor explicit work-distribution.To that end, we extended Marrow, an algorithmic skeleton framework for multi-GPUs, to support CPU computations and efficiently balance the work-load between devices. Our approach is based on an offline training execution that identifies the ideal work balance and platform configurations for a given application and input data size. The evaluation of this work shows that the combination of CPU and GPU devices can significantly boost the performance of our benchmarks in the tested environments, when compared to GPU-only executions

    Transactional memory on heterogeneous architectures

    Get PDF
    Tesis Leida el 9 de Marzo de 2018.Si observamos las necesidades computacionales de hoy, y tratamos de predecir las necesidades del mañana, podemos concluir que el procesamiento heterogéneo estará presente en muchos dispositivos y aplicaciones. El motivo es lógico: algoritmos diferentes y datos de naturaleza diferente encajan mejor en unos dispositivos de cómputo que en otros. Pongamos como ejemplo una tecnología de vanguardia como son los vehículos inteligentes. En este tipo de aplicaciones la computación heterogénea no es una opción, sino un requisito. En este tipo de vehículos se recolectan y analizan imágenes, tarea para la cual los procesadores gráficos (GPUs) son muy eficientes. Muchos de estos vehículos utilizan algoritmos sencillos, pero con grandes requerimientos de tiempo real, que deben implementarse directamente en hardware utilizando FPGAs. Y, por supuesto, los procesadores multinúcleo tienen un papel fundamental en estos sistemas, tanto organizando el trabajo de otros coprocesadores como ejecutando tareas en las que ningún otro procesador es más eficiente. No obstante, los procesadores tampoco siguen siendo dispositivos homogéneos. Los diferentes núcleos de un procesador pueden ofrecer diferentes características en términos de potencia y consumo energético que se adapten a las necesidades de cómputo de la aplicación. Programar este conjunto de dispositivos es una tarea compleja, especialmente en su sincronización. Habitualmente, esta sincronización se basa en operaciones atómicas, ejecución y terminación de kernels, barreras y señales. Con estas primitivas de sincronización básicas se pueden construir otras estructuras más complejas. Sin embargo, la programación de estos mecanismos es tediosa y propensa a fallos. La memoria transaccional (TM por sus siglas en inglés) se ha propuesto como un mecanismo avanzado a la vez que simple para garantizar la exclusión mutua

    Parallel evaluation strategies for lazy data structures in Haskell

    Get PDF
    Conventional parallel programming is complex and error prone. To improve programmer productivity, we need to raise the level of abstraction with a higher-level programming model that hides many parallel coordination aspects. Evaluation strategies use non-strictness to separate the coordination and computation aspects of a Glasgow parallel Haskell (GpH) program. This allows the specification of high level parallel programs, eliminating the low-level complexity of synchronisation and communication associated with parallel programming. This thesis employs a data-structure-driven approach for parallelism derived through generic parallel traversal and evaluation of sub-components of data structures. We focus on evaluation strategies over list, tree and graph data structures, allowing re-use across applications with minimal changes to the sequential algorithm. In particular, we develop novel evaluation strategies for tree data structures, using core functional programming techniques for coordination control, achieving more flexible parallelism. We use non-strictness to control parallelism more flexibly. We apply the notion of fuel as a resource that dictates parallelism generation, in particular, the bi-directional flow of fuel, implemented using a circular program definition, in a tree structure as a novel way of controlling parallel evaluation. This is the first use of circular programming in evaluation strategies and is complemented by a lazy function for bounding the size of sub-trees. We extend these control mechanisms to graph structures and demonstrate performance improvements on several parallel graph traversals. We combine circularity for control for improved performance of strategies with circularity for computation using circular data structures. In particular, we develop a hybrid traversal strategy for graphs, exploiting breadth-first order for exposing parallelism initially, and then proceeding with a depth-first order to minimise overhead associated with a full parallel breadth-first traversal. The efficiency of the tree strategies is evaluated on a benchmark program, and two non-trivial case studies: a Barnes-Hut algorithm for the n-body problem and sparse matrix multiplication, both using quad-trees. We also evaluate a graph search algorithm implemented using the various traversal strategies. We demonstrate improved performance on a server-class multicore machine with up to 48 cores, with the advanced fuel splitting mechanisms proving to be more flexible in throttling parallelism. To guide the behaviour of the strategies, we develop heuristics-based parameter selection to select their specific control parameters

    Towards Intelligent Runtime Framework for Distributed Heterogeneous Systems

    Get PDF
    Scientific applications strive for increased memory and computing performance, requiring massive amounts of data and time to produce results. Applications utilize large-scale, parallel computing platforms with advanced architectures to accommodate their needs. However, developing performance-portable applications for modern, heterogeneous platforms requires lots of effort and expertise in both the application and systems domains. This is more relevant for unstructured applications whose workflow is not statically predictable due to their heavily data-dependent nature. One possible solution for this problem is the introduction of an intelligent Domain-Specific Language (iDSL) that transparently helps to maintain correctness, hides the idiosyncrasies of lowlevel hardware, and scales applications. An iDSL includes domain-specific language constructs, a compilation toolchain, and a runtime providing task scheduling, data placement, and workload balancing across and within heterogeneous nodes. In this work, we focus on the runtime framework. We introduce a novel design and extension of a runtime framework, the Parallel Runtime Environment for Multicore Applications. In response to the ever-increasing intra/inter-node concurrency, the runtime system supports efficient task scheduling and workload balancing at both levels while allowing the development of custom policies. Moreover, the new framework provides abstractions supporting the utilization of heterogeneous distributed nodes consisting of CPUs and GPUs and is extensible to other devices. We demonstrate that by utilizing this work, an application (or the iDSL) can scale its performance on heterogeneous exascale-era supercomputers with minimal effort. A future goal for this framework (out of the scope of this thesis) is to be integrated with machine learning to improve its decision-making and performance further. As a bridge to this goal, since the framework is under development, we experiment with data from Nuclear Physics Particle Accelerators and demonstrate the significant improvements achieved by utilizing machine learning in the hit-based track reconstruction process

    Automatic Translation of Data Parallel Programs for Heterogeneous Parallelism Through OpenMP Offloading

    Get PDF
    Heterogeneous multicores like GPGPUs are now commonplace in modern computing systems. Although heterogeneous multicores offer the potential for high performance, programmers are struggling to program such systems. This paper presents OAO, a compiler-based approach to automatically translate shared-memory OpenMP data-parallel programs to run on heterogeneous multicores through OpenMP offloading directives. Given the large user base of shared memory OpenMP programs, our approach allows programmers to continue using a single-source-based programming language that they are familiar with while benefiting from the heterogeneous performance. OAO introduces a novel runtime optimization scheme to automatically eliminate unnecessary host–device communication to minimize the communication overhead between the host and the accelerator device. We evaluate OAO by applying it to 23 benchmarks from the PolyBench and Rodinia suites on two distinct GPU platforms. Experimental results show that OAO achieves up to 32×× speedup over the original OpenMP version, and can reduce the host–device communication overhead by up to 99% over the hand-translated version
    • …
    corecore