409 research outputs found
UPIR: Toward the Design of Unified Parallel Intermediate Representation for Parallel Programming Models
The complexity of heterogeneous computing architectures, as well as the
demand for productive and portable parallel application development, have
driven the evolution of parallel programming models to become more
comprehensive and complex than before. Enhancing the conventional compilation
technologies and software infrastructure to be parallelism-aware has become one
of the main goals of recent compiler development. In this paper, we propose the
design of unified parallel intermediate representation (UPIR) for multiple
parallel programming models and for enabling unified compiler transformation
for the models. UPIR specifies three commonly used parallelism patterns (SPMD,
data and task parallelism), data attributes and explicit data movement and
memory management, and synchronization operations used in parallel programming.
We demonstrate UPIR via a prototype implementation in the ROSE compiler for
unifying IR for both OpenMP and OpenACC and in both C/C++ and Fortran, for
unifying the transformation that lowers both OpenMP and OpenACC code to LLVM
runtime, and for exporting UPIR to LLVM MLIR dialect.Comment: Typos corrected. Format update
SemCache: Semantics-Aware Caching for Efficient GPU Offloading
Graphical Processing Units (GPUs) offer massive, highly-efficient parallelism, making them an attractive target for computation-intensive applications. However, GPUs have a separate memory space which introduces the complexity of manually handling explicit data movements between GPU and CPU memory spaces. Although GPU kernels/libraries have made it easy to improve application performance by offloading computation to GPUs, unfortunately it is very difficult to manually optimize CPU-GPU communication between multiple kernel invocations to avoid redundant communication when using these kernels with complex applications. ^ In this thesis, we introduce SemCache, a semantics-aware GPU cache that automatically manages CPU-GPU communication in addition to optimizing communication by eliminating redundant transfers using caching. It uses library semantics to determine the appropriate caching granularity for a given offloaded library (e.g., matrices). Our caching technique is efficient; it only tracks matrices instead of tracking every memory access at fine granularity. We applied SemCache to Basic Linear Algebra Subprograms (BLAS) library to provide a GPU drop-in replacement library which requires no program rewriting or annotations. ^ SemCache++ extends SemCache to support offloading to multiple GPUs. SemCache++ is used to build the first multi-GPU drop-in replacement library that (a) uses the virtual memory to automatically manage and optimize multi-GPU communication and (b) requires no program rewriting or annotations. SemCache++ also enables new features like asynchronous transfers, parallel execution and overlapping communication with computation. Experimental results show that our system can dramatically reduce redundant communication for real-world computational science application and deliver significant performance improvements, beating GPU-based implementations like MAGMA, CULA, CUBLAS, StarPU and CUBLASX
JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization
The rapid development in computing technology has paved the way for
directive-based programming models towards a principal role in maintaining
software portability of performance-critical applications. Efforts on such
models involve a least engineering cost for enabling computational acceleration
on multiple architectures while programmers are only required to add meta
information upon sequential code. Optimizations for obtaining the best possible
efficiency, however, are often challenging. The insertions of directives by the
programmer can lead to side-effects that limit the available compiler
optimization possible, which could result in performance degradation. This is
exacerbated when targeting multi-GPU systems, as pragmas do not automatically
adapt to such systems, and require expensive and time consuming code adjustment
by programmers.
This paper introduces JACC, an OpenACC runtime framework which enables the
dynamic extension of OpenACC programs by serving as a transparent layer between
the program and the compiler. We add a versatile code-translation method for
multi-device utilization by which manually-optimized applications can be
distributed automatically while keeping original code structure and
parallelism. We show in some cases nearly linear scaling on the part of kernel
execution with the NVIDIA V100 GPUs. While adaptively using multi-GPUs, the
resulting performance improvements amortize the latency of GPU-to-GPU
communications.Comment: Extended version of a paper to appear in: Proceedings of the 28th
IEEE International Conference on High Performance Computing, Data, and
Analytics (HiPC), December 17-18, 202
Data-centric Performance Measurement and Mapping for Highly Parallel Programming Models
Modern supercomputers have complex features: many hardware threads, deep memory hierarchies, and many co-processors/accelerators. Productively and effectively designing programs to utilize those hardware features is crucial in gaining the best performance. There are several highly parallel programming models in active development that allow programmers to write efficient code on those architectures. Performance profiling is a very important technique in the development to achieve the best performance.
In this dissertation, I proposed a new performance measurement and mapping technique that can associate performance data with program variables instead of code blocks. To validate the applicability of my data-centric profiling idea, I designed and implemented a profiler for PGAS and CUDA. For PGAS, I developed ChplBlamer, for both single-node and multi-node Chapel programs. My tool also provides new features such as data-centric inter-node load imbalance identification. For CUDA, I developed CUDABlamer for GPU-accelerated applications. CUDABlamer also attributes performance data to program variables, which is a feature that was not found in any previous CUDA profilers. Directed by the insights from the tools, I optimized several widely-studied benchmarks and significantly improved program performance by a factor of up to 4x for Chapel and 47x for CUDA kernels
Heterogeneous computing with an algorithmic skeleton framework
The Graphics Processing Unit (GPU) is present in almost every modern day personal
computer. Despite its specific purpose design, they have been increasingly used for general
computations with very good results. Hence, there is a growing effort from the community
to seamlessly integrate this kind of devices in everyday computing. However, to
fully exploit the potential of a system comprising GPUs and CPUs, these devices should
be presented to the programmer as a single platform.
The efficient combination of the power of CPU and GPU devices is highly dependent
on each device’s characteristics, resulting in platform specific applications that cannot
be ported to different systems. Also, the most efficient work balance among devices is
highly dependable on the computations to be performed and respective data sizes.
In this work, we propose a solution for heterogeneous environments based on the
abstraction level provided by algorithmic skeletons. Our goal is to take full advantage of
the power of all CPU and GPU devices present in a system, without the need for different
kernel implementations nor explicit work-distribution.To that end, we extended Marrow,
an algorithmic skeleton framework for multi-GPUs, to support CPU computations and
efficiently balance the work-load between devices. Our approach is based on an offline
training execution that identifies the ideal work balance and platform configurations for
a given application and input data size.
The evaluation of this work shows that the combination of CPU and GPU devices can
significantly boost the performance of our benchmarks in the tested environments, when
compared to GPU-only executions
Transactional memory on heterogeneous architectures
Tesis Leida el 9 de Marzo de 2018.Si observamos las necesidades computacionales de hoy, y tratamos de predecir
las necesidades del mañana, podemos concluir que el procesamiento heterogéneo
estará presente en muchos dispositivos y aplicaciones.
El motivo es lógico: algoritmos diferentes y datos de naturaleza diferente encajan mejor
en unos dispositivos de cómputo que en otros. Pongamos como ejemplo una
tecnologÃa de vanguardia como son los vehÃculos inteligentes. En este tipo de
aplicaciones la computación heterogénea no es una opción, sino un requisito.
En este tipo de vehÃculos se recolectan y analizan imágenes, tarea para la cual
los procesadores gráficos (GPUs) son muy eficientes.
Muchos de estos vehÃculos utilizan algoritmos sencillos,
pero con grandes requerimientos de tiempo real, que deben
implementarse directamente en hardware utilizando FPGAs.
Y, por supuesto, los procesadores multinúcleo tienen un
papel fundamental en estos sistemas, tanto organizando el trabajo de otros coprocesadores
como ejecutando tareas en las que ningún otro procesador
es más eficiente. No obstante, los procesadores tampoco siguen siendo dispositivos
homogéneos. Los diferentes núcleos de un procesador pueden
ofrecer diferentes caracterÃsticas en términos de potencia y consumo
energético que se adapten a las necesidades de cómputo de la aplicación.
Programar este conjunto de dispositivos es una tarea compleja, especialmente
en su sincronización.
Habitualmente, esta sincronización se basa en operaciones atómicas, ejecución y
terminación de kernels, barreras y señales. Con estas primitivas de sincronización
básicas se pueden construir otras estructuras más complejas.
Sin embargo, la programación de estos
mecanismos es tediosa y propensa a fallos. La memoria transaccional
(TM por sus siglas en inglés) se ha propuesto como un mecanismo
avanzado a la vez que simple para garantizar la exclusión mutua
Parallel evaluation strategies for lazy data structures in Haskell
Conventional parallel programming is complex and error prone. To improve programmer
productivity, we need to raise the level of abstraction with a higher-level
programming model that hides many parallel coordination aspects. Evaluation
strategies use non-strictness to separate the coordination and computation aspects
of a Glasgow parallel Haskell (GpH) program. This allows the specification of high
level parallel programs, eliminating the low-level complexity of synchronisation and
communication associated with parallel programming.
This thesis employs a data-structure-driven approach for parallelism derived through
generic parallel traversal and evaluation of sub-components of data structures. We
focus on evaluation strategies over list, tree and graph data structures, allowing
re-use across applications with minimal changes to the sequential algorithm.
In particular, we develop novel evaluation strategies for tree data structures, using
core functional programming techniques for coordination control, achieving more
flexible parallelism. We use non-strictness to control parallelism more flexibly. We
apply the notion of fuel as a resource that dictates parallelism generation, in particular,
the bi-directional flow of fuel, implemented using a circular program definition,
in a tree structure as a novel way of controlling parallel evaluation. This is the first
use of circular programming in evaluation strategies and is complemented by a lazy
function for bounding the size of sub-trees.
We extend these control mechanisms to graph structures and demonstrate performance
improvements on several parallel graph traversals. We combine circularity
for control for improved performance of strategies with circularity for computation
using circular data structures. In particular, we develop a hybrid traversal strategy
for graphs, exploiting breadth-first order for exposing parallelism initially, and
then proceeding with a depth-first order to minimise overhead associated with a full
parallel breadth-first traversal.
The efficiency of the tree strategies is evaluated on a benchmark program, and
two non-trivial case studies: a Barnes-Hut algorithm for the n-body problem and
sparse matrix multiplication, both using quad-trees. We also evaluate a graph search
algorithm implemented using the various traversal strategies.
We demonstrate improved performance on a server-class multicore machine with
up to 48 cores, with the advanced fuel splitting mechanisms proving to be more
flexible in throttling parallelism. To guide the behaviour of the strategies, we develop
heuristics-based parameter selection to select their specific control parameters
Towards Intelligent Runtime Framework for Distributed Heterogeneous Systems
Scientific applications strive for increased memory and computing performance, requiring massive amounts of data and time to produce results. Applications utilize large-scale, parallel computing platforms with advanced architectures to accommodate their needs. However, developing performance-portable applications for modern, heterogeneous platforms requires lots of effort and expertise in both the application and systems domains. This is more relevant for unstructured applications whose workflow is not statically predictable due to their heavily data-dependent nature. One possible solution for this problem is the introduction of an intelligent Domain-Specific Language (iDSL) that transparently helps to maintain correctness, hides the idiosyncrasies of lowlevel hardware, and scales applications. An iDSL includes domain-specific language constructs, a compilation toolchain, and a runtime providing task scheduling, data placement, and workload balancing across and within heterogeneous nodes. In this work, we focus on the runtime framework. We introduce a novel design and extension of a runtime framework, the Parallel Runtime Environment for Multicore Applications. In response to the ever-increasing intra/inter-node concurrency, the runtime system supports efficient task scheduling and workload balancing at both levels while allowing the development of custom policies. Moreover, the new framework provides abstractions supporting the utilization of heterogeneous distributed nodes consisting of CPUs and GPUs and is extensible to other devices. We demonstrate that by utilizing this work, an application (or the iDSL) can scale its performance on heterogeneous exascale-era supercomputers with minimal effort. A future goal for this framework (out of the scope of this thesis) is to be integrated with machine learning to improve its decision-making and performance further. As a bridge to this goal, since the framework is under development, we experiment with data from Nuclear Physics Particle Accelerators and demonstrate the significant improvements achieved by utilizing machine learning in the hit-based track reconstruction process
Automatic Translation of Data Parallel Programs for Heterogeneous Parallelism Through OpenMP Offloading
Heterogeneous multicores like GPGPUs are now commonplace in modern computing systems. Although heterogeneous multicores offer the potential for high performance, programmers are struggling to program such systems. This paper presents OAO, a compiler-based approach to automatically translate shared-memory OpenMP data-parallel programs to run on heterogeneous multicores through OpenMP offloading directives. Given the large user base of shared memory OpenMP programs, our approach allows programmers to continue using a single-source-based programming language that they are familiar with while benefiting from the heterogeneous performance. OAO introduces a novel runtime optimization scheme to automatically eliminate unnecessary host–device communication to minimize the communication overhead between the host and the accelerator device. We evaluate OAO by applying it to 23 benchmarks from the PolyBench and Rodinia suites on two distinct GPU platforms. Experimental results show that OAO achieves up to 32×× speedup over the original OpenMP version, and can reduce the host–device communication overhead by up to 99% over the hand-translated version
- …