122 research outputs found
Hardware acceleration of reaction-diffusion systems:a guide to optimisation of pattern formation algorithms using OpenACC
Reaction Diffusion Systems (RDS) have widespread applications in computational ecology, biology, computer graphics and the visual arts. For the former applications a major barrier to the development of effective simulation models is their computational complexity - it takes a great deal of processing power to simulate enough replicates such that reliable conclusions can be drawn. Optimizing the computation is thus highly desirable in order to obtain more results with less resources. Existing optimizations of RDS tend to be low-level and GPGPU based. Here we apply the higher-level OpenACC framework to two case studies: a simple RDS to learn the ‘workings’ of OpenACC and a more realistic and complex example. Our results show that simple parallelization directives and minimal data transfer can produce a useful performance improvement. The relative simplicity of porting OpenACC code between heterogeneous hardware is a key benefit to the scientific computing community in terms of speed-up and portability
UPIR: Toward the Design of Unified Parallel Intermediate Representation for Parallel Programming Models
The complexity of heterogeneous computing architectures, as well as the
demand for productive and portable parallel application development, have
driven the evolution of parallel programming models to become more
comprehensive and complex than before. Enhancing the conventional compilation
technologies and software infrastructure to be parallelism-aware has become one
of the main goals of recent compiler development. In this paper, we propose the
design of unified parallel intermediate representation (UPIR) for multiple
parallel programming models and for enabling unified compiler transformation
for the models. UPIR specifies three commonly used parallelism patterns (SPMD,
data and task parallelism), data attributes and explicit data movement and
memory management, and synchronization operations used in parallel programming.
We demonstrate UPIR via a prototype implementation in the ROSE compiler for
unifying IR for both OpenMP and OpenACC and in both C/C++ and Fortran, for
unifying the transformation that lowers both OpenMP and OpenACC code to LLVM
runtime, and for exporting UPIR to LLVM MLIR dialect.Comment: Typos corrected. Format update
Massively Parallel Algorithm for Solving the Eikonal Equation on Multiple Accelerator Platforms
The research presented in this thesis investigates parallel implementations of the Fast Sweeping Method (FSM) for Graphics Processing Unit (GPU)-based computational plat forms and proposes a new parallel algorithm for distributed computing platforms with accelerators. Hardware accelerators such as GPUs and co-processors have emerged as general- purpose processors in today’s high performance computing (HPC) platforms, thereby increasing platforms’ performance capabilities. This trend has allowed greater parallelism and substantial acceleration of scientific simulation software. In order to leverage the power of new HPC platforms, scientific applications must be written in specific lower-level programming languages, which used to be platform specific. Newer programming models such as OpenACC simplifies implementation and assures portability of applications to run across GPUs from different vendors and multi-core processors.
The distance field is a representation of a surface geometry or shape required by many algorithms within the areas of computer graphics, visualization, computational fluid dynamics and more. It can be calculated by solving the eikonal equation using the FSM. The parallel FSMs explored in this thesis have not been implemented on GPU platforms and do not scale to a large problem size. This thesis addresses this problem by designing a parallel algorithm that utilizes a domain decomposition strategy for multi-accelerated distributed platforms. The proposed algorithm applies first coarse grain parallelism using MPI to distribute subdomains across multiple nodes and then fine grain parallelism to optimize performance by utilizing accelerators. The results of the parallel implementations of FSM for GPU-based platforms showed speedup greater than 20× compared to the serial version for some problems and the newly developed parallel algorithm eliminates the limitation of current algorithms to solve large memory problems with comparable runtime efficiency
Loo.py: From Fortran to performance via transformation and substitution rules
A large amount of numerically-oriented code is written and is being written
in legacy languages. Much of this code could, in principle, make good use of
data-parallel throughput-oriented computer architectures. Loo.py, a
transformation-based programming system targeted at GPUs and general
data-parallel architectures, provides a mechanism for user-controlled
transformation of array programs. This transformation capability is designed to
not just apply to programs written specifically for Loo.py, but also those
imported from other languages such as Fortran. It eases the trade-off between
achieving high performance, portability, and programmability by allowing the
user to apply a large and growing family of transformations to an input
program. These transformations are expressed in and used from Python and may be
applied from a variety of settings, including a pragma-like manner from other
languages.Comment: ARRAY 2015 - 2nd ACM SIGPLAN International Workshop on Libraries,
Languages and Compilers for Array Programming (ARRAY 2015
JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization
The rapid development in computing technology has paved the way for
directive-based programming models towards a principal role in maintaining
software portability of performance-critical applications. Efforts on such
models involve a least engineering cost for enabling computational acceleration
on multiple architectures while programmers are only required to add meta
information upon sequential code. Optimizations for obtaining the best possible
efficiency, however, are often challenging. The insertions of directives by the
programmer can lead to side-effects that limit the available compiler
optimization possible, which could result in performance degradation. This is
exacerbated when targeting multi-GPU systems, as pragmas do not automatically
adapt to such systems, and require expensive and time consuming code adjustment
by programmers.
This paper introduces JACC, an OpenACC runtime framework which enables the
dynamic extension of OpenACC programs by serving as a transparent layer between
the program and the compiler. We add a versatile code-translation method for
multi-device utilization by which manually-optimized applications can be
distributed automatically while keeping original code structure and
parallelism. We show in some cases nearly linear scaling on the part of kernel
execution with the NVIDIA V100 GPUs. While adaptively using multi-GPUs, the
resulting performance improvements amortize the latency of GPU-to-GPU
communications.Comment: Extended version of a paper to appear in: Proceedings of the 28th
IEEE International Conference on High Performance Computing, Data, and
Analytics (HiPC), December 17-18, 202
Accelerating Face Anti-Spoofing Algorithms
Tato práce se specializuje na akceleraci algoritmu z oblasti obličejově zaměřených anti-spoofing algoritmů s využitím grafického hardware jakožto platformy pro paralelní zpracování dat. Jako framework je použita technologie OpenCL která umožňuje použití od výkoných stolních počítačů po přenosná zařízení, od různých akcelerátorů jako grafické čipy, či ASIC až po procesory typu x86 bez vazby na konkrétního výrobce či operační systém. Autor předkládá čtenáři rozbor a akcelerovanou implementaci široce používaného algoritmu a dopadu urychlení výpočtu.This thesis is specializes on algorithm acceleration from the field of face-based anti-spoofing. Graphics hardware is used as platform for data-parallel processing. As framework, the OpenCL is used. It allows execution on devices such as powerful desktop computers or hand-held devices as well as usage of different kind of processing units such as GPU, ASIC or CPU without any bound to hardware vendor or operating system. Author presents to reader analysis and accelerated implementation of widely used algorithm and impact of such improvement in execution time.
OpenCL의 프로그래밍 용이성 향상 기법
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 2. 이재진.OpenCL is one of the major programming models for heterogeneous systems. This thesis presents two limitations of OpenCL, the complicated nature of programming in OpenCL and the lack of support for a heterogeneous cluster, and proposes a solution for each of them for ease of programming.
The first limitation is that it is complicated to write a program using OpenCL. In order to lower this programming complexity, this thesis proposes a framework that translates a program written in a high-level language (OpenMP) to OpenCL at the source level. This thesis achieves both ease of programming and high performance by employing two techniquesdata transfer minimization (DTM) and performance portability enhancement (PPE). This thesis shows the effectiveness of the proposed translation framework by evaluating benchmark applications and the practicality by comparing it with the commercial PGI compiler.
The second limitation of OpenCL is the lack of support for a heterogeneous cluster. In order to extend OpenCL to a heterogeneous cluster, this thesis proposes a framework called SnuCL-D that is able to execute a program written only in OpenCL on a heterogeneous cluster. Unlike previous approaches that apply a centralized approach, the proposed framework applies a decentralized approach, which gives a chance to reduce three kinds of overhead occurring in the execution path of commands.
With the ability to analyze and reduce three kinds of overhead, the proposed framework shows good scalability for a large-scale cluster system. The proposed framework proves its effectiveness and practicality by compared to the representative centralized approach (SnuCL) and MPI with benchmark applications.
This thesis proposes solutions for the two limitations of OpenCL for ease of programming on heterogeneous clusters. It is expected that application developers will be able to easily execute not only an OpenMP program on various accelerators but also a program written only in OpenCL on a heterogeneous cluster.Chapter I. Introduction 1
I.1 Motivation and Objectives 5
I.1.1 Programming Complexity 5
I.1.2 Lack of Support for a Heterogeneous Cluster 8
I.2 Contributions 12
Chapter II. Background and Related Work 15
II.1 Background 15
II.1.1 OpenCL 16
II.1.2 OpenMP 23
II.2 Related Work 26
II.2.1 Programming Complexity 26
II.2.2 Support for a Heterogeneous Cluster 29
Chapter III. Lowering the Programming Complexity 34
III.1 Motivating Example 35
III.1.1 Device Constructs 35
III.1.2 Needs for Data Transfer Optimization 41
III.2 Mapping OpenMP to OpenCL 44
III.2.1 Architecture Model 44
III.2.2 Execution Model 45
III.3 Code Translation 46
III.3.1 Translation Process 46
III.3.2 Translating OpenMP to OpenCL 48
III.3.3 Example of Code Translation 50
III.3.4 Data Transfer Minimization (DTM) 62
III.3.5 Performance Portability Enhancement (PPE) 66
III.4 Performance Evaluation 69
III.4.1 Evaluation Methodology 70
III.4.2 Effectiveness of Optimization Techniques 74
III.4.3 Comparison with Other Implementations 79
Chapter IV. Support for a Heterogeneous Cluster 90
IV.1 Problems of Previous Approaches 90
IV.2 The Approach of SnuCL-D 91
IV.2.1 Overhead Analysis 93
IV.2.2 Remote Device Virtualization 94
IV.2.3 Redundant Computation and Data Replication 95
IV.2.4 Memory-read Commands 97
IV.3 Consistency Management 98
IV.4 Deterministic Command Scheduling 100
IV.5 New API Function: clAttachBufferToDevice() 103
IV.6 Queueing Optimization 104
IV.7 Performance Evaluation 105
IV.7.1 Evaluation Methodology 105
IV.7.2 Evaluation with a Microbenchmark 109
IV.7.3 Evaluation on the Large-scale CPU Cluster 111
IV.7.4 Evaluation on the Medium-scale GPU Cluster 123
Chapter V. Conclusion and Future Work 125
Bibliography 129
Korean Abstract 140Docto
Directive-based Approach to Heterogeneous Computing
El mundo de la computación de altas prestaciones está sufriendo grandes cambios que incrementan notablemente su complejidad. La incapacidad de los sistemas monoprocesador o incluso multiprocesador de mantener el incremento de la potencia de cómputo para suplir las necesidades de la comunidad científica ha forzado la irrupción de arquitecturas hardware masivamente paralelas y de unidades específicas para realizar operaciones concretas. Un buen ejemplo de este tipo de dispositivos son las GPU (Unidades de procesamiento gráfico). Estos dispositivos, tradicionalmente dedicados a la programación gráfica, se han convertido recientemente en una plataforma ideal para implementar cómputos masivamente paralelos. La combinación de GPUs para realizar tareas intensivas en cómputo con multi-procesadores para llevar tareas menos intensas pero con lógica de control más compleja, se ha convertido en los últimos años en una de las plataformas más comunes para la realización de cálculos científicos a bajo coste, dado que la potencia desplegada en muchos casos puede alcanzar la de clústers de pequeño o mediano tamaño, con un coste inicial y de mantenimiento notablemente inferior. La incorporación de GPUs en clústers ha permitido también aumentar la capacidad de éstos. Sin embargo, la complejidad de la programación de GPUs, y su integración con códigos existentes, dificultan enormemente la introducción de estas tecnologías entre usuarios menos expertos. En esta tésis exploramos la utilización de modelos de programación basados en directivas para este tipo de entornos, multi-core, many-core, GPUs y clústers, donde el usuario medio ve disminuida notablemente su productividad debido a la dificultad de programación en estos entornos. Para explorar la mejor forma de aplicar directivas en estos entornos, hemos desarrollado un conjunto de herramientas software altamente flexibles (un compilador y un runtime), que permiten explorar diversas técnicas con relativamente poco esfuerzo. La irrupción del estándar de programación de directivas de OpenACC nos permitió demostrar la capacidad de estas herramientas, realizando una implementación experimental del estándar (accULL) en muy poco tiempo y con un rendimiento nada desdeñable. Los resultados computacionales aportados nos permiten demostrar: (a) La disminución en el esfuerzo de programación que permiten las aproximaciones basadas en directivas, (b) La capacidad y flexibilidad de las herramientas diseñadas durante esta tésis para explorar estas aproximaciones y finalmente (c) El potencial de desarrollo futuro de accULL como herramienta experimental en OpenACC en base al rendimiento obtenido actualmente frente al rendimiento de otras aproximaciones comerciales
- …