122 research outputs found

    Hardware acceleration of reaction-diffusion systems:a guide to optimisation of pattern formation algorithms using OpenACC

    Get PDF
    Reaction Diffusion Systems (RDS) have widespread applications in computational ecology, biology, computer graphics and the visual arts. For the former applications a major barrier to the development of effective simulation models is their computational complexity - it takes a great deal of processing power to simulate enough replicates such that reliable conclusions can be drawn. Optimizing the computation is thus highly desirable in order to obtain more results with less resources. Existing optimizations of RDS tend to be low-level and GPGPU based. Here we apply the higher-level OpenACC framework to two case studies: a simple RDS to learn the ‘workings’ of OpenACC and a more realistic and complex example. Our results show that simple parallelization directives and minimal data transfer can produce a useful performance improvement. The relative simplicity of porting OpenACC code between heterogeneous hardware is a key benefit to the scientific computing community in terms of speed-up and portability

    UPIR: Toward the Design of Unified Parallel Intermediate Representation for Parallel Programming Models

    Full text link
    The complexity of heterogeneous computing architectures, as well as the demand for productive and portable parallel application development, have driven the evolution of parallel programming models to become more comprehensive and complex than before. Enhancing the conventional compilation technologies and software infrastructure to be parallelism-aware has become one of the main goals of recent compiler development. In this paper, we propose the design of unified parallel intermediate representation (UPIR) for multiple parallel programming models and for enabling unified compiler transformation for the models. UPIR specifies three commonly used parallelism patterns (SPMD, data and task parallelism), data attributes and explicit data movement and memory management, and synchronization operations used in parallel programming. We demonstrate UPIR via a prototype implementation in the ROSE compiler for unifying IR for both OpenMP and OpenACC and in both C/C++ and Fortran, for unifying the transformation that lowers both OpenMP and OpenACC code to LLVM runtime, and for exporting UPIR to LLVM MLIR dialect.Comment: Typos corrected. Format update

    Massively Parallel Algorithm for Solving the Eikonal Equation on Multiple Accelerator Platforms

    Get PDF
    The research presented in this thesis investigates parallel implementations of the Fast Sweeping Method (FSM) for Graphics Processing Unit (GPU)-based computational plat forms and proposes a new parallel algorithm for distributed computing platforms with accelerators. Hardware accelerators such as GPUs and co-processors have emerged as general- purpose processors in today’s high performance computing (HPC) platforms, thereby increasing platforms’ performance capabilities. This trend has allowed greater parallelism and substantial acceleration of scientific simulation software. In order to leverage the power of new HPC platforms, scientific applications must be written in specific lower-level programming languages, which used to be platform specific. Newer programming models such as OpenACC simplifies implementation and assures portability of applications to run across GPUs from different vendors and multi-core processors. The distance field is a representation of a surface geometry or shape required by many algorithms within the areas of computer graphics, visualization, computational fluid dynamics and more. It can be calculated by solving the eikonal equation using the FSM. The parallel FSMs explored in this thesis have not been implemented on GPU platforms and do not scale to a large problem size. This thesis addresses this problem by designing a parallel algorithm that utilizes a domain decomposition strategy for multi-accelerated distributed platforms. The proposed algorithm applies first coarse grain parallelism using MPI to distribute subdomains across multiple nodes and then fine grain parallelism to optimize performance by utilizing accelerators. The results of the parallel implementations of FSM for GPU-based platforms showed speedup greater than 20× compared to the serial version for some problems and the newly developed parallel algorithm eliminates the limitation of current algorithms to solve large memory problems with comparable runtime efficiency

    Loo.py: From Fortran to performance via transformation and substitution rules

    Full text link
    A large amount of numerically-oriented code is written and is being written in legacy languages. Much of this code could, in principle, make good use of data-parallel throughput-oriented computer architectures. Loo.py, a transformation-based programming system targeted at GPUs and general data-parallel architectures, provides a mechanism for user-controlled transformation of array programs. This transformation capability is designed to not just apply to programs written specifically for Loo.py, but also those imported from other languages such as Fortran. It eases the trade-off between achieving high performance, portability, and programmability by allowing the user to apply a large and growing family of transformations to an input program. These transformations are expressed in and used from Python and may be applied from a variety of settings, including a pragma-like manner from other languages.Comment: ARRAY 2015 - 2nd ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming (ARRAY 2015

    JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization

    Get PDF
    The rapid development in computing technology has paved the way for directive-based programming models towards a principal role in maintaining software portability of performance-critical applications. Efforts on such models involve a least engineering cost for enabling computational acceleration on multiple architectures while programmers are only required to add meta information upon sequential code. Optimizations for obtaining the best possible efficiency, however, are often challenging. The insertions of directives by the programmer can lead to side-effects that limit the available compiler optimization possible, which could result in performance degradation. This is exacerbated when targeting multi-GPU systems, as pragmas do not automatically adapt to such systems, and require expensive and time consuming code adjustment by programmers. This paper introduces JACC, an OpenACC runtime framework which enables the dynamic extension of OpenACC programs by serving as a transparent layer between the program and the compiler. We add a versatile code-translation method for multi-device utilization by which manually-optimized applications can be distributed automatically while keeping original code structure and parallelism. We show in some cases nearly linear scaling on the part of kernel execution with the NVIDIA V100 GPUs. While adaptively using multi-GPUs, the resulting performance improvements amortize the latency of GPU-to-GPU communications.Comment: Extended version of a paper to appear in: Proceedings of the 28th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC), December 17-18, 202

    Accelerating Face Anti-Spoofing Algorithms

    Get PDF
    Tato práce se specializuje na akceleraci algoritmu z oblasti obličejově zaměřených anti-spoofing algoritmů s využitím grafického hardware jakožto platformy pro paralelní zpracování dat. Jako framework je použita technologie OpenCL která umožňuje použití od výkoných stolních počítačů po přenosná zařízení, od různých akcelerátorů jako grafické čipy, či ASIC až po procesory typu x86 bez vazby na konkrétního výrobce či operační systém. Autor předkládá čtenáři rozbor a akcelerovanou implementaci široce používaného algoritmu a dopadu urychlení výpočtu.This thesis is specializes on algorithm acceleration from the field of face-based anti-spoofing. Graphics hardware is used as platform for data-parallel processing. As framework, the OpenCL is used. It allows execution on devices such as powerful desktop computers or hand-held devices as well as usage of different kind of processing units such as GPU, ASIC or CPU without any bound to hardware vendor or operating system. Author presents to reader analysis and accelerated implementation of widely used algorithm and impact of such improvement in execution time.

    Fast-coding robust motion estimation model in a GPU

    Full text link

    OpenCL의 프로그래밍 용이성 향상 기법

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 2. 이재진.OpenCL is one of the major programming models for heterogeneous systems. This thesis presents two limitations of OpenCL, the complicated nature of programming in OpenCL and the lack of support for a heterogeneous cluster, and proposes a solution for each of them for ease of programming. The first limitation is that it is complicated to write a program using OpenCL. In order to lower this programming complexity, this thesis proposes a framework that translates a program written in a high-level language (OpenMP) to OpenCL at the source level. This thesis achieves both ease of programming and high performance by employing two techniquesdata transfer minimization (DTM) and performance portability enhancement (PPE). This thesis shows the effectiveness of the proposed translation framework by evaluating benchmark applications and the practicality by comparing it with the commercial PGI compiler. The second limitation of OpenCL is the lack of support for a heterogeneous cluster. In order to extend OpenCL to a heterogeneous cluster, this thesis proposes a framework called SnuCL-D that is able to execute a program written only in OpenCL on a heterogeneous cluster. Unlike previous approaches that apply a centralized approach, the proposed framework applies a decentralized approach, which gives a chance to reduce three kinds of overhead occurring in the execution path of commands. With the ability to analyze and reduce three kinds of overhead, the proposed framework shows good scalability for a large-scale cluster system. The proposed framework proves its effectiveness and practicality by compared to the representative centralized approach (SnuCL) and MPI with benchmark applications. This thesis proposes solutions for the two limitations of OpenCL for ease of programming on heterogeneous clusters. It is expected that application developers will be able to easily execute not only an OpenMP program on various accelerators but also a program written only in OpenCL on a heterogeneous cluster.Chapter I. Introduction 1 I.1 Motivation and Objectives 5 I.1.1 Programming Complexity 5 I.1.2 Lack of Support for a Heterogeneous Cluster 8 I.2 Contributions 12 Chapter II. Background and Related Work 15 II.1 Background 15 II.1.1 OpenCL 16 II.1.2 OpenMP 23 II.2 Related Work 26 II.2.1 Programming Complexity 26 II.2.2 Support for a Heterogeneous Cluster 29 Chapter III. Lowering the Programming Complexity 34 III.1 Motivating Example 35 III.1.1 Device Constructs 35 III.1.2 Needs for Data Transfer Optimization 41 III.2 Mapping OpenMP to OpenCL 44 III.2.1 Architecture Model 44 III.2.2 Execution Model 45 III.3 Code Translation 46 III.3.1 Translation Process 46 III.3.2 Translating OpenMP to OpenCL 48 III.3.3 Example of Code Translation 50 III.3.4 Data Transfer Minimization (DTM) 62 III.3.5 Performance Portability Enhancement (PPE) 66 III.4 Performance Evaluation 69 III.4.1 Evaluation Methodology 70 III.4.2 Effectiveness of Optimization Techniques 74 III.4.3 Comparison with Other Implementations 79 Chapter IV. Support for a Heterogeneous Cluster 90 IV.1 Problems of Previous Approaches 90 IV.2 The Approach of SnuCL-D 91 IV.2.1 Overhead Analysis 93 IV.2.2 Remote Device Virtualization 94 IV.2.3 Redundant Computation and Data Replication 95 IV.2.4 Memory-read Commands 97 IV.3 Consistency Management 98 IV.4 Deterministic Command Scheduling 100 IV.5 New API Function: clAttachBufferToDevice() 103 IV.6 Queueing Optimization 104 IV.7 Performance Evaluation 105 IV.7.1 Evaluation Methodology 105 IV.7.2 Evaluation with a Microbenchmark 109 IV.7.3 Evaluation on the Large-scale CPU Cluster 111 IV.7.4 Evaluation on the Medium-scale GPU Cluster 123 Chapter V. Conclusion and Future Work 125 Bibliography 129 Korean Abstract 140Docto

    Directive-based Approach to Heterogeneous Computing

    Get PDF
    El mundo de la computación de altas prestaciones está sufriendo grandes cambios que incrementan notablemente su complejidad. La incapacidad de los sistemas monoprocesador o incluso multiprocesador de mantener el incremento de la potencia de cómputo para suplir las necesidades de la comunidad científica ha forzado la irrupción de arquitecturas hardware masivamente paralelas y de unidades específicas para realizar operaciones concretas. Un buen ejemplo de este tipo de dispositivos son las GPU (Unidades de procesamiento gráfico). Estos dispositivos, tradicionalmente dedicados a la programación gráfica, se han convertido recientemente en una plataforma ideal para implementar cómputos masivamente paralelos. La combinación de GPUs para realizar tareas intensivas en cómputo con multi-procesadores para llevar tareas menos intensas pero con lógica de control más compleja, se ha convertido en los últimos años en una de las plataformas más comunes para la realización de cálculos científicos a bajo coste, dado que la potencia desplegada en muchos casos puede alcanzar la de clústers de pequeño o mediano tamaño, con un coste inicial y de mantenimiento notablemente inferior. La incorporación de GPUs en clústers ha permitido también aumentar la capacidad de éstos. Sin embargo, la complejidad de la programación de GPUs, y su integración con códigos existentes, dificultan enormemente la introducción de estas tecnologías entre usuarios menos expertos. En esta tésis exploramos la utilización de modelos de programación basados en directivas para este tipo de entornos, multi-core, many-core, GPUs y clústers, donde el usuario medio ve disminuida notablemente su productividad debido a la dificultad de programación en estos entornos. Para explorar la mejor forma de aplicar directivas en estos entornos, hemos desarrollado un conjunto de herramientas software altamente flexibles (un compilador y un runtime), que permiten explorar diversas técnicas con relativamente poco esfuerzo. La irrupción del estándar de programación de directivas de OpenACC nos permitió demostrar la capacidad de estas herramientas, realizando una implementación experimental del estándar (accULL) en muy poco tiempo y con un rendimiento nada desdeñable. Los resultados computacionales aportados nos permiten demostrar: (a) La disminución en el esfuerzo de programación que permiten las aproximaciones basadas en directivas, (b) La capacidad y flexibilidad de las herramientas diseñadas durante esta tésis para explorar estas aproximaciones y finalmente (c) El potencial de desarrollo futuro de accULL como herramienta experimental en OpenACC en base al rendimiento obtenido actualmente frente al rendimiento de otras aproximaciones comerciales
    corecore