339 research outputs found

    A portable platform for accelerated PIC codes and its application to GPUs using OpenACC

    Get PDF
    We present a portable platform, called PIC_ENGINE, for accelerating Particle-In-Cell (PIC) codes on heterogeneous many-core architectures such as Graphic Processing Units (GPUs). The aim of this development is efficient simulations on future exascale systems by allowing different parallelization strategies depending on the application problem and the specific architecture. To this end, this platform contains the basic steps of the PIC algorithm and has been designed as a test bed for different algorithmic options and data structures. Among the architectures that this engine can explore, particular attention is given here to systems equipped with GPUs. The study demonstrates that our portable PIC implementation based on the OpenACC programming model can achieve performance closely matching theoretical predictions. Using the Cray XC30 system, Piz Daint, at the Swiss National Supercomputing Centre (CSCS), we show that PIC_ENGINE running on an NVIDIA Kepler K20X GPU can outperform the one on an Intel Sandybridge 8-core CPU by a factor of 3.4

    Porting and tuning of the Mont-Blanc benchmarks to the multicore ARM 64bit architecture

    Get PDF
    This project is about porting and tuning the Mont-Blanc benchmarks to the multicore ARM 64 bits architecture. The Mont-Blanc benchmarks are part of the Mont-Blanc European project and they have been developed internally in the BSC (Barcelona Supercomputing Center). The project will explore the possibilities that an ARM architecture can offer running in a HPC (High Performance Computing) setup, this includes to learn how to tune and adapt a parallelized computer program and analyze its execution behavior. As part of the project, we will analyze the performance of each benchmark using instrumentation tools such like Extrae and Paraver. Each benchmark will be adapted, tuned and executed mainly in the three new Mont-Blanc mini-clusters, Thunder (ARMv8 custom), Merlin (ARMv8 custom) and Jetson TX (ARMv8 cortex-a57) using the OmpSs programming model. The evolution of the performance obtained will be shown followed by a brief analysis of the results after each optimization.Aquest projecte es basa en adaptar i afinar els Mont-Blanc benchmarks a l’arquitectura multinucli ARM 64 bits. Els Mont-Blanc benchmarks formen part del projecte Europeu Mont-Blanc i han estat desenvolupats internament en el BSC (Barcelona Supercomputing Center). Aquest projecte explorarà el potencial d’usar l’arquitectura ARM en un entorn HPC (High Performance Computing), això inclou aprendre a adaptar i afinar un programa paral·lel, i analitzar el seu comportament durant l’execució. Com a part del projecte, s’analitzarà el rendiment de cada benchmark usant eines d’instrumentació com Extrae o Paraver. Cada benchmark serà adaptat, afinat i executat en els tres nous miniclústers de Mont-Blanc, Thunder (ARMv8 personalitzat), Merlin (ARMv8 personalitzat) i Jetson TX (ARMv8 cortex-a57) usant el model de programació OmpSs. Es mostrarà l’evolució del rendiment, seguit d’una breu explicació dels resultats després de cada optimització.Este proyecto se basa en adaptar y afinar los Mont-blanc benchmarks a la arquitectura multi-núcleo ARM 64 bits. Los Mont-Blanc benchmarks forman parte del proyecto Europeo Mont-Blanc y han sido desarrollados internamente en el BSC (Barcelona Supercomputing Center). Este proyecto explorará el potencial de usar la arquitectura ARM en un entorno HPC (High Performance Computing), esto incluye aprender a adaptar y afinar un programa paralelo, y analizar su comportamiento durante la ejecución. Como parte del proyecto, se analizará el rendimiento de cada benchmark usando herramientas de instrumentación como Extrae o Paraver. Cada benchmark será adaptado, afinado y ejecutado en los tres nuevos mini-clústeres de Mont-Blanc, Thunder (ARMv8 personalizado), Merlin (ARMv8 personalizado) y Jetson TX (ARMv8 cortex-a57) usando el modelo de programación OmpSs. Se mostrará la evolución del rendimiento obtenido, y una breve explicación de los resultados después de cada optimización

    Scratchpad Sharing in GPUs

    Full text link
    GPGPU applications exploit on-chip scratchpad memory available in the Graphics Processing Units (GPUs) to improve performance. The amount of thread level parallelism present in the GPU is limited by the number of resident threads, which in turn depends on the availability of scratchpad memory in its streaming multiprocessor (SM). Since the scratchpad memory is allocated at thread block granularity, part of the memory may remain unutilized. In this paper, we propose architectural and compiler optimizations to improve the scratchpad utilization. Our approach, Scratchpad Sharing, addresses scratchpad under-utilization by launching additional thread blocks in each SM. These thread blocks use unutilized scratchpad and also share scratchpad with other resident blocks. To improve the performance of scratchpad sharing, we propose Owner Warp First (OWF) scheduling that schedules warps from the additional thread blocks effectively. The performance of this approach, however, is limited by the availability of the shared part of scratchpad. We propose compiler optimizations to improve the availability of shared scratchpad. We describe a scratchpad allocation scheme that helps in allocating scratchpad variables such that shared scratchpad is accessed for short duration. We introduce a new instruction, relssp, that when executed, releases the shared scratchpad. Finally, we describe an analysis for optimal placement of relssp instructions such that shared scratchpad is released as early as possible. We implemented the hardware changes using the GPGPU-Sim simulator and implemented the compiler optimizations in Ocelot framework. We evaluated the effectiveness of our approach on 19 kernels from 3 benchmarks suites: CUDA-SDK, GPGPU-Sim, and Rodinia. The kernels that underutilize scratchpad memory show an average improvement of 19% and maximum improvement of 92.17% compared to the baseline approach

    Decoupling algorithms from schedules for easy optimization of image processing pipelines

    Get PDF
    Using existing programming tools, writing high-performance image processing code requires sacrificing readability, portability, and modularity. We argue that this is a consequence of conflating what computations define the algorithm, with decisions about storage and the order of computation. We refer to these latter two concerns as the schedule, including choices of tiling, fusion, recomputation vs. storage, vectorization, and parallelism. We propose a representation for feed-forward imaging pipelines that separates the algorithm from its schedule, enabling high-performance without sacrificing code clarity. This decoupling simplifies the algorithm specification: images and intermediate buffers become functions over an infinite integer domain, with no explicit storage or boundary conditions. Imaging pipelines are compositions of functions. Programmers separately specify scheduling strategies for the various functions composing the algorithm, which allows them to efficiently explore different optimizations without changing the algorithmic code. We demonstrate the power of this representation by expressing a range of recent image processing applications in an embedded domain specific language called Halide, and compiling them for ARM, x86, and GPUs. Our compiler targets SIMD units, multiple cores, and complex memory hierarchies. We demonstrate that it can handle algorithms such as a camera raw pipeline, the bilateral grid, fast local Laplacian filtering, and image segmentation. The algorithms expressed in our language are both shorter and faster than state-of-the-art implementations.National Science Foundation (U.S.) (Grant 0964004)National Science Foundation (U.S.) (Grant 0964218)National Science Foundation (U.S.) (Grant 0832997)United States. Dept. of Energy (Award DE-SC0005288)Cognex CorporationAdobe System

    Scalability Validation of Parallel Sorting Algorithms

    Get PDF
    As single-core performance of processors is not improving significantly anymore, the computer industry is moving towards increasing the amount of cores per processor or, in the case of large-scale computers, by installing more processors per computer. Applications now need to scale in accordance with the increase of parallel computing power and software developers need to take advantage of this movement. And parallel sorting algorithms present basic building blocks for many complex applications. In this thesis, we will validate the expected execution time complexities of five state-of-the-art parallel sorting algorithms, implemented in C using MPI for parallelization, by using a scalability validation framework based on Score-P and Extra-P. For each of the parallel sorting algorithms, we will create a performance model. These models will allow us to compare their scalability behaviour to the expectations. Furthermore, we will attempt to parallelize the local sorting step of the splitter-based parallel sorting algorithms via C++11 threads, OpenMP tasks, and CUDA acceleration. We construct the performance models, on which we base our evaluations, using uniformly randomly generated data. For most of the parallel sorting algorithms, we show that the given expectations match the created models. We will discuss any other discrepancies in detail

    Pro++: A Profiling Framework for Primitive-based GPU Programming

    Get PDF
    Parallelizing software applications through the use of existing optimized primitives is a common trend that mediates the complexity of manual parallelization and the use of less efficient directive-based programming models. Parallel primitive libraries allow software engineers to map any sequential code to a target many-core architecture by identifying the most computational intensive code sections and mapping them into one ore more existing primitives. On the other hand, the spreading of such a primitive-based programming model and the different GPU architectures have led to a large and increasing number of third-party libraries, which often provide different implementations of the same primitive, each one optimized for a specific architecture. From the developer point of view, this moves the actual problem of parallelizing the software application to selecting, among the several implementations, the most efficient primitives for the target platform. This paper presents Pro++, a profiling framework for GPU primitives that allows measuring the implementation quality of a given primitive by considering the target architecture characteristics. The framework collects the information provided by a standard GPU profiler and combines them into optimization criteria. The criteria evaluations are weighed to distinguish the impact of each optimization on the overall quality of the primitive implementation. The paper shows how the tuning of the different weights has been conducted through the analysis of five of the most widespread existing primitive libraries and how the framework has been eventually applied to improve the implementation performance of two standard and widespread primitives
    corecore