242 research outputs found

    Optimizing the Performance of Directive-based Programming Model for GPGPUs

    Get PDF
    Accelerators have been deployed on most major HPC systems. They are considered to improve the performance of many applications. Accelerators such as GPUs have an immense potential in terms of high compute capacity but programming these devices is a challenge. OpenCL, CUDA and other vendor-specific models for accelerator programming definitely offer high performance, but these are low-level models that demand excellent programming skills; moreover, they are time consuming to write and debug. In order to simplify GPU programming, several directive-based programming models have been proposed, including HMPP, PGI accelerator model and OpenACC. OpenACC has now become established as the de facto standard. We evaluate and compare these models involving several scientific applications. To study the implementation challenges and the principles and techniques of directive- based models, we built an open source OpenACC compiler on top of a main stream compiler framework (OpenUH as a branch of Open64). In this dissertation, we present the required techniques to parallelize and optimize the applications ported with OpenACC programming model. We apply both user-level optimizations in the applications and compiler and runtime-driven optimizations. The compiler optimization focuses on the parallelization of reduction operations inside nested parallel loops. To fully utilize all GPU resources, we also extend the OpenACC model to support multiple GPUs in a single node. Our application porting experience also revealed the challenge of choosing good loop schedules. The default loop schedule chosen by the compiler may not produce the best performance, so the user has to manually try different loop schedules to improve the performance. To solve this issue, we developed a locality-aware auto-tuning framework which is based on the proposed memory access cost model to help the compiler choose optimal loop schedules and guide the user to choose appropriate loop schedules.Computer Science, Department o

    Evaluation of GPU Acceleration for WRF–SFIRE

    Get PDF
    WRF–SFIRE is an open source, atmospheric–wildfire model that couples the WRF model with the level set fire spread model to simulate wildfires in real time. This model has many applications and more scientific questions can be asked and answered if the model can be run faster. Nvidia has put a lot of effort into easing the barrier of entry for accelerating applications with their tools to be run on GPUs. Various physical simulations have been successfully ported to utilize GPUs and have benefited from the speed increase. In this research, we take a look at WRF-SFIRE and try to use the Nvida tools to accelerate portions of code. We were successful in offloading work to the GPU. However, the WRF-SFIRE codebase contains too many data dependencies, deeply nested function calls and I/O to effectively utilize the GPU’s resources. We look at specific examples and try to run them on a Titan V GPU. In the end, the compute intensive portions of WRF-SFIRE need to be rewritten to avoid data dependencies in order to leverage GPUs to improve the execution time

    Multiple target task sharing support for the OpenMP accelerator model

    Get PDF
    The use of GPU accelerators is becoming common in HPC platforms due to the their effective performance and energy efficiency. In addition, new generations of multicore processors are being designed with wider vector units and/or larger hardware thread counts, also contributing to the peak performance of the whole system. Although current directive–based paradigms, such as OpenMP or OpenACC, support both accelerators and multicore-based hosts, they do not provide an effective and efficient way to concurrently use them, usually resulting in accelerated programs in which the potential computational performance of the host is not exploited. In this paper we propose an extension to the OpenMP 4.5 directive-based programming model to support the specification and execution of multiple instances of task regions on different devices (i.e. accelerators in conjunction with the vector and heavily multithreaded capabilities in multicore processors). The compiler is responsible for the generation of device-specific code for each device kind, delegating to the runtime system the dynamic schedule of the tasks to the available devices. The new proposed clause conveys useful insight to guide the scheduler while keeping a clean, abstract and machine independent programmer interface. The potential of the proposal is analyzed in a prototype implementation in the OmpSs compiler and runtime infrastructure. Performance evaluation is done using three kernels (N-Body, tiled matrix multiply and Stream) on different GPU-capable systems based on ARM, Intel x86 and IBM Power8. From the evaluation we observe speed–ups in the 8–20% range compared to versions in which only the GPU is used, reaching 96 % of the additional peak performance thanks to the reduction of data transfers and the benefits introduced by the OmpSs NUMA-aware scheduler.This work is partially supported by the IBM/BSC Deep Learning Center Initiative, by the Spanish Government through Programa Severo Ochoa (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contract 2014-SGR-1051).Peer ReviewedPostprint (author's final draft

    A Novel Set of Directives for Multi-device Programming with OpenMP

    Get PDF
    This work was supported by MEEP project, which has received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 946002. The JU receives support from theEuropean Union’s Horizon 2020 research and innovation programme and Spain, Croatia, Turkey.Peer ReviewedPostprint (author's final draft

    The readying of applications for heterogeneous computing

    Get PDF
    High performance computing is approaching a potentially significant change in architectural design. With pressures on the cost and sheer amount of power, additional architectural features are emerging which require a re-think to the programming models deployed over the last two decades. Today's emerging high performance computing (HPC) systems are maximising performance per unit of power consumed resulting in the constituent parts of the system to be made up of a range of different specialised building blocks, each with their own purpose. This heterogeneity is not just limited to the hardware components but also in the mechanisms that exploit the hardware components. These multiple levels of parallelism, instruction sets and memory hierarchies, result in truly heterogeneous computing in all aspects of the global system. These emerging architectural solutions will require the software to exploit tremendous amounts of on-node parallelism and indeed programming models to address this are emerging. In theory, the application developer can design new software using these models to exploit emerging low power architectures. However, in practice, real industrial scale applications last the lifetimes of many architectural generations and therefore require a migration path to these next generation supercomputing platforms. Identifying that migration path is non-trivial: With applications spanning many decades, consisting of many millions of lines of code and multiple scientific algorithms, any changes to the programming model will be extensive and invasive and may turn out to be the incorrect model for the application in question. This makes exploration of these emerging architectures and programming models using the applications themselves problematic. Additionally, the source code of many industrial applications is not available either due to commercial or security sensitivity constraints. This thesis highlights this problem by assessing current and emerging hard- ware with an industrial strength code, and demonstrating those issues described. In turn it looks at the methodology of using proxy applications in place of real industry applications, to assess their suitability on the next generation of low power HPC offerings. It shows there are significant benefits to be realised in using proxy applications, in that fundamental issues inhibiting exploration of a particular architecture are easier to identify and hence address. Evaluations of the maturity and performance portability are explored for a number of alternative programming methodologies, on a number of architectures and highlighting the broader adoption of these proxy applications, both within the authors own organisation, and across the industry as a whole

    The ESCAPE project : Energy-efficient Scalable Algorithms for Weather Prediction at Exascale

    Get PDF
    In the simulation of complex multi-scale flows arising in weather and climate modelling, one of the biggest challenges is to satisfy strict service requirements in terms of time to solution and to satisfy budgetary constraints in terms of energy to solution, without compromising the accuracy and stability of the application. These simulations require algorithms that minimise the energy footprint along with the time required to produce a solution, maintain the physically required level of accuracy, are numerically stable, and are resilient in case of hardware failure. The European Centre for Medium-Range Weather Forecasts (ECMWF) led the ESCAPE (Energy-efficient Scalable Algorithms for Weather Prediction at Exascale) project, funded by Horizon 2020 (H2020) under the FET-HPC (Future and Emerging Technologies in High Performance Computing) initiative. The goal of ESCAPE was to develop a sustainable strategy to evolve weather and climate prediction models to next-generation computing technologies. The project partners incorporate the expertise of leading European regional forecasting consortia, university research, experienced high-performance computing centres, and hardware vendors. This paper presents an overview of the ESCAPE strategy: (i) identify domain-specific key algorithmic motifs in weather prediction and climate models (which we term Weather & Climate Dwarfs), (ii) categorise them in terms of computational and communication patterns while (iii) adapting them to different hardware architectures with alternative programming models, (iv) analyse the challenges in optimising, and (v) find alternative algorithms for the same scheme. The participating weather prediction models are the following: IFS (Integrated Forecasting System); ALARO, a combination of AROME (Application de la Recherche a l'Operationnel a Meso-Echelle) and ALADIN (Aire Limitee Adaptation Dynamique Developpement International); and COSMO-EULAG, a combination of COSMO (Consortium for Small-scale Modeling) and EULAG (Eulerian and semi-Lagrangian fluid solver). For many of the weather and climate dwarfs ESCAPE provides prototype implementations on different hardware architectures (mainly Intel Skylake CPUs, NVIDIA GPUs, Intel Xeon Phi, Optalysys optical processor) with different programming models. The spectral transform dwarf represents a detailed example of the co-design cycle of an ESCAPE dwarf. The dwarf concept has proven to be extremely useful for the rapid prototyping of alternative algorithms and their interaction with hardware; e.g. the use of a domain-specific language (DSL). Manual adaptations have led to substantial accelerations of key algorithms in numerical weather prediction (NWP) but are not a general recipe for the performance portability of complex NWP models. Existing DSLs are found to require further evolution but are promising tools for achieving the latter. Measurements of energy and time to solution suggest that a future focus needs to be on exploiting the simultaneous use of all available resources in hybrid CPU-GPU arrangements
    • …
    corecore