9 research outputs found

    Scalable Heterogeneous Supercomputing: Programming Methodologies and Automated Code Generation

    No full text
    Flerkjerne-prosessorer som grafikkprosessorer (GPUer) og Xeon Phi leverer enorm beregningskraft med et lavt effektforbruk. Denne egenskapen har ført til at slike prosesseringsenheter på kort tid har blitt svært populære til bruk i vitenskapelig tungregning. Stadig flere moderne superdatamaskiner og tungregningsklyng er utstyres med slike prosesseringsenheter i tillegg til tradisjonelle prosessorer (CPUer). Utfordringen med flerkjerne-prosessorer er imidlertid at de er betydelig vanskeligere å programmere enn CPUer, og det mangler gode tilpassede programmeringsspråk og modeller – noe som fører til at store deler av maskinvaren forblir ubenyttet. Ringvirkningene er aller mest synlige i de største klyngene der lav ytelse og dårlig skalering setter en stopper for utviklingen av mer kompliserte simuleringer. Målet med denne avhandlingen er å bidra til nye og mer effektive programmeringsmetoder som forbedrer ressursutnyttelsen i superdatamaskiner utstyrt med GPUer. Ved å fordele beregninger mellom GPU og CPU viser vi at slike heterogene beregninger er vesentlig mer effektive sammenliknet med mer etablerte beregningsmodeller der beregningene foregår utelukkende på GPU eller CPU. For å redusere kompleksiteten som kommer med heterogene beregninger, har vi utviklet et rammeverk bestående av en programmeringsmodell og en kompilator som gjør det enkelt å oversette sekvensielle beregninger til distribuerte og parallelle heterogene beregninger som kan kjøres på en superdatamaskin utstyrt med både GPU-og CPU-er

    A Parallel Front Propagation Method : Simulating geological folds on parallel architectures

    Get PDF
    Static non-linear Hamilton-Jacobi equations are often used to describe a propagating front. Advanced numerical algorithms are needed to effi- ciently compute solutions to these non-linear equations. In geological modelling, layers of rocks can be described as the position of a propa- gating front at different times. A fast simulation of such layers is a key component in exploration software developed by Kalkulo AS for Statoil AS. Developing fast algorithms and solvers is essential in this application, since faster solvers enables users to test more geological scenarios, leading to a better understanding of the inner earth. Front propagation is also used in other applications, such as reservoir simulation, seismic processing and medical imaging, making a fast algorithm highly versatile. The recent years rise of parallel architectures has made substantial computational resources available. One way to originate faster solvers is therefore to develop algorithms that are able to exploit the increasing parallelism that these architectures offer. In this thesis, a novel three- dimensional anisotropic front propagation algorithm for simulation of geological folds on parallel architecture is presented. The algorithm’s abundant parallelism is demonstrated on multi-core CPUs and GPU architectures. Implementation on multi-core architectures is achieved by using the OpenMP API, while the Mint programming model is used to facilitate with the GPU programming. We demonstrate 7x to 2x speedups running on the Nvidia GeForce GTX 590 GPU, compared with a multi-threaded implementation on a NUMA- machine using two interconnected 12 core AMD Opteron processors. These results point to enormous potential performance advances of our algorithm on parallel architectures

    Multi-GPU Implementations of Parallel 3D Sweeping Algorithms with Application to Geological Folding

    Get PDF
    This paper studies the CUDA programming challenges with using multiple GPUs inside a single machine to carry out plane-by-plane updates in parallel 3D sweeping algorithms. In particular, care must be taken to mask the overhead of various data movements between the GPUs. Multiple OpenMP threads on the CPU side should be combined multiple CUDA streams per GPU to hide the data transfer cost related to the halo computation on each 2D plane. Moreover, the technique of peer-to-peer data motion can be used to reduce the impact of 3D volumetric data shuffles that have to be done between mandatory changes of the grid partitioning. We have investigated the performance improvement of 2-and 4-GPU implementations that are applicable to 3D anisotropic front propagation computations related to geological folding. In comparison with a straightforward multi-GPU implementation, the overall performance improvement due to masking of data movements on four GPUs of the Fermi architecture was 23%. The corresponding improvement obtained on four Kepler GPUs was 47%

    Scalable Heterogeneous CPU-GPU Computations for Unstructured Tetrahedral Meshes

    No full text
    A recent trend in modern high-performance computing environments is the introduction of powerful, energy-efficient hardware accelerators such as GPUs and Xeon Phi coprocessors. These specialized computing devices coexist with CPUs and are optimized for highly parallel applications. In regular computing-intensive applications with predictable data access patterns, these devices often far outperform CPUs and thus relegate the latter to pure control functions instead of computations. For irregular applications, however, the performance gap can be much smaller and is sometimes even reversed. Thus, maximizing the overall performance on heterogeneous systems requires making full use of all available computational resources, including both accelerators and CPUs

    On the performance and energy efficiency of the PGAS programming model on multicore architectures

    Get PDF
    Accepted manuscript version. Published version at https://doi.org/10.1109/HPCSim.2016.7568416

    Towards Fine-Grained Dynamic Tuning of HPC Applications on Modern Multi-Core Architectures

    No full text
    There is a consensus that exascale systems should operate within a power envelope of 20MW. Consequently, energy conservation is still considered as the most crucial constraint if such systems are to be realized. So far, most research on this topic has focused on strategies such as power capping and dynamic power management. Although these approaches can reduce power consumption, we believe that they might not be sufficient to reach the exascale energy-efficiency goals. Hence, we aim to adopt techniques from embedded systems, where energy-efficiency has always been the fundamental objective. A successful energy-saving technique used in embedded systems is to integrate fine-grained autotuning with dynamic voltage and frequency scaling. In this paper, we apply a similar technique to a real-world HPC application. Our experimental results on a HPC cluster indicate that such an approach can save up to 19% of energy compared to the baseline configuration, with negligible performance loss

    The READEX formalism for automatic tuning for energy efficiency

    Get PDF
    Energy efficiency is an important aspect of future exascale systems, mainly due to rising energy cost. Although High performance computing (HPC) applications are compute centric, they still exhibit varying computational characteristics in different regions of the program, such as compute-, memory-, and I/O-bound code regions. Some of today’s clusters already offer mechanisms to adjust the system to the resource requirements of an application, e.g., by controlling the CPU frequency. However, manually tuning for improved energy efficiency is a tedious and painstaking task that is often neglected by application developers. The European Union’s Horizon 2020 project READEX (Runtime Exploitation of Application Dynamism for Energy-efficient eXascale computing) aims at developing a tools-aided approach for improved energy efficiency of current and future HPC applications. To reach this goal, the READEX project combines technologies from two ends of the compute spectrum, embedded systems and HPC, constituting a split design-time/runtime methodology. From the HPC domain, the Periscope Tuning Framework (PTF) is extended to perform dynamic auto-tuning of fine-grained application regions using the systems scenario methodology, which was originally developed for improving the energy efficiency in embedded systems. This paper introduces the concepts of the READEX project, its envisioned implementation, and preliminary results that demonstrate the feasibility of this approach
    corecore