Search CORE

9 research outputs found

Parallel solutions of static Hamilton-Jacobi equations for simulations of geological folds

Author: Are Bruaset
Mohammed Sourouri
Tor Gillberg
Øyvind Hjelle
Publication venue: Springer Nature
Publication date: 01/01/2014
Field of study

Scalable Heterogeneous Supercomputing: Programming Methodologies and Automated Code Generation

Author: Sourouri Mohammed
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

Flerkjerne-prosessorer som grafikkprosessorer (GPUer) og Xeon Phi leverer enorm beregningskraft med et lavt effektforbruk. Denne egenskapen har ført til at slike prosesseringsenheter på kort tid har blitt svært populære til bruk i vitenskapelig tungregning. Stadig flere moderne superdatamaskiner og tungregningsklyng er utstyres med slike prosesseringsenheter i tillegg til tradisjonelle prosessorer (CPUer). Utfordringen med flerkjerne-prosessorer er imidlertid at de er betydelig vanskeligere å programmere enn CPUer, og det mangler gode tilpassede programmeringsspråk og modeller – noe som fører til at store deler av maskinvaren forblir ubenyttet. Ringvirkningene er aller mest synlige i de største klyngene der lav ytelse og dårlig skalering setter en stopper for utviklingen av mer kompliserte simuleringer. Målet med denne avhandlingen er å bidra til nye og mer effektive programmeringsmetoder som forbedrer ressursutnyttelsen i superdatamaskiner utstyrt med GPUer. Ved å fordele beregninger mellom GPU og CPU viser vi at slike heterogene beregninger er vesentlig mer effektive sammenliknet med mer etablerte beregningsmodeller der beregningene foregår utelukkende på GPU eller CPU. For å redusere kompleksiteten som kommer med heterogene beregninger, har vi utviklet et rammeverk bestående av en programmeringsmodell og en kompilator som gjør det enkelt å oversette sekvensielle beregninger til distribuerte og parallelle heterogene beregninger som kan kjøres på en superdatamaskin utstyrt med både GPU-og CPU-er

NORA - Norwegian Open Research Archives

A Parallel Front Propagation Method : Simulating geological folds on parallel architectures

Author: Sourouri Mohammed
Publication venue
Publication date: 01/01/2012
Field of study

Static non-linear Hamilton-Jacobi equations are often used to describe a propagating front. Advanced numerical algorithms are needed to effi- ciently compute solutions to these non-linear equations. In geological modelling, layers of rocks can be described as the position of a propa- gating front at different times. A fast simulation of such layers is a key component in exploration software developed by Kalkulo AS for Statoil AS. Developing fast algorithms and solvers is essential in this application, since faster solvers enables users to test more geological scenarios, leading to a better understanding of the inner earth. Front propagation is also used in other applications, such as reservoir simulation, seismic processing and medical imaging, making a fast algorithm highly versatile. The recent years rise of parallel architectures has made substantial computational resources available. One way to originate faster solvers is therefore to develop algorithms that are able to exploit the increasing parallelism that these architectures offer. In this thesis, a novel three- dimensional anisotropic front propagation algorithm for simulation of geological folds on parallel architecture is presented. The algorithm’s abundant parallelism is demonstrated on multi-core CPUs and GPU architectures. Implementation on multi-core architectures is achieved by using the OpenMP API, while the Mint programming model is used to facilitate with the GPU programming. We demonstrate 7x to 2x speedups running on the Nvidia GeForce GTX 590 GPU, compared with a multi-threaded implementation on a NUMA- machine using two interconnected 12 core AMD Opteron processors. These results point to enormous potential performance advances of our algorithm on parallel architectures

NORA - Norwegian Open Research Archives

Multi-GPU Implementations of Parallel 3D Sweeping Algorithms with Application to Geological Folding

Author: Cai Xing
Krishnasamy Ezhilmathi
Sourouri Mohammed
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

This paper studies the CUDA programming challenges with using multiple GPUs inside a single machine to carry out plane-by-plane updates in parallel 3D sweeping algorithms. In particular, care must be taken to mask the overhead of various data movements between the GPUs. Multiple OpenMP threads on the CPU side should be combined multiple CUDA streams per GPU to hide the data transfer cost related to the halo computation on each 2D plane. Moreover, the technique of peer-to-peer data motion can be used to reduce the impact of 3D volumetric data shuffles that have to be done between mandatory changes of the grid partitioning. We have investigated the performance improvement of 2-and 4-GPU implementations that are applicable to 3D anisotropic front propagation computations related to geological folding. In comparison with a straightforward multi-GPU implementation, the overall performance improvement due to masking of data movements on four GPUs of the Fermi architecture was 23%. The corresponding improvement obtained on four Kepler GPUs was 47%

Publikationer från Linköpings universitet

Elsevier - Publisher Connector

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Scalable Heterogeneous CPU-GPU Computations for Unstructured Tetrahedral Meshes

Author: Glenn Terje Lines
Johannes Langguth
Mohammed Sourouri
Scott B. Baden
Xing Cai
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

A recent trend in modern high-performance computing environments is the introduction of powerful, energy-efficient hardware accelerators such as GPUs and Xeon Phi coprocessors. These specialized computing devices coexist with CPUs and are optimized for highly parallel applications. In regular computing-intensive applications with predictable data access patterns, these devices often far outperform CPUs and thus relegate the latter to pure control functions instead of computations. For irregular applications, however, the performance gap can be much smaller and is sometimes even reversed. Thus, maximizing the overall performance on heterogeneous systems requires making full use of all available computational resources, including both accelerators and CPUs

Crossref

eScholarship - University of California

On the performance and energy efficiency of the PGAS programming model on multicore architectures

Author: Cai Xing
Ha Hoai Phuong
Lagraviere Jeremie Alexandre Emilien
Langguth Johannes
Sourouri Mohammed
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Accepted manuscript version. Published version at https://doi.org/10.1109/HPCSim.2016.7568416

Munin - Open Research Archive

NORA - Norwegian Open Research Archives

Towards Fine-Grained Dynamic Tuning of HPC Applications on Modern Multi-Core Architectures

Author: Hackenberg Daniel
Kjeldsberg Per Gunnar
Langguth Johannes
Raknes Espen Birger
Reissmann Nico
Schöne Robert
Sourouri Mohammed
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

There is a consensus that exascale systems should operate within a power envelope of 20MW. Consequently, energy conservation is still considered as the most crucial constraint if such systems are to be realized. So far, most research on this topic has focused on strategies such as power capping and dynamic power management. Although these approaches can reduce power consumption, we believe that they might not be sufficient to reach the exascale energy-efficiency goals. Hence, we aim to adopt techniques from embedded systems, where energy-efficiency has always been the fundamental objective. A successful energy-saving technique used in embedded systems is to integrate fine-grained autotuning with dynamic voltage and frequency scaling. In this paper, we apply a similar technique to a real-world HPC application. Our experimental results on a HPC cluster indicate that such an approach can save up to 19% of energy compared to the baseline configuration, with negligible performance loss

NORA - Norwegian Open Research Archives

The READEX formalism for automatic tuning for energy efficiency

Author: Bendifallah Zakaria
Beseda Martin
Bouizi Othman
Chowdhury Anamika
Diethelm Kai
Gerndt Michael
Gocht Andreas
Hackenberg Daniel
Horák David
Jahre Magnus
Kannan Venkatesh
Kjeldsberg Per Gunnar
Kružík Jakub
Kumaraswamy Madhura
Lysaght Michael
Mian Umbreen Sabir
Nagel Wolfgang E.
Schuchart Joseph
Sojka Radim
Sourouri Mohammed
Říha Lubomír
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Energy efficiency is an important aspect of future exascale systems, mainly due to rising energy cost. Although High performance computing (HPC) applications are compute centric, they still exhibit varying computational characteristics in different regions of the program, such as compute-, memory-, and I/O-bound code regions. Some of today’s clusters already offer mechanisms to adjust the system to the resource requirements of an application, e.g., by controlling the CPU frequency. However, manually tuning for improved energy efficiency is a tedious and painstaking task that is often neglected by application developers. The European Union’s Horizon 2020 project READEX (Runtime Exploitation of Application Dynamism for Energy-efficient eXascale computing) aims at developing a tools-aided approach for improved energy efficiency of current and future HPC applications. To reach this goal, the READEX project combines technologies from two ends of the compute spectrum, embedded systems and HPC, constituting a split design-time/runtime methodology. From the HPC domain, the Periscope Tuning Framework (PTF) is extended to perform dynamic auto-tuning of fine-grained application regions using the systems scenario methodology, which was originally developed for improving the energy efficiency in embedded systems. This paper introduces the concepts of the READEX project, its envisioned implementation, and preliminary results that demonstrate the feasibility of this approach

Crossref

Springer - Publisher Connector

Irish Universities

DSpace at VSB Technical University of Ostrava

NORA - Norwegian Open Research Archives

Access to Research at National University of Ireland, Galway