Search CORE

24 research outputs found

Hybrid Use of OmpSs for a Shock Hydrodynamics Proxy Application

Author: Jan Christian Meyer
Publication venue
Publication date
Field of study

The LULESH proxy application models the behavior of the ALE3D multi-physics code with an explicit shock hydrodynamics problem, and is made in order to evaluate interactions between programming models and architectures, using a representative code significantly less complex than the application it models. As identified in the PRACE deliverable D7.2.1 [1], the OmpSs programming model specifically targets programming at the exascale, and this whitepaper investigates the effectiveness of its support for development on hybrid architectures

ZENODO

On the benefits of tasking with OpenMP

Author: A Duran
A Duran
A Rico
E Ayguadé
M Garcia-Gasulla
MJ Berger
MJ Berger
P Atkinson
P Virouleau
R Vidal
T Gautier
X Teruel
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Tasking promises a model to program parallel applications that provides intuitive semantics. In the case of tasks with dependences, it also promises better load balancing by removing global synchronizations (barriers), and potential for improved locality. Still, the adoption of tasking in production HPC codes has been slow. Despite OpenMP supporting tasks, most codes rely on worksharing-loop constructs alongside MPI primitives. This paper provides insights on the benefits of tasking over the worksharing-loop model by reporting on the experience of taskifying an adaptive mesh refinement proxy application: miniAMR. The performance evaluation shows the taskified implementation being 15–30% faster than the loop-parallel one for certain thread counts across four systems, three architectures and four compilers thanks to better load balancing and system utilization. Dynamic scheduling of loops narrows the gap but still falls short of tasking due to serial sections between loops. Locality improvements are incidental due to the lack of locality-aware scheduling. Overall, the introduction of asynchrony with tasking lives up to its promises, provided that programmers parallelize beyond individual loops and across application phases.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

Evaluating Performance of OpenMP Tasks in a Seismic Stencil Application

Author: Araya-Polo Mauricio
Chapman Barbara
Meng Jie
Raut Eric
Publication venue: Academic Commons
Publication date: 01/09/2020
Field of study

Simulations based on stencil computations (widely used in geosciences) have been dominated by the MPI+OpenMP programming model paradigm. Little effort has been devoted to experimenting with task-based parallelism in this context. We address this by introducing OpenMP task parallelism into the kernel of an industrial seismic modeling code, Minimod. We observe that even for these highly regular stencil computations, taskified kernels are competitive with traditional OpenMP-augmented loops, and in some experiments tasks even outperform loop parallelism. This promising result sets the stage for more complex computational patterns. Simulations involve more than just the stencil calculation: a collection of kernels is often needed to accomplish the scientific objective (e.g., I/O, boundary conditions). These kernels can often be computed simultaneously; however, implementing this simultaneous computation with traditional programming models is not trivial. The presented approach will be extended to cover simultaneous execution of several kernels, where we expect to fully exploit the benefits of task-based programming

Stony Brook University - SUNY

Task scheduling techniques for asymmetric multi-core systems

Author: Ayguadé Parra Eduard
Badia Sala Rosa Maria
Casas Marc
Chronaki Kallia
Labarta Mancho Jesús José
Moreto Planas Miquel
Rico Alejandro
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

As performance and energy efficiency have become the main challenges for next-generation high-performance computing, asymmetric multi-core architectures can provide solutions to tackle these issues. Parallel programming models need to be able to suit the needs of such systems and keep on increasing the application’s portability and efficiency. This paper proposes two task scheduling approaches that target asymmetric systems. These dynamic scheduling policies reduce total execution time either by detecting the longest or the critical path of the dynamic task dependency graph of the application, or by finding the earliest executor of a task. They use dynamic scheduling and information discoverable during execution, fact that makes them implementable and functional without the need of off-line profiling. In our evaluation we compare these scheduling approaches with two existing state-of the art heterogeneous schedulers and we track their improvement over a FIFO baseline scheduler. We show that the heterogeneous schedulers improve the baseline by up to 1.45 in a real 8-core asymmetric system and up to 2.1 in a simulated 32-core asymmetric chip.This work has been supported by the Spanish Government (SEV2015-0493), by the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), by the RoMoL ERC Advanced Grant (GA 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project receives funding from the EU’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no 610402 and from the EU’s H2020 Framework Programme (H2020/2014-2020) under grant agreement no 671697. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047. M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243).Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Digital.CSIC

A Parallel Algorithm and Implementation to Compute Spatial Autocorrelation (Hotspot) Using MATLAB

Author: Li Mingjun
Publication venue: e-Publications@Marquette
Publication date: 01/04/2020
Field of study

Being a spatial autocorrelation visualization tool in recent years, hotspot is often used in various fields, such as disease analysis, crime analysis, and weather conditions analysis and prediction in a certain area. Most of the research in hot spot analysis is in applying the concept to a variety of fields and to gain insights on the statistical significance prevalent in the clustering of data. Only a few of them discussed the efficiency and optimization of the algorithm. Commonly, these kinds of analyses would be based upon a huge dataset about space and time, and the conventional algorithm would take too much time to get the results. This paper mainly discusses whether the algorithm can be processed in parallel with MATLAB and how to further optimize the algorithm to shorten the calculation time and obtain accurate outcomes faster. I will use the toolbox ‘parpool’ in MATLAB on a multi-core node to parallelize the conventional algorithm, and then take advantage of the basic idea of the \u27R-tree\u27 to further optimize the parallel algorithm. In the end, the results are satisfactory, because the conventional serial algorithm can be parallelized in MATLAB, and the time consumption was saved about five times compared to the original algorithm. When the algorithm was further optimized, its time consumption is saved about ten times. This paper will be helpful in saving time when doing similar computations and analyses in the future

epublications@Marquette

Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems

Author: Horn Webb H
Sanchez Daniel
Zhang Guowei
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/12/2015
Field of study

We present Coup, a technique to lower the cost of updates to shared data in cache-coherent systems. Coup exploits the insight that many update operations, such as additions and bitwise logical operations, are commutative: they produce the same final result regardless of the order they are performed in. Coup allows multiple private caches to simultaneously hold update-only permission to the same cache line. Caches with update-only permission can locally buffer and coalesce updates to the line, but cannot satisfy read requests. Upon a read request, Coup reduces the partial updates buffered in private caches to produce the final value. Coup integrates seamlessly into existing coherence protocols, requires inexpensive hardware, and does not affect the memory consistency model. We apply Coup to speed up single-word updates to shared data. On a simulated 128-core, 8-socket system, Coup accelerates state-of-the-art implementations of update-heavy algorithms by up to 2.4×.Center for Future Architectures ResearchNational Science Foundation (U.S.) (CAREER-1452994)Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science (Grier Presidential Fellowship)Microelectronics Advanced Research CorporationUnited States. Defense Advanced Research Projects Agenc

DSpace@MIT

Crossref

Towards dynamic threading support for OpenMP

Author: Stadler Jacques
Publication venue: ETH, Swiss Federal Institute of Technology, Laboratory for Software Technology
Publication date: 01/01/2009
Field of study

Repository for Publications and Research Data