Search CORE

29 research outputs found

Resilient Optimistic Termination Detection for the Async-Finish Model

Author: EW Dijkstra
RD Blumofe
TH Lai
W Bland
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 16/06/2019
Field of study

International audienceDriven by increasing core count and decreasing mean-time-to-failure in supercomputers, HPC runtime systems must improve support for dynamic task-parallel execution and resilience to failures. The async-finish task model, adapted for distributed systems as the asynchronous partitioned global address space programming model, provides a simple way to decompose a computation into nested task groups, each managed by a ‘finish’ that signals the termination of all tasks within the group.For distributed termination detection, maintaining a consistent view of task state across multiple unreliable processes requires additional book-keeping when creating or completing tasks and finish-scopes. Runtime systems which perform this book-keeping pessimistically, i.e. synchronously with task state changes, add a high communication overhead compared to non-resilient protocols. In this paper, we propose optimistic finish, the first message-optimal resilient termination detection protocol for the async-finish model. By avoiding the communication of certain task and finish events, this protocol allows uncertainty about the global structure of the computation which can be resolved correctly at failure time, thereby reducing the overhead for failure-free execution.Performance results using micro-benchmarks and the LULESH hydrodynamics proxy application show significant reductions in resilience overhead with optimistic finish compared to pessimistic finish. Our optimistic finish protocol is applicable to any task-based runtime system offering automatic termination detection for dynamic graphs of non-migratable tasks

Crossref

INRIA a CCSD electronic archive server

The cooperative parallel: A discussion about run-time schedulers for nested parallelism

Author: A YarKhan
D Caballero
E Ayguadé
J Cajas
J Jeffers
J Kurzak
James P. Briggs
L Meadows
MA Serrano
ME Russinovich
R Blikberg
R Nanjegowda
RD Blumofe
VV Dimakopoulos
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Nested parallelism is a well-known parallelization strategy to exploit irregular parallelism in HPC applications. This strategy also fits in critical real-time embedded systems, composed of a set of concurrent functionalities. In this case, nested parallelism can be used to further exploit the parallelism of each functionality. However, current run-time implementations of nested parallelism can produce inefficiencies and load imbalance. Moreover, in critical real-time embedded systems, it may lead to incorrect executions due to, for instance, a work non-conserving scheduler. In both cases, the reason is that the teams of OpenMP threads are a black-box for the scheduler, i.e., the scheduler that assigns OpenMP threads and tasks to the set of available computing resources is agnostic to the internal execution of each team. This paper proposes a new run-time scheduler that considers dynamic information of the OpenMP threads and tasks running within several concurrent teams, i.e., concurrent parallel regions. This information may include the existence of OpenMP threads waiting in a barrier and the priority of tasks ready to execute. By making the concurrent parallel regions to cooperate, the shared computing resources can be better controlled and a work conserving and priority driven scheduler can be guaranteed.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

Parallel Tree Search in Volunteer Computing: a Case Study

Author: A Clauset
A Grama
AL Bazinet
B Abbott
B Javadi
CP Gomes
D Thain
DP Anderson
EM Heien
GE Blelloch
JA Gallian
K Apt
K Drakakis
L Dagum
M Pataki
RA Wright
RD Blumofe
RD Blumofe
RM Karp
Uwe Beckert
Wenjie Fang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

International audienc

Crossref

TUGraz OPEN Library

A taxonomy of task-based parallel programming technologies for high-performance computing

Author: A Duran
AD Robison
BL Chamberlain
C Augonnet
Dimitrios S. Nikolopoulos
Erwin Laure
G Bosilca
Herbert Jordan
K Huck
Khalid Hasanov
Kiril Dichev
Kostas Katrinis
L Dagum
Peter Thoman
Philipp Gschwandtner
Pierre Lemarinier
R Blumofe
RD Blumofe
Roman Iakymchuk
S Seo
Stefano Markidis
Thomas Fahringer
Thomas Heller
Xavier Aguilar
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

A high-performance matrix-matrix multiplication methodology for CPU and GPU architectures

Author: A. Kritikakou
B Moon
DF Bacon
F Desprez
G Shobaki
HR Arabnia
HR Arabnia
HR Arabnia
HR Arabnia
HR Arabnia
Iosif Mporas
J Kurzak
K Goto
KD Cooper
M Hattori
M Kulkarni
M Stephenson
M Tartara
MA Wani
N Binkert
N Nethercote
P Bjørstad
P Kulkarni
PA Kulkarni
R Nath
RC Whaley
RC Whaley
RD Blumofe
SM Bhandarkar
SM Bhandarkar
SM Bhandarkar
SS Pinter
T Austin
V Strassen
Vasilios Kelefouras
Vasilios Kolonias
VI Kelefouras
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Current compilers cannot generate code that can compete with hand-tuned code in efficiency, even for a simple kernel like matrix–matrix multiplication (MMM). A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and number of levels of tiling. The scheduling parameter values selection is a very difficult and time-consuming task, since parameter values depend on each other; this is why they are found by using searching methods and empirical techniques. To overcome this problem, the scheduling sub-problems must be optimized together, as one problem and not separately. In this paper, an MMM methodology is presented where the optimum scheduling parameters are found by decreasing the search space theoretically, while the major scheduling sub-problems are addressed together as one problem and not separately according to the hardware architecture parameters and input size; for different hardware architecture parameters and/or input sizes, a different implementation is produced. This is achieved by fully exploiting the software characteristics (e.g., data reuse) and hardware architecture parameters (e.g., data caches sizes and associativities), giving high-quality solutions and a smaller search space. This methodology refers to a wide range of CPU and GPU architectures

HAL-CentraleSupelec

Crossref

INRIA a CCSD electronic archive server

Sheffield Hallam University Research Archive

HAL Descartes

University of Hertfordshire Research Archive

Hal-Diderot

HAL-Rennes 1

Hoard: A scalable memory allocator for multithreaded applications

Author: Berger ED
Blumofe RD
McKinley KS
Wilson PR
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2000
Field of study

Parallel, multithreaded C and C++ programs such as web servers, database managers, news servers, and scientific applications are becoming increasingly prevalent. For these applications, the memory allocator is often a bottleneck that severely limits program performance and scalability on multiprocessor systems. Previous allocators suffer from problems that include poor performance and scalability, and heap organizations that introduce false sharing. Worse, many allocators exhibit a dramatic increase in memory consumption when confronted with a producer-consumer pattern of object allocation and freeing. This increase in memory consumption can range from a factor of P (the number of processors) to unbounded memory consumption

CiteSeerX

ScholarWorks@UMass Amherst

Distributed Computing for Enumeration

Author: B Négrevergne
MJ Zaki
R Martins
RD Blumofe
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

International audienc

HAL-CentraleSupelec

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

HAL-Rennes 1

Weighted adaptive concurrency control for software transactional memory

Author: K Fraser
Mohammad Ansari
P Felber
RD Blumofe
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Upper Bounds on Number of Steals in Rooted Trees

Author: Charles E. Leiserson
RD Blumofe
RM Karp
Tao B. Schardl
Warut Suksompong
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

A survey on optimizations towards best-effort hardware transactional memory

Author: GW Burr
L Adhianto
L Xiang
MD Hill
P Auer
P Hammarlund
RD Blumofe
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref