Search CORE

5,751 research outputs found

Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers

Author: Dinh David
Simhadri Harsha Vardhan
Tang Yuan
Publication venue
Publication date: 14/02/2016
Field of study

The nested parallel (a.k.a. fork-join) model is widely used for writing parallel programs. However, the two composition constructs, i.e. "

\parallel

" (parallel) and "

;

" (serial), are insufficient in expressing "partial dependencies" or "partial parallelism" in a program. We propose a new dataflow composition construct "

\leadsto

" to express partial dependencies in algorithms in a processor- and cache-oblivious way, thus extending the Nested Parallel (NP) model to the \emph{Nested Dataflow} (ND) model. We redesign several divide-and-conquer algorithms ranging from dense linear algebra to dynamic-programming in the ND model and prove that they all have optimal span while retaining optimal cache complexity. We propose the design of runtime schedulers that map ND programs to multicore processors with multiple levels of possibly shared caches (i.e, Parallel Memory Hierarchies) and provide theoretical guarantees on their ability to preserve locality and load balance. For this, we adapt space-bounded (SB) schedulers for the ND model. We show that our algorithms have increased "parallelizability" in the ND model, and that SB schedulers can use the extra parallelizability to achieve asymptotically optimal bounds on cache misses and running time on a greater number of processors than in the NP model. The running time for the algorithms in this paper is

O\left(\frac{\sum_{i=0}^{h-1} Q^{*}({\mathsf t};\sigma\cdot M_i)\cdot C_i}{p}\right)

, where

Q^{*}

is the cache complexity of task

{\mathsf t}

C_i

is the cost of cache miss at level-

i

cache which is of size

M_i

\sigma\in(0,1)

is a constant, and

p

is the number of processors in an

h

-level cache hierarchy

arXiv.org e-Print Archive

Crossref

Parallelizing with BDSC, a resource-constrained scheduling algorithm for shared and distributed memory systems

Author: Ancourt Corinne
Jouvelot Pierre
Khaldi Dounia
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

International audienceWe introduce a new parallelization framework for scientific computing based on BDSC, an efficient automatic scheduling algorithm for parallel programs in the presence of resource constraints on the number of processors and their local memory size. BDSC extends Yang and Gerasoulis's Dominant Sequence Clus-tering (DSC) algorithm; it uses sophisticated cost models and addresses both shared and distributed parallel memory architectures. We describe BDSC, its integration within the PIPS compiler infrastructure and its application to the parallelization of four well-known scientific applications: Harris, ABF, equake and IS. Our experiments suggest that BDSC's focus on efficient resource man-agement leads to significant parallelization speedups on both shared and dis-tributed memory systems, improving upon DSC results, as shown by the com-parison of the sequential and parallelized versions of these four applications running on both OpenMP and MPI frameworks

HAL Descartes

HAL-MINES ParisTech

Autotuning for Automatic Parallelization on Heterogeneous Systems

Author: Pfaffe Philip
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2020
Field of study

KITopen

Performance Estimation for Task Graphs Combining Sequential Path Profiling and Control Dependence Regions

Author: A. Tumeo
C. Pilato
F. Ferrandi
M. Lattuada
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

The speed-up estimation of parallelized code is crucial to efficiently compare different parallelization techniques or task graph transformations. Unfortunately, most of the time, during the parallelization of a specification, the information that can be extracted by profiling the corresponding sequential code (e.g. the most executed paths) are not properly taken into account. In particular, correlating sequential path profiling with the corresponding parallelized code can help in the identification of code hot spots, opening new possibilities for automatic parallelization. For this reason, starting from a well-known profiling technique, the Efficient Path Profiling, we propose a methodology that estimates the speed-up of a parallelized specification, just using the corresponding hierarchical task graph representation and the information coming from the dynamic profiling of the initial sequential specification. Experimental results show that the proposed solution outperforms existing approaches

Archivio istituzionale della ricerca - Politecnico di Milano

Crossref

05101 Abstracts Collection -- Scheduling for Parallel Architectures: Theory, Applications, Challenges

Author: Altman Erik
Dehnert James
Kessler Christoph W.
Knoop Jens
Publication venue: Dagstuhl Seminar Proceedings. 05101 - Scheduling for Parallel Architectures: Theory, Applications, Challenges
Publication date: 01/01/2005
Field of study

From 06.03.05 to 11.03.05, the Dagstuhl Seminar 05101 ``Scheduling for Parallel Architectures: Theory, Applications, Challenges\u27\u27 was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general

Dagstuhl Research Online Publication Server

Safety verification of asynchronous pushdown systems with shaped stacks

Author: A. Bouajjani
A. Bouajjani
A. Finkel
A. Heußner
A. Lal
C. Flanagan
D. Brand
G. Ramalingam
J. Esparza
K. Sen
M. Mukund
M. Mukund
P. Ganty
R. Chadha
S. Torre La
S. Torre La
T.A. Henzinger
W. Czerwiński
W. Czerwiński
Publication venue
Publication date: 01/01/2013
Field of study

In this paper, we study the program-point reachability problem of concurrent pushdown systems that communicate via unbounded and unordered message buffers. Our goal is to relax the common restriction that messages can only be retrieved by a pushdown process when its stack is empty. We use the notion of partially commutative context-free grammars to describe a new class of asynchronously communicating pushdown systems with a mild shape constraint on the stacks for which the program-point coverability problem remains decidable. Stacks that fit the shape constraint may reach arbitrary heights; further a process may execute any communication action (be it process creation, message send or retrieval) whether or not its stack is empty. This class extends previous computational models studied in the context of asynchronous programs, and enables the safety verification of a large class of message passing programs

arXiv.org e-Print Archive

CiteSeerX

Crossref

Oxford University Research Archive