1,148 research outputs found
GLB: Lifeline-based Global Load Balancing library in X10
We present GLB, a programming model and an associated implementation that can
handle a wide range of irregular paral- lel programming problems running over
large-scale distributed systems. GLB is applicable both to problems that are
easily load-balanced via static scheduling and to problems that are hard to
statically load balance. GLB hides the intricate syn- chronizations (e.g.,
inter-node communication, initialization and startup, load balancing,
termination and result collection) from the users. GLB internally uses a
version of the lifeline graph based work-stealing algorithm proposed by
Saraswat et al. Users of GLB are simply required to write several pieces of
sequential code that comply with the GLB interface. GLB then schedules and
orchestrates the parallel execution of the code correctly and efficiently at
scale. We have applied GLB to two representative benchmarks: Betweenness
Centrality (BC) and Unbalanced Tree Search (UTS). Among them, BC can be
statically load-balanced whereas UTS cannot. In either case, GLB scales well--
achieving nearly linear speedup on different computer architectures (Power,
Blue Gene/Q, and K) -- up to 16K cores
A Three-Level Parallelisation Scheme and Application to the Nelder-Mead Algorithm
We consider a three-level parallelisation scheme. The second and third levels
define a classical two-level parallelisation scheme and some load balancing
algorithm is used to distribute tasks among processes. It is well-known that
for many applications the efficiency of parallel algorithms of the second and
third level starts to drop down after some critical parallelisation degree is
reached. This weakness of the two-level template is addressed by introduction
of one additional parallelisation level. As an alternative to the basic solver
some new or modified algorithms are considered on this level. The idea of the
proposed methodology is to increase the parallelisation degree by using less
efficient algorithms in comparison with the basic solver. As an example we
investigate two modified Nelder-Mead methods. For the selected application, a
few partial differential equations are solved numerically on the second level,
and on the third level the parallel Wang's algorithm is used to solve systems
of linear equations with tridiagonal matrices. A greedy workload balancing
heuristic is proposed, which is oriented to the case of a large number of
available processors. The complexity estimates of the computational tasks are
model-based, i.e. they use empirical computational data
Performance driven distributed scheduling of parallel hybrid computations
AbstractExascale computing is fast becoming a mainstream research area. In order to realize exascale performance, it is necessary to have efficient scheduling of large parallel computations with scalable performance on a large number of cores/processors. The scheduler needs to execute in a pure distributed and online fashion, should follow affinity inherent in the computation and must have low time and message complexity. Further, it should also avoid physical deadlocks due to bounded resources including space/memory per core. Simultaneous consideration of these factors makes affinity driven distributed scheduling particularly challenging. We attempt to address this challenge for hybrid parallel computations which contain tasks that have pre-specified affinity to a place and also tasks that can be mapped to any place in the system. Specifically, we address two scheduling problems of the type Pm|Mj,prec|Cmax. This paper presents online distributed scheduling algorithms for hybrid parallel computations assuming both unconstrained and bounded space per place. We also present the time and message complexity for distributed scheduling of hybrid computations. To the best of our knowledge, this is the first time that distributed scheduling algorithms for hybrid parallel computations have been presented and analyzed for time and message bounds under both unconstrained space and bounded space
Adaptive runtime-assisted block prefetching on chip-multiprocessors
Memory stalls are a significant source of performance degradation in modern processors. Data prefetching is a widely adopted and well studied technique used to alleviate this problem. Prefetching can be performed by the hardware, or be initiated and controlled by software. Among software controlled prefetching we find a wide variety of schemes, including runtime-directed prefetching and more specifically runtime-directed block prefetching. This paper proposes a hybrid prefetching mechanism that integrates a software driven block prefetcher with existing hardware prefetching techniques. Our runtime-assisted software prefetcher brings large blocks of data on-chip with the support of a low cost hardware engine, and synergizes with existing hardware prefetchers that manage locality at a finer granularity. The runtime system that drives the prefetch engine dynamically selects which cache to prefetch to. Our evaluation on a set of scientific benchmarks obtains a maximum speed up of 32 and 10 % on average compared to a baseline with hardware prefetching only. As a result, we also achieve a reduction of up to 18 and 3 % on average in energy-to-solution.Peer ReviewedPostprint (author's final draft
Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model
We present a method for parallel block-sparse matrix-matrix multiplication on
distributed memory clusters. By using a quadtree matrix representation, data
locality is exploited without prior information about the matrix sparsity
pattern. A distributed quadtree matrix representation is straightforward to
implement due to our recent development of the Chunks and Tasks programming
model [Parallel Comput. 40, 328 (2014)]. The quadtree representation combined
with the Chunks and Tasks model leads to favorable weak and strong scaling of
the communication cost with the number of processes, as shown both
theoretically and in numerical experiments.
Matrices are represented by sparse quadtrees of chunk objects. The leaves in
the hierarchy are block-sparse submatrices. Sparsity is dynamically detected by
the matrix library and may occur at any level in the hierarchy and/or within
the submatrix leaves. In case graphics processing units (GPUs) are available,
both CPUs and GPUs are used for leaf-level multiplication work, thus making use
of the full computing capacity of each node.
The performance is evaluated for matrices with different sparsity structures,
including examples from electronic structure calculations. Compared to methods
that do not exploit data locality, our locality-aware approach reduces
communication significantly, achieving essentially constant communication per
node in weak scaling tests.Comment: 35 pages, 14 figure
Runtime support for load balancing of parallel adaptive and irregular applications
Applications critical to today\u27s engineering research often must make use of the increased memory and processing power of a parallel machine. While advances in architecture design are leading to more and more powerful parallel systems, the software tools needed to realize their full potential are in a much less advanced state. In particular, efficient, robust, and high-performance runtime support software is critical in the area of dynamic load balancing. While the load balancing of loosely synchronous codes, such as field solvers, has been studied extensively for the past 15 years, there exists a class of problems, known as asynchronous and highly adaptive , for which the dynamic load balancing problem remains open. as we discuss, characteristics of this class of problems render compile-time or static analysis of little benefit, and complicate the dynamic load balancing task immensely.;We make two contributions to this area of research. The first is the design and development of a runtime software toolkit, known as the Parallel Runtime Environment for Multi-computer Applications, or PREMA, which provides interprocessor communication, a global namespace, a framework for the implementation of customized scheduling policies, and several such policies which are prevalent in the load balancing literature. The PREMA system is designed to support coarse-grained domain decompositions with the goals of portability, flexibility, and maintainability in mind, so that developers will quickly feel comfortable incorporating it into existing codes and developing new codes which make use of its functionality. We demonstrate that the programming model and implementation are efficient and lead to the development of robust and high-performance applications.;Our second contribution is in the area of performance modeling. In order to make the most effective use of the PREMA runtime software, certain parameters governing its execution must be set off-line. Optimal values for these parameters may be determined through repeated executions of the target application; however, this is not always possible, particularly in large-scale environments and long-running applications. We present an analytic model that allows the user to quickly and inexpensively predict application performance and fine-tune applications built on the PREMA platform
- …