4,799 research outputs found
A scheduling theory framework for GPU tasks efficient execution
Concurrent execution of tasks in GPUs can reduce the computation time of a workload by
overlapping data transfer and execution commands.
However it is difficult to implement an efficient run-
time scheduler that minimizes the workload makespan
as many execution orderings should be evaluated. In
this paper, we employ scheduling theory to build a
model that takes into account the device capabili-
ties, workload characteristics, constraints and objec-
tive functions. In our model, GPU tasks schedul-
ing is reformulated as a flow shop scheduling prob-
lem, which allow us to apply and compare well known
methods already developed in the operations research
field. In addition we develop a new heuristic, specif-
ically focused on executing GPU commands, that
achieves better scheduling results than previous tech-
niques. Finally, a comprehensive evaluation, showing
the suitability and robustness of this new approach,
is conducted in three different NVIDIA architectures
(Kepler, Maxwell and Pascal).Proyecto TIN2016- 0920R, Universidad de Málaga (Campus de Excelencia Internacional Andalucía Tech) y programa de donación de NVIDIA Corporation
Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model
We present a method for parallel block-sparse matrix-matrix multiplication on
distributed memory clusters. By using a quadtree matrix representation, data
locality is exploited without prior information about the matrix sparsity
pattern. A distributed quadtree matrix representation is straightforward to
implement due to our recent development of the Chunks and Tasks programming
model [Parallel Comput. 40, 328 (2014)]. The quadtree representation combined
with the Chunks and Tasks model leads to favorable weak and strong scaling of
the communication cost with the number of processes, as shown both
theoretically and in numerical experiments.
Matrices are represented by sparse quadtrees of chunk objects. The leaves in
the hierarchy are block-sparse submatrices. Sparsity is dynamically detected by
the matrix library and may occur at any level in the hierarchy and/or within
the submatrix leaves. In case graphics processing units (GPUs) are available,
both CPUs and GPUs are used for leaf-level multiplication work, thus making use
of the full computing capacity of each node.
The performance is evaluated for matrices with different sparsity structures,
including examples from electronic structure calculations. Compared to methods
that do not exploit data locality, our locality-aware approach reduces
communication significantly, achieving essentially constant communication per
node in weak scaling tests.Comment: 35 pages, 14 figure
A Domain Specific Approach to High Performance Heterogeneous Computing
Users of heterogeneous computing systems face two problems: firstly, in
understanding the trade-off relationships between the observable
characteristics of their applications, such as latency and quality of the
result, and secondly, how to exploit knowledge of these characteristics to
allocate work to distributed computing platforms efficiently. A domain specific
approach addresses both of these problems. By considering a subset of
operations or functions, models of the observable characteristics or domain
metrics may be formulated in advance, and populated at run-time for task
instances. These metric models can then be used to express the allocation of
work as a constrained integer program, which can be solved using heuristics,
machine learning or Mixed Integer Linear Programming (MILP) frameworks. These
claims are illustrated using the example domain of derivatives pricing in
computational finance, with the domain metrics of workload latency or makespan
and pricing accuracy. For a large, varied workload of 128 Black-Scholes and
Heston model-based option pricing tasks, running upon a diverse array of 16
Multicore CPUs, GPUs and FPGAs platforms, predictions made by models of both
the makespan and accuracy are generally within 10% of the run-time performance.
When these models are used as inputs to machine learning and MILP-based
workload allocation approaches, a latency improvement of up to 24 and 270 times
over the heuristic approach is seen.Comment: 14 pages, preprint draft, minor revisio
Revisiting Actor Programming in C++
The actor model of computation has gained significant popularity over the
last decade. Its high level of abstraction makes it appealing for concurrent
applications in parallel and distributed systems. However, designing a
real-world actor framework that subsumes full scalability, strong reliability,
and high resource efficiency requires many conceptual and algorithmic additives
to the original model.
In this paper, we report on designing and building CAF, the "C++ Actor
Framework". CAF targets at providing a concurrent and distributed native
environment for scaling up to very large, high-performance applications, and
equally well down to small constrained systems. We present the key
specifications and design concepts---in particular a message-transparent
architecture, type-safe message interfaces, and pattern matching
facilities---that make native actors a viable approach for many robust,
elastic, and highly distributed developments. We demonstrate the feasibility of
CAF in three scenarios: first for elastic, upscaling environments, second for
including heterogeneous hardware like GPGPUs, and third for distributed runtime
systems. Extensive performance evaluations indicate ideal runtime behaviour for
up to 64 cores at very low memory footprint, or in the presence of GPUs. In
these tests, CAF continuously outperforms the competing actor environments
Erlang, Charm++, SalsaLite, Scala, ActorFoundry, and even the OpenMPI.Comment: 33 page
- …