2,070 research outputs found
Deterministic Consistency: A Programming Model for Shared Memory Parallelism
The difficulty of developing reliable parallel software is generating
interest in deterministic environments, where a given program and input can
yield only one possible result. Languages or type systems can enforce
determinism in new code, and runtime systems can impose synthetic schedules on
legacy parallel code. To parallelize existing serial code, however, we would
like a programming model that is naturally deterministic without language
restrictions or artificial scheduling. We propose "deterministic consistency",
a parallel programming model as easy to understand as the "parallel assignment"
construct in sequential languages such as Perl and JavaScript, where concurrent
threads always read their inputs before writing shared outputs. DC supports
common data- and task-parallel synchronization abstractions such as fork/join
and barriers, as well as non-hierarchical structures such as producer/consumer
pipelines and futures. A preliminary prototype suggests that software-only
implementations of DC can run applications written for popular parallel
environments such as OpenMP with low (<10%) overhead for some applications.Comment: 7 pages, 3 figure
An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling
We present a sparse linear system solver that is based on a multifrontal
variant of Gaussian elimination, and exploits low-rank approximation of the
resulting dense frontal matrices. We use hierarchically semiseparable (HSS)
matrices, which have low-rank off-diagonal blocks, to approximate the frontal
matrices. For HSS matrix construction, a randomized sampling algorithm is used
together with interpolative decompositions. The combination of the randomized
compression with a fast ULV HSS factorization leads to a solver with lower
computational complexity than the standard multifrontal method for many
applications, resulting in speedups up to 7 fold for problems in our test
suite. The implementation targets many-core systems by using task parallelism
with dynamic runtime scheduling. Numerical experiments show performance
improvements over state-of-the-art sparse direct solvers. The implementation
achieves high performance and good scalability on a range of modern shared
memory parallel systems, including the Intel Xeon Phi (MIC). The code is part
of a software package called STRUMPACK -- STRUctured Matrices PACKage, which
also has a distributed memory component for dense rank-structured matrices
An Efficient OpenMP Runtime System for Hierarchical Arch
Exploiting the full computational power of always deeper hierarchical
multiprocessor machines requires a very careful distribution of threads and
data among the underlying non-uniform architecture. The emergence of multi-core
chips and NUMA machines makes it important to minimize the number of remote
memory accesses, to favor cache affinities, and to guarantee fast completion of
synchronization steps. By using the BubbleSched platform as a threading backend
for the GOMP OpenMP compiler, we are able to easily transpose affinities of
thread teams into scheduling hints using abstractions called bubbles. We then
propose a scheduling strategy suited to nested OpenMP parallelism. The
resulting preliminary performance evaluations show an important improvement of
the speedup on a typical NAS OpenMP benchmark application
An Efficient Thread Mapping Strategy for Multiprogramming on Manycore Processors
The emergence of multicore and manycore processors is set to change the
parallel computing world. Applications are shifting towards increased
parallelism in order to utilise these architectures efficiently. This leads to
a situation where every application creates its desirable number of threads,
based on its parallel nature and the system resources allowance. Task
scheduling in such a multithreaded multiprogramming environment is a
significant challenge. In task scheduling, not only the order of the execution,
but also the mapping of threads to the execution resources is of a great
importance. In this paper we state and discuss some fundamental rules based on
results obtained from selected applications of the BOTS benchmarks on the
64-core TILEPro64 processor. We demonstrate how previously efficient mapping
policies such as those of the SMP Linux scheduler become inefficient when the
number of threads and cores grows. We propose a novel, low-overhead technique,
a heuristic based on the amount of time spent by each CPU doing some useful
work, to fairly distribute the workloads amongst the cores in a
multiprogramming environment. Our novel approach could be implemented as a
pragma similar to those in the new task-based OpenMP versions, or can be
incorporated as a distributed thread mapping mechanism in future manycore
programming frameworks. We show that our thread mapping scheme can outperform
the native GNU/Linux thread scheduler in both single-programming and
multiprogramming environments.Comment: ParCo Conference, Munich, Germany, 201
Some Experiments and Issues to Exploit Multicore Parallelism in a Distributed-Memory Parallel Sparse Direct Solver
MUMPS is a parallel sparse direct solver, using message passing (MPI) for parallelism. In this report we experiment how thread parallelism can help taking advantage of recent multicore architectures. The work done consists in testing multithreaded BLAS libraries and inserting OpenMP directives in the routines revealed to be costly by profiling, with the objective to avoid any deep restructuring or rewriting of the code. We report on various aspects of this work, present some of the benefits and difficulties, and show that 4 threads per MPI process is generally a good compromise. We then discuss various issues that appear to be critical in a mixed MPI-OpenMP environment
- …