389 research outputs found
A Note on (Parallel) Depth- and Breadth-First Search by Arc Elimination
This note recapitulates an algorithmic observation for ordered Depth-First
Search (DFS) in directed graphs that immediately leads to a parallel algorithm
with linear speed-up for a range of processors for non-sparse graphs. The note
extends the approach to ordered Breadth-First Search (BFS). With
processors, both DFS and BFS algorithms run in time steps on a
shared-memory parallel machine allowing concurrent reading of locations, e.g.,
a CREW PRAM, and have linear speed-up for . Both algorithms need
synchronization steps
The Shortest Path Problem with Edge Information Reuse is NP-Complete
We show that the following variation of the single-source shortest path
problem is NP-complete. Let a weighted, directed, acyclic graph
with source and sink vertices and be given. Let in addition a mapping
on be given that associates information with the edges (e.g., a
pointer), such that means that edges and carry the same
information; for such edges it is required that . The length of a
simple path is the sum of the weights of the edges on but edges
with are counted only once. The problem is to determine a shortest
such path. We call this problem the \emph{edge information reuse shortest
path problem}. It is NP-complete by reduction from 3SAT
On Optimal Trees for Irregular Gather and Scatter Collectives
We study the complexity of finding communication trees with the lowest
possible completion time for rooted, irregular gather and scatter collective
communication operations in fully connected, -ported communication networks
under a linear-time transmission cost model. Consecutively numbered processors
specify data blocks of possibly different sizes to be collected at or
distributed from some (given) root processor where they are stored in processor
order. Data blocks can be combined into larger segments consisting of blocks
from or to different processors, but individual blocks cannot be split. We
distinguish between ordered and non-ordered communication trees depending on
whether segments of blocks are maintained in processor order. We show that
lowest completion time, ordered communication trees under one-ported
communication can be found in polynomial time by giving simple, but costly
dynamic programming algorithms. In contrast, we show that it is an NP-complete
problem to construct cost-optimal, non-ordered communication trees. We have
implemented the dynamic programming algorithms for homogeneous networks to
evaluate the quality of different types of communication trees, in particular
to analyze a recent, distributed, problem-adaptive tree construction algorithm.
Model experiments show that this algorithm is close to the optimum for a
selection of block size distributions. A concrete implementation for specially
structured problems shows that optimal, non-binomial trees can possibly have
even further practical advantage
VieM v1.00 -- Vienna Mapping and Sparse Quadratic Assignment User Guide
This paper severs as a user guide to the mapping framework VieM (Vienna
Mapping and Sparse Quadratic Assignment). We give a rough overview of the
techniques used within the framework and describe the user interface as well as
the file formats used.Comment: arXiv admin note: text overlap with arXiv:1311.171
Simplified, stable parallel merging
This note makes an observation that significantly simplifies a number of
previous parallel, two-way merge algorithms based on binary search and
sequential merge in parallel. First, it is shown that the additional merge step
of distinguished elements as found in previous algorithms is not necessary,
thus simplifying the implementation and reducing constant factors. Second, by
fixating the requirements to the binary search, the merge algorithm becomes
stable, provided that the sequential merge subroutine is stable. The stable,
parallel merge algorithm can easily be used to implement a stable, parallel
merge sort.
For ordered sequences with and elements, , the simplified
merge algorithm runs in operations using processing
elements. It can be implemented on an EREW PRAM, but since it requires only a
single synchronization step, it is also a candidate for implementation on other
parallel, shared-memory computers
Stamp-it: A more Thread-efficient, Concurrent Memory Reclamation Scheme in the C++ Memory Model
We present Stamp-it, a new, concurrent, lock-less memory reclamation scheme
with amortized, constant-time (thread-count independent) reclamation overhead.
Stamp-it has been implemented and proved correct in the C++ memory model using
as weak memory-consistency assumptions as possible. We have likewise
(re)implemented six other comparable reclamation schemes. We give a detailed
performance comparison, showing that Stamp-it performs favorably (sometimes
better, at least as good as) than most of these other schemes while being able
to reclaim free memory nodes earlier.Comment: arXiv admin note: substantial text overlap with arXiv:1712.0613
On the State and Importance of Reproducible Experimental Research in Parallel Computing
Computer science is also an experimental science. This is particularly the
case for parallel computing, which is in a total state of flux, and where
experiments are necessary to substantiate, complement, and challenge
theoretical modeling and analysis. Here, experimental work is as important as
are advances in theory, that are indeed often driven by the experimental
findings. In parallel computing, scientific contributions presented in research
articles are therefore often based on experimental data, with a substantial
part devoted to presenting and discussing the experimental findings. As in all
of experimental science, experiments must be presented in a way that makes
reproduction by other researchers possible, in principle. Despite appearance to
the contrary, we contend that reproducibility plays a small role, and is
typically not achieved. As can be found, articles often do not have a
sufficiently detailed description of their experiments, and do not make
available the software used to obtain the claimed results. As a consequence,
parallel computational results are most often impossible to reproduce, often
questionable, and therefore of little or no scientific value. We believe that
the description of how to reproduce findings should play an important part in
every serious, experiment-based parallel computing research article. We aim to
initiate a discussion of the reproducibility issue in parallel computing, and
elaborate on the importance of reproducible research for (1) better and sounder
technical/scientific papers, (2) a sounder and more efficient review process
and (3) more effective collective work. This paper expresses our current view
on the subject and should be read as a position statement for discussion and
future work. We do not consider the related (but no less important) issue of
the quality of the experimental design
A new and five older Concurrent Memory Reclamation Schemes in Comparison (Stamp-it)
Memory management is a critical component in almost all shared-memory,
concurrent data structures and algorithms, consisting in the efficient
allocation and the subsequent reclamation of shared memory resources. This
paper contributes a new, lock-free, amortized constant-time memory reclamation
scheme called \emph{Stamp-it}, and compares it to five well-known, selectively
efficient schemes from the literature, namely Lock-free Reference Counting,
Hazard Pointers, Quiescent State-based Reclamation, Epoch-based Reclamation,
and New Epoch-based Reclamation. An extensive, experimental evaluation with
both new and commonly used benchmarks is provided, on four different
shared-memory systems with hardware supported thread counts ranging from 48 to
512, showing Stamp-it to be competitive with and in many cases and aspects
outperforming other schemes
More Parallelism in Dijkstra's Single-Source Shortest Path Algorithm
Dijkstra's algorithm for the Single-Source Shortest Path (SSSP) problem is
notoriously hard to parallelize in depth, being the number of
vertices in the input graph, without increasing the required parallel work
unreasonably. Crauser et al.\ (1998) presented observations that allow to
identify more than a single vertex at a time as correct and correspondingly
more edges to be relaxed simultaneously. Their algorithm runs in parallel
phases, and for certain random graphs they showed that the number of phases is
with high probability. A work-efficient CRCW PRAM with this depth
was given, but no implementation on a real, parallel system.
In this paper we strengthen the criteria of Crauser et al., and discuss
tradeoffs between work and number of phases in their implementation. We present
simulation results with a range of common input graphs for the depth that an
ideal parallel algorithm that can apply the criteria at no cost and parallelize
relaxations without conflicts can achieve. These results show that the number
of phases is indeed a small root of , but still off from the shortest path
length lower bound that can also be computed.
We give a shared-memory parallel implementation of the most work-efficient
version of a Dijkstra's algorithm running in parallel phases, which we compare
to an own implementation of the well-known -stepping algorithm. We can
show that the work-efficient SSSP algorithm applying the criteria of Crauser et
al. is competitive to and often better than -stepping on our chosen
input graphs. Despite not providing an guarantee on the number of
required phases, criteria allowing concurrent relaxation of many correct
vertices may be a viable approach to practically fast, parallel SSSP
implementations
Memory Models for C/C++ Programmers
The memory model is the crux of the concurrency semantics of shared-memory
systems. It defines the possible values that a read operation is allowed to
return for any given set of write operations performed by a concurrent program,
thereby defining the basic semantics of shared variables. It is therefore
impossible to meaningfully reason about a program or any part of the
programming language implementation without an unambiguous memory model.
This note provides a brief introduction into the topic of memory models,
explaining why it is essential for concurrent programs and covering well known
memory models from sequential consistency to those of the x86 and ARM/POWER
CPUs. Section 4 is fully dedicated to the C++11 memory model, explaining how it
can be used to write concurrent code that is not only correct and portable, but
also efficient by utilizing the relaxed memory models of modern architectures
- …