76 research outputs found
An Experimental Study of Parallel Biconnected Components Algorithms on Symmetric Multiprocessors (SMPs)
We present an experimental study of parallel biconnected components algorithms
employing several fundamental parallel primitives, e.g., prefix sum, list ranking, sorting, connectivity, spanning tree, and tree computations. Previous experimental studies
of these primitives demonstrate reasonable parallel speedups. However, when these
algorithms are used as subroutines to solve higher-level problems, there are two factors that hinder fast parallel implementations. One is parallel overhead, i.e., the large
constant factors hidden in the asymptotic bounds; the other is the discrepancy among
the data structures used in the primitives that brings non-negligible conversion cost.
We present various optimization techniques and a new parallel algorithm that significantly improve the performance of finding biconnected components of a graph
on symmetric multiprocessors (SMPs). Finding biconnected components has application in fault-tolerant network design, and is also used in graph planarity testing.
Our parallel implementation achieves speedups up to 4 using 12 processors on a Sun
E4500 for large, sparse graphs, and the source code is freely-available at our web site
http://www.ece.unm.edu/~dbader.This work was supported in part by NSF Grants CAREER ACI-00-93039, ITR ACI-00-81404, DEB-99-
10123, ITR EIA-01-21377, Biocomplexity DEB-01-20709, DBI-0420513, ITR EF/BIO 03-31654 and DBI-04-
20513; and DARPA Contract NBCH30390004
An Empirical Analysis of Parallel Random Permutation Algorithms on SMPs
We compare parallel algorithms for random permutation generation on symmetric multiprocessors
(SMPs). Algorithms considered are the sorting-based algorithm, Anderson's shuffling
algorithm, the dart-throwing algorithm, and Sanders' algorithm. We investigate the impact
of synchronization method, memory access pattern, cost of generating random numbers
and other parameters on the performance of the algorithms. Within the range of inputs used and
processors employed, Anderson's algorithm is preferable due to its simplicity when random
number generation is relatively costly, while Sanders' algorithm has superior performance due
to good cache performance when a fast random number generator is available. There is no definite
winner across all settings. In fact we predict our new dart-throwing algorithm performs
best when synchronization among processors becomes costly and memory access is relatively
fast.
We also compare the performance of our parallel implementations with the sequential implementation.
It is unclear without extensive experimental studies whether fast parallel algorithms
beat efficient sequential algorithms due to mismatch between model and architecture.
Our implementations achieve speedups up to 6 with 12 processors on the Sun E4500.This work was supported in part by NSF Grants CAREER ACI-00-93039, NSF DBI-0420513, ITR ACI-00-
81404, DEB-99-10123, ITR EIA-01-21377, Biocomplexity DEB-01-20709, and ITR EF/BIO 03-31654; and DARPA
Contract NBCH30390004
A parallel edge orientation algorithm for quadrilateral meshes
One approach to achieving correct finite element assembly is to ensure that
the local orientation of facets relative to each cell in the mesh is consistent
with the global orientation of that facet. Rognes et al. have shown how to
achieve this for any mesh composed of simplex elements, and deal.II contains a
serial algorithm to construct a consistent orientation of any quadrilateral
mesh of an orientable manifold.
The core contribution of this paper is the extension of this algorithm for
distributed memory parallel computers, which facilitates its seamless
application as part of a parallel simulation system.
Furthermore, our analysis establishes a link between the well-known
Union-Find algorithm and the construction of a consistent orientation of a
quadrilateral mesh. As a result, existing work on the parallelisation of the
Union-Find algorithm can be easily adapted to construct further parallel
algorithms for mesh orientations.Comment: Second revision: minor change
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable
There has been significant recent interest in parallel graph processing due
to the need to quickly analyze the large graphs available today. Many graph
codes have been designed for distributed memory or external memory. However,
today even the largest publicly-available real-world graph (the Hyperlink Web
graph with over 3.5 billion vertices and 128 billion edges) can fit in the
memory of a single commodity multicore server. Nevertheless, most experimental
work in the literature report results on much smaller graphs, and the ones for
the Hyperlink graph use distributed or external memory. Therefore, it is
natural to ask whether we can efficiently solve a broad class of graph problems
on this graph in memory.
This paper shows that theoretically-efficient parallel graph algorithms can
scale to the largest publicly-available graphs using a single machine with a
terabyte of RAM, processing them in minutes. We give implementations of
theoretically-efficient parallel algorithms for 20 important graph problems. We
also present the optimizations and techniques that we used in our
implementations, which were crucial in enabling us to process these large
graphs quickly. We show that the running times of our implementations
outperform existing state-of-the-art implementations on the largest real-world
graphs. For many of the problems that we consider, this is the first time they
have been solved on graphs at this scale. We have made the implementations
developed in this work publicly-available as the Graph-Based Benchmark Suite
(GBBS).Comment: This is the full version of the paper appearing in the ACM Symposium
on Parallelism in Algorithms and Architectures (SPAA), 201
A scalable parallel union-find algorithm for distributed memory computers
Abstract The Union-Find algorithm is used for maintaining a number of nonoverlapping sets from a finite universe of elements. The algorithm has applications in a number of areas including the computation of spanning trees and in image processing. Although the algorithm is inherently sequential there has been some previous efforts at constructing parallel implementations. These have mainly focused on shared memory computers. In this paper we present the first scalable parallel implementation of the Union-Find algorithm suitable for distributed memory computers. Our new parallel algorithm is based on an observation of how the Find part of the sequential algorithm can be executed more efficiently. We show the efficiency of our implementation through a series of tests to compute spanning forests of very large graphs
Efficient techniques to provide scalability for token-based cache coherence protocols
Cache coherence protocols based on tokens can provide low latency without relying on non-scalable interconnects thanks to the use of efficient requests that are unordered. However, when these unordered requests contend for the same memory block, they may cause protocols races. To resolve the races and ensure
the completion of all the cache misses, token protocols use a starvation prevention mechanism that is inefficient and non-scalable in terms of required storage structures and generated traffic. Besides, token protocols use
non-silent invalidations which increase the latency of write misses proportionally to the system size. All these problems make token protocols non-scalable.
To overcome the main problems of token protocols and increase their scalability, we propose a new starvation prevention mechanism named Priority Requests. This mechanism resolves contention by an efficient, elegant, and flexible method based on ordered requests. Furthermore, thanks to Priority Requests, efficient
techniques can be applied to limit the storage requirements of the starvation prevention mechanism, to reduce the total traffic generated for managing protocol races, and to reduce the latency of write misses. Thus, the main problems of token protocols can be solved, which, in turn, contributes to wide their efficiency and scalability.Cuesta Sáez, BA. (2009). Efficient techniques to provide scalability for token-based cache coherence protocols [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/6024Palanci
- …