Using PRAM algorithms on a uniform-memory-access shared-memory architecture by Bader, D. A. et al.
Using PRAM Algorithms on a
Uniform-Memory-Access Shared-Memory
Architecture
David A. Bader1, Ajith K. Illendula2, Bernard M.E. Moret3, and
Nina R. Weisse-Bernstein4
1 Department of Electrical and Computer Engineering, University of New Mexico
Albuquerque, NM 87131 USA
dbader@eece.unm.edu‡
2 Intel Corporation, Rio Rancho, NM 87124 USA
ajith.k.illendula@intel.com
3 Department of Computer Science, University of New Mexico
Albuquerque, NM 87131 USA
moret@cs.unm.edu§
4 University of New Mexico, Albuquerque, NM 87131 USA
nina@unm.edu¶
Abstract. The ability to provide uniform shared-memory access to a
signiﬁcant number of processors in a single SMP node brings us much
closer to the ideal PRAM parallel computer. In this paper, we develop
new techniques for designing a uniform shared-memory algorithm from
a PRAM algorithm and present the results of an extensive experimental
study demonstrating that the resulting programs scale nearly linearly
across a signiﬁcant range of processors (from 1 to 64) and across the
entire range of instance sizes tested. This linear speedup with the number
of processors is, to our knowledge, the ﬁrst ever attained in practice for
intricate combinatorial problems. The example we present in detail here
is a graph decomposition algorithm that also requires the computation of
a spanning tree; this problem is not only of interest in its own right, but
is representative of a large class of irregular combinatorial problems that
have simple and eﬃcient sequential implementations and fast PRAM
algorithms, but have no known eﬃcient parallel implementations. Our
results thus oﬀer promise for bridging the gap between the theory and
practice of shared-memory parallel algorithms.
1 Introduction
Symmetric multiprocessor (SMP) architectures, in which several processors op-
erate in a true, hardware-based, shared-memory environment and are packaged
‡ Supported in part by NSF CAREER 00-93039, NSF ITR 00-81404, NSF DEB 99-
10123, and DOE CSRI-14968
§ Supported in part by NSF ITR 00-81404
¶ Supported by an NSF Research Experience for Undergraduates (REU)
G. Brodal et al. (Eds.): WAE 2001, LNCS 2141, pp. 129–144, 2001.
c© Springer-Verlag Berlin Heidelberg 2001
130 David A. Bader et al.
as a single machine, are becoming commonplace. Indeed, most of the new high-
performance computers are clusters of SMPs, with from 2 to 64 processors per
node. The ability to provide uniform-memory-access (UMA) shared-memory for
a signiﬁcant number of processors brings us much closer to the ideal parallel
computer envisioned over 20 years ago by theoreticians, the Parallel Random
Access Machine (PRAM) (see [22,41]) and thus may enable us at long last to
take advantage of 20 years of research in PRAM algorithms for various irregular
computations. Moreover, as supercomputers increasingly use SMP clusters, SMP
computations will play a signiﬁcant role in supercomputing. For instance, much
attention has been devoted lately to OpenMP [35], which provides compiler di-
rectives and runtime support to reveal algorithmic concurrency and thus take
advantage of the SMP architecture; and to mixed-mode programming, which
combines message-passing style between cluster nodes (using MPI [31]) and
shared-memory style within each SMP (using OpenMP or POSIX threads [36]).
While an SMP is a shared-memory architecture, it is by no means the PRAM
used in theoretical work—synchronization cannot be taken for granted and the
number of processors is far smaller than that assumed in PRAM algorithms.
The signiﬁcant feature of SMPs is that they provide much faster access to their
shared-memory than an equivalent message-based architecture. Even the largest
SMP to date, the 64-processor Sun Enterprise 10000 (E10K), has a worst-case
memory access time of 600ns (from any processor to any location within its 64GB
memory); in contrast, the latency for access to the memory of another processor
in a distributed-memory architecture is measured in tens of µs. In other words,
message-based architectures are two orders of magnitude slower than the largest
SMPs in terms of their worst-case memory access times.
The largest SMP architecture to date, the Sun E10K [8], uses a combination
of data crossbar switches, multiple snooping buses, and sophisticated cache han-
dling to achieve UMA across the entire memory. Of course, there remains a large
diﬀerence between the access time for an element in the local processor cache
(around 15ns) and that for an element that must be obtained from memory (at
most 600ns)—and that diﬀerence increases as the number of processors increases,
so that cache-aware implementations are, if anything, even more important on
large SMPs than on single workstations. Figure 1 illustrates the memory access
behavior of the Sun E10K (right) and its smaller sibling, the E4500 (left), using
a single processor to visit each node in a circular array. We chose patterns of ac-
cess with a ﬁxed stride, in powers of 2 (labelled C, stride), as well as a random
access pattern (labelled R). The data clearly show the eﬀect of addressing outside
the on-chip cache (the ﬁrst break, at a problem size of 213 words, or 32KB) and
then outside the L2 cache (the second break, at a problem size of 221 words,
or 8MB). The uniformity of access times was impressive—standard deviations
around our reported means are well below 10 percent. Such architectures make
it possible to design algorithms targeted speciﬁcally at SMPs.
In this paper, we present promising results for writing eﬃcient implemen-
tations of PRAM-based parallel algorithms for UMA shared-memory machines.
As an example of our methodology, we look at ear decomposition in a graph
PRAM Algorithms on a Shared-Memory Architecture 131
log2 (Problem Size)
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Ti
m
e
pe
r
m
e
m
o
ry
re
a
d
(ns
)
10 ns
100 ns
1000 ns
C, 1
C, 2
C, 4
C, 8
C, 16
C, 32
C, 64
R
Sun Enterprise 4500
log2 (Problem Size)
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Ti
m
e
pe
r
m
e
m
o
ry
re
a
d
(ns
)
10 ns
100 ns
1000 ns
C, 1
C, 2
C, 4
C, 8
C, 16
C, 32
C, 64
R
Sun Enterprise 10000
Fig. 1. Memory access (read) time using one MHz processor of a Sun E4500
(left) and an E10K (right) as a function of array size for various strides
and show that our implementation achieves near linear speedups, as illustrated
in Figure 2. Our main contributions are
1. A new methodology for designing practical shared-memory algorithms on
UMA shared-memory machines,
2. A fast and scalable shared-memory implementation of ear decomposition (in-
cluding spanning tree construction) demonstrating the ﬁrst ever signiﬁcant—
in our case, nearly optimal—parallel speedup for this class of problems.
3. An example of experimental performance analysis for a nontrivial parallel
implementation.
2 Related Work
Several groups have conducted experimental studies of graph algorithms on par-
allel architectures [19,20,26,39,40,44]. Their approach to producing a parallel
program is similar to ours (especially that of Ramachandran et al. [17]), but
their test platforms have not provided them with a true, scalable, UMA shared-
memory environment or have relied on ad hoc hardware [26]. Thus ours is the
ﬁrst study of speedup over a signiﬁcant range of processors on a commercially
available platform.
3 Methodology
3.1 Approach
Our methodology has two principal components: an approach to the conversion
of PRAM algorithms into parallel programs for shared-memory machines and a
matching complexity model for the prediction of performance. In addition, we
132 David A. Bader et al.
Problem Size (log
2
n)
8 9 10 11 12 13 14 15 16 17
T
im
e
 (
m
s
)
0.01
0.1
1
10
100
 
a) varying graph and machine size with ﬁxed random sparse graph.
Number of Processors (p)
1 2 4 8 16 32
T
im
e
 (
m
s
)
0.1
1
10
 
Random Graph ( Count = n )
Random Graph ( Count = 2*n )
Random Graph ( Count = n
2
/2 )
Random Graph ( Count = n
2
 )
Regular Triangulation 
Regular Mesh
Constrained Delaunay Triangulation
b) varying number of processors and sparse graphs with ﬁxed size n = 8192.
Fig. 2. Running times of ear decomposition on the NPACI Sun E10K with 1 to
32 processors a) on varying problem sizes (top) and b) diﬀerent sparse graph
models with n = 8192 (bottom)
make use of the best precepts of algorithm engineering [34] to ensure that our
implementations are as eﬃcient as possible.
Converting a PRAM algorithm to a parallel program requires us to address
three problems: (i) how to partition the tasks (and data) among the very limited
number of processors available; (ii) how to optimize locally as well as globally
the use of caches; and (iii) how to minimize the work spent in synchronization
(barrier calls). The ﬁrst two are closely related: good data and task partitioning
will ensure good locality; coupling such partitioning with cache-sensitive coding
(see [27,28,29,34] for discussions) provides programs that take best advantage of
PRAM Algorithms on a Shared-Memory Architecture 133
the architecture. Minimizing the work done in synchronization barriers is a fairly
simple exercise in program analysis, but turns out to be far more diﬃcult in prac-
tice: a tree-based gather-and-broadcast barrier, for instance, will guarantee syn-
chronization of processes at fairly minimal cost (and can often be split when only
one processor should remain active), but may not properly synchronize the caches
of the various processors on all architectures, while a more onerous barrier that
forces the processors’ caches to be ﬂushed and resynchronized will be portable
across all architectures, but unnecessarily expensive on most. We solve the prob-
lem by providing both a simple tree-based barrier and a heavy-weight barrier
and placing in our libraries architecture-speciﬁc information that can replace the
heavy barrier with the light-weight one whenever the architecture permits it.
3.2 Complexity Model for Shared-Memory
Various cost models have been proposed for SMPs [1,2,3,4,6,16,18,38,45]; we
chose the Helman and Ja´Ja´ model [18] because it gave us the best match between
our analyses and our experimental results. Since the number of processors used in
our experiments is relatively small (not exceeding 64), contention at the memory
location is negligible compared to the contention at the processors. Other models
that take into account only the number of reads/writes by a processor or the con-
tention at the memory are thus unsuitable here. Performance on most computer
systems dictates the reliance on several levels of memory caches, and thus, cost
beneﬁt should be given to an algorithm that optimizes for contiguous memory
accesses over non-contiguous access patterns. Distinguishing between contiguous
and non-contiguous memory accesses is the ﬁrst step towards capturing the ef-
fects of the memory hierarchy, since contiguous memory accesses are much more
likely to be cache-friendly. The Queuing Shared Memory [15,38] model takes into
account both the number of memory accesses and contention at the memory, but
does not distinguish between between contiguous versus non-contiguous accesses.
In contrast, the complexity model of Helman and Ja´Ja´ [18] takes into account
contention at both the processors and the memory. In the Helman-Ja´Ja´ model,
we measure the overall complexity of an algorithm by the triplet (MA, ME , TC),
where MA is the maximum number of (noncontiguous) accesses made by any
processor to main memory, ME is the maximum amount of data exchanged by
any processor with main memory, and TC is an upper bound on the local com-
putational work of any of the processors. TC is measured in standard asymptotic
terms, while MA and ME are represented as approximations of the actual values.
In practice, it is often possible to focus on either MA or ME when examining the
cost of algorithms. Because the number of required barrier synchronizations is
less than the local computational work on each processor, the cost of synchroniza-
tion is dominated by the term TC and thus is not explicitly included in the model.
134 David A. Bader et al.
4 Example: Ear Decomposition
4.1 The Problem
Of the many basic PRAM algorithms we have implemented and tested, we chose
ear decomposition to present in this study, for three reasons. First, although the
speedup observed with our implementation of ear decomposition is no better
than what we observed with our other implementations of basic PRAM algo-
rithms, ear decomposition is more complex than such problems as preﬁx sum,
pointer doubling, symmetry breaking, etc. Secondly, ear decomposition is typi-
cal of problems that have simple and eﬃcient sequential solutions, have known
fast or optimal PRAM algorithms, but yet have no practical parallel implemen-
tation. One of its component tasks is ﬁnding a spanning tree, a task that was
also part of the original DIMACS challenge on parallel implementation many
years ago, in which sequential implementations had proved signiﬁcantly faster
than even massively parallel implementations (using a 65,536-processor CM2).
Finally, ear decomposition is interesting in its own right, as it is used in a variety
of applications from computational geometry [7,10,11,24,25], structural engineer-
ing [12,13], to material physics and molecular biology [14].
The eﬃcient parallel solution of many computational problems often requires
approaches that depart completely from those used for sequential solutions. In
the area of graph algorithms, for instance, depth-ﬁrst search is the basis of many
eﬃcient algorithms, but no eﬃcient PRAM algorithm is known for depth-ﬁrst
search (which is P-complete). To compensate for the lack of such methods, the
technique of ear decomposition, which does have an eﬃcient PRAM algorithm,
is often used [25].
Let G = (V,E) be a connected, undirected graph; set n = |V | and m = |E|.
An ear decomposition of G is a partition of E into an ordered collection of simple
paths (called ears), Q0, Q1, . . . , Qr, obeying the following properties:
– Each endpoint of Qi, i > 0, is contained in some Qj, j < i.
– No internal vertex of Qi, (i > 0), is contained in any Qj, for j < i.
Thus a vertex may belong to more than on ear, but an edge is contained in ex-
actly one ear [21]. If the endpoints of the ear do not coincide, then the ear is open;
otherwise, the ear is closed. An open ear decomposition is an ear decomposition
in which every ear is open. Figure 5 in the appendix illustrates these concepts.
Whitney ﬁrst studied open ear decomposition and showed that a graph has an
open ear decomposition if and only if it is biconnected [46]. Lova´sz showed that
the problem of computing an open ear decomposition in parallel is in NC [30]. Ear
decomposition has also been used in designing eﬃcient sequential and parallel
algorithms for triconnectivity [37] and 4-connectivity [23]. In addition to graph
connectivity, ear decomposition has been used in graph embeddings (see [9]).
The sequential algorithm: Ramachandran [37] gave a linear-time algorithm
for ear decomposition based on depth-ﬁrst search. Another sequential algorithm
that lends itself to parallelization (see [22,33,37,42]) ﬁnds the labels for each edge
PRAM Algorithms on a Shared-Memory Architecture 135
as follows. First, a spanning tree is found for the graph; the tree is then arbi-
trarily rooted and each vertex is assigned a level and parent. Each non-tree edge
corresponds to a distinct ear, since an arbitrary spanning tree and the non-tree
edges form a cycle basis for the input graph. Each non-tree edge is then exam-
ined and uniquely labeled using the level of the least common ancestor (LCA)
of the edge’s endpoints and a unique serial number for that edge. Finally, the
tree edges are assigned ear labels by choosing the smallest label of any non-tree
edge whose cycle contains it. This algorithm runs in O((m+ n) logn) time.
4.2 The PRAM Algorithm
The PRAM algorithm for ear decomposition [32,33] is based on the second se-
quential algorithm. The ﬁrst step computes a spanning tree in O(logn) time,
using O(n+m) processors. The tree can then be rooted and levels and parents
assigned to nodes by using the Euler-tour technique. Labelling the nontree edges
uses an LCA algorithm, which runs within the same asymptotic bounds as the
ﬁrst step. Next, the labels of the tree edges can be found as follows. Denote the
graph by G, the spanning tree by T , the parent of a vertex v by p(v), and the la-
bel of an edge (u, v) by label(u, v). For each vertex v, set f(v) = min{label(v, u) |
(v, u) ∈ G−T } and g = (x, y) ∈ T , where we have y = p(x). label(g) is the min-
imum value in the subtree rooted at x. These two substeps can be executed in
O(logn) time. Finally, labelling the ears involves sorting the edges by their labels,
which can be done in O(logn) time using O(n+m) processors. Therefore the
total running time of this CREW PRAM algorithm is O(logn) using O(n+m)
processors. This is not an optimal PRAM algorithm (in the sense of [22]), because
the work (processor-time product) is asymptotically greater than the sequential
complexity; however, no known optimal parallel approaches are known.
In the spanning tree part of the algorithm, the vertices of the input graph G
(held in shared memory) are initially assigned evenly among the processors—
although the entire input graph G is of course available to every processor
through the shared memory. Let D be the function on the vertex set V deﬁning
a pseudoforest (a collection of trees). Initially each vertex is in its own tree, so
we set D(v) = v for each v ∈ V . The algorithm manipulates the pseudoforest
through two operations.
– Grafting: Let Ti and Tj be two distinct trees in the pseudoforest deﬁned
by D. Given the root ri of Ti and a vertex v of Tj, the operation D(ri)← v
is called grafting Ti onto Tj .
– Pointer jumping: Given a vertex v in a tree T , the pointer-jumping op-
eration applied to v sets D(v)← D(D(v)).
Initially each vertex is in a rooted star. The ﬁrst step will be several grafting
operations of the same tree. The next step attempts to graft the rooted stars
onto other trees if possible. If all vertices then reside in rooted stars, the algo-
rithm stops. Otherwise pointer jumping is applied at every vertex, reducing the
diameter of each tree, and the algorithm loops back to the ﬁrst step. Figure 6
136 David A. Bader et al.
in the appendix shows the grafting and pointer-jumping operations performed
on an example input graph. The ear decomposition algorithm ﬁrst computes the
spanning tree, then labels non-tree edges using independent LCA searches, and
ﬁnally labels tree edges in parallel.
4.3 SMP Implementation and Analysis
Spanning Tree: Grafting subtrees can be carried out in O(m/p) time, with two
noncontiguous memory accesses to exchange approximately np elements; grafting
rooted stars onto other trees takes O(m/p) time; pointer-jumping on all vertices
takes O(n/p) time; and the three steps are repeated O(logn) times. Note that
the memory is accessed only once even though there are up to logn iterations.
Hence the total running time of the algorithm is
T (n, p) = O
(
1, np , ((m+ n)/p) logn
)
. (1)
Ear Decomposition: Equation (1) gives us the running time for spanning
tree formation. Computing the Euler tour takes time linear in the number of
vertices per processor or O(n/p), with np noncontiguous memory accesses to ex-
change np elements. The level and parent of the local vertices can be found in
O(n/p) time with np noncontiguous memory accesses. Edge labels (for tree and
nontree edges) can be found in O(n/p) time. The two labeling steps need np non-
contiguous memory accesses to exchange approximately np elements. The total
running time of the algorithm is O
(
n
p ,
n
p ,
m+n
p logn
)
. Since MA and ME are of
the same order and since noncontiguous memory accesses are more expensive
(due to cache faults), we can rewrite our running time as
T (n, p) = O
(
n
p ,
m+n
p log n
)
. (2)
This expression decreases with increasing p; moreover, the algorithm scales down
eﬃciently as well, since we have T (n, p) = T
∗(n)
p/ logn , where T
∗(n) is the sequential
complexity of ear decomposition.
5 Experimental Results
We study the performance of our implementations using a variety of input graphs
that represent classes typically seen in high-performance computing applications.
Our goals are to conﬁrm that the empirical performance of our algorithms is
realistically modeled and to learn what makes a parallel shared-memory imple-
mentation eﬃcient.
PRAM Algorithms on a Shared-Memory Architecture 137
5.1 Test Data
We generate graphs from seven input classes of planar graphs (2 regular and
5 irregular) that represent a diverse collection of realistic inputs. The ﬁrst two
classes are regular meshes (lattice RL and triangular RT ); the next four classes
are planar graphs generated through a simple random process, two very sparse
(GA and GB) and two rather more dense (GC and GD)—since the graphs are
planar, they cannot be dense in the usual sense of the word, but GD graphs
are generally fully triangulated. The last graph class generates the constrained
Delaunay triangulation (CD) on a set of random points [43]. For the random
graphs GA, GB, GC, and GD, the input graph on n = |V | vertices is generated
as follows. Random coordinates are picked in the unit square according to a uni-
form distribution; a Euclidean minimum spanning tree (MST) on the n vertices
is formed to ensure that the graph is connected and serves as the initial edge set
of the graph. Then for count times, two vertices are selected at random and a
straight edge is added between them if the new edge does not intersect with any
existing edge. Note that a count of zero produces a tree, but that the expected
number of edges is generally much less than the count used in the construction,
since any crossing edges will be discarded. Table 1 summarizes the seven graph
families. Figure 7 in the appendix shows some example graphs with various val-
Table 1. Diﬀerent classes of input graphs
Key Name Description
RL Regular Lattice Regular 4-connected mesh of √n× √n ver-
tices
RT Regular Triangulation RL graph with an additional edge connecting a
vertex to its down-and-right neighbor, if any
GA Random Graph A Planar, very sparse graph with count = n
GB Random Graph B Planar, very sparse graph with count = 2n
GC Random Graph C Planar graph with count = n2/2
GD Random Graph D Planar graph with count = n2
CD Constrained Delaunay
Triangulation
Constrained Delaunay triangulation of n ran-
dom points in the unit square
ues of count. Note that, while we generate the test input graphs geometrically,
only the vertex list and edge set of each graph are used in our experiments.
5.2 Test Platforms
We tested our shared-memory implementation on the NPACI Sun E10K at the
San Diego Supercomputing Center. This machine has 64 processors and 64GB
of shared memory, with 16KB of on-chip direct-mapped data cache and 8MB of
external cache for each processor [8].
138 David A. Bader et al.
Our practical programming environment for SMPs is based upon the SMP
Node Library component of SIMPLE [5], which provides a portable framework
for describing SMP algorithms using the single-program multiple-data (SPMD)
program style. This framework is a software layer built from POSIX threads
that allows the user to use either already developed SMP primitives or direct
thread primitives. We have been continually developing and improving this li-
brary over the past several years and have found it to be portable and eﬃcient
on a variety of operating systems (e.g., Sun Solaris, Compaq/Digital UNIX, IBM
AIX, SGI IRIX, HP-UX, Linux). The SMP Node Library contains a number of
SMP node algorithms for barrier synchronization, broadcasting the location of
a shared buﬀer, replication of a data buﬀer, reduction, and memory manage-
ment for shared-buﬀer allocation and release. In addition to these functions, we
have control mechanisms for contextualization (executing a statement on only
a subset of processors), and a parallel do that schedules n independent work
statements implicitly to p processors as evenly as possible.
5.3 Experimental Results
Due to space limitations, we present only a few graphs illustrating our results
and omit a discussion of our timing methodology. (We used elapsed times on
a dedicated machine; we cross-checked our timings and ran suﬃcient tests to
verify that our measurements did not suﬀer from any signiﬁcant variance.) Fig-
ure 2 shows the execution time of the SMP algorithms and demonstrates that
the practical performance of the SMP approach is nearly invariant with respect
to the input graph class.
Number of Processors (p)
2 4 8 16 32
E
ff
ic
ie
n
c
y
0.0
0.2
0.4
0.6
0.8
1.0
 
log2n = 8 
log2n = 9 
log2n = 10 
log2n = 11 
log2n = 12 
log2n = 13 
log2n = 14 
log2n = 15 
log2n = 16 
log2n = 17 
 
Fig. 3. Eﬃciency of ear decomposition for ﬁxed inputs, each a sparse random
graph with from 256 to 128K vertices, on the NPACI Sun E10K with 2 to 32
processors
PRAM Algorithms on a Shared-Memory Architecture 139
The analysis for the shared-memory algorithm given in Equation (2) shows
that a practical parallel algorithm is possible. We experimented with the SMP
implementation on problems ranging in size from 256 to 128K vertices on the
Sun E10K using p = 1 to 32 processors. Clearly, the nearly linear speedup with
the number of processors predicted by Equation (2) may not be achievable due
to synchronization overhead, serial work, or contention for shared resources. In
fact, our experimental results, plotted in Figures 2, conﬁrm the cost analysis and
provide strong evidence that our shared-memory algorithm achieves nearly lin-
ear speedup with the number of processors for each ﬁxed problem size. Figures 3
and 4 present the eﬃciency of our implementation when compared with the se-
quential algorithm for ear decomposition. In Figure 3 for a ﬁxed problem size, as
Problem Size (log2n)
8 9 10 11 12 13 14 15 16 17
E
ff
ic
ie
n
c
y
0.0
0.2
0.4
0.6
0.8
1.0
 
P = 2
P = 4
P = 8
P = 16
P = 32
 
Fig. 4. Eﬃciency of ear decomposition for ﬁxed machine sizes, from 2 to 32
processors, on the NPACI Sun E10K for a sparse random graph with 256 to
128K vertices
expected, the eﬃciency decreases as we add more processors—caused by increas-
ing parallel overheads. Another viewpoint is that in Figure 4 for a ﬁxed number
of processors, eﬃciency increases with the problem size. Note that the eﬃciency
results conﬁrm that our implementation scales nearly linearly with the number
of processors and that, as expected, larger problem sizes show better scalability.
6 Conclusions
We have implemented and analyzed a PRAM algorithm for the ear decomposi-
tion problem. We have shown both theoretically and practically that our shared-
memory approach to parallel computation is eﬃcient and scalable on a variety
of input classes and problem sizes. In particular, we have demonstrated the
ﬁrst ever near-linear speedup for a nontrivial graph problem using a large range
of processors on a shared-memory architecture. As our example shows, PRAM
140 David A. Bader et al.
algorithms that once were of mostly theoretical interest, now provide plausible
approaches for real implementations on UMA shared-memory architectures such
as the Sun E10K. Our future work includes determining what algorithms can
be eﬃciently parallelized in this manner on these architectures, building a basic
library of eﬃcient implementations, and using them to tackle diﬃcult optimiza-
tion problems in computational biology.
References
1. A. Aggarwal, B. Alpern, A. Chandra, and M. Snir. A Model for Hierarchical Mem-
ory. In Proceedings of the 19th Annual ACM Symposium of Theory of Computing
(STOC), pages 305–314, New York City, May 1987. 133
2. A. Aggarwal and J. Vitter. The Input/Output Complexity of Sorting and Related
Problems. Communications of the ACM, 31:1116–1127, 1988. 133
3. B. Alpern, L. Carter, E. Feig, and T. Selker. The Uniform Memory Hierarchy
Model of Computation. Algorithmica, 12:72–109, 1994. 133
4. N. M. Amato, J. Perdue, A. Pietracaprina, G. Pucci, and M. Mathis. Predicting
performance on SMPs. a case study: The SGI Power Challenge. In Proceedings of
the International Parallel and Distributed Processing Symposium (IPDPS 2000),
pages 729–737, Cancun, Mexico, May 2000. 133
5. D. A. Bader and J. Ja´Ja´. SIMPLE: A Methodology for Programming High Per-
formance Algorithms on Clusters of Symmetric Multiprocessors (SMPs). Journal
of Parallel and Distributed Computing, 58(1):92–108, 1999. 138
6. G. E. Blelloch, P. B. Gibbons, Y. Matias, and M. Zagha. Accounting for Memory
Bank Contention and Delay in High-Bandwidth Multiprocessors. IEEE Transac-
tions on Parallel and Distributed Systems, 8(9):943–958, 1997. 133
7. M. H. Carvalho, C. L. Lucchesi, and U. S. R. Murty. Ear Decompositions of
Matching Covered Graphs. Combinatorica, 19(2):151–174, 1999. 134
8. A. Charlesworth. Starﬁre: extending the SMP envelope. IEEE Micro, 18(1):39–49,
1998. 130, 137
9. J. Chen and S. P. Kanchi. Graph Ear Decompositions and Graph Embeddings.
SIAM Journal on Discrete Mathematics, 12(2):229–242, 1999. 134
10. P. Crescenzi, C. Demetrescu, I. Finocchi, and R. Petreschi. LEONARDO: A Soft-
ware Visualization System. In Proceedings of the First Workshop on Algorithm
Engineering (WAE’97), pages 146–155, Venice, Italy, sep 1997. 134
11. D. Eppstein. Parallel Recognition of Series Parallel Graphs. Information & Com-
putation, 98:41–55, 1992. 134
12. D. S. Franzblau. Combinatorial Algorithm for a Lower Bound on Frame Rigidity.
SIAM Journal on Discrete Mathematics, 8(3):388–400, 1995. 134
13. D. S. Franzblau. Ear Decomposition with Bounds on Ear Length. Information
Processing Letters, 70(5):245–249, 1999. 134
14. D. S. Franzblau. Generic Rigidity of Molecular Graphs Via Ear Decomposition.
Discrete Applied Mathematics, 101(1-3):131–155, 2000. 134
15. P. B. Gibbons, Y. Matias, and V. Ramachandran. Can shared-memory model serve
as a bridging model for parallel computation? In Proceedings of the 9th annual
ACM symposium on parallel algorithms and architectures, pages 72–83, Newport,
RI, June 1997. 133
16. P. B. Gibbons, Y. Matias, and V. Ramachandran. The Queue-Read Queue-Write
PRAM Model: Accounting for Contention in Parallel Algorithms. SIAM Journal
on Computing, 28(2):733–769, 1998. 133
PRAM Algorithms on a Shared-Memory Architecture 141
17. B. Grayson, M. Dahlin, and V. Ramachandran. Experimental evaluation of QSM,
a simple shared-memory model. In Proceedings of the 13th International Parallel
Processing Symposium and 10th Symposium on Parallel and Distributed Processing
(IPPS/SPDP), pages 1–7, San Juan, Puerto Rico, April 1999. 131
18. D. R. Helman and J. Ja´Ja´. Designing Practical Eﬃcient Algorithms for Symmetric
Multiprocessors. In Algorithm Engineering and Experimentation (ALENEX’99),
pages 37–56, Baltimore, MD, January 1999. 133
19. T.-S. Hsu and V. Ramachandran. Eﬃcient massively parallel implementation of
some combinatorial algorithms. Theoretical Computer Science, 162(2):297–322,
1996. 131
20. T.-S. Hsu, V. Ramachandran, and N. Dean. Implementation of parallel graph
algorithms on a massively parallel SIMD computer with virtual processing. In
Proceedings of the 9th International Parallel Processing Symposium, pages 106–
112, Santa Barbara, CA, April 1995. 131
21. L. Ibarra and D. Richards. Eﬃcient Parallel Graph Algorithms Based on Open
Ear Decomposition. Parallel Computing, 19(8):873–886, 1993. 134
22. J. Ja´Ja´. An Introduction to Parallel Algorithms. Addison-Wesley Publishing Com-
pany, New York, 1992. 130, 134, 135
23. A. Kanevsky and V. Ramachandran. Improved Algorithms for Graph Four-
Connectivity. Journal of Computer and System Sciences, 42(3):288–306, 1991.
134
24. D. J. Kavvadias, G. E. Pantziou, P. G. Spirakis, and C. D. Zaroliagis. Hammock-
On-Ears Decomposition: A Technique for the Eﬃcient Parallel Solution of Shortest
Paths and Other Problems. Theoretical Computer Science, 168(1):121–154, 1996.
134
25. A. Kazmierczak and S. Radhakrishnan. An Optimal Distributed Ear Decompo-
sition Algorithm with Applications to Biconnectivity and Outerplanarity Testing.
IEEE Transactions on Parallel and Distributed Systems, 11(1):110–118, 2000. 134
26. J. Keller, C. W. Keßler, and J. L. Tra¨ﬀ. Practical PRAM Programming. John
Wiley & Sons, 2001. 131
27. R. Ladner, J. D. Fix, and A. LaMarca. The cache performance of traversals and
random accesses. In Proc. 10th Ann. ACM/SIAM Symposium on Discrete Algo-
rithms (SODA-99), pages 613–622, Baltimore, MD, 1999. 132
28. A. LaMarca and R. E. Ladner. The Inﬂuence of Caches on the Perfor-
mance of Heaps. ACM Journal of Experimental Algorithmics, 1(4), 1996.
http://www.jea.acm.org/1996/LaMarcaInfluence/. 132
29. A. LaMarca and R. E. Ladner. The Inﬂuence of Caches on the Performance of
Heaps. In Proceedings of the Eighth ACM/SIAM Symposium on Discrete Algo-
rithms, pages 370–379, New Orleans, LA, 1997. 132
30. L. Lova´sz. Computing Ears and Branchings in Parallel. In Proceedings of the 26th
Annual IEEE Symposium on Foundations of Computer Science (FOCS 85), pages
464–467, Portland, Oregon, October 1985. 134
31. Message Passing Interface Forum. MPI: A Message-Passing Interface Standard.
Technical report, University of Tennessee, Knoxville, TN, June 1995. Version 1.1.
130
32. G. L. Miller and V. Ramachandran. Eﬃcient parallel ear decomposition with
applications. Manuscript, UC Berkeley, MSRI, January 1986. 135
33. Y. Moan, B. Schieber, and U. Vishkin. Parallel ear decomposition search (EDS)
and st-numbering in graphs. Theoretical Computer Science, 47(3):277–296, 1986.
134, 135
142 David A. Bader et al.
34. B. M. E. Moret. Towards a discipline of experimental algorithmics. In
DIMACS Monographs in Discrete Mathematics and Theoretical Computer
Science. American Mathematical Society, 2001. To appear. Available at
www.cs.unm.edu/∼moret/dimacs.ps. 132
35. OpenMP Architecture Review Board. OpenMP: A Proposed Industry Standard
API for Shared Memory Programming. http://www.openmp.org/, October 1997.
130
36. Portable Applications Standards Committee of the IEEE. Information technol-
ogy – Portable Operating System Interface (POSIX) – Part 1: System Application
Program Interface (API), 1996-07-12 edition, 1996. ISO/IEC 9945-1, ANSI/IEEE
Std. 1003.1. 130
37. V. Ramachandran. Parallel Open Ear Decomposition with Applications to Graph
Biconnectivity and Triconnectivity. In J. H. Reif, editor, Synthesis of Parallel
Algorithms, pages 275–340. Morgan Kaufman, San Mateo, CA, 1993. 134
38. V. Ramachandran. A General-Purpose Shared-Memory Model for Parallel Com-
putation. In M. T. Heath, A. Ranade, and R. S. Schreiber, editors, Algorithms
for Parallel Processing, volume 105, pages 1–18. Springer-Verlag, New York, 1999.
133
39. M. Reid-Miller. List ranking and list scan on the Cray C-90. In Proceedings
Symposium on Parallel Algorithms and Architectures, pages 104–113, Cape May,
NJ, June 1994. 131
40. M. Reid-Miller. List ranking and list scan on the Cray C-90. Journal of Computer
and System Sciences, 53(3):344–356, December 1996. 131
41. J. H. Reif, editor. Synthesis of Parallel Algorithms. Morgan Kaufmann Publishers,
1993. 130
42. C. Savage and J. Ja´Ja´. Fast, Eﬃcient Parallel Algorithms for Some Graph Prob-
lems. SIAM Journal on Computing, 10(4):682–691, 1981. 134
43. J. R. Shewchuk. Triangle: Engineering a 2D Quality Mesh Generator and Delau-
nay Triangulator. In M. C. Lin and D. Manocha, editors, Applied Computational
Geometry: Towards Geometric Engineering, volume 1148 of Lecture Notes in Com-
puter Science, pages 203–222. Springer-Verlag, May 1996. From the First ACM
Workshop on Applied Computational Geometry. 137
44. J. Sibeyn. Better trade-oﬀs for parallel list ranking. In Proceedings of the 9th
annual ACM symposium on parallel algorithms and architectures, pages 221–230,
Newport, RI, June 1997. 131
45. J. Vitter and E. Shriver. Algorithms for Parallel Memory I: Two-Level Memories.
Algorithmica, 12:110–147, 1994. 133
46. H. Whitney. Non-Separable and Planar Graphs. Transactions of the American
Mathematical Society, 34:339–362, 1932. 134
PRAM Algorithms on a Shared-Memory Architecture 143
Appendix: Illustrations
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
e0
e1
e2
e3
e4
e5
e6
e7
e8 e9
e10
e11
e12
e13
e14
e15
 1
 5  2
 6  3
10  8
11  9
e0e2
e1 e3e8
e11
 e12
 e13
e14
      e15
 4
e4
 7
e9
e7 e6
e10
e5
edge type label
e0 non-tree (0,1)
e1 tree (0,1)
e2 tree (0,7)
e3 tree (0,1)
e4 tree (0,1)
e5 tree (2,6)
e6 non-tree (2,6)
e7 non-tree (0,7)
e8 tree (0,7)
e9 non-tree (0,9)
e10 non-tree (0,10)
e11 tree (2,6)
e12 tree (0,10)
e13 non-tree (7,13)
e14 tree (7,13)
e15 tree (0,10)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11Q4
Q4
Q4
Q0
Q2
Q0
Q1
Q1 Q5
Q5
Q1Q0Q0
Q2
Q3
Q2
the original graph
G = (V, E)
spanning tree of G edge labels
ear decomposition
of G
Fig. 5. A graph, a spanning tree for that graph, edge classiﬁcation and labels,
and an ear decomposition for that graph
6
 2  4
 5
 6
 7
 8
 9
10
11
12
 1
 3
1
1
2
2
3
3
4
4
5
5
6
7
7
8
8
9
9
10
10
11
11
12
12
1 4 6 10
2 3 5 71
2    3 5  7
 4
6
 8
9
9
 8
10 11
1112
12
1 4 6
3 3
5
5 4 7
7 6
9 8
9
8
10
12 11
102 2
2
2
  1 3 3
5
5 4 7
7
1
6
6
9 8
9
8
12
11
12
10
11
12
11
2
2
1
3
3 4
5 5
4
7
7
6
9
8 9
  8 10
10
11
11
12
12
/
/ /
6
/ / / /
/ / / / /  /
/
 /
/
/
/
/
    /
    /
   /
 /
  1
 /  /
 /
 /
   /
   /
   /
   /
  /    /
 /
 /    /
 /
 /
 /
 /
   /
   /
     /     /
    /
  /
 /
  /
   /
  /
   /
    /
    /
   /
   /
/ /
  /
  /
1
10
4
Fig. 6. Grafting and pointer-jumping operations applied to a sample graph (top
left) and steps of the algorithm
144 David A. Bader et al.
 0 1 2 3
4 5
6 7
8 9 10 11
12 13 14
15
regular lattice RL
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
regular triangulation RT
 9
 7
 2
 15
6
13
 10
  4
  1
  0
5
11
  3
  8
 14
 12
random sparse GA
(count = n)
 2
15
13
1
 0
 11
 5
9
14
10
12
 6
 4
 8
3
 7
random sparse GB
(count = 2n)
2
 0
 13
 7
12
 4
 15
 10
 1
 8
5
14
11
3
 6
9
random GC
(count = n2/2)
  12
 3
 1
 10
6
 14
 4
 7 11
15
 2
13
 8
5
 0
 9
random GD
(count = n2)
constrained Delaunay trian-
gulation CD (count = 100)
constrained Delaunay trian-
gulation CD (count = 1000)
Fig. 7. Examples of our seven planar graph families
