List Ranking and List Scan on the CRAYC90  by Reid-Miller, Margaret
File: 571J 146001 . By:CV . Date:12:12:96 . Time:12:32 LOP8M. V8.0. Page 01:01
Codes: 7401 Signs: 5483 . Length: 60 pic 11 pts, 257 mm
Journal of Computer and System Sciences  SS1460
journal of computer and system sciences 53, 344356 (1996)
List Ranking and List Scan on the CRAY C90*
Margaret Reid-Miller-
School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213-3891
Received December 22, 1995
Although parallel algorithms using linked lists, trees, and graphs have
been studied extensively by the research community, implementations
have met with limited success, even for the simplest algorithms. In this
paper we present our results of a careful implementation study of
parallel list ranking (and the related list-scan operation) and show that
it can have substantial speed up over fast workstations. Obtaining good
parallel performance for list ranking is a challenge for two reasons. First,
although there are many asymptotic work-efficient algorithms, it is hard
to keep the constants comparable to those of the sequential algorithm.
In this paper we introduce a new parallel algorithm that both is work
efficient and has small constant but sacrifices logarithmic time; it
achieves only an O(log2 n) time. We contend, however, that work
efficiency and small constants are more important, given that multipro-
cessor machines are used for problems that are much larger than the
number of processors and, therefore, the optimal O(log n) time is never
achieved in practice. Second, list ranking is highly communication
intensive and its memory access patterns are dynamic. We show,
however, that by using high memory bandwidth multiprocessors, such
as CRAY C90 computers, programmed with virtual processors to hide
latency we can ameliorate these problems. To the best of our
knowledge, our implementation of list ranking and list scan on the
CRAY C90 is the fastest to date and is the first implementation that
substantially outperforms fast workstations. The success of our algorithm
is due to its moderate grain size and simplicity; the success of the
implementationis due to pipelining readsandwrites throughvectorization
to hide latency and optimizing performance by analyzing the expected
execution time of the algorithm. ] 1996 Academic Press, Inc.
1. INTRODUCTION
As production parallel and vector machines become
faster and commonplace, it becomes feasible to solve larger
problems. However, large problems that have irregular
sparsity structure or are dynamic are often most efficiently
represented and manipulated using pointer-based data
structures, such as linked lists, trees, and graphs. Use of
such data structures has become natural and common on
sequential machines, but their use on parallel machines has
been avoided in practice. However, theory indicates that
pointer-based data structures can significantly reduce a
problem’s size and, therefore, can improve asymptotic
performance. Many parallel random access machine
(PRAM) algorithms for such data structures have been
developed. But are these PRAM algorithms practical? Can
we perform even the most primitive operations used by
PRAM algorithms efficiently? In this paper we consider one
of simplest of these pointer-based operations, list ranking
(and the related list scan), and its parallel implementation.
Our aim was to understand the problems involved in
implementing parallel pointer-based algorithms and their
possible solutions, and to see how well these kinds of
algorithms perform in practice.
Implementations of parallel pointer-based algorithms
have been conspicuously sparse. Lumetta et al. [22]
implemented a hybrid parallelserial connected components
algorithm for distributed memory machines. Their study
showed that for certain classes of probabilistic meshes their
algorithm performs reasonably well. But its performance
was quite poor on other meshes and is likely to be worse on
arbitrary graphs. Hsu et al. [1719] built a library of
pointer-based algorithms, including connected components,
open ear decomposition, list ranking etc.; Greiner [13]
compared several parallel connected-components algorithms;
Narayannan [27] implemented a single source shortest
path algorithm; and Hillis and Steele [15] implemented list
ranking. These studies showed little or no speed up over
implementations on fast workstations.
List ranking finds for each vertex in a linked list the
number of vertices that precede it in the list. We introduce
a new list-ranking algorithm and its implementation on a
Cray C90 vector multiprocessor. We chose the Cray C90,
because compared to other commercial architectures, it has
very high memory bandwidth and fine-grained access to
memory, and we wanted to see how well list ranking could
perform under the best of circumstances. Because list
ranking makes great demands of the memory system, it
would certainly have much worse performance on other
article no. 0074
3440022-000096 18.00
Copyright  1996 by Academic Press, Inc.
All rights of reproduction in any form reserved.
* This research was sponsored in part by the Wright Laboratory,
Aeronautical Systems Center, Air Force Materiel Command, USAF, and
the Advanced Research Projects Agency (ARPA) under Grant F33615-93-
1-1330 and in part by the Pittsburgh Supercomputing Center (Grant
ASC890018P) which provided Cray Y-MP C90 time.
The views and conclusions contained in this document are those of the
author and should not be interpreted as necessarily representing the official
policies or endorsements, either expressed or implied, of the Wright
Laboratory or the U. S. Government.
- E-mail: mrmillercs.cmu.edu.
File: 571J 146002 . By:CV . Date:12:12:96 . Time:12:32 LOP8M. V8.0. Page 01:01
Codes: 5209 Signs: 4246 . Length: 56 pic 0 pts, 236 mm
TABLE I
Comparison of the Asymptotic Execution Time (ns) per Vertex of Our Vector Parallel List-Scan and List-Ranking Algorithms on the
CRAY C90 and Serial Algorithms on the CRAY C90 and a DEC 3000600 Alpha Workstation. Times for the Alpha Depend on Whether
the Data Are Already in the Cache or Not, as Indicated.
DEC Alpha Cray C90
Algorithm Cache Memory Serial Vectorized 2 Proc. 4 Proc. 8 Proc.
List rank 98 690 177 21.3 10.9 5.8 3.1
List scan 200 990 183 30.8 16.1 8.5 4.6
architectures. To hide memory latency we used virtual
processing. To avoid memory bank contention we used
randomization. To get the best performance possible we
developed a cost model of the algorithm, empirically deter-
mined the execution time of each parallel loop, and then
analytically tuned the parameters.
Our implementation of list ranking and list scan (parallel
prefix) on the Cray C90 is the fastest implementation to
date, to the best of our knowledge. In addition, it is the first
implementation of which we are aware that significantly
outperforms fast workstations. Table I shows that on 8
processors our list ranking is 200 times faster than a DEC
3000600 Alpha workstation. On one processor it is over
eight times faster than the serial algorithm on the Cray
C90, which is about the best one can expect for any pointer-
based algorithm that does about twice the work of the serial
algorithm (for example, contraction and expansion phases
as ours does). On 8 processors we achieve a 50 fold speedup
over the serial implementation.
Although list ranking is simple, it typifies the kinds of
problems for which it is hard to get good vector or parallel
performance. In particular, it uses an irregular data structure,
it is communication intensive, and its communication
patterns are data dependent and dynamic. From an algo-
rithmic point of view it is interesting because it has features
common to many problems: symmetry breaking and load
balancing. Furthermore, because the serial algorithm is so
simple, the overhead associated with a parallel implementa-
tion must be kept small in order to compare well with a
serial implementation.
TABLE II
Comparison of Several List-Rankin Algorithms, Where n Is the Length of the Linked List and p Is the Number of Processors. The
Table Also Shows the Space Required beyond What Is Needed for the List.
Algorithm Time Work Constants Space
Serial O(n) O(n) small c
Wyllie [35] O((n log n)p+log n) O(n log n) small n+c
Randomized [25, 3] O(np+log n) O(n) medium >2n
Optimal [8, 9, 2] O(np+log n) O(n) very large >n
Ours O(np+log2 n) O(n) small 5p+c
List ranking has been studied extensively by the theory
community and is used as a primitive for many tree and
graph algorithms [1, 11, 12, 20, 21, 25, 26, 29, 32]. Table II
compares several list-ranking algorithms. The serial
algorithm takes O(n) time and has small constants. The first
parallel algorithm was developed by Wyllie [35] and is
based on pointer jumping. It is not work efficient since it
takes O(n log n) operations. But because Wyllie’s algorithm
is very simple, it works well for short lists. To obtain work-
efficient algorithms one approach is to use randomized
pointer jumping. These algorithms suffer from having to
take multiple trials on average before being able to perform
a pointer jump and, therefore, result in larger constants
than Wyllie’s. On the other hand, optimal deterministic
parallel PRAM list-ranking algorithms have even larger
constants, which makes their implementation impractical.
The problem is that none of these algorithms simulta-
neously work efficient and have small constants.
Given that current supercomputers and massively
parallel processor machines are usually used for problems
that are much larger than the number of processors, the
running time of an algorithm is dominated by the total work
and its associated constants and not so much by the
asymptotic time. Therefore, our goal was to design an
algorithm that both is work efficient and has small
constants, even if it meant sacrificing optimal time. To
evaluate the performance of our algorithm, we implemented
the list-ranking algorithms that are likely to be the most
competitive: serial, Wyllie, Miller and Reif [25] and
Anderson and Miller [3]. The latter two use randomized
345LIST RANKING AND LIST SCAN
File: 571J 146003 . By:XX . Date:16:12:12 . Time:03:54 LOP8M. V8.0. Page 01:01
Codes: 5515 Signs: 4742 . Length: 56 pic 0 pts, 236 mm
FIG. 1. Execution times per vertex for several list-scan algorithms on
one processor of the Cray C90. The times for Wyllie’s algorithm and our
algorithm were obtained on a dedicated machine. The sawtooth shape of
the Wyllie curve is due to the discontinuity of Wlog n&1X , which is the
number of rounds of pointer jumping done by the algorithm.
pointer jumping. Figure 1 shows the list-scan execution
times on one Cray C90 vector processor. Wyllie’s algorithm
is faster than ours for lists shorter than 1000 vertices. But for
longer lists, our algorithm outperforms other algorithms. In
addition, the space required by our algorithm, beyond what
is needed for the list, depends on the number of (virtual)
processors, which can be substantially less than the space
required by other algorithms (see Table II).
In the next section we describe vector multiprocessors
and their close relationship to the PRAM model. In
Section 2 we define the problems and highlight the five list-
ranking algorithms we implemented. In Section 3 we
describe the implementation of our algorithm in more detail
and give timing data. Then, in Section 4 we give a cost
model of our algorithm, analyze its expected performance,
and describe how we tuned the parameters. In Section 5 we
describe the multiprocessor version of the algorithm and its
performance. In Section 6 we review some other PRAM list-
ranking algorithms. Finally, in Section 7 we discuss our
conclusions and future directions.
1.1. Vector Multiprocessors as PRAMs
We chose to implement list ranking on a vector multi-
processor because these machines, such as the Cray family
of vector computers, closely approximate the abstract
EREW PRAM machine (see Fig. 2). They use a shared
memory model, have fine-grain access to memory, have
high global communication bandwidth, and can hide func-
tional and memory latencies through vectorization. The
most important features that distinguish these machines
from massively parallel processor machines are the high
global communication bandwidth and the pipelined
memory access. Processors communicate to memory via a
multistage butterfly-like interconnection network. As long
as there are no memory bank conflicts, the network can
service one memory request per clock cycle for each
memory pipe. Thus, the PRAM model assumption that
often is cited as unrealistic, namely unit-time memory
access, holds on vector multiprocessors as long as we can
avoid memory bank conflicts and hide latencies.
FIG. 2. Vector multiprocessors as viewed as a PRAM. Numbers are
those for the Cray C90.
Zagha [36] proposes several vector multiprocessing
programming techniques for avoiding bank conflicts and
hiding latencies. To address bank conflicts he proposes a
data distribution technique to manage the memory system
explicitly. To address memory and functional unit latencies
he proposes that programs are written for sufficiently more
virtual processors than physical processors. For vector
multiprocessors, virtual processing allows for vectorization
so that computation and communication can be pipelined
to hide latencies. Virtual processing has been a common
approach for hiding latencies [33, 24].
Using Zagha’s programming model we implement
PRAM algorithms by treating a vector processor as a
SIMD (distributed memory) multiprocessor. Each element
of the vector registers of length L act as L element
processors of the SIMD machine, see Fig. 2. We call the
processors element processors to distinguish them from a full
vector processor. Because processors in data parallel algo-
rithms do not use the results of another processor in the
same time step, there are no recurrences to worry about in
the corresponding vectorized implementation. To amortize
latencies, we attempt to keep the vector lengths long, close
to the length of the vector registers. When the work load is
imbalanced so that processors finish at different time steps
we can use strip-mining [28] or loop-raking [37, 5] to
assign the work of several virtual processors to a single
element processor. That is, element processor i is assigned
virtual processors j } l+i in strip-mining and i wn(l&1)x+j
in loop-raking, where i=0, ..., l&1 and j=0, ...,
wn(l&1)x&1 and lL. As processors complete, we can
reassign the virtual processors to the element processors.
Extending the vectorized algorithm to vector multi-
processors is straightforward if the machine is SIMD;
simply treat the p vector processors as a L_p SIMD multi-
processor and apply the vectorized algorithm. If the
machine is MIMD, it can be treated the same way except
that, for efficiency, we should minimize the frequency of
load balancing across physical processors and processor
synchronization.
346 MARGARET REID-MILLER
File: 571J 146004 . By:CV . Date:12:12:96 . Time:12:32 LOP8M. V8.0. Page 01:01
Codes: 6469 Signs: 5687 . Length: 56 pic 0 pts, 236 mm
2. THE LIST-RANKING AND LIST-SCAN ALGORITHMS
List ranking finds the number of links from the head of a
linked list to each vertex in the list. This information, for
example, can be used to reorder the vertices of a linked list
into an array in one parallel step. List scan computes, for
each vertex in the linked list, the ‘‘sum’’ of the values of all
prior vertices in the list, where ‘‘sum’’ is a binary associative
operator. List ranking and list scan are related in that list
ranking is the list scan where integer addition is the
operator and the values to be summed are all equal to one.
Because list ranking needs to read the link data only,
whereas list scan needs to read both the link and value data,
list ranking is usually faster than list scan (see Table I). In
addition, the scan operator can be more costly to compute
than the increment operator used in list ranking. List-rank-
ing and list-scan algorithms are otherwise the same.
To evaluate the performance of our parallel algorithm we
implemented several list-ranking algorithms. We chose
algorithms that have small constants because they are the
most likely to have the best performance for practical
linked-list lengths. In Section 6 we conclude that other work
efficient algorithms have much larger constants than the
ones we implemented and, therefore, would have worse
performance. Below we describe the main features of the five
algorithms we implemented. Since list-ranking and list-scan
algorithms are essentially the same we generally describe the
list-scan algorithm only. For simplicity we use integer
addition as the ‘‘sum’’ operator.
2.1. The Serial Algorithm
The serial list-scan algorithm simply walks down the list
storing the accumulated values of the previous vertices until
it reaches the end of the list. From Table I we see that the
Cray C90 serial list-scan and list-ranking times are very
nearly the same. The times are similar because the Cray
C90 has two input ports that can bring in both the link and
value data simultaneously. On workstations the list-ranking
execution times are substantially faster than the list-scan
times and both times depend on whether all the linked-list
data can be placed in the cache or not.
2.2. Wyllie’s Algorithm
The first parallel list-ranking algorithm was introduced
by Wyllie [35]. The algorithm uses a technique common to
most parallel list-ranking algorithms, ‘‘pointer jumping’’ or
‘‘shortcutting.’’ A processor is associated with every vertex
and each processor repeatedly replaces its successor pointer
with its successor’s successor pointer in unison with the
other processors. The new value for the vertex is its old
value plus the value of its successor. For a list of length n
and after Wlog2(n&1)X rounds of pointer jumping, every
vertex points to the tail of the list and its value is the sum of
the values from the vertex to the tail of the list. Although this
algorithm is simple and has an O(log n) running time, it is
not work efficient since the total number of operations is
O(n log n), whereas the serial algorithm takes O(n) opera-
tions.
Figure 1 shows the vectorized execution times of Wyllie’s
algorithm on one processor of the Cray C90. The sawtooth
shape of the curves is due to the addition of another round
of pointer jumping whenever the value of Wlog(n&1)X
changes. The negative slope between a pair of teeth is due to
the amortization of the additive constant terms over larger
size lists. As one can see from Fig. 1, Wyllie’s algorithm
quickly degrades in performance as the list lengths grow.
However, it does scale almost linearly with the number of
processors and so on multiple processors is faster than the
serial algorithm for moderately long lists.
2.3. MillerReif Random Mate
One of the simplest work-efficient parallel algorithms was
devised by Miller and Reif [25, 31]. It uses randomization
for symmetry breaking so that processors at neighboring
vertices do not attempt to dereference their successor
pointers simultaneously. Each processor flips an unbiased
malefemale coin. If it is a female and its successor is a male
(a ‘‘random mate’’) then it ‘‘splices out’’ its successor. The
processor for the successor vertex becomes idle. On each
round processors splice out only 14 of the remaining
vertices on average. With high probability after O(log n)
rounds all the vertices of the list are spliced out. Finally,
there is a reconstruction phase, in which spliced-out vertices
are reintroduced in reverse order from which they were
removed. The constants for this algorithm are greater than
for Wyllie’s algorithm because it needs to generate random
numbers and perform extra communication to establish
random mates with successor vertices, takes on average 4
attempts to splice out each vertex, requires load balancing,
and has both a reduction phase and a reconstruction phase.
We implemented this algorithm using the vector units on
a single processor of the Cray C90. Our version removes
idle processors on every round by compressing the remain-
ing vertices into contiguous vector elements (an operation
we call ‘‘pack’’). This algorithm is 20 times slower than our
algorithm and 3.5 times slower than the serial algorithm for
long linked lists.
2.4 AndersonMiller Random Mate
Anderson and Miller [3, 31] modified the above
algorithm so that it avoids load balancing (packing). The p
processors are assigned the work of splicing out a queue of
np vertices. At each round a processor attempts to remove
one vertex in its vertex queue. After a processor splices out
a vertex, on the next round it attempts to splice out the next
vertex in its queue. In this simple way processors remain
347LIST RANKING AND LIST SCAN
File: 571J 146005 . By:CV . Date:12:12:96 . Time:12:32 LOP8M. V8.0. Page 01:01
Codes: 6251 Signs: 5567 . Length: 56 pic 0 pts, 236 mm
busy without load balancing. When there are fewer than p
vertices left, it compresses these vertices in memory and
applies Wyllie’s algorithm. Finally, there is reconstruction
phase to reintroduce spliced out vertices.
To determine if a processor can splice out a vertex, all the
vertices are set to female, except those at the top of the
queue. These vertices are assigned male or female by a
random toss of a coin. A vertex can be spliced out if it is a
male pointed to by a female. If the coin is unbiased then in
the initial rounds the processors remove on average up to
12 of the vertices each round. But as the processing
proceeds, the average number of processors that can remove
a vertex each round drops to 14 of those left. Anderson and
Miller show that if the number of processors, p, is nlog n
then after a little over 4 log n rounds p vertices are left.
We attempted to optimize the AndersonMiller algorithm
to see how well it could perform. The most important
optimization was changing the coin bias so that the prob-
ability of assigning male was 0.9. The effect was that almost
900 of the active processors could splice out on every
round, even when the total number of remaining vertices
was small. This high rate continued because as some
processor queues completed and the remaining queues
continued to have several (female) vertices on them. The
result was to reduce the number of rounds and the run time
by about 400. Switching to Wyllie’s algorithm was not
useful because the number of rounds needed to complete
without Wyllie’s was not much greater than with Wyllie’s.
However, we did switch to the serial algorithm when only a
few queues remained. This result is in contrast to that found
by Hsu and Ramachandran [17] on the MasPar MP-1.
They found that switching to Wyllie’s algorithm greatly
reduced the number of rounds needed. The difference is that
they had 16,383 processors and were using an unbiased coin
whereas we had only 128 element processors and a biased
coin. For long lists the AndersonMiller algorithm is 3 times
faster than the MillerReif algorithm, but still 7 times slower
than our algorithm and 350 slower than the serial algorithm.
However, because it scales almost linearly, for long lists it
is faster on multiple physical processors than the serial
algorithm or Wyllie’s algorithm.
2.5 Our Parallel Algorithm
Many other work-efficient PRAM algorithms have been
developed for list ranking. Most use contract-scan-expand
phases and address two considerations. One is how to find
vertices on which to work to keep all the processors busy
and the second is how to break symmetry so that two
processors are not working on neighboring vertices [2]. We
break symmetry by randomly dividing up the linked list of
length n into m+1 sublists that can be processed independ-
ently and in parallel. We briefly describe the three phases of
the algorithm below.
Phase 1. Randomly divide the list into m+1 sublists.
Traverse each sublist computing the ‘‘sum’’ of the values at
each vertex. Form a new linked list of length m+1 that links
the sublists sums in the order the sublists appear in the
original list.
Phase 2. Find the list scan of the reduced list found in
Phase 1. These values become the scan values for the heads
of the sublists.
Phase 3. Traverse each sublist computing the list scan
of each vertex as the sum of the value and list scan of the
previous vertex.
Phases 1 and 3 are parallel. We pick mnlog n so that
Phase 1 reduces the problem size by at least a factor of log n.
The list scan in Phase 2 can be done recursively for large m,
using Wyllie’s pointer jumping technique for moderate size
m, or serially for small m. For small m serial list scan works
best because it avoids the overhead associated with the
parallel versions (see Fig. 1). Wyllie’s algorithm performs
best on moderate size lists where it can take advantage of
vectorization and multiprocessing and where log n is small.
For large m we use our algorithm recursively, until the
number of sublists becomes small enough to use either the
serial or Wyllie’s algorithm. We empirically determined
when we should switch between algorithms.
There are two problems with our algorithm that make it
theoretically inferior:
v The sublists lengths vary widely, from a small fraction
of the mean nm, namely (nm) ln((m+1)(m+0.5)) on
average, to a large factor of the mean, namely (nm)_
ln(2m+2) on average. Thus, the processors’ work is very
imbalanced.
v Since the expected length of the longest sublist is
approximately (nm) ln(2m+2) the parallel running time
can be no better than that, i.e., O((np)+(nm) log m),
mnlog n. In contrast, there are many parallel algorithms
that have an O((np)+log n) running time.
We ameliorate both problems by requiring m to be much
greater than the number of processors, p. In this way a
processor is responsible for several sublists, namely
(m+1)p. Periodically we perform load balancing to
regroup the lists, which addresses the first problem. When
p<mlog m the running time is dominated by np and the
length of the longest sublist is not a problem.
The primary advantage of our algorithm is that it is both
work efficient and has small constants. The algorithm is
fully vector parallel and scales almost linearly with the
number of processors. We expect, however, some degrada-
tion in performance as the number of processors increases,
because the available memory bandwidth per processor
decreases. Figure 3 shows the speedup relative to one
processor for various size lists.
348 MARGARET REID-MILLER
File: 571J 146006 . By:XX . Date:16:12:12 . Time:03:54 LOP8M. V8.0. Page 01:01
Codes: 3853 Signs: 3162 . Length: 56 pic 0 pts, 236 mm
FIG. 3. Relative speedups of our list-scan algorithm on the Cray C90.
3. OUR ALGORITHM ON A VECTOR PROCESSOR
This section describes our vector implementation of list
scan (for a more detailed description see [30]). We used the
C programming language and the standard Cray C
compiler on a Cray C90. The time equations we give in this
section are in Cray C90 clock cycles (4.2 nsec). Because
many of our vector operations use indirect addressing, we
needed to give compiler directives in order to get the
compiler to vectorize the loops; the only portion that is not
vectorizable is the serial list scan in Phase 2. We attempted
to reorder the statements within a loop in order to fill the
FIG. 4. The top of the figure shows the initial linked list with its values at each vertex. The bottom of the figure shows the results of the initialization
step. The linked list is divided into 3 sublists, one for each processor, and each sublist is terminated with a self loop. Processor P0 sets its sublist head, H,
to the head of the whole list. Each remaining processor, P1 and P2 , saves two values: its chosen random position, R, and the successor of the random position
in the original linked list, which becomes the head of its sublist, H. Each processor also initializes its sublist sum, S, to zero, the identity of the scan operator.
multiple functional units for concurrent operations, to
avoid contention between inputoutput memory ports and
the gatherscatter hardware, and to avoid write after read
dependencies [14]. With nested loops we unrolled the outer
loop up to eight times to avoid unnecessary loading and stor-
ing of vector registers on each execution of the inner loop.
Chaining is also possible within loops. We made no attempt
to avoid memory bank conflicts. However, since we are
choosing random positions for the heads of the sublists,
systematic memory bank conflicts are unlikely.
In the following description we assume that there is one
virtual processor for every sublist. By using strip-mining, the
element processors are assigned the work of an equal number
of virtual processors. We represent the linked list as a pair of
arrays. The value array contains the value of each vertex of
the list and the link array contains the index of the next vertex
in the list. The tail of the list is a self-loop, i.e., the link at the
tail is the index of the tail vertex.
Initialization. Each virtual processor except one design-
ated one, P0 , picks a random vertex in the linked list to be the
tail of a sublist and makes the successor vertex the head of its
sublist. P0 takes the head of the whole list as its sublist
head. Then each virtual processor sets the sublist tail to a self
loop and initializes its sublist sum to zero, where zero is the
identity of the list-sum operator. Figure 4 shows the linked
list that is the input of the list-scan algorithm and the result
of the initialization step.
It is possible that two virtual processors pick the same ran-
dom position at which to break the list. Then the two pro-
cessors duplicate each other’s actions and cause contention.
We can remove duplicate random numbers by having a
349LIST RANKING AND LIST SCAN
File: 571J 146007 . By:XX . Date:16:12:12 . Time:04:13 LOP8M. V8.0. Page 01:01
Codes: 3914 Signs: 2799 . Length: 56 pic 0 pts, 236 mm
FIG. 5. The figure shows the results of computing the sum of each sublist. Each processor traversed its sublist until it reaches the sublist tail, T, and
accumulates the ‘‘sum’’ of the values along the sublist, S.
competition among the processors. Each processor writes its
index at its random location and waits for all the processors
to complete their writes. Then each processor reads the
index at its random location. If the index is not its own it
knows that it is a duplicate processor and can drop out of
the computation.
The standard model [16] for the performance of vector
operations on vectors of length n is:
T(n)=te(n+n12),
where te is the incremental time per vector element and n12
is the vector half performance length (the vector length that
achieves half the peak performance). Based on this model we
InitialScan (vp, 11, nlinks)[
* vp  virtual processor data *
* ll  linked list *
* nlinks  number of links to traverse *
for (j=0; j<nlinks; j++) * Traverse nlinks links of each sublist *
for (i=0; i<vp.n; i++) [ * Vectorized loop *
vp.sum[i]+=11.value[vp.next[i]]; * Gather value and increment sum *
vp.next[i]=11.next[vp.next[i]]; * Gather successor link *
]
]
The number of clock cycles needed for the inner loop is:
TInitialScan(x)=3.4x+35,
where x is the number of sublists remaining in the computa-
tion. We use the variable x, as opposed to m+1, to indicate
that the value of x changes during the course the execution
of the algorithm.
FIG. 6. The figure shows creating the reduced list of sublist sums during Phase 1. Each processor writes its index at its random position in the linked
list, R, and reads the index written at the tail of its sublist, T. This index is the index of the processor with the successor sublist. P1 finds no index at
the tail of its sublist and therefore is the tail sublist.
found that the number of clock cycles needed for Initialize
to set up m+1 sublists is:
Tinitialize(m+1)=22(m+1)+1800.
Phase 1. Phase 1 alternates between summing the
values along the links of the sublists and load balancing.
Because of the predictable sizes of the sublists lengths we
can determine how many links to traverse between load
balancing steps and adjust this number as the procedure
progresses (see Section 4). Figure 5 shows the status after
the processors have found their sublist sums.
To illustrate how streamlined the list traversal is, we show
the code written in pseudo C.
Notice that all the loop does is gather the data necessary
to compute the sum and store the results; there are no
conditional tests or additional computation. To avoid
conditional tests we destructively set the sublist tail values
to zero in the initialization step. In this way, we can
repeatedly add the tail value without changing the sum.
Because this loop (and the corresponding one in Phase 3)
350 MARGARET REID-MILLER
File: 571J 146008 . By:XX . Date:16:12:12 . Time:04:13 LOP8M. V8.0. Page 01:01
Codes: 4920 Signs: 3745 . Length: 56 pic 0 pts, 236 mm
FIG. 7. Phase 2 finds the list scan on the reduced list of sublist sums.
traverses every link in the linked list the time of this loop
dominates the overall time and every economy is critical.
For list ranking, we are able to improve the performance of
the loop further by reducing the number of gather opera-
tions to one, which is important because the Cray C90 can
perform only one gather or scatter operation at a time. One
gather is sufficient because we encode the link and value
data for a vertex into a w-bit integer value, which we can do
as long as the list length (and therefore the maximum rank)
is no more than 2w2.
To load balance, the sum and tail indices for the
completed sublists are saved and the incomplete virtual
processor data are packed into contiguous array locations.
The number of clock cycles needed to load balance x
sublists is:
TInitialPack(x)=8.2x+1200.
Once the virtual processors have reached the tails of their
sublists, they create the reduced list of sublists sums. Each
processor knows the tail of the previous sublist because it is
the random position the processor chose during initializa-
tion. By writing the virtual processor’s index into the tail of
the previous sublist and then reading the index at the tail
of its own sublist, the processor determines the virtual
processor index of its successor’s sublist. This index
becomes the successor link from its sublist sum to its
successor’s sublist sum and forms a new shorter linked list
(see Fig. 6). For example, consider the tail of the first sublist
in Fig. 6. Processor 2 writes 2 in the tail of the first sublist.
Then Processor 0 reads the 2 at the tail of its sublist. Thus,
Processor 0 links its sum to the sum at Processor 2.
A processor can determine whether its sublist is the tail
sublist because no processor wrote its index in the tail. Thus,
the tail sublist processor, P1 in Fig. 6, sets its successor link
to its own index.
The number of clock cycles needed to create the reduced
list of length m+1 is:
TFindSublistList(m+1)=11(m+1)+650.
FIG. 8. The scan values for the heads of the sublists are the scan values of the reduced list found in Phase 2. Phase 3 finds the remaining scan values.
Phase 2. Depending on the size of this new linked list,
the algorithm finds the list scan of the reduced linked list
recursively, using Wyllie’s algorithm or serially. Note that in
this phase we find the list scan, for both list ranking and list
scan, and not the sum as in Phase 1. Figure 7 shows the
result of Phase 2 of the algorithm.
Phase 3. The scan value found in Phase 2 becomes the
scan value for the head of a virtual processor’s sublist, see
Fig. 8.
As in Phase 1 each virtual processor alternates between
traversing its sublist and load balancing. But this time it
finds the scan of each vertex as the sum of the previous
vertex scan and value. The number of clock cycles needed to
traverse one link of the x sublists remaining in the computa-
tion is:
TFinalScan(x)=4.6x+28,
and to load balance the x sublists is:
TFinalPack(x)=7.2x+950.
Restoration. Finally, each virtual processor reconnects
the sublists to form the original linked list, using the values
saved during initialization. The number of clock cycles
required to reconnect m+1 sublists is:
TRestoreList(m+1)=4.2(m+1)+300.
4. ANALYSIS OF THE ALGORITHM
In Phases 1 and 3 of the algorithm we periodically per-
form load balancing so that completed sublists are removed
from the computation. If we load balance too frequently we
remove none or only a few sublists, and when there are
many sublists load balancing is expensive. If we do not load
balance often enough we may have many processors perfor-
ming needless work repeatedly chasing completed sublists’
351LIST RANKING AND LIST SCAN
File: 571J 146009 . By:XX . Date:16:12:12 . Time:04:14 LOP8M. V8.0. Page 01:01
Codes: 5547 Signs: 4271 . Length: 56 pic 0 pts, 236 mm
tails. In order to determine when are good times to load
balance we first need a better understanding of what the
expected distribution of the sublists lengths are. We derive
the expected distribution in Section 4.1. Next, in Section 4.2
we determine the overall cost of performing the algorithm,
given the timing data in Section 3. In Section 4.3 we used the
expected distribution to minimize the execution time of the
algorithm, given the number of sublists and number of times
to load balance. Finally, we explain how we compute the
parameters values to minimize the overall execution time.
The main theorem is:
Theorem 1. The list-ranking algorithm in this paper has
expected time O((np)+(nm) log m) on p processors, when
m<nlog n.
4.1. Analysis of Sublist Lengths
In this section we show that the distribution of the lengths
of the sublists is approximately a negative exponential
distribution, when n and m are large. We first consider the
following situation. Let X1 , ..., Xm be m random numbers in
therange (0, 1). For truly random numbersProb(Xj=Xk)=0
for j{k. Therefore, the numbers partition (0, 1) m+1 sub-
intervals. Let X(1) , ..., X(m) denote the Xs ordered by their
sizes from smallest to largest.
Proposition 2 (Feller [10]). If X1 , ..., Xm are inde-
pendent and uniformly distributed over the range (0, 1) then
as m   the successive intervals in our partition behave
as though they are mutually independent exponentially
distributed variables with Exp(X( j+1)&X ( j))=1m.
In our case, we are choosing m random positions in a list
of length n. If nrm, n  , and m   then the lengths
of the sublists tend to behave as mutually independent
exponential variates with expectation nm. That is, if L is a
sublist length, then
Prob[L>x]re&mxn=a(x). (1)
We can estimate the expected length of the j th shortest sub-
list by setting a(x)=(m& j+0.5)(m+1) and solving for x.
This estimate seems to be reasonable for n and m as small as
n>1000 and m>100. The expected length of the shortest
sublist is:
Exp(L(0))r
n
m
ln \ m+1m+0.5+ ,
and the longest sublist is:
Exp(L(m))r
n
m
ln(2m+2).
FIG. 9. The function curves are the expected length of the j th shortest
sublist when n=10000 for several values of m, the number of sublists. The
observed lengths were found by taking 20 samples of dividing a list of size
n=10000 into m sublists randomly. The error bars show the minimum,
average, and maximum lengths of the jth shortest sublists of the 20 samples.
Figure 9 shows the expected length of the j th shortest sublist
for several values of m when n=10000 and compares it to
some actual data averaged over 20 samples. Notice that as
m increases the expected length of the longest sublist
decreases and there is less variation in list lengths. There-
fore, to reduce the parallel running time we want to make m
large. However, as m increases so do the costs due to load
balancing, initialization, and Phase 2.
4.2. Cost of the Algorithm
Using the timing equations of each piece of the algorithm
given in Section 3 we can determine the cost of the complete
algorithm, assuming we know the exact lengths of the
sublists and when to perform load balancing. Let Si be the
total number of links traversed in each list before the load
balancing for the i th time. Let g(x) be the expected number
of sublists that have length greater than x. From Eq. (1) we
get
g(x)=(m+1)_Prob(L>x)
r(m+1) e&mxn. (2)
The dotted line in Fig. 10 shows g(x) when n=10000 and
m=200. The x-axis is the sublist length and the y-axis is the
number of sublists with that length. You can think of each
sublist as being laid out from left to right and placed one
above the other from longest to smallest, each starting at
x=0. That is, the y-axis is the number of sublists that are
still active in the computation, namely the vector lengths of
the computations, while the x-axis is the number of links
traversed in each list. As we proceed from left to right, we
are traversing links on a vector of length equal to the height
of the step function. Every time we load balance (at the
corner of a step) the vector length decreases. The area under
the step function in Fig. 10 is the expected total number of
links traversed in either Phase 1 or Phase 3. If Si=i then the
area under the step function would be n, the area under the
352 MARGARET REID-MILLER
File: 571J 146010 . By:XX . Date:16:12:12 . Time:04:14 LOP8M. V8.0. Page 01:01
Codes: 5441 Signs: 3696 . Length: 56 pic 0 pts, 236 mm
FIG. 10. The dotted function is g(x) the expected number of sublists
that have length greater than x, where n=10000 and m=199. If we load
balance 11 times the expected execution time on the Cray C90 is mini-
mized by load balancing at the vertical lines. The step function shows the
expected number of sublists that are currently active at every link traversal.
The drop in step size is the expected number of sublists that completed
since the last time load balancing was done.
curve g(x). Our aim is to minimize the area under the step
function that is above the dotted line, while keeping the cost
of load balancing down. The cost of load balancing is
proportional to the sum of the heights of the step function.
From the equations for each part of the algorithm given
in Section 3, the expected number of Cray C90 cycles for
Phases 1 and 3 of the algorithm is:
TP1+P3=\ :
l&1
i=0
(Si+1&Si)(a } g(Si)+b)+
+\ :
l&1
i=0
(c } g(Si)+d )++e(m+1)+ f, (3)
where TScan(x)=ax+b, TPack(x)=cx+d, and TOther(x)=
ex+ f, and g(Si) is the expected number of sublists remain-
ing after traversing Si links in each list. The first summation
is the total time to traverse the links in Phases 1 and 3, the
second summation is the time for load balancing in Phases
1 and 3, and the final e(m+1)+ f is the time for creating the
sublists, forming the reduced linked lists from the sublist
sums, and restoring the linked list to its original form after
the final phase.
4.3. Minimizing the Time Given Fixed Parameters
Consider n, m, and l fixed. We want to minimize the
execution time with respect to S0 , S1 , S2 , ..., Sl , where
S0=0 and Si , i>0, is the number of links traversed on each
sublist before load balancing for the i th time. We can mini-
mize Eq. (3) by taking partial derivatives for each Si and
setting the partial to zero to obtain a set of l simultaneous
equations:
Si+1=Si&
g(Si&1)& g(Si)
g$(Si)
&
c
a
=Si+
g(Si&1)& g(Si)
(mn) g(Si)
&
c
a
, (4)
using g$(Si)=&(mn) g(Si). That is, if we know Si&1 and Si
then we can determine Si+1. Since we know S0=0 and if we
know S1 we can compute S2 , ..., Sl iteratively.
Note in Fig. 10 that the Si ’s become increasingly further
apart for larger i’s because the rate sublists complete slows
down over time. The factor ca in Eq. (4) reflects the relative
cost of load balancing and traversing links. If we increase c
and keep the number of times we load balance constant,
load balancing would occur less frequently during the initial
iterations and occur more frequently during later iterations.
Initially load balancing is expensive because the vector
lengths are long and becomes less expensive as vector
lengths shorten. As we increase c relative to a eventually we
find that the execution time is reduced by decreasing the
number of times we load balance even though we increase
the total number links we traverse. In the next section we
consider how to determine the best value for S1 which in
turn determines the number times to load balance.
We can simplify Eq. (3) by using Eq. (4) to substitute for
Si+1&Si . That is,
:
l&1
i=0
(Si+1&Si)(a } g(Si)+b)
=aS1g(S0)+a :
l&1
i=1
(Si+1&Si) g(Si)+b :
l&1
i=0
(Si+1&Si)
=aS1(m+1)
+a :
l&1
i=1 _
n
m
(g(Xi&1)& g(Si))&
c
a
g(Si)&+bSl
raS1(m+1)+an&c :
l&1
i=1
g(Si)+bSl .
Thus, the expected time for Phases 1 and 3 is:
TP1+P3(S1 , ..., Sl)ran+b
n
m
ln m
+(aS1+c+e)(m+1)+ld+ f,
since Exp(Sl)r(nm) ln m.
Because the time for the second phase is no worse than
the serial time (44 cyclesvertex) and given the timing data
in Section 3, the expected number of clock cycles on a one
processor Cray C90 for the algorithm is:
T(n)r8n+62
n
m
ln m+(8S1+96)(m+1)
+2150l+2750, (5)
where m+1 is the number of sublists, S1 is the number of
links traversed before load balancing the first time, l is the
353LIST RANKING AND LIST SCAN
File: 571J 146011 . By:XX . Date:16:12:12 . Time:04:14 LOP8M. V8.0. Page 01:01
Codes: 5508 Signs: 4454 . Length: 56 pic 0 pts, 236 mm
number of times load balancing is done and is a function of
S1 , m, and n.
4.4. Overall Vector Performance
From the previous discussion we have a way to determine
when we should load balance, as long as we know the length
of the whole linked list, n, the number of sublists, m, and the
number of links traversed before load balancing the first
time, S1 . However, all we know is the length of the whole
linked list, namely n. We need to find good choices for m
and S1 . Our approach is to estimate the running time of the
algorithm using Eq. (3) for various values of m, S1 and n,
and use Eq. (4) to determine the corresponding S2 , ..., Sl
values. Then, for each value of n we find values of m and S1
that minimized the running time within about two percent.
Finally, we fit functions to m vs. n and S1 vs. n. It appears
that m and S1 are approximately cubic polynomials of log n.
We use these fitted polylog functions in our implementation
to determine m and S1 given n, and use Eq. (4) to find
successive values of Si . When we use these estimates of m
and S1 , we find that Eq. (3) accurately predicts and Eq. (5)
over estimates the actual execution time on one Cray C90
vector processor.
5. VECTOR MULTIPROCESSOR LIST SCAN
To extend the algorithm to multiple vector processors we
divide the virtual processors equally among the physical
processors and let vectorization proceed on the data
assigned to the physical processors. The Cray C compiler
makes parallelizing relatively easy. Loops are modified to be
tasked loops with compiler directives that direct the
compiler to divide the loops into equal size chunks, one
chunk per processor, and to vectorize the chunks within the
processors. Ignoring the time for synchronization, from
Eq. (3) we see that the vector-parallel time for the algorithm
is:
T(n)r :
l&1
i=0
(Si+1&Si)(ag(Si)p+b)
+ :
l&1
i=0
(cg(Si)p+d )+e(m+1)p+ f+TP2(m+1)
rO \ np+
n
m
log m+, if m<nlog n. (6)
We derive Eq. (6) using an analysis similar to that used in
the previous section and that the time for Wyllie’s algorithm
in Phase 2 is TP2O(np+log n) if m<nlog n. From
Eq. (6) we get Theorem 1.
FIG. 11. Execution times per vertex on 1, 2, 4, and 8 processors of the
Cray C90 for our list-scan algorithm.
Data parallel algorithms assume that each data parallel
step is synchronized, whether or not it is necessary. How-
ever, the most we need to synchronize is after every inner-
most vectorized loop. If we treat the vector multiprocessor
strictly as a l_p multiprocessor then, in particular, we need
to synchronize each time we load balance in Phases 1 and
3 so that load balancing can proceed globally across the
physical processors. Instead we assign virtual processors to
physical processors once at the beginning and only load
balancing locally within each physical processor. In this
way each processor completes all of Phase 1 and Phase 3
independently of the other processors. In effect, we
synchronize only a constant number of times and do no load
balancing across processors. Eliminating synchronization
avoids needless delays at each synchronization point.
Avoiding global load balancing across processors is impor-
tant because most compilers do not know how to parallelize
a pack operation across processors and global load balancing
requires extra communication.
Because we use randomization, we do not expect a signifi-
cant load imbalance when we only load balance locally.
Even if an imbalance should become a problem as the
procedure progresses, only one across-processor load
balance should be necessary. Our results are excellent
without any global load balancing.
Unfortunately, to tune the parameters m and S1 we need
to minimize for every possible number of processors. For a
highly or massively parallel machines tuning the parameters
for every number of processors would not be practical. But
since the Cray C90 has 16 processors there are only 16 sets
of equations. We tuned the parameters for 1, 2, 4, and 8
processors, resulting in the list-scan asymptotic performance
of 7.4, 3.9, 2.0, and 1.1 clock cycles per vertex, respectively,
where a clock cycle is 4.2 nsec. The asymptotic times for list
ranking are 5.1, 2.6, 1.4, and 0.75 clock cycles per vertex,
respectively. Fig. 11 shows a graph of the list scan execution
times in nanoseconds.
6. OTHER WORK EFFICIENT LIST-RANKING
ALGORITHMS
On one Cray C90 vector processor our algorithm takes
about 7.4 clock cycles per vertex asymptotically to find the
354 MARGARET REID-MILLER
File: 571J 146012 . By:CV . Date:12:12:96 . Time:12:32 LOP8M. V8.0. Page 01:01
Codes: 6427 Signs: 5928 . Length: 56 pic 0 pts, 236 mm
scan of a linked list. If other algorithms are to be
competitive, they must be able to use no more than 7.4
cycles per vertex on average. Below we discuss some other
deterministic algorithms that have been described in the
literature. Except for Wyllie’s pointer jumping algorithm on
short linked lists we conclude that other algorithms are
unlikely to be competitive.
Cole and Viskin devised a parallel deterministic coin
tossing technique [6] which they used to develop an
optimal deterministic parallel list-ranking algorithm [7, 4].
This algorithm breaks the linked list into sublists of two or
three vertices long (the heads of the sublists are called
2-ruling sets); reduces the sublists to a single vertices; and
then compacts these single vertices into contiguous memory
to create a new linked list. It recursively applies the
algorithm to the new linked list until the resulting linked list
is less than nlog n, at which point it applies Wyllie’s
algorithm. In the final phase it reconstructs the linked list by
unraveling the recursion in the first phase in order to fill in
the rank values of the removed vertices. The algorithm runs
in O(log n log log n) parallel time and uses O(n) steps. Later
they modified their algorithm to give a O(log n) time
optimal deterministic algorithm [9, 34]. The problem is
algorithms for finding 2-ruling sets that give either of
these time bounds are quite complicated and have large
constants.
Cole and Viskin also developed the first O(log n) optimal
deterministic list-ranking algorithm based on assigning
processors to jobs and using expander graphs [8]. Its
constants are far too large to be practical. In addition, they
give a much simpler 2-ruling set algorithm that is not work
efficient but has smaller constants (see [4]). Because it is
not work efficient and its constants are larger than Wyllie’s
or ours, we chose not to implement it.
Anderson and Miller combined their randomized
algorithm with the ColeViskin deterministic coin tossing to
get an optimal O(log n) time deterministic list-ranking
algorithm [2]. As with their randomized algorithm, the
processors are assigned log n vertices to process. On each
round each processor executes a case statement that either
breaks symmetry, splices out a vertex in its queue, or splices
out a vertex at another processor’s queue. To break
symmetry it finds a log log n ruling set. Finding log log n
ruling sets are much simpler (O(1) time) than finding
2-ruling sets (O(log n) time with large constants). But
because each round involves a nonparallel three way case
statement, where each case needs to be completed by all the
processors before the next case can be executed, its con-
stants are also much larger than ours.
The problem with these work-efficient algorithms is that
on each of the O(log n) rounds, they attempt to find a large
number, at least nlog n and as many as n2, of nonadjacent
elements in the linked list in order to maximize the
parallelism in the algorithm. In contrast our algorithm has
only few rounds and finds only a moderate number of such
elements each round. Thus, we greatly reduced the overhead
associated with symmetry breaking.
7. CONCLUSIONS AND FUTURE DIRECTIONS
In this paper we described a new parallel algorithm for
list ranking and list scan and its implementation. The
implementation is surprisingly fast, especially when com-
pared with the limited speedups of other implementations of
pointer-based algorithms. The key to the success is twofold.
First, we were willing to sacrifice fast asymptotic times for
reduced parallel work, only a small constant factor greater
than the serial work. By reducing the constants in the work,
our algorithm performs better than other work-efficient
algorithms, which have larger constants. Second, we used
hardware with the highest-performance memory system
available, and programmed it with virtual processing to
hide latency. Because list ranking is so memory bound, its
performance is directly related to the bandwidth of the
memory system. Even though vector supercomputers,
such as the Cray C90 are becoming less common, there is
evidence that multiprocessor systems are moving to higher
bandwidths and reduced overhead. Both small constants
and high bandwidth are necessary for good performance.
High bandwidth is not sufficient because the algorithms
with even moderate size constants are not much better
than the serial implementation as shown in Fig. 1. Small
constants are not sufficient because reduce bandwidths
results in longer parallel times, as indicated in [23] and seen
in the reduced speedup of our algorithm for larger numbers
of processors (see Fig. 3).
As with any implementation, there are a multitude of
possible modifications and enhancements that could
improve its performance. A large part of the performance
loss is due to short vector lengths. As sublists drop out of the
computation the vector lengths shorten. Not only are the
vector lengths short, the number of iterations remaining
with short vector lengths can be relatively large, since the
longest sublists can be much longer that the other sublists.
Short vectors are inefficient because of the latency due to
filling the vector pipelines. John Reif suggested we use over-
sampling to further subdivide the remaining long sublists
when the vector lengths become short. The cost, however, of
maintaining which subdivisions remain relevant would slow
down the two major list-scan loops of the algorithm and
likely slow down the overall performance.
Finally, the question still remains whether having a fast
list-ranking implementation helps in making other pointer-
based applications practical. If so, we may have opened
up major classes of PRAM algorithms that can have
reasonable implementations. In addition, it also would be
interesting to see whether our analysis and techniques can
be applied to other pointer-based algorithms.
355LIST RANKING AND LIST SCAN
File: 571J 146013 . By:CV . Date:12:12:96 . Time:13:07 LOP8M. V8.0. Page 01:01
Codes: 8499 Signs: 7356 . Length: 56 pic 0 pts, 236 mm
ACKNOWLEDGMENTS
Guy Blelloch originated the basic list-ranking algorithm presented here
and suggested that I implement it on the Cray C90. I thank him for his
continued support and numerous insightful conversations. In addition,
I thank Marco Zagha who helped me understand the Cray Assembly
Language code produced by the Cray C compiler, Alan Frieze who noted
that the sublists lengths approximately followed an exponential distribu-
tion, Gary Miller for many helpful conversations, and Marco Zagha,
John Greiner, and the referees for their careful and detailed reviews of the
paper.
REFERENCES
1. K. Abrahamson, N. Dadoun, D. G. Kirpatrick, and T. Przytycka,
A simple parallel tree contraction algorithm, J. Algorithms 10, No. 2
(1989), 287302.
2. R. Anderson and G. L. Miller, Deterministic parallel list ranking, in
‘‘VLSI Algorithms and Architectures: 3rd Aegean Workshop on
Computing, AWOC88’’ (J. H. Reif, Ed.), pp. 8190, Lecture Notes in
Computer Science, Vol. 319, Springer-Verlag, New YorkBerlin, 1988.
3. R. Anderson and G. L. Miller, A simple randomized parallel algorithm
for list-ranking, Inform. Process. Lett. 33, No. 5 (1990), 269273.
4. S. Baase, Introduction to parallel connectivity, list ranking, and Euler
tour techniques, in ‘‘Synthesis of Parallel Algorithms’’ (J. Reif, Ed.),
pp. 61114, Morgan Kaufmann, 1993.
5. G. E. Blelloch, S. Chatterjee, and M. Zagha, Solving linear recurrences
with loop raking, in ‘‘Proceedings Sixth International Parallel Processing
Symposium, Mar. 1992,’’ pp. 416424.
6. R. Cole and U. Vishkin, Deterministic coin tossing and accelerating
cascades: Micro and macro techniques for designing parallel algo-
rithms, in ‘‘Proceedings ACM Symposium on Theory of Computing,
1986,’’ pp. 206219.
7. R. Cole and U. Vishkin, Deterministic coin tossing with applications to
optimal parallel list ranking, Inform. and Control 70, No. 1 (1986),
3153.
8. R. Cole and U. Vishkin, Approximate parallel scheduling. I. The basic
technique with applications to optimal parallel list ranking in
logarithmic time: SIAM J. Comput. 17, No. 1 (1988), 128142.
9. R. Cole and U. Vishkin, Faster optimal parallel prefix sums and list
ranking. Inform. and Comput. 81, No. 3 (1989), 334352.
10. W. Feller, ‘‘An Introduction to Probability Theory and Its Applica-
tions,’’ Vol. 2, Wiley Series in Probability and Mathematical Statistics,
Wiley, New York, 1971.
11. H. Gazit, Optimal EREW parallel algorithms for connectivity, ear
decomposition and st-number of planar graphs, in ‘‘Proceedings Inter-
national Conference on Parallel Processing, Aug. 1991,’’ pp. 8491.
12. H. Gazit, G. L. Miller, and S.-H. Teng, Optimal tree contraction in
the EREW model, in ‘‘Concurrent Computations’’ (S. K. Tewsburg,
B. W. Dickinson, and S. C. Schwartz, Eds.), pp. 139156, Plenum,
New York, 1988.
13. J. Greiner, A comparison of data-parallel algorithms for connected
components, in ‘‘Proceedings Sixth Annual Symposium on Parallel
Algorithms and Architectures, Cape May, NJ, June 1994,’’ pp. 1625.
14. A. R. Hainline, S. R. Thompson, and L. L. Halcomb, Vector
performance estimation for Cray X-MPY-MP supercomputers,
J. Supercomputing 6 (1992), 4970.
15. W. D. Hillis and G. L. Steele Jr., Data parallel algorithms, Comm.
ACM 29, No. 12 (Dec. 1986), 11701183.
16. R. W. Hockney, Characterization of parallel computers and algorithms,
Comput. Phys. Comm. 26 (1982), 285291.
17. T.-s. Hsu and V. Ramachandran, Efficient implementation of virtual
processing for some combinatorial algorithms on the MasPar MP-1,
in ‘‘Proceedings of the Seventh IEEE Symposium on Parallel and
Distributed Computing, San Antonio, TX, Oct. 1995,’’ pp. 154159.
18. T.-s. Hsu, V. Ramachandran, and N. Dean, Implementation of parallel
graph algorithms on the MasPar, in ‘‘Computational Support for
Discrete Mathematics,’’ DIMACS Series in Discrete Mathematics and
Theoretical Computer Science, Vol. 15, pp. 165198, Amer. Math.
Soc., 1994; also available an TR-92-38, Dept. of Comp. Sci., University
of Texas at Austin, Feb. 1992.
19. T.-s. Hsu, V. Ramachandran, and N. Dean, Implementation of parallel
graph algorithms on a massively parallel SIMD computer with virtual
processing, in ‘‘Proceedings of the International Parallel Processing
Symposium, Santa Barbara, CA, Apr. 1995,’’ pp. 106112.
20. S. R. Kosaraju and A. L. Delcher, Optimal parallel evaluation of
tree-structured computation by ranking (extended abstract), in ‘‘VLSI
Algorithms and Architectures: 3rd Aegean Workshop on Computing,
AWOC88’’ (J. H. Reif, Ed.), pp. 101110, Lecture Notes in Computer
Science, Vol. 319, Springer-Verlag, New YorkBerlin, JuneJuly 1988.
21. C. P. Kruskal, L. Rudolph, and M. Snir, Efficient parallel algorithms
for graph problems, Algorithmica 5, No. 1 (1990), 4364.
22. S. S. Lumetta, A. Krishnamurthy, and D. E. Culler, Towards modeling
the performance of a fast connected components algorithm on parallel
machines, in ‘‘Proceedings Supercomputing ’95,’’ 1995.
23. Y. Mansour, N. Nisan, and U. Vishkin, Trade-offs between com-
munication throughput and parallel time, in ‘‘Proceedings ACM
Symposium on Theory of Computing, Montreal, Quebec, May 1994,’’
pp. 372381.
24. K. Mehlhorn and U. Vishkin, Randomized and deterministic simula-
tions of PRAMs by parallel machines with restricted granularity of
parallel memory, Acta Inform. 21 (1984), 339374.
25. G. L. Miller and J. H. Reif, Parallel tree contraction and its application,
in ‘‘Proceedings Symposium on Foundations of Computer Science,
Oct. 1985,’’ 478489.
26. G. L. Miller and J. H. Reif, Parallel tree contraction. 2. Further
applications, SIAM J. Computing 20, No. 6 (Dec. 1991), 11281147.
27. P. Narayanan, Single source shortest path problem on processor
arrays, in ‘‘Proceedings Frontiers of Massively Parallel Computation,
McLean, VA, Oct. 1992,’’ pp. 553556.
28. D. A. Padua and M. Wolfe, Advanced compiler optimizations for
supercomputers, Comm. ACM 29, No. 12 (Dec. 1986), 11841201.
29. V. Ramachandran, Parallel open ear decomposition with applications
to graph biconnectivity and triconnectivity, in ‘‘Synthesis of Parallel
Algorithms’’ (J. Reif, Ed.), pp. 275340, Morgan Kaufmann, 1993.
30. M. Reid-Miller and G. E. Blelloch, ‘‘List Ranking and List Scan on the
Cray C-90,’’ Tech. Rep. CMU-CS-94-101, School of Computer
Science, Carnegie Mellon University, Feb. 1994.
31. M. Reid-Miller, G. L. Miller, and F. Modugno, List ranking and
parallel tree contraction, in ‘‘Synthesis of Parallel Algorithms’’ (J. Reif,
Ed.), pp. 115194, Morgan Kaufmann, 1993.
32. B. Schieber, Parallel lowest common ancestor computation, in
‘‘Synthesis of Parallel Algorithms’’ (J. Reif, Ed.), pp. 259274, Morgan
Kaufmann, 1993.
33. L. G. Valiant, A bridging model for parallel computation, Comm.
ACM (8) 33 (1990), 103111.
34. U. Vishkin, Advanced parallel prefix-sums, list ranking and
connectivity, in ‘‘Synthesis of Parallel Algorithms’’ (J. Reif, Ed.),
pp. 215257, Morgan Kaufmann, 1993.
35. J. C. Wyllie, The Complexity of Parallel Computations, Tech. Rep.
TR-79-387, Department of Computer Science, Cornell University,
Ithaca, NY, Aug. 1979.
36. M. Zagha, ‘‘Efficient Irregular Computation on Pipelined-Memory
Multiprocessors,’’ Ph.D. thesis, School of Computer Science, Carnegie
Mellon University, in preparation.
37. M. Zagha and G. E. Blelloch, Radix sort for vector multiprocessors, in
‘‘Proceedings Supercomputing ’91, Nov. 1991,’’ pp. 712721.
356 MARGARET REID-MILLER
