Efficient massively parallel implementation of some combinatorial algorithms  by Hsu, Tsan-sheng & Ramachandran, Vijaya
ELSEVIER Theoretical Computer Science 162 (1996) 297-322 
Theoretical 
Computer Science 
Efficient massively parallel implementation 
of some combinatorial algorithms ’
Tsan-sheng Hsu a,*,2, Vijaya Ramachandran b,3 
a Institute of Information Science, Academia Sinica, Nankany 11529, Taipei, Taiwan. ROC 
b Department of Computer Sciences, University of Texas at Austin, Austin, TX 78712, USA 
Abstract 
We describe our implementation of several efficient parallel algorithms on the massively par- 
allel SIMD machine MasPar MP-1 with virtual processing. The MPL language that we used on 
the MasPar MP-1 does not support virtual processing. In this paper, we describe the implemen- 
tation of virtual processing for several combinatorial algorithms using the MPL language. We 
present our data allocation scheme for virtual processing and code rewriting rules for converting 
a code that uses no virtual processors into a code with virtual processing. We then describe the 
implementation of virtual processing and the fine-tuning of a set of commonly used routines. In 
coding these routines, we tried different underlying (deterministic and randomized) algorithms. 
We present the performance data for our different implementations. We also compared the perfor- 
mance of several of the parallel routines with their sequential implementations. The performance 
of our code tracks theoretical predictions quite well for the range of values for virtual processing, 
that we tested. 
We used techniques presented in this paper to convert non-virtual processing code for undi-- 
rected graph algorithms into virtual processing code. Our experimental data suggests that by using 
our techniques, one can implement parallel algorithms with virtual processing quite effectiveI> 
on the MasPar MP-1 using the MPL language. 
1. Introduction 
This paper describes an ongoing project for implementing parallel algorithms on 
the massively parallel single-instruction-multiple-data (SIMD) machine MasPar MP- 1. 
There are important problems that can be solved on massively parallel SIMD machines 
* Corresponding author. 
’ Supported by NSF Grant CCR-90-23059 and Texas Advanced Research Projects Grant 003658480. A 
preliminary version of this paper appears in Proc. 7th IEEE Symp. on Parallel and Distributed Processing 
(1995) 15&159. This work is also reported in T.-s. Hsu’s Ph.D. Thesis [18]. 
‘Also supported by an IBM graduate fellowship. Part of this work was done while this author was with 
Department of Computer Sciences, University of Texas at Austin, 
3 Also supported by Texas Advanced Research Projects Grant 0003658386. 
SO304-3975/96/$15.00 @ 1996-Elsevier Science B.V. All rights reserved 
PZI SO304-3975(96)00034-5 
298 T-s. Hsu, K Ramachandranl Theoretical Computer Science 162 (1996) 297-322 
with virtual processing [4,5,13,15,22,27,28,39,40]. In [ 191, we reported the imple- 
mentation of several parallel graph algorithms on the MasPar MP-1 using the parallel 
language MPL [30,3 l] which is an extension of the C language [25]. The MPL lan- 
guage provides an efficient way of using the MasPar MP-1. However, it requires the 
user to specify the physical organization of the processors used in the program, and 
provides no support for virtual processing. As a result, our implementation in [ 191 
could only handle the case when the size of the input is no more than the number of 
physical processors. 
In this paper, we describe our implementation of virtual processing using MPL. Our 
current implementation includes a set of coding techniques for writing code with vir- 
tual processing and fine-tuned routines for several primitives such as (i) exchanging 
data between neighboring virtual processors, (ii) the SCUM routine [26] with several 
combining operators, (iii) routing routines, (iv) list ranking routines [ll], (v) routines 
for ordering data, (vi) concurrent write routines with several combining operators, and 
(vii) range minima routines (see e.g. [23, Section 3.4.31). For many of our primitives, 
we implemented several different algorithms and we present a comparison of their per- 
formances. Previous work on implementing commonly used parallel primitives includes 
work reported in [7,8,16,17,35,38,41,43,44]. 
The rest of the paper is organized as follows. Section 2 gives a high-level descrip- 
tion of our implementation. Section 3 describes the programming environment on the 
MasPar MP-1. Section 4 describes general issues for coding virtual processing code. 
Section 5 describes the implementation of parallel primitives and optimization tech- 
niques used, and presents performance results. Section 6 gives a comparison between 
our parallel implementation of primitives and our sequential implementation. Section 7 
presents performance data for the parallel graph algorithms library that we have imple- 
mented using the approach provided here. Section 8 provides some concluding remarks. 
2. High-level description of our implementation 
We used the following techniques for implementing parallel algorithms with virtual 
processing using the MPL language. First, we implemented a parallel algorithm using 
the MPL language assuming the size of the input is no more than the number of 
available physical processors, i.e., without virtual processing. We then defined a virtual 
machine according to the number of virtual processors that were needed, based on 
the size of the input. We mapped this virtual machine on to the real machine by 
evenly distributing the data allocated on virtual processors among available physical 
processors. 
After properly allocating data (using Rewriting Rule l(i) which will be given in Sec- 
tion 4), we implemented our code with virtual processors using the following method. 
We translated the original non-virtual processing code line by line (except system and 
library routines) by using rewriting rules given in Section 4 to handle virtual process- 
ing. Statements for performing arithmetic and logic operations on the data allocated 
T.-s. Hsu. V. Ramachandranl Theoretical Computer Science 162 (1996) 297-322 299 
among virtual processors were transformed into several statements executed on physi- 
cal processors (Rewriting Rule 2). Conditional branching statements were transformed 
using a similar strategy (Rewriting Rule 3). To handle procedure call, we maintained 
an active flag for each virtual processor (Rewriting Rule l(ii)). 
After translating our code using the rewriting rules, we replaced the system rou- 
tines used in the code. (These system routines were provided by MPL without virtual 
processing.) We implemented virtual processing and fine-tuned these system routines 
and a set of other commonly used routines. In our implementation, we used various 
techniques to optimize our code, and also tried different underlying algorithms. 
For the rest of this paper, we use the term non-up code to denote the code that uses 
no virtual processing scheme. 
3. Programming environment 
The MasPar MP-1 [6,32,34] is a fine-grained massively parallel SIMD computer. 
All of its parallel processors synchronously execute the same instruction at the same 
time. The MasPar MP-1 that we used has 16 384 processors. Each physical processor 
(PE) consists of a 4-bit CPU with 64 kbyte of main memory and a unique ID ranging 
from 0 to 16 383. Through the emulation of microcode, each PE can perform opera- 
tions on g-bit, 16-bit, 32-bit, and 64-bit data. There are two types of inter-processor 
communication available. First, PE’s are organized as a 2-D wrap-around mesh. Each 
PE can communicate with its 8 nearest neighbors. Second, 16 PE’s (a 4 x 4 sub-mesh) 
are grouped into a cluster. The 1024 clusters are organized as a 10-D hypercube with 
each cluster representing a node in the hypercube. Processors can communicate with 
each other by using the hypercube connection. Mesh communications are about 200 
times faster than global routing requests for transmitting 32-bit data [31,37]. The Mas- 
Par MP-1 also has an array control unit (ACU) that controls the PE’s and executes 
sequential instructions. The MasPar MP-1 PE is a very slow processor. For comparison, 
SUN SPARC 10141 is more than 230 times faster than an MP-1 PE. The ACU is at 
least 10 times faster than an MP-1 PE. 
We used the MPL high-level programming language for coding our programs. The 
current version of the MPL language [30,3 1,361 is an extension of the ANSI C lan- 
guage [25] with data parallel constructs and a library of parallel primitives. In addition 
to all of the standard C language features, it allows the user to program a set of PE’s 
to execute the same instruction on their own local data. MPL leaves the responsibility 
of processor allocation to the programmer. Using MPL, we can declare a variable with 
the attribute plural. A plural variable has a local copy (possibly with different values) 
in each PE, while a variable without the plural attribute has only one copy in the 
ACU. During each computation step, each PE decides whether or not to participate in 
the current computation, based on the value of a local active flag. Any expression that 
involves a plural variable is computed on each active PE. An expression that contains 
no plural variables is computed on the ACU. It is important to note that the current 
300 T.-s. Hsu, V. RamachandranlTheoretical Computer Science 162 (1996) 297-322 
version of the MPL language does not support the usage of virtual processors. We had 
to design and implement our own virtual processing scheme. 
4. General coding issues for virtual processing 
We first handled the allocation of data and virtual processors on to the physical 
processors. Then we incorporated virtual processing into computation. The MPL pro- 
gramming language provides two ways of performing SIMD parallel computations. 
First, the MPL language provides (without virtual processing) basic arithmetic and 
logic operations among plural data, conditional branching statements, and procedure 
call statements for performing parallel computations. In this section, we discuss the 
rules for rewriting these statements to handle virtual processing. Second, MPL also 
provides subroutines for some fundamental parallel primitives such as computing the 
prefix sums of an array, sorting and interprocessor communication (again without vir- 
tual processing). In Section 5, we will discuss how to incorporate virtual processing 
into these system-provided subroutines and the commonly used routines given in [ 191. 
4.1. Mapping of virtual processors and data allocation 
Let nproc be the number of physical processors. For the MasPar MP-1 we used, 
nproc = 16 384, and these processors are organized as an nxproc x nyproc mesh, 
where nxproc = nyproc = 128. In our programs, each virtual processor (or VPE) is 
given a unique ID ranging from 0 to vnproc - 1, where vnproc is the number of virtual 
processors needed. We let the number of virtual processors per physical processor be 
vpr = [vnproc/nproc], and used VP = vpr.nproc virtual processors for the computation 
(note that VP can be larger than vnproc). The VP virtual processors are arranged in a 2- 
D vnxproc x vnyproc mesh. Under the above mapping scheme, each physical processor 
simulates a roughly equal number of the vnproc virtual processors needed, except that 
the last VP-vnproc virtual processors have their local active flag set to 0 throughout 
the computation. 
For our implementation, we used the so-called hierarchical partitioning scheme [29]. 
Each physical processor simulated a vpr x 1 sub-mesh of virtual processors. Thus given 
an nxproc x nyproc 2-D mesh, the virtual machine being simulated is an (nxproc . 
vpr) x nyproc 2-D mesh. (This is the same mapping scheme as the one used in the 
implementation of bitonic sort with virtual processing [38].) The reason for our choice 
is that, in our implementation, we frequently need to use operations that utilize locality 
of data, and this locality is preserved by our data partitioning scheme. 
Once our code decided the vpr value, we transformed the plural variables used in a 
non-vp code into a plural array of vpr elements using the following rewriting rule to 
handle virtual processing. 
Rewriting Rule 1. (i) Each plural variable allocated in an MPL non-vp code is trans- 
formed into a plural array of vpr elements in our new code with virtual processing. 
T.-S. Hsu, V. RamachandranITheoretical Computer Science 16.2 (1996) 297-322 301 
Program 1. MPL code segments for data allocation (Rewriting Rule I). The variable a is plural and the 
variable b is singular. A flag active is allocated to each virtual processor in our code. 
The ith element in the jth physical processor corresponds to the local copy of virtual 
processor (j - 1). vpr + i. Variables used in an A4PL non-vp code without the plural 
attribute are not changed in the new code. 
(ii) An extra flag (called active) in each virtual processor is allocated in the new 
code to indicate whether its corresponding virtual processor is active during each step 
of computation. 
Thus given a plural variable data and a VPE with ID w, the local copy of data 
is stored in the (w mod vpr)th element of the local array data in the PE with ID 
Lw/vpr] . An example is shown in Program 1. 
4.2. Arithmetic and logic operations 
To translate arithmetic operations that use plural variables to handle virtual process- 
ing, we transformed each statement involving computations on virtual processors to 
statements involving computations on physical processors. We used the following rule 
to rewrite statements containing only arithmetic and logic operations. 
Rewriting Rule 2. An A4PL statement that contains only arithmetic and logic op- 
erations can be simulated with virtual processing by vpr sub-steps using physical 
processors. In sub-step i, physical processor PEj computes the value for virtual pro- 
cessor VPEcj_1).“pr+(i_1). The transformed code puts a for loop around the original 
statement and converts the plural variables into their array format as described in 
Rewriting Rule l(i) (Section 4.1). A group of statements can be grouped into a for 
loop as long as each of them consists of only arithmetic and logic operations. 
An example is shown in Program 2. 
4.3. Conditional branching statements and procedure calls 
An MPL conditional branching statement consists of a conditional test on the value 
of an expression ?I, a list of statements that are executed if the value of 4 is true and 
an optional list of statements that are executed otherwise. If there is no procedure call 
in the conditional branching statement, we can rewrite the statement to handle virtual 
302 T.-S. Hsu, i? Ramachandran I Theoretical Computer Science 162 (1996) 297-322 
/* without virtual processing */ 
plural int a, b, sum, p; 
int c; 
p=a*b; 
sum = p + c; 
W)) 
i* with virtual processing *I 
plural int a[vpr], b[vpr], sum[upr], p[vpr]; 
int c; 
register int i; 
for (i = 0; i < upr; i++){ 
p[i] = a[i] * b[i]; 
sum[i] = p[i] + c; 
P(b)) 
Program 2. MPL code segments for arithmetic and logic operations (Rewriting Rule 2). 
processing using Rewriting Rule 2. Otherwise, additional work is needed to take care 
of procedure calls within an if statement. The details are in the following rewriting 
rule. 
Rewriting Rule 3. A sub-statement inside the body of an if statement hat does not 
call any subroutine can be rewritten using Rewriting Rule 2. Let n be the expres- 
sion that computes the conditional value. If procedure calls are used in the ex- 
pression y, then each such procedure call should be evaluated before the if state- 
ment. The results of the procedure calls are saved in temporary variables which 
are then used to evaluate n. If procedure calls are used inside the body of an 
if statement, we use the active flag, defined in Rewriting Rule l(ii) (Section 4.1), 
to store the information of which processors are active. Before changing the active 
flag for subroutine calls, the current value of the active flag is saved into a stack. 
We restore the value of the active flag from the saved stack after the subroutine 
call. 
We show an example in Program 3 
T.-s. Hsu, F Ramachandranl Theoretical Computer Science 162 (1996) 297-322 303 
I* without virtual processing *I 
plural int a; 
int b, c; 
if (a > 0) 
b = reduceAdd32(a); 
else { 
c = reduceAdd32(a); 
c = -c; 
1 
Ma)) 
I* with virtual processing *I 
plural int a[upv]; 
int b, c; 
plural char active[yu]; 
code for saving the current “active” values (omitted) 
{ register int i; 
for (i = 0; i < vpr; i++) 
if (u[i] > 0 and latest saved value of uctiue[i] is 1) 
uctive[i] = 1; 
else uctive[i] = 0; 
b = vreduceAdd32(u, active); 
{ register int i; 
for (i = 0; i < upr; i++) 
if (uctive[i] == 0 and latest saved value of uctive[i] is 1) 
uctive[i] = 1; 
else uctive[i] = 0; 
c = vreduceAdd32(u, active); 
c = -c; 
code for retrieving the previous “active” values (omitted) 
Mb)) 
Program 3. MPL code segment for a conditional branching statement (Rewriting Rule 3). The MPL library 
routine reduceAdd32(a) returns the summation of all a values in active processors. After the execution of 
the code in Program 3(a), b contains the summation of all local values of a’s that are positive. The variable 
c contains the absolute value of the summation of all local values of a’s that were not positive. The rewritten 
code with virtual processing is shown in Program 3(b). 
304 T.-s. Hsu, V. Ramachandranl Theoretical Computer Science 162 (1996) 297-322 
5. Implementation of parallel primitives 
In this section, we describe our implementation of virtual processing for a set of 
library subroutines. The library routines include several parallel primitives provided 
by the system and several parallel primitives in the graph algorithms kernel given in 
[19,21]. 
We classify the parallel primitives into three different categories according to the 
method we used to implement them with virtual processing. They will be discussed 
in detail in the following subsections. Every routine in each category was fine-tuned 
for speed and small memory space usage. For some of the routines, we tried different 
implementations, and performance data for all of them are reported. We often found 
that there is a trade-off between the running time and the amount of additional memory 
used. One can choose the implementation that best suits one’s needs. 
5.1. Category 1: locality preserving routines 
In this category are those parallel primitives that could be implemented with virtual 
processing by exactly one execution of their corresponding original non-vp code and 
some local computations performed in each physical processor. The following parallel 
primitives are in this category. 
_ The neighbor operator which retrieves data from the processor whose ID is adjacent 
to the ID of the current processor. This operator is described in the kernel routines 
given in [19]. For example, the kernel routine Lneighbor32 retrieves a 32-bit integer 
from processor i - 1 for each processor i, i > 0. We implemented this operator 
without virtual processing in the work described in [19] by using two MPL system 
mesh-communication operations (i.e., xnet). 
- The reduce operator which applies an associative operator on a plural variable and 
outputs its result. For example, the MPL library routine reduceAdd 
returns the summation of 32-bit integers in each active processor. This routine has 
been provided in the system library without virtual processing. 
- The segmented rotation operator which rotates data within a set of processors with 
continuous ID’s. A set of processors with continuous ID’s is called a segment of 
processors. Let the next processor to the right of the processor with ID i in a 
segment of processors be the processor whose ID is i + 1 if i is not the largest 
ID in the segment; otherwise its next processor to the right is the processor in 
the segment with the least ID. This operator is described in the kernel routines 
given in [19]. For example, the kernel routine segRshift32 rotates a 32-bit integer 
in each active processor to the processor on the right in the same segment. We 
implemented this operator without virtual processing in the work described in [ 191 
by using several MPL mesh-communication operations and global routings (i.e., 
router). 
- The scan operator which applies prefix summing using a given associative operator 
on a plural variable. For example, the MPL library routine scanAdd computes all 
T.-s. Hsu, V. Ramachandranl Theoretical Computer Science I62 (1996) 297-322 305 
/* return the summation of a values in all active VPE’s */ 
int vreduceAdd32(a, active) 
plural int a[upr]; 
plural char active[vpr]; 
1 
plural int tsum; 
register int i; 
/* compute the partial sum in each physical PE */ 
tsum = 0; 
for (i = 0; i < vpr; i++) 
if (actiue[i]) tsum = tsum + a[i]; 
/* compute the summation of all partial sums */ 
return (reduceAdd32(tsum)); 
> 
Program 4. MPL Code for adding an array of 32-bit integers with virtual processing (Category 1 ). 
Table 1 
Performance data for performing four basic parallel primitives on 32-bit integers. The “ratio” columns give 
the ratio of the time used by the routines using cpr virtual processors per physical processor to the time 
used by the non-vp routines. 
Lneighbor32 
Time Ratio 
(ms) 
reduceAdd 
Time Ratio 
(ms) 
segRshift32 
Time Ratio 
(ms) 
scanAdd 
Time Ratio 
(ms) 
w/o VPE 0.017 0.068 0.747 0.934 
UP’ 
1 
2 
4 
8 
16 
32 
64 
128 
0.034 2.0 0.116 1.7 1.063 1.4 1.079 1.2 
0.035 2.1 0.153 2.3 1.146 1.5 1.145 1.2 
0.059 3.5 0.227 3.3 1.379 1.8 I.285 1.4 
0.108 6.4 0.375 5.5 1.881 2.5 1.560 1.7 
0.204 12.0 0.671 9.8 2.882 3.9 2.122 2.3 
0.398 23.4 1.264 18.6 5.163 6.9 3.235 3.5 
0.785 46.2 2.449 36.0 9.777 13.1 5.454 5.8 
1.558 91.6 4.836 71.1 19.087 25.6 9.901 10.6 
of the prefix sums of an array of 32-bit integers for active processors. This routine 
has been provided in the system library without virtual processing. 
Our code for implementing the reduce operator on 32-bit integers is shown in Program 
4, as an example of our code for Category 1. 
We list the performance data for our parallel code for the four representative exam- 
ples in this category in Table 1. 
The running time for operators in this category could be formulated as follows. Let 
t,,(S?, upr) be the time to run the algorithm 9 with virtual processing using p vpr 
VPE’s and let +(?I) be the time to run .S? without virtual processing using p physical 
306 T-s. Hsu, V. Ramachandran I Theoretical Computer Science 162 (1996) 297-322 
PE’s. Then 
where c(2) is the time to initialize and to wrap-up the simulation of virtual processors 
for algorithm 2 and e,(2) is the unit time used in performing local computations in 
the simulation. Since ~~(2) > e,(2) for the MP-1, routines in this category had super- 
linear speedups when using virtual processors (i.e., $,(Z?, vpr)/+,(9) < vpr). In Table 1, 
we also show the ratio between the performance of our current implementation with 
virtual processing and the performance of a previous implementation in [ 191 without 
virtual processing using p = 16 384 physical processors. Note that we achieved super- 
linear speedups on all four routines when vpr 32. 
One may conjecture that the value of +,(Z?) would be smaller if p was smaller. 
Thus the best way to compute algorithm Z? with virtual processing on an input of size 
k is not to use all of the available physical processors. Instead, one should use only 
r physical processors such that c,.(S?,d) plus the time to evenly distribute the data is 
minimized, where k = r. d. However, we found that ~~(2) remained roughly the same 
no matter how many active physical processors were used in the MP-1 for all the 
routines in this category. (This fact was also confirmed by experiments conducted in 
[37].) Thus we were unable to further fine-tune our code using this approach. 
5.2. Category 2: locality insensitive routines 
In this category are those parallel primitives that could be implemented with virtual 
processing by exactly vpr executions of the original non-vp parallel primitive and 
some local computations performed in each physical processor. The following parallel 
primitives are in this category. 
_ The global routing operator. The router operators are used for performing global 
routing. For example, the MPL system routine router[addr]duta returns the value 
of the plural variable data that is stored in the processor with ID addr for each 
active processor. 
- The list ranking operator. Given a linear linked list, each element in the list is 
stored in a processor. For each element, compute the summation of all the data that 
are ahead of it in the linked list. This operator is described in the kernel routines 
given in [19]. For example, the kernel routine listrank32(ptr, data) performs list 
ranking on the plural variable data for each active processor, where ptr specifies 
the location of the next element. 
We describe our implementations of these primitives and their performance data in the 
following subsections. 
5.2.1. Routing 
The MPL language provides an efficient implementation of the routing operator if 
the data needed to be routed is any one of the fundamental data types allowed by 
the MPL language (e.g., not a member of an array element). However, to implement 
T.-s. Hsu, V. Ramachandranl Theoretical Computer Science 162 (1996) 297-322 307 
I* without virtual processing *I 
result = router[addr].duta; 
(5(a)) 
I* with virtual processing *I 
for (i = 0; i < vpr; i++) 
ps_rfetch( &dutu[uddr[i] % vpr], &resuZt[i], sizeof (data)); 
(5(b)) 
Program 5. MPL code segments for routing statements. Note that “%” is the module operator in the MPL 
language and “&” is an unary operator to compute the address of a variable. Note also that the data allocated 
to the VPE with ID addr[i] was stored in the (addr[i] % upr)th element of the plural array data in the 
physical processor with ID Laddr[i]/uprl 
a routing operator using virtual processing, we needed to access different indices of 
a plural array within each physical processor. To do this, we used the MPL system 
routine rfetch(from, st, to, sz) to retrieve sz bytes of data starting from the local 
address st of the physical processor with ID from and store it in the local address to 
for each active processor. We tested two versions of implementation (exclusive read 
version and concurrent read version) using different data distribution and compared 
their performance. We also tested our code on performing permutation routing. 
5.2.1.1. Implementution. An example of our implementation of routing is shown in 
Program 5. In Program 5(a), each processor fetched the data stored in the processor 
with ID uddr and stored it in the local variable result. The size of the data was 
sizeof (data) bytes. (Note that sizeof is an MPL system routine.) To implement this 
statement with virtual processing, we had to access the proper data element allocated 
to each virtual processor in our implementation. We show the implementation of a 
routing statement with virtual processing in Program 5(b). After testing our program, 
we found that the above implementation did not work properly when vpr = 1 because 
the MPL system routine rfetch does not function correctly when its input parameters 
are arrays of size one. Thus most of our programs reported in this paper that used the 
router did not work for vpr = 1. 
5.2.1.2. Further jine-tuning. The performance of a routing statement in the MasPar 
MP-1 is determined by the following three factors: the size of data transmitted by each 
active processor, the number of virtual processors that are active, and the maximum 
degree of concurrency among all physical processors. The first two factors are related 
to the bandwidth of the communication network and the third factor is related to the 
sequentialization of concurrent requests on physical processors. The first two factors 
308 T.-s. Hsu, K RamachandranlTheoretical Computer Science 162 (1996) 297-322 
are determined by the nature of the routing request and are hard to be changed. Though 
the effect of the third factor is not considered in the PRAM model, it is addressed in 
the QRQW PRAM model [14] and can be improved by algorithmic approaches. To 
improve the performance of a routing statement, we could reduce the maximum degree 
of concurrency (i.e., the third factor). There is a well-known PRAM algorithm to sim- 
ulate a concurrent read (write) statement by using sorting and exclusive read (write) 
statements [ 12,24,45]. Although this algorithm does not eliminate all concurrency in 
reading (writing) when virtual processors are used (i.e., reading (writing) to VPE’s 
with different ID’s might be mapped into different locations in the same physical pro- 
cessor), one would expect the maximum degree of concurrency to be greatly reduced. 
Note that we need to pay the overhead of performing a sorting routine to use this 
algorithm. 
We implemented the above algorithm for routing and tested our program when vpr = 
16 while varying the expected degree of concurrency. We also tested Program 5 with 
virtual processing and vpr = 16 and the system non-vp routing routine. Let con be the 
expected degree of concurrency in our testing (i.e., the expected number of physical 
(virtual) processors that read data from a physical (virtual) processor without (with) 
virtual processing). We obtained the desired expected degree of concurrency by draw- 
ing the destination addresses from a system pseudo-random number generator in the 
range from 0 to nproclcon - 1 without virtual processing, and in the range from 0 
to vpr . nproclcon - 1 with virtual processing, where 1 d con Gnproc. Note that it is 
well known (for example, it is fairly easy to derive from [ 1,9]) that with high prob- 
ability the maximum number of requests on any physical processor is some constant 
times con when con > log(nproc/con) without virtual processing and is some constant 
times vpr f con when vpr . con > log(nproc/con) with virtual processing. (Note that we 
use logarithms to base 2 throughout this paper.) Thus con is the expected degree of 
concurrency per virtual processor if it is sufficiently large. 
The performance data is shown in Table 2. Table 2 suggested that the performance of 
the system routine for performing concurrent read on a physical processor did not grow 
linearly in the number of expected concurrent requests. Instead, the running time fell 
into the repeated patterns of growing linearly for some values of con and then staying 
unchanged for other values of con. We also notice that the naive implementation with 
virtual processing shown in Program 5 ran faster than its exclusive read version when 
con was within 128. However, for programs that expect a high degree of concurrency, 
the exclusive read version should be used. 
5.2.1.3. Permutation Routing. We also tested the performance of a permutation routing 
with and without virtual processing. The performance data is shown in Table 3. Note 
that the expected degree of concurrency on a physical processor is proportional to the 
value of vpr when virtual processors are used. Thus when the value of vpr doubled, we 
not only had twice the number of virtual processors, but also had twice the expected 
degree of concurrency on each physical processor. The performance could be 4 times 
as bad when we doubled the value of vpr. However in Table 3, we notice that the cost 
T.-s. Hsu, V. Ramachandranl Theoretical Computer Science 162 (1996) 297-322 309 
Table 2 
Comparison of performance data for reading 32-bit integers from different processors without virtual pro- 
cessing and with virtual processing when opr = 16 on the MasPar MP-1. The number of physical processors 
used is nproc = 16 384. We tested two implementations of inter-processor memory accessing with virtual 
processing. The destination addresses were drawn from a system pseudo-random number generator in the 
range from 0 to nproclcon - 1 without virtual processing, and in the range from 0 to vpr nproclcon - 1 
with virtual processing 
Expected concurrency Time (ms) for routing 32-bit integers 
(con ) No virtual processing 
upr = 16 
Naive implementation 
cpr = 16 
Convert to 
exclusive read 
1 0.45 18.00 236.97 
2 0.67 26.95 239.00 
4 1.13 45.03 244.18 
8 1.56 63.56 238.61 
16 2.86 115.71 237.72 
32 5.33 218.57 236.51 
64 5.44 216.33 232.18 
128 5.49 218.23 230.30 
256 10.59 415.99 233.74 
512 20.60 819.08 229.88 
1024 41.28 1624.66 229.56 
2048 82.94 3250.92 229.30 
4096 144.84 5673.66 23 1.47 
8192 144.84 5675.19 23 1.47 
16384 144.84 5676.73 234.77 
Table 3 
Performance data. for performing a permutation rout- 
ing of 32-bit integers. The “ratio” column gives the 
ratio of the time used by the routine using vpr virtual 
processors per physical processor to the time used by 
the non-vp routine 
Permutation routing 
Time (ms) Ratio 
system (no VPE) 
rpr 
2 
4 
8 
16 
32 
64 
0.45 
2.28 5.07 
4.57 10.16 
9.25 20.56 
18.12 40.27 
36.53 81.18 
72.33 160.73 
for performing a permutation routing was linear in the value of opr instead of being 
quadratic. 
From the data obtained in this section, we can conclude that if the degree of con- 
currency is not too large, it is better to use the naive implementation of routing given 
in Program 5. 
310 7: -s. Hsu, V. Ramachandran I Theoretical Computer Science I62 (1996) 297-322 
5.2.2. List ranking 
Our code without virtual processing for list ranking was for the simple EREW list 
ranking algorithm using pointer jumping [47]. The algorithm runs in logn iterations 
by repeatedly changing the current pointer of each active processor Q to the pointer 
stored in the processor that is currently pointed to by Q. A processor becomes inactive 
when it reaches the end of the list. The MPL code for this algorithm is shown in 
Program 6. 
In rewriting the MPL code for list ranking to handle virtual processing, we noticed 
that we need additional temporary variables to store the intermediate results when 
router or xnet are used in a statement. The MPL compiler treats router and xnet 
as a function that can be used as an expression in a statement when the results of these 
two operators are basic data types (e.g., integer). However, when virtual processors 
are used, the results returned by router and xnet are elements inside an array. As 
described in Section 5.2.1, for technical reasons, the MPL compiler does not allow the 
values returned by the above two operators to be used as an expression. Thus if there 
is a router or xnet operator Q in a statement @‘, we have to compute the results 
returned by Q and store them into temporary variables. Using the temporary variables, 
we then compute the statement @. Thus we expect the amount of memory used by our 
programs with virtual processing to be more than vpr times the amount of memory in 
the programs for non-vp routines. 
5.2.2.1. Pipelined requests. In steps 1 and 2 of Program 6(b), each VPE retrieved 
data and pointer from the same VPE. We observed that, on the MP-1, it is often the 
case that the total time required for two interprocessor communications of el and & 
bytes, respectively, is more than the time required for one interprocessor communication 
of ei + e2 bytes. Since each VPE read from the same VPE in steps 1 and 2, we 
could pipeline these two requests. By doing this, we were able to save time, though 
programming pipelined operations required additional temporary space to pack data 
together. We list the performance of our list ranking algorithms (both pipelined and 
non-pipelined) in Table 4. 
We noticed that the usage of pipelined instructions made our program run about 
1.6 times faster than the non-pipelined version. Since the amount of temporary space 
used in implementing pipelined requests was relatively small, we decided to use the 
pipelined version for our code. 
Note that we did not implement the optimal deterministic O(log n)-time O(n/logn)- 
processor list ranking algorithm presented in [2, IO], where n is the number of ele- 
ments in the list. This is because we felt that the coding would be tedious and the 
overhead would be too large. Let nproc be the number of physical processors. If 
nproc an, the simple deterministic algorithm is optimal and should run faster than 
the algorithms described in [2, lo] because of simplicity. We expect the simple de- 
terministic algorithm to run slower when II is much larger than nproc. However, we 
did implement two randomized optimal list ranking algorithms which we describe 
below. 
T-s. Hsu. V. Ramachandran I Theoretical Computer Science 162 (1996) 297-322 311 
/* without virtual processing *I 
/* Perform list ranking on the linked list specified by 
the successor pointer ptr and the element data in each processor. 
The pointer value of the last element is negative. 
The result will be stored in rank. */ 
int listrank32(runk, ptr, data) 
plural int rank, ptr, data; 
{ plural int pointer;; 
rank = data; pointer = ptr; 
while there is a value in pointer that is positive{ 
1. rank = rank + router[pointer].rank; 
2. pointer = router[pointer].pointer; } 
(6(a)) 
I* with virtual processing */ 
int vlistrank32(runk, ptr, data, active) 
plural int runk[vpr], ptr[vpr], duta[vpr]; 
plural char active[vpr]; /* active flag for each VPE */ 
{ register int i; /* Temporary variable for virtual processing */ 
plural int tmpptr[upr], temp[vpr], pointer[upr]; 
/* The last element in the list has a negative pointer value. 
Every pointer value is positive. */ 
for (i = 0; i < vpr; i++) 
if (active[i]){ pointer[i] = ptr[i]; rank[i] = duta[i]; 
}else pointer[i] = -1; 
while there is a value in pointer that is positive{ 
for (i = 0; i < upr; i++) 
1. if (pointer[i] > 0) 
pa-rf etch( 1 y , &rank[pointer[i] % vpr], &temp[i], 1 
sizeof (rank)); 
for (i = 0; i < vpr; i++) 
2. if (pointer[i] > 0) 
ps-rf etch( 1-1, &pointer[pointer[i] % vpr], &tmpptr[i], 
sizeof (pointer)); 
3. for (i = 0; i < vpr; i++) 
if (active[i]){ pointer[i] = tmpptr[i]; rank[i] += temp[i]; }} 
1 
(6(b)) 
I 
Program 6. The MPL code for a simple sub-optimal deterministic list ranking algorithm [47] with virtual 
processing. 
312 T.-S. Hsu, V. Ramachandran 1 Theoretical Computer Science 162 (1996) 297-322 
Table 4 
Performance data for performing the list ranking operation on 
32-bit integers (listrank32) 
use no VPE 
“Pr 
2 
4 
8 
16 
32 
64 
Time Tl (ms) Time T2 (ms) 
(not pipelined) (pipelined) 
72.05 44.16 61% 
154.55 94.41 61% 
330.37 201.43 61% 
704.30 429.10 61% 
1494.33 910.17 61% 
3164.83 1927.21 61% 
Tz/Tl 
13.91 
5.2.2.2. A simple optimal randomized algorithm. In addition to the fairly complicated 
deterministic optimal algorithms in [2, lo], there are several simple randomized optimal 
algorithms for list ranking (see e.g. [3,11,24,46]). A simple randomized optimal list 
ranking algorithm [ 1 l] is shown in Algorithm 1. The implementation of step 1.1 on 
the MasPar using MPL needed to use one global routing of 8-bit flags. Step 1.2 used 
one global routing of 8-bit flags and two pipelined global routings of 32-bit integers. 
Step 2 used only one global routing of 32-bit integers. All global routings used in this 
algorithm were performed among physical processors. 
The performance data for our MPL code for the simple deterministic sub-optimal 
algorithm and for Algorithm 1 (as well as a revised randomized algorithm described 
below) is given in Table 5. In our testing, we terminated our program if more than 512 
iterations were needed to finish step 1 of Algorithm 1. We marked those experiments 
that needed more than 512 iterations as failure because both the time spent and the 
extra space used were too large. Although the average time needed for each successful 
trial was smaller than the time needed for the simple sub-optimal algorithm when 
vpr 2 16, the probability of success for Algorithm 1 was fairly small. The reason might 
be that the size of the input is not large enough for the probabilistic analysis in [ 1 l] 
to hold with high probability. 
By tracing our code, we observed the following. If the total number of elements 
was significantly smaller than the number of physical processors, then only a fairly 
small percentage of elements were eliminated during each iteration. However, if the 
total number of elements was much larger than the number of physical processors, 
then almost nproc/2 elements were eliminated during each iteration. (Note that we 
could eliminate at most nproc elements during each iteration, and we expect about half 
of these elements to generate random bit 0, and hence become ineligible for elimi- 
nation.) We conjecture that the reason for only a fairly small percentage of elements 
being eliminated when there are very few elements left might come from the MPL 
system pseudo-random number generator p-random used in our code. The system rou- 
tine p-random generates a pseudo-random number for each active PE. In [33, p. 51, it 
T.-s. Hsu, V. Ramachandranl Theoretical Computer Science 162 (1996) 297-322 313 
/* An optimal randomized list ranking algorithm.*/ 
1. I* Shrink the size of the list *I 
mark all elements “un-eliminated”; 
assume vpr = nlnproc is an integer; 
assigned vpr elements to a PE; 
while the total number of “un-eliminated” elements > 1 do 
for each PEi execute in parallel do 
pick an “un-eliminated” element ei; 
generate a random bit b(e, ) for ei; 
let s(ei) be the successor of e, in the current list; 
1.1. if b(ei) = 1 and (b(s(e,)) = 0 or no random bit 
assigned for s(e,)) 
1.2. then eliminate ei from the list 
end if 
end for 
end while; 
2. compute the rank of the eliminated element by “unraveling” 
the computation of step 1; 
Algorithm 1. An optima1 randomized list ranking algorithm. 
Table 5 
Performance data for the three algorithms we implemented for performing list ranking with virtual processing. 
The “%” column is the average percentage of runs finished in 512 iterations. The Y” column is the average 
number of iterations needed for successM runs to finish 
Deterministic Randomized algorithms 
algorithm 
up’ Program 6(b) 
(pipelined) 
Time (ms) # % Time (ms) # % Time (ms) # 
2 44.16 15 42.5 179.96 150.7 100.0 31.63 3.6 
4 94.41 16 45.0 260.00 218.2 100.0 54.06 9.9 
8 201.43 17 41.5 266.58 174.8 100.0 92.14 15.7 
16 429.10 18 70.0 409.08 207.6 100.0 180.19 37.1 
32 910.17 19 62.5 642.37 237.5 100.0 371.36 74.0 
64 1927.2 1 20 40.0 1134.99 287.7 100.0 808.53 145.5 
Algorithm 1 Algorithm 2 
says “Although the values of p-random are random sequences from each PE’s point 
of view, you might see some discernible pattern in the PE array as a whole.” 
5.2.2.3. A revised optimal randomized algorithm. In order to remedy the above prob- 
lem we modified Algorithm 1 as follows. Our randomized algorithm checked for the 
number of elements remaining before the execution of each iteration. We ran the ran- 
314 T.-s. Hsu, V. Ramachandranl Theoretical Computer Science 162 (1996) 297-322 
/* A revised optimal randomized list ranking algorithm. */ 
1. I* Shrink the size of the list *I 
perform the while loop in step 1 of Algorithm 1 until 
the total number of remaining elements is less than or equal to 
the number of physical processors; 
/* Load Balancing */ 
2. /* compute the rank of the current list by using the simple 
deterministic algorithm */ 
2.1. distribute the remaining “un-eliminated” elements evenly among 
physical processors; 
2.2. compute the rank using Program 6(a); 
2.3. send results back to original processors; 
3. compute the rank of the eliminated element by “unraveling” 
the computation of step 1; 
Algorithm 2. A revised optimal randomized list ranking algorithm. 
domized algorithm until the first time the number of elements remaining was no more 
than the number of physical processors. Then we ran the deterministic algorithm with- 
out virtual processing given in [19] after evenly distributing the remaining elements 
among available physical processors. Each physical processor would get at most one 
of the remaining elements. We finally “unraveled” the computation and obtained the 
result for each virtual processor. The revised algorithm is shown in Algorithm 2. 
According to our test data, the number of iterations needed to execute our revised 
algorithm, is much smaller than the number of iterations needed to execute Algorithm 
1. All of our test cases for our revised algorithm succeeded in i . upr iterations. We 
also note that on the largest input, our revised randomized algorithm was more than 
twice faster than the sub-optimal deterministic version (Program 6). 
For a comparison with other work, Reid-Miller [41] implemented a list ranking algo- 
rithm on a vector super computer CRAY C-90. Using 8 processors on a 16-processor 
machine, their implementation is currently the fastest parallel implementation known. 
However, their code requires a careful manual adjusting of parameters to achieve good 
speedup each time a different number of processors is used, and does not seem suitable 
for use in a massively parallel computing environment. Our implementation is tailored 
to be used on massively parallel computers. 
5.3. Category 3: locality sensitive routines 
In this category are those parallel primitives which could not be implemented with 
virtual processing by the simple strategies given in Sections 5.1 and 5.2. For example, 
an operator that assigns ranks to nproc elements cannot be applied upr times to assign 
ranks to upr . nproc elements. We have to look for other methods to solve a problem 
T.-s. Hsu, V. RamachandranITheoretical Computer Science 162 (1996) 297-322 315 
of this type with virtual processing instead of iteratively applying the basic parallel 
primitive. 
The following parallel primitives are in this category. 
_ The sort operator which sorts elements in parallel. For example, the MPL system 
routine psort(result, data) sorts data in each processor and stores them in result. 
_ The rank operator which computes, for each processor, the number of elements 
among all local copies of a plural variable d that are smaller than the processor’s 
local value of d (break ties arbitrary). The value computed is its rank. For example, 
the MPL system routine prank(result, data) stores the rank of data in result for 
each processor. 
_ The sendwith operator which combines data sent to each processor. To use the 
sendwith operator, one specifies an associative operator, a source plural variable s, 
and a plural variable d for destination address. Each active processor sends the local 
value of s to the processor whose ID is specified in the local value of d. If there are 
several values sent to one processor, all incoming values (this does not include the 
original value in the destination processor) are combined using the given associative 
operator before storing into the destination processor. 
For example, the MPL system routine sendwithAdd32(resuft, data, addr) sends 
the local value of data in each processor to the local variable result in the destination 
processor addr. If several data are sent to one processor, then the summation of these 
data is stored. 
_ The range minima (maxima) operator. Given local variables v, s and t in each pro- 
cessor, we build data structures such that we can return the minimum (maximum) 
value of all v’s from the processor with ID s to the processor with ID t for each pro- 
cessor. This operator is described in the kernel routines given in [ 191. For example, 
the kernel routine buildTMin32(v) builds such data structures for 32-bit integers 
and the kernel routine queryMin32(s, t) returns the minimum (maximum) values 
queried. We implemented these two routines in the work reported in [ 191 without 
virtual processing. 
We describe these primitives in the following subsections. 
5.3. I. Sort, rank, and sendwith 
The first three routines in this category could be implemented with virtual processing 
by calling a sorting routine that could handle virtual processing. For this, we used the 
package developed in [38]. Since the sorting package in [38] can handle only the case 
when the number of virtual processors simulated by each physical processor is a power 
of two, our code inherited the same restriction. The rank operator can be implemented 
easily by sorting the plural data created by concatenating the input data value and the 
processor ID. To compute the ranks of 32-bit integers, we had to sort 64-bit data. The 
implementation of the sendwith operator with virtual processing is as follows. We 
first sorted all requests from active VPE’s according to their destination addresses. 
We then grouped requests to the same destination in a segment and performed a 
316 T.-S. Hsu. V. Ramachandran I Theoretical Computer Science 162 (1996) 297-322 
Table 6 
Performance data for performing the sort operator [38] and the rank operator on 32-bit integers, and for 
performing the sendwith operator on 64-bit integers by using maximum as the combining operator. The 
“ratio” columns give the ratio of the time used by the routine using upr virtual processors per physical 
processor to the time used by the non-vu routine 
psort32 
system (no VPE) 
vpr 
2 
4 
8 
16 
32 
64 
Time (ms) 
7.69 
17.87 
31.07 
59.64 
118.99 
241.90 
496.62 
Ratio 
2.3 
4.0 
7.8 
15.5 
31.5 
64.6 
prank32 
Time (ms) Ratio 
sendwithMax64 
Time (ms) Ratio 
9.44 9.6 
24.76 2.6 34.4 3.6 
48.75 5.2 66.9 7.0 
98.57 10.4 134.4 14.0 
201.68 21.4 273.8 28.5 
415.41 44.0 559.8 58.3 
860.70 91.2 1149.9 119.8 
segmented scan operation [7] using the associative operator specified by the sendwith 
operator. In each segment, we picked the last VPE to write the result to the destination. 
Note that to send 32-bit integers using the sendwith operator, we had to sort 64-bit 
data. 
Performance data for these three routines are shown in Table 6. 
5.3.2. Range minima (maxima) 
In our code without virtual processing [19], the table we built in each physical pro- 
cessor for the range minima (maxima) queries was a 1-D array W such that W[i] 
is the minimum (maximum) value of all elements starting from the current physical 
processor to a physical processor whose ID is 2’ - 1 larger than the ID of the current 
physical processor. To retrieve the minimum (maximum) value starting from the phys- 
ical processor with ID s to the physical processor with ID t, we returned the minimum 
(maximum) value of W[k] in the physical processor with ID s and W[k] in the pro- 
cessorwithIDt-2k+1,where2k<t--+l ~2 kf*. We could implement this with 
virtual processing using a straightforward approach of building such an array for each 
virtual processor. However, the size of the array in each physical processor would be 
upr . [log(upr . nproc)l . 
Instead of using the above strategy, we used the following approach. For each phys- 
ical processor that simulated virtual processors with ID’s from x to x+upr - 1, we built 
a suffix array ST and a prefix array PT such that ST[i] contained the minimum (max- 
imum) value of elements from x + i to x + vpr - 1 and PT[i] contained the minimum 
(maximum) value of elements from x to x + i - 1. We also built a 2-D inner array IT 
such that IT[i,j] contained the minimum (maximum) value of all elements in virtual 
processors with ID’s from x + i to x + i + 2j - 1, for all j such that 2j < vpr. Let u 
be the minimum (maximum) value of all elements in virtual processors simulated by 
one physical processor. (The variable u is plural.) We built a global array W for the 
v’s which was the same as the array we built without virtual processing. 
T.-s. Hsu, V. Ramachandran I Theoretical Computer Science 162 (1996) 297-322 317 
Table 7 
Performance data for building the data structure to sup- 
port range minima queries of 32-bit integers. The “ra- 
tio” column gives the ratio of the time used by the 
routine using up’ virtual processors per physical pro- 
cessor to the time used by the non-vp routine 
buildMinT 
system (no VPE) 
VP’ 
2 
4 
8 
16 
32 
64 
Time (ms) 
0.036 
0.039 
0.041 
0.048 
0.070 
0.135 
0.361 
Ratio 
I .08 
1.14 
1.33 
1.94 
3.75 
10.02 
To query the minimum (maximum) value of a range of elements allocated on virtual 
processors that were simulated by one physical processor, we used the inner array IT. 
To query the value of a range of elements allocated on virtual processors simulated 
by more than one physical processor, we used the suffix array ST, the prefix array 
PT and the global array W. The total size of arrays in each physical processor was 
[log(nproc)l + vpr (2 + [log(upr)] ). Although our implementation issued four routing 
requests per virtual processor, two more than the naive approach, to perform the queries, 
our approach used less memory than the naive approach when nproc 34. For large 
values of nproc, the above approach used about l/vpr of the amount of memory used 
by the naive approach. 
Thus for parallel programs that require a lot of memory, our approach is better 
than the naive approach. We show the performance data for buildMinT in Table 7. 
Note that the performance of our range minima query routine depends on the data 
distribution of where the queries occur. Thus we did not obtain its performance data. 
6. Comparisons between sequential and parallel implementations 
In this section, we compare the performance of sequential and parallel implementa- 
tions of three basic parallel primitives, one from each of the three categories described 
in Section 5. The three basic parallel primitives are prefix summing, list ranking, and 
sorting. 
We implemented sequential linear time algorithms for prefix summing and list rank- 
ing using the C programming language [25] on a UNIX system [42] (SunOS 4.1.3). 
We used the quick sort routine provided by the system library for sorting. The perfor- 
mance data for the sequential algorithms was obtained by running our programs on a 
SPARC lo/41 machine with 32 Mbyte of main memory and about 80 Mbyte of swap- 
ping space. As indicated in [21], the SPARC IO/41 is at least 230 times faster than a 
318 T-s. Hsu, K Ramachandran I Theoretical Computer Science 162 (1996) 297-322 
single MP-1 PE. Since the MP-1 that we used has 16 384 PE’s, the raw computational 
power of the MP-1 was at least 63 times larger than a SPARC 10/41. 
All of our performance data for sequential programs was obtained when the ma- 
chine had only one active job. We measured two quantities in performing sequential 
computations. The first quantity was the user time, which is the amount of the CPU 
time used directly by a computation. The second quantity was the total time, which is 
the user time plus the amount of time used by the system to handle events generated 
while executing a computation, e.g., the swapping between the cache system and the 
main memory when there is a cache miss, and the swapping between main memory 
and the secondary storage when there is a page fault. On a single user system, the 
total time of a sequential computation is roughly equal to its turn-around time, i.e., the 
wall clock time during the computation. 
We found that when we ran our sequential programs on inputs that required less 
than 24Mbyte of memory (or 80% of the main memory on our system), the total 
time was roughly equal to the user time. When the amount of memory used was 
more than 32Mbyte, the effect of slow down due to swapping started to appear. For 
inputs that required 24-32Mbyte of memory, the behavior of our programs varied. 
One advantage in using a massively parallel computer, in addition to the enormous 
raw computation power offered, is the huge total amount of main memory available 
for a computation. For example, although each physical processor in the MP-1 only 
had 64 kbyte of memory, the total available memory for a 16 384-processor machine 
is 1 Gbyte (or 32 times larger than the main memory of our SPARC 10/41). Thus we 
would expect that our parallel implementations can run larger inputs faster than our 
sequential implementations, since sequential implementations would spend most of the 
time swapping when the input could not fit into the main memory. 
The performance data for performing prefix summing, list ranking and sorting are 
shown in Figs. l-3. 
We find that for performing prefix summing, our parallel implementation ran about 
70 times faster than our sequential implementation when all data could fit in the main 
10 
8 
z 6 
z! 
: 4 
scanwithAdd32 (SPARC IO/411 
Tj :;:i 
i- 
: 2 / 
cpu time 0 - 
0.001 t 0.000042 X ‘...‘.. ./ 
l  0.07 ,.  
...a ..i . 4 0.06 /f--J * 8 0.05 ,..- .I P 111 0.04 . ,.” .:’ 0.03 
ol-” ’ J oL 1 0.02 0.01 
,. .’ 41 
.. 
..: 
,. .’ 
.m . . . 
. . . 
. . .‘. 
. ..’ 
,,.o” 
..* .. 
.’ 
0 400 800 1200 0 400 800 1200 1600 
input size (in units of 10000) input size (in units of 10000) 
scanwithAdd32 WasPar MP-1) 
Fig. 1. Performance data for performing prefix summing sequentially and in parallel. 
T.-s. Hsu, Y Ramachandranl Theoretical Computer Science 162 (1996) 297-322 319 
200 
2 150 
g 
h: 100 
II) 
50 
0 
user time 0 
0.040 x - 
total time + 
0 50 100 150 200 250 300 350 
input size (in units of 10000) 
List Ranking (SPARC 10/41) 
50 
45 
40 
35 
$ 30 
8 25 
P 
61 20 
15 
10 
5 
0 rm 
Fig. 2. Performance data for performing list ranking sequentially and in parallel 
Quick Sort (SPARC lo/411 
I 
user time 0 
0.22 + 0.17 x + 0.022 xlog(x) - 
total time * A : 
0 400 800 1200 
input size (in units of 10000) 
List Ranking (MasPar MP-11 
cpu time 0 
0.012 x + 0.00096 X log(X) 
300 600 900 1200 1500 
input size (in units of 10000) 
Bitonic Sort (MasPar MP-1) 
cpu time c 
fIXI 
0 400 800 1200 1600 
input size (in units of 10000) 
Fig. 3. Performance data for performing sorting sequentially and in parallel. The sequential implementation 
used a quick sort algorithm. The parallel implementation which is given [38] uses a bitonic sort algorithm. 
The function f(X) in the right figure is 0.004 + 0.004X + 0.001X log X + 0.0000064X log2 X. 
memory of our SPARC 10141. When data had to be placed out of the core in our 
sequential code, our parallel program ran about 100 times faster than our sequential 
implementation on the largest input that we have tested. For performing list ranking, 
our parallel implementation was only twice faster than our sequential implementation if 
all cells in the linked list were in the main memory. However, when heavy swapping 
was needed in the sequential implementation, its performance degraded dramatically. 
On the other hand, our parallel implementation performed predictably well for the 
size of the the input that was more than 5 times larger than the largest input that 
we have tested on the sequential implementation. For performing sorting, the parallel 
implementation of the bitonic sort algorithm we used [38] was about 45 times faster 
than the sequential implementation of the quick sort algorithm on the largest input that 
we have tested. 
320 T.-s. Hsu, V. Ramachandranl Theoretical Computer Science 162 (1996) 297-322 
Table 8 
Performance data for our parallel programs with and without virtual processing. Note that n and m are the 
number of the vertices and edges in the input graph, respectively. The input size of non-vp programs is 
m = 8192. The input size for programs with virtual processing is 32 times larger than the input size for 
non-vp programs. The data for parallel programs without virtual processing is from [19] 
in = 3n/2 m = n31= m = n2/4 
No VP VP No VP VP No VP VP 
(s) (s) (s) (s) (s) (s) 
Spanning forest 1.01 74.86 0.41 7.23 0.39 5.35 
Minimum spanning forest 1.05 51.97 0.73 18.58 0.70 19.69 
All cut edges 1.17 83.36 0.61 17.92 0.57 15.85 
Ear decomposition 1.19 72.54 0.60 11.32 0.58 8.71 
Strong orientation 1.20 75.35 0.63 11.61 0.60 9.06 
Open ear decomposition 1.47 90.35 1.11 33.69 0.94 33.50 
7. Performance for implemented parallel graph algorithms 
The techniques presented in this paper can be used to rewrite a general MPL 
code without virtual processing to a code that can handle virtual processing. We 
have used this approach to convert the non-vp code for six parallel graph algorithms 
given in [ 191 into code that solves the same set of graph algorithms with virtual 
processing. 
Our coding approach is summarized as follows. Using our original code when no 
virtual processors are used [19] as a blueprint and the extended mapping as a guideline, 
we transformed our code to handle the allocation of virtual processors. Since the MPL 
language does not support virtual processing, we had to implement our own scheme for 
virtual processing. To do this, we recoded and fine-tuned the set of parallel primitives 
identified in [19] and several system library routines to handle the allocation of virtual 
processors efficiently. This was done using the techniques described in the present 
paper. Then we implemented a set of parallel graph algorithms by calling these parallel 
primitives and system routines. 
The performance data is shown in Table 8. Table 8 shows that our techniques for 
implementing virtual processing worked quite well on the code for graph algorithms. 
Details of our experiences in implementing parallel graph algorithms with virtual pro- 
cessing are given in [20,21]. 
8. Concluding remarks 
We have described techniques for implementing virtual processing on the MP-1 
using the MPL language. We have described our data allocation and code rewriting 
rules for writing MPL programs with virtual processing. We have also described the 
implementation of virtual processing and the fine-tuning of a set of parallel primitives. 
The performance of our code tracks theoretical predictions quite well for the range of 
T.-s. Hsu, V. Ramachandranl Theoretical Computer Science 162 (1996) 297-322 321 
vpr, the number of virtual processors simulated by each physical processor, that we 
have tested. 
References 
[I] N. Alon, J.H. Spencer and P. ErdGs, The Probabilistic Method (Wiley, New York, 1992). 
[2] R.J. Anderson and G.L. Miller, Deterministic parallel list ranking, in: Proc. 3rd Aegean Workshop on 
Computing, Lecture Notes in Computer Science, Vol. 319, (Springer, Berlin, 1988) 81-90. 
[3] R.J. Anderson and G.L. Miller, A simple randomized parallel algorithm for list-ranking, Inform. Process. 
Lett., 33 (5) (1990) 269-273. 
[4] J.D. Becher and R.V. Thanakij, Massively parallel processing for fingerprint classification, Tech. Report 
TR-MPiIPISP-36.93, MasPar Computer Corporation, 1993. 
[5] M. Berry, J. Comiskey and K. Minser, Parallel map analysis on 2-D grids, in: Proc. 6th SIAM Conf 
Parallel Processing for Scientijic Computing (1993) 312-319. 
[6] T. Blank, The MasPar MP-1 architecture, in: Proc. COMPCON Spring 90 -- 35th IEEE Computer 
Society Internat. Conf (1990) 20-40. 
[7] G.E. Blelloch, Scan primitives and parallel vector models, Ph.D. Thesis, MIT, 1989. 
[S] G.E. Blelloch, C.E. Leiserson, B.M. Maggs, C.G. Plaxton, S.J. Smith and M. Zagha, A comparison of 
sorting algorithms for the connection machine CM-2, in: Proc. 3rd ACM Symp. on Parallel Algorithms 
and Architectures (1991) 3-16. 
[9] H. Chemoff, A measure of the asymptotic efficiency for tests of a hypothesis based on the sum of 
observations, Ann. Math. Statist. 23 (1952) 493-509. 
[lo] R. Cole and U. Vi&kin, Approximate parallel scheduling. Part I: The basic technique with applications 
to optimal parallel list ranking in logarithmic time, SIAM J. Comput. 17 (1988) 128-142. 
[ll] T.H. Cormen, C.E. Leiserson and R.L. Rivest, Introduction to Algorithms (MIT Press, Cambridge, 
MA, 1990). 
[12] D.M. Eckstein, Simultaneous memory access, Tech. Report, Computer Science Dept., Iowa State Umv., 
Ames, IA, 1979. 
[13] T. Feder, A.G. Greenberg, V. Ramachandran, M. Rauch and L.-C. Wang, Circuit switched link 
simulation: Algorithms, complexity and implementation, Draft manuscript, 1992. 
[14] P.B. Gibbons, Y. Matias and V. Ramachandran, The QRQW PRAM: Accounting for contention in 
parallel algorithms, in: Proc. 5th ACM-SIAM Symp. on Discrete Algorithms (1994) 638448. 
[15] A.G. Greenberg, B. D. Lubachevsky and L.-C. Wang, Experience in massively parallel discrete event 
simulation, in: Proc. 5th ACM Symp. on Parallel Algorithms and Architectures (1993) 193-202. 
[16] W. Hightower, J. Prins and J. Reif, Implementations of randomized sorting on large parallel machines, 
in: Proc. 4th ACM Symp. on Parallel Algorithms and Architectures (1992) 158-167. 
[17] W.D. Hillis and G. L. Steele Jr., Data parallel algorithms, Comm. ACM 29 (1986) 117&l 183. 
[18] T.-s. Hsu. Graph augmentation and related problems: theory and practice, Ph.D. Thesis, Univ. of Texas 
at Austin, 1993. 
[19] T.-s. Hsu, V. Ramachandran and N. Dean, Implementation of parallel graph algorithms on the MasPar. 
in: DIMACS Series in Discrete Mathematics and Theoretical Computer Science, Vol. 15 (American 
Mathematical Society, Providence, RI, 1994) 165-198. 
[20] T.-s. Hsu, V. Ramachandran and N. Dean, Parallel implementation of algorithms for finding connected 
components, presented at the 3rd DIMACS Implementation Challenge Workshop, 1994. 
[21] T.-s. Hsu, V. Ramachandran and N. Dean, Implementation of parallel graph algorithms on a massively 
parallel SIMD computer with virtual processing, in: Proc. 9th International Parallel Processing Symp. 
(1995) 106112. 
[22] W.M. Hsu, Segmented ray casting for data parallel volume rendering, Tech. Report TR-MPIIPISP-39.93. 
MasPar Computer Corporation, 1993. 
[23] J. J&J& An Introduction to Parallel Algorithms (Addison-Wesley, Reading, MA, 1992). 
[24] R.M. Karp and V. Ramachandran, Parallel algorithms for shared-memory machines, in: J. van Leeuwen, 
ed., Handbook of Theoretical Computer Science (North-Holland, Amsterdam, 1990) 869-941. 
[25] B.W. Kemighan and D.M. Ritchie, The C Programming language (Prentice-Hall, Englewood Cliffs, 
NJ, 2nd ed., 1988). 
322 T-s. Hsu, V. Ramachandran I Theoretical Computer Science 162 (1996) 297-322 
[26] R.E. Ladner and M.J. Fischer, Parallel prefix computation, J. ACM, 27 (1980) 831-838. 
[27] S. Lanteri and C. Farhat, Viscous flow computations on MPP systems: implementational issues and 
performance results for unstructured grids, in: Proc. 6th SIAM Conf Parallel Processing for Scienttfic 
Computing (1993) 65-70. 
[28] A.K. Len&a, Factoring integers using SIMD sieves, Tech. Report TR-MP/PA-28.94, MasPar Computer 
Corporation, 1993. 
[29] MasPar Computer Co. MasPar Data Display Library (MPDDL) Reference manual, version 3.0, rev. 
a6 ed., July 1992. 
[30] MasPar Computer Co., MasPar Parallel Application Language (MPL) Reference manual, version 
3.0, rev. a3 ed., July 1992. 
[31] MasPar Computer Co., MasPar Parallel Application Language (MPL) User Guide, version 3.1, rev. 
a3 ed., November 1992. 
[32] MasPar Computer Co., MasPar System Oueruiew, rev. a5 ed., July 1992. 
[33] MasPar Computer Co., Release Notes for MasPar System Software, version 3.1, rev. b4 ed., November 
1992. 
[34] J.R. Nickolls and J. Reusch, Autonomous SIMD flexibility in the MP-1 and MP-2, in: Proc. 5th ACM 
Symp. on Parallel Algorithms and Architectures (1993) 98-99. 
[35] P.M. Pardalos, M.G.C. Resende and K.G. Ramakrishnan, eds., Parallel Processing of Discrete 
Optimization Problems, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 
Vol. 22 (American Mathematical Society, Providence, RI, 1995). 
1361 R. Pickering and J. Cook, A first course in programming the DECmppiSx, Tech. Report, Pam/lab, 
Dept. of Informatics, Univ. of Bergen, N-5020 Bergen, Norway, 1993. Series of Parallel Processing: A 
Self-Study Introduction. 
[37] L. Prechelt, Measurements of MasPar MP-1216A communication operations, Tech. Report 01/93, 
Institute tilr Programmstrukturen und Datenorganisation, Fakult;it flir Informatik, Universitat Karlsruhe, 
Germany, 1993 
[38] J.F. Prins and J.A. Smith, Parallel sorting of large arrays on the MasPar MP-1, in: Proc. 3rd Symp. 
on the Frontiers of Massively Parallel Computation (1990) 59-64. 
[39] A. Purkayastha and J. Seguel, Efficient algorithms for some specific FFTs on a massively parallel 
computer, in: Proc. 7th SIAM Cant Parallel Processing for ScientiJic Computing (1995) 21-26. 
[40] V. Ramachandran and L.-C. Wang, Parallel algorithms and complexity results for telephone link 
simulation, in: Proc. 3rd IEEE Symp. on Parallel and Distributed Processing (1991) 378-385. 
[41] M. Reid-Miller, List ranking and list scan on the CRAY C-90, in Proc. 6th ACM Symp. on Parallel 
Algorithms and Architectures ( 1994) 104113. 
[42] D.M. Ritchie and K. Thompson, The Unix timesharing system, Comm. ACM 17 (1974) 365-375. 
[43] J.T. Schwartz, Ultracomputers, ACM Trans. Programming Languages and Systems 2 (1980) 484521. 
[44] T.J. Sheffler, Implementing the multiprefix operation on parallel and vector computers, in: Proc. 5th 
ACM Symp. on Parallel Algorithms and Architectures (1993) 377-386. 
[45] U. Vishkin, Implementation of simultaneous memory address access in models that forbid it, J. 
Algorithms 4 (1983) 45-50. 
[46] U. Vi&kin, Randomized speed-ups in parallel computation, in: Proc. 16th ACM Symp. on Theory of 
Computing (1984) 230-239. 
[47] J.C. Wyllie, The complexity of parallel computations, Tech. Rep. TR-79-387, Dept. of Computer 
Science, Cornell Univ., Ithaca, NY, 1979. 
