On XRAM and PRAM models, and on data-movement-intensive problems  by Fraigniaud, Pierre




On XRAM and PRAM models, and 
on data-movement-intensive problems’ 
Pierre Fraigniaud * 
Lnborutoire de l’lrzformuliqur du PurullGme, CNRS, EC& Nvmmule Suptrieure de L_wn, 
69364 Lyon Cedex 07. France 
Received October 1996 
Communicated by M. Nivat 
Abstract 
In this paper, we deal with the XRAM model introduced in Cosnard and Ferreira (1991). 
We show that the original definition of the XRAM model was not accurate enough, and that 
it must be slightly modified. Thence, we modify the definition of the XRAM model to make 
it consistent, and we study the consequence of this modification on the complexity theory de- 
veloped in this model. In particular, the new model modifies the definition of a problem on 
a XRAM (and thus on a PRAM, and on a RAM since these two models are particular cases of 
the XRAM). However, we show that, though theoretically important, this modification has no 
practical consequence on the complexity theory developed on the XRAM model. Only results 
based on the use of data-movement-intensive problems (Akl et al., 1992) must bc carefully 
reconsidered. 
Keywords: Parallel computing; Model; PRAM; Communication problem 
1. Introduction 
This paper deals with the XRAM model introduced by Cosnard and Ferreira in [3]. 
The XRAM model generalizes the PRAM model [lo] by taking into account the several 
possible interconnection topologies of the existing distributed memory parallel comput- 
ers (these ones are not fully connected in general). 
A random access machine (RAM) [ 1 l] consists of 
l a memory with a potentially infinite number of locations, and 
* E-mail: pfraign@lip.ens-lyon.fr. 
’ Supported by the research programs PRS and ANM of the CNRS, and by the DRET of the DGA. 
0304-3975/98/$19.00 @ 199X - Elsevier Science B.V. All rights reserved 
PII SO304-3975(97)00189-S 
226 P. Fraiyniaudl Theoreticul Computer Scierzce 194 (1998) 225-237 
l a processor capable of loading and storing data from and into the memory, executing 
arithmetic and logical operations using a finite number of internal registers, and 
operating under the control of a program stored in a control unit. 
In one step requiring a unit of time, the processor can 
1. read a datum from an arbitrary location in memory into one of its internal registers, 
2. perform a computation on the content of one or two registers, and 
3. write the content of one register into an arbitrary memory location. 
The parallel RAM (PRAM) [lo] consists of an arbitrary large number n of RAMS, all 
sharing the same common memory. Every step of a PRAM consists of three phases (all 
along the paper, we restrict ourselves to the exclusive read, exclusive write (EREW) 
model): 
1. all processors read simultaneously from n different locations in the shared memory 
(one for each processor), and each processor stores the obtained value in one of its 
internal registers; 
2. all processors perform a computation on the content of one or two local registers; 
3. all processors write simultaneously into n different locations in the shared memory 
(one for each processor). 
Cosnard and Ferreira [3] generalized the PRAM model by the definition of the XRAM 
model: 
Definition 1. Let Xi, i = 0,. . . , n - 1, be a collection of subsets of (0,. . . ,n - l}. An 
XRAM(P,M,X) is an undirected bipartite graph such that P = {I$, i = 0,. . . , n - l}, 
and M = {A4i, i = 0,. . . , n - l} are the two partitions (representing the processors and 
the memory locations respectively), and such that fi is connected to Mj if and only if 
j E Xi. X is the corresponding interconnection network. Each computation step of an 
XRAM satisfies the same constraints as the PRAM excepted that the memory locations 
that can access processor P; are limited to IV,, j E Xi, i = 0,. . . , n - 1. 
For instance, the hypercube RAM is defined by the sets Xi = {j E (0,. . . , n - 1) 1 the 
binary expressions of i and j differ in at most one bit position}, i = 0,. . . , n - I. The 
hypercube RAM is denoted by HRAM in the following. The PRAM is a special case of 
Definition 1 where xi = (0,. . . , n - 1) for every i = 0,. . , n - 1 (although the number 
of memory location is limited to n instead of being infinite, but this can be easily 
solved by allowing IPI and [MI to be different, and by letting IMI to be arbitrarily 
large). 
Though this model is quite attractive, and draws a bridge between PRAMS and 
distributed memory computers, it suffers of a major default that is two virtually equiv- 
alent topologies are not comparable. This default is pointed out in the next section, 
and a new definition which corrects it is proposed. The main consequence of this new 
definition is the freedom of the initial and final placements of data and results. This 
also forces a new definition of the complexity of a problem, and then we can prove 
that two isomorphic topologies have indeed the same computational power. Hopefully, 
we also show that our new model does not modify the hierarchy of the complexity 
P. Fraigniaudl Theoretical Computer Science 194 (1998) 225-237 221 
classes since all results that were previously derived based on Definition 1 are still 
valid up to an additive factor corresponding to the time of the permutation routing 
problem. In Section 3, we discuss about several properties of our new model. In par- 
ticular, we show that a problem can be naturally decomposed in subproblems whereas 
such a formal decomposition was not easy in the former model. We also discuss about 
separation theorems, and show that most of the separation theorems proved in [3] still 
hold in our new model. We also revisit the speed-up folk theorem, and the simulation 
theorem. We prove that a speed-up of 2p- 1 is possible on a p-processors PRAM, even 
if there is no constraint on the memory location of the data and the results. We gener- 
alize in this way the result obtained in [l], where input and output memory locations 
were part of the problem. Finally, Section 4 contains some concluding remarks. 
2. A new definition of the XRAM model 
2.1. Comparability must be rejexive 
We adopt the same terminology as in [3]: given a problem 9, and two models of 
computation Mi and Mz, we denote by Mr(Y)<Mz(Y) (resp. Mi(B)<M2(9’)) the 
fact that the complexity of 9 in the model M2 is smaller (resp. strictly smaller) than the 
complexity of 9” in the model Mi. We will say that the model A41 is less powerful than 
the model M2 if MI (9’) <M*(g) for every problems 9. This is denoted by Mi dM2. 
Moreover, if A41 <A42, and if there exists a problem B such that M,(Y) <A42(9), then 
we say that Mi is strictly less powerful than M2, that is denoted by A41 <M2. 
The two models PRAM and HRAM are separated in [3] as follows. (Of course 
HFUM < PRAM.) Let us consider the cyclic shif problem defined by: C[Mi] +- 
C[M(i-l)modn], i=O,..., PI - 1, where C[Mi] denotes the content of the ith memory 
location. This problem can be solved in one step on a PRAM. On the other hand, 
solving this problem on a HRAM requires at least Q(log n) steps since C[M, _ I] must 
be “sent” to processor PO that is at distance Q(log n) from M, _ 1 in the hypercubic 
network induced by the HRAM. 
Even if this proof is correct, one can argue against it because it also proves that the 
HRAM is strictly less powerful than.. . itself! Indeed, let us consider two isomorphic 
copies G1 and G2 of an Hamiltonian graph G. For instance, graphs (a) and (b) of 
Fig. 1 are two isomorphic copies of the 2-dimensional hypercube Q2. Assume that the 
vertices of the two copies are arbitrarily labeled. These two graphs induce two XRAMs. 
For instance, XRAMs (c) and (d) of Fig. 1 are obtained from the graphs (a) and (b), 
respectively. Now, in a same way as (0,1,3,2) is an Hamiltonian cycle of graph (a) 
in Fig. 1 but not of graph (b), it is likely true that one can find a permutation o1 E r, 
such that the ordered set {ai( i = 0,. . , II - l} is an Hamiltonian cycle in Gi but 
not in G2, and a permutation 02 E r, such that the ordered set {@2(i), i = 0,. . . ,n - l} 
is an Hamiltonian cycle in G2 but not in Gi. Hence, following the same arguments 





Fraiyniaudl Tlzeorelicul Computer Science 194 (1998) 225-237 
0 1 x 2 3 
0 1 2 3 0 1 2 3 
@) 69 W 
Fig. 1. Two isomorphic XRAMs obtained from two isomorphic copies of 43. 
XFUMs obtained from two isomorphic copies of the same graph G are incomparable: 
GIRAM(B)<G~RAM(~‘) and G&AM(P’)<G1RAM(@) 
where 9 and 9’ are two different versions of the cyclic shift problem adapted to the 
corresponding Hamiltonian cycles of Gr and Gz. 
Therefore, the classification based on the comparator “ <” defined before does not 
produce a partial order because it is not reflexive. As we will see later, this classification 
may also produce some other strange results. In the following section, we first modify 
the definition of the XRAM so that “ < ” produces a partial order. This will be enough 
to avoid inconsistent results which can be obtained with the use of Definition 1. 
2.2. A new definition of the XRAM model 
We propose the following new definition for the XRAM model. To make a distinction 
between the definition of Cosnard and Ferreira, and the new definition, we denote our 
model by io-XRAM (for input-output XRAM). 
Definition 2 (New dejinition of the XRAM: io-XRAM). Let G be any graph of p 
vertices. An io-XRAM of topology G consists of a set P (for processor) of p RAMS, 
a set M (for memory) of p memory blocks, each block being potentially infinite as 
does the memory of a RAM, and two sets 4 (for input) and 0 (for output) of n 
memory locations. The p RAMS of P, and the p memory blocks of M are connected 
as the incident bipartite graph of G. 
,. 
Computation on a io-XRAM are performed as follows: 
Input the data. Data are initially stored in 9. They are “loaded” in M using an 
input function ~~=(~,:,~~):{O,...,n- l>-{O,...,p - 1) x N which maps 4 
to M. The mapping 41 depends on the problem solved (but not on the values of 
the data): the ith data, that is the one stored in position i of 9, is stored in the 
memory block A44;ci, at the address specified by c/$(i). 
Computation. This is done exactly in the same way as for the PRAM or the XRAM: 
computation proceeds step by step, each step being composed of the three phases 
described in Section 1 where, for each processor, data can be loaded and stored 
from/to adjacent memory blocks following the connections defined by the graph G. 
P. Fraigniaudl Theoretical Computer Science 194 (1998) 225-237 229 
3. Output the results. Results must be placed in 8. They are loaded from A4 using 
an output function &~=(~~‘,@):{0,...,n- l}-{O,...,p - 1) x N which maps 
0 to M. As c$,T, the mapping 4~ depends on the solved problem (but not on the 
values of the results): the ith result, that is the one that must be placed in position 
i of 0 is stored in memory block A4+iCi) at the address 4$(i). 
The two functions 4~ and 4~ allow to take into account that two XRAMs defined 
from two isomorphic copies of the same graph are the same: even if the two sets 
of nodes are labeled in a different way, the choice of the adapted functions 69 and 
4~ will allow to execute the same code for solving the same problem on the two 
machines. We will formally prove this fact soon but, before, we need to define what 
is the complexity of a problem in the io-XRAM model. 
2.3. Complexity of a problem 
An instance of a problem on an io-XRAM is defined as a function from 9 to 0 
whereas it was defined as a function from M to M on an XRAM. We are free to choose 
the best adapted input and output functions ~$9 and &, but this choice is actually of 
no help because it depends on the problem, and not on its instances. For instance, one 
cannot choose the functions 4~ and ~$6 such that &’ o 4.9 systematically sorts any set 
of keys. 
Of course the load of the data from the input set 9 to the memory, and the store 
of the result from the memory to the output set Co, are both virtual operations. It is 
simply a way to say where are initially the data, and where can be obtained the results. 
Therefore, in the computation process, phases 1 and 3 are for free, and only phase 2 
is costly. 
More precisely, given a problem 9, and given 40 and &, let d be an algorithm 
solving 8. That is, for any instance of 9, d transforms the contents of the memory 
locations according to the rules of the io-XRAh4 computation such that, assuming that 
the ith component of the data is placed in memory block M#J(~) at the address &i), 
for every i, then, for every j, the jth component of the result is placed in M4jcj, at the 
address #i(j). As usual, the complexity of the algorithm d is the maximum, taken 
over all the instances of 9, of the number of steps of d required to solve the given 
instance of 9. Given 4.9 and &, the complexity of a problem 9’ is the minimum, 
taken over all the algorithms d solving 9, of the complexity of d. It is denoted by 
compd,, k (9). 
Definition 2 introduces a new degree of freedom, and solving a problem 9 on an 
io-XRAM consists in: 
1. finding 4.9 and 46 ; 
2. given $19 and I&, finding the fastest algorithm & solving 9. 
Therefore, the complexity of a problem 9 is denoted by comp(.P), and satisfies 
230 P. Fraigniaudl Theoretical Computer Science 194 (1998) 225-237 
Now, we can prove the following result which was not true with Definition 1: 
Theorem 1. Let G, and G2 be two isomorphic copies of a graph G, and let XI and 
Xl be the two io-XRAMs obtained from GI and G2 respectively. XI and X2 have the 
same computational power. 
Proof. Let 6’1 and e2 be two arbitrary labelings of the nodes of Gi and G2, respectively. 
(al and e2 then also label the processors and the memory locations of Xi and X2.) 
These labeling, plus the isomorphism $ between Gi and G2, induce a permutation 
aEr*, a=82 o*oe,‘. Let $..I and &r be the “best” input and output functions for 
solving a problem 9 on X1, and let d be the “best” algorithm used to solved 9 on Xi, 
given 49 and 4~. Then choose the input function (cr o $,$, 4:) and the output function 
(o o &#) for X2, and apply the algorithm d’ on X2 where d’ is obtained from 
De by replacing each instruction “fi accesses Mj at the address k” by “PO(i) accesses 
M,(j) at the address k”. & and &’ have the same complexity. q 
Remark. The execution of the algorithm &” in the proof of Theorem 1 can also be 
done using the XRAM model (Definition 1) except that the data are not placed initially 
at their correct positions, and therefore &’ will not produce the correct answer. 
Note also that, roughly speaking, the io-PRAM model and the PRAM model are 
identical because two labelings of the vertices of the complete graph cannot be distin- 
guished. The unique difference lies on the statement of problems in these two models: 
in the io-PRAM, a problem is defined in terms of input and output, and not in term 
of memory location. 
To definitively convince that the input and output functions must be included in the 
definition of the XRAM, let us consider the following example: let C,, be the cycle 
of n vertices, and let Qiogn be the hypercube of n vertices (we assume n is a power 
of 2). Label the vertices of C,, from 0 to n - 1 in the clockwise direction. Label the 
vertices of the hypercube as usual, that is the labeling obtained using the recursive 
construction of the cube: vertex i is joined by and edge to vertex j if and only if 
the binary expressions of i and j differ in exactly one bit position. Now, consider the 
cyclic shift problem 9 as defined in Section 2.1 under the XRAM model. It allows 
to prove that Qiosn(9’)<Cn(Y)! Does it mean that the cycle is more powerful than 
the hypercube? Of course not. Again, the several ways of labeling the vertices are not 
taken into account in Definition 1. This induces inconsistent results. In fact the new 
definition of the XRAM model allows to prove the following theorem which sounds 
quite natural but which was not true using the former definition: 
Theorem 2. Let G = (V, E) be any graph, and let G’ = (V,E’) be a subgraph of G, 
E’ c E. Then the io-XRAM of topology G’ is less powerful that the io-XRAM of 
topology G: io-G’RAM< io-GRAM. 
P. Fraiyniaudl Theoretical Computer Science 194 (1998) 225-237 231 
Proof. Let / and C’ be two arbitrary labeling of the nodes of G and G’, respectively. 
Since G’ is a subgraph of G, one can define $ = / o et-‘. Let (4>, ~8, c#&) be the 
input function, the algorithm, and the output function solving a given problem 9 
on G’. Using the relabeling function $ as we did in the proof of Theorem 1, one can 
construct an algorithm d, and two functions 49 and 4~ which directly apply to G. 
Therefore io-G/RAM(P) < io-GRAM(P). 0 
Remark. Why this straightforward proof did not apply in the model of Definition l? 
Simply because the labeling of the RAMS and the memory locations is forced in 
Definition 1 whereas it is not considered in Definition 2. 
Note also that there exist many conditions for which io-G’RAM < io-GRAM, where 
G’ is a subgraph of G. For instance it might be the case if the diameter or the girth 
of G’ turn to be much larger than the ones of G. However such conditions must 
be studied in detail because one must also find a problem for which these structural 
modifications really induce an increase in the problem complexity. 
2.4. XRAM versus io-XRAM 
It is known that sorting on hypercube is in Q(logn), and in 0(logn(loglogn)2) [7]. 
Now, can we prove that the complexity of sorting on an io-HRAM is in this range? 
Such a question is meaningful because a problem on an io-XRAM does not map 
the memory to itself, but an input set 9 to an output set 0, where 4 and Co are both 
isomorphic to the memory space M, and where the choice of the isomorphisms 9 H M 
and 0 +-+ M are free. Of course, the answer of this question is yes, though up to the 
price of routing a permutation on the machine. More precisely, one can easily get the 
following: 
Theorem 3. Let us consider an arbitrary p-processor io-XRAM of topology G. For 
any problem 8, we have 
where PC is the problem which consists to permute any array A stored in 9 (A[i] in 
position i) according to a, and to get the result in 0 (A[i] in position a(i)). 
This theorem shows that, although the virtual spaces 9 and Co, and the functions 
which map these spaces to the memory, must be introduced to keep consistent the 
formal definition of an abstraction of a distributed memory computer, the complexity 
of a problem can be computed practically in fixing arbitrarily the input and output 
positions of the data. 
Note that the bound of Theorem 3 is tight because, for every a E 5, comp(PC) = 
O(1). 
232 P. Fraigniaudl Tl~eorehul Computer Scierlce 194 (1998) 225-237 
Example. On a p-processor hypercube, any permutation can be off-line routed in 
O(log p) steps [12]. Therefore all the result for the hypercube which were previously 
derived are valid in the io-HRAM model up to an additive logarithmic factor. 
3. General properties of the iu-XRAM model 
3.1. Decomposition of a problem in subproblems 
As we have seen, functions & and &I were introduced to insure the reflexivity of 
the comparability of interconnection networks. This was done by taking into account 
the possible graph isomorphisms. As we said, such functions 49 and 4~~ cannot be 
used for solving a problem because they depend on the problem only, and not on 
its instances. However, one can be tempted to cheat by decomposing a problem in 
subproblems. For instance, consider the problem of adding matrices in the following 
order: 
1. C+A+B; 
2. D t A’ + B. (A’ denotes the transposition of A), 
where A and B are stored in .Y in row major order, and C and D are stored in 6 in 
row major order. This problem implies to transpose A. This cannot be done using the 
input and output functions: once the data have been loaded, intermediate results cannot 
be output during the computation in order to be loaded again in different memory 
locations after. Indeed, the complexity of a problem is evaluated once the data are 
loaded, and before they are output. Therefore, if a problem 9 can be decomposed into 
two successive subproblems 9~1 and Ypz, then 
cove,,+, (9) dcov4f.,d(bl 1 + ~0nYkf.~~ (92 1. (1) 
However, it could be interesting to redistribute the data between the execution of 81 
and 3’2. This redistribution might be costly, but may also allow to place the data in 
the right positions so that c?p, can be executed rapidly. For instance, if comp(Yl) = 
~omp~,;,~~ (PI), and comp(Y2) = comptiT,tiG(92), then 
comp(Y)dcomp(9~) + ~onzp,~.~J$: 0 qh:-‘) + comp(93). (2) 
In Eq. (2), comp,,,(@ 0 &FM’) is th e t ime necessary to perform the permutation 
of the data from their positions after the execution of 9, to the positions chosen to 
perform 9pz optimally. It is not clear whether or not the upper bound of Eq. (2) is 
better than the one of Eq. (1). In fact, there is a tradeoff between, on one hand, the 
time to perform Pi and 9’2 given the input and output positions of the data, and, on 
the other hand, the time to permute the data between PI and P2. Therefore, we can 
state the following upper bound: 
P. Fraiyniaud I Theoretical Computer Science I94 (I 998) 225-237 233 
More generally, let us denote by 9 = 9i192 1 . . . \9$ the fact that problem .Y can be 
decomposed into a succession of k subproblems 9’i,&,. . . , 9kk, k > 2. We get: 




The reader may find interesting to refer to practical experiments where redistributing 
the data between the several phases of a problem yields better results than the direct 
algorithm [S, 131. This is typically the case in the parallel implementations of the 
ScaLAPack subroutines for linear algebra [2]. 
Remark. Such a decomposition in subproblems was not so clear in the former XRAM 
model. Let us take an example: finding the eigenvalues of a matrix is a well de- 
fined problem. However, nobody will consider the sentence “finding the eigenvalues 
of a matrix which is stored on a hypercube such that row 1 is stored on processor 4, 
block 2.. .7,3.. .5 is stored on processor 13, column 2 is stored on processor 31,. . .” 
as a problem. Indeed, the problem is “finding the eigenvalues of a matrix”, and the 
last part of the sentence is just indications about the initial storage of the data. Such 
a distinction between problem and storuge formally appears in the io-XRAM model. 
3.2. About separation theorems 
We have seen that the HRAM and the PRAM can be separated in the 
XRAM model. This result still holds in the io-XRAM model. Indeed, let us consider 
the permutation problem defined by: C[Oi] + C[90(i)], i = 0,. . . , p - 1 where G is an 
arbitrary permutation of r, stored in 9 between positions p and 2p - 1. This problem 
can be solved in one step on a PRAM. However, whatever is the choice of the input 
and output functions, there exists a permutation q such that the memory block M$;(,) 
and the memory block M+;(c,(iJ, are at unbounded distance in the hypercube. (Indeed, 
only a constant amount of data can be stored in each block, otherwise it would already 
take an unbounded time just to access locally the data.) Therefore, it is true that 
io-HRAM <PRAM 
when we restrict our study to the EREW model. Other separation results have been 
proved in [3]. Most of them separate not only topologies but also memory access 
234 P. Fraigniaudl Theoretical Computer Scierw 194 (1998) 225-237 
constraints (EREW, CREW, CRCW). They stay true in the io-XRAM model because 
proofs use arguments based on problems not defined in term of memory location, but 
in terms of input and output (like searching or prefix computation). 
Tom Leighton deeply investigates in [12] the computational power of several topolo- 
gies including cycles (linear arrays), meshes, meshes of trees, and hypercubes. We re- 
fer the reader to his book for the several simulation and separation results which link 
these topologies. He showed in particular that the butterfly network is universul in 
the sense that it can emulate every bounded degree network with a constant slow- 
down in the computation time. The XRAM model give a general framework to such 
results. 
3.3. Speed-up und simuhtion 
Definition 2 applies to the io-XRAM of topology KP (the complete graph of p ver- 
tices), and therefore to the PRAM model. Of course it does not imply any modification 
of the PRAM theory because, as we said, two labelings of the vertices of Kp cannot 
be distinguished. However, in some cases, we need to go through the proofs of the- 
orems when they are based on problems described in terms of data movement inside 
the memory (the memory locations of the data and the results are specified as part 
of the problem). As an example, we consider the speed-up folk theorem which says 
that the speed-up of a parallel algorithm using p processors cannot be greater than 
O(p). Of course, super linear speed-ups can be obtained in practice (i.e. on real par- 
allel machines). Indeed, a processor dealing with less data may avoid problems, such 
as cache miss for instance, which strongly slows down the sequential computation on 
a large amount of data. However, it is often said that a speed-up larger than p cannot 
be achieved on PRAM. Akl, Cosnard and Ferreira [l] have shown that it is not true, 
and that a speed-up of 2p - 1 can be achieved on a PRAM of p processors. This re- 
sult holds mainly because one must keep in mind that each RAM has a jinite number 
of registers, and therefore a PRAM of p processors has p times more registers than 
a single RAM. This is why a p-processors PRAM is more than p times faster than 
aRAM. 
The proof in [l] lies on two arguments (in the following, we assume that each 
processor has a unique register, the generalization to an arbitrary number of registers 
can be found in [l]): 
1. there exists a problem which can be solved in one step on a p-processors PRAM, 
and which cannot be solved in less than 2p- 1 steps on a single RAM (Theorem 3.2 
in VI); 
2. each step of a p-processors PRAM can be simulated by 2p - 1 steps on a RAM 
(Theorem 6.1 in [l]). 
The second argument stays true even under the model of Definition 2. However, the 
first argument used a problem of the class named dutu-movement intensive problem 
which is defined in terms of memory location as follows: 
P. Fraigniaudl Theoretical Computer Science 194 (1998) 225-237 235 
Problem 1. Let I,, ..,Zr be p distinct integers in the range ]-co, p] stored in an 
array A in such a way that A[i]=Zi, i=l,..., p. It is required to modtfy A so that 
it satisfies the following condition: 
A[i] = i tf there exists j such that Zi = i; 
A[i] = Zi otherwise. 
(3) 
Problem 1 requires 1 step on a p-processors PRAM, whereas it requires at least 2p- 1 
steps on a RAM. Indeed, the memory location of the input and the output is imposed, 
that is the data A[i] is given in memory location i, and the result A[i] must be returned 
in memory location i, with a risk of overwriting an unread data. We could now imagine 
to store the results elsewhere to avoid this problem. Indeed, what is important is that 
we must know where is the result, but why the memory location of the result should 
it be specified in advance? In fact it does not correspond to Definition 2. Problem 1 
under the PRAM model is translated in the following problem in the io-PRAM model: 
Problem 2. We are given an array A stored in 4 (A[i] in position i). We want to 
modtfi it according to the rules of Problem 1, and we want the result stored in 6 
(A[i] in position i). 
The two sentences “stored in 3” and “stored in CO” just mean “we give you the 
data”, and “we want the result”, but the positions where are stored and loaded the 
data in the memory is not part of the problem, it is part of the algorithm solving the 
problem. Anyway, one can still prove that Problem 2 cannot be solved in less than 
2p- 1 steps on an io-PRAM, that is even with a total freedom on the memory locations 
of the data and the results. 
Lemma 1. Problem 2 requires at least 2p - 1 steps on an io-RAM. 
Proof. Let i, 1 <i < p, be any integer, and let i* = &(i). More precisely, i* denotes 
the memory location where we can found the result A[i] after the modifications specified 
by Eq. (3) in Problem 1: C[Mi*] = i if there exists j such that IZ = i, and C[Mi*] =I, 
otherwise. It means that the last instruction “write” at the address i* must follow at 
least p instructions “read” because p reads are necessary to check whether or not 
there exists j such that Zj = i. Therefore, for every i, 1 <id p, each final write at the 
address C#Q (i) must be preceded by p reads. That is a total of at least 2p - 1 steps 
are necessary (one step can be economized because one can read and then write in the 
same step). Cl 
Therefore, we get the following result: 
Theorem 5. The speed-up of a p-processors io-PRAM over an io-RAM cannot 
exceed 2p - 1, and this bound is tight. 
236 P. Fraiyniaudl Theoretical Computer Science 194 (1998) 225-237 
Proof. The tightness of the bound is given by Lemma 1. The simulation Theorem 6.1 
in [l] shows that any step of a p-processors PRAM can be simulated on a RAM in 
at most 2p - 1 steps. 0 
4. Conclusion 
In this paper, we have modified the XRAM model defined in [3] in order to correct 
some looseness induced by graph isomorphisms in the former definition. We used the 
terminology io-XRAM (for input-output XRAM) to specify our new model. Although 
replacing the definition of the XRAM model by the definition of the io-XRAM model 
was dictated by formal reasons, we have shown that most of the results derived for 
the XRAM models also hold for the io-XRAM model. It is particularly the case for 
separation theorems, speed-up theorems [I], etc. In fact, the io-XRAM model is the 
XRAM model, and its definition in [3] must simply be replaced by Definition 2. 
The fact that the formal mathematical environment provided by the input and out- 
put sets, and by the input and output functions, could be relaxed in general implies 
that models like the one defined in [ 121 (Chapter 1 .lS) could be certainly prefered 
for practical use. The XRAM model is just a formal framework for the definition of 
such a model, and provides a bridge between PRAMS and parallel computing on dis- 
tributed architectures. There exist many other models for parallel computation as BSP 
model [14], LogP model [6], pr model [4,8,9], etc. It would be of a major interest to 
provide simulation theorems between these different models, and the XRAM model. 
References 
[I] S. Akl, M. Cosnard, A. Ferreira, Data-movement-intensive problems: two folk theorems in parallel 
computation revisited, Theoret. Comput. Sci. 95 (1992) 323-337. 
[2] E. Anderson, A. Benzoni, J. Dongarra, S. Moulton, S. Ostrouchov, B. Tourancheau, R. Van de geijn, 
LAPACK for distributed memory architecture, in: 5th SIAM Conf. on Parallel Processing for Scientific 
Computing, USA, 199 1. 
[3] M. Cosnard. A. Ferreira, Designing parallel non numerical algorithms, in: J. Evans, Liddell (Eds.). 
Parallel Computing ‘89. Elsevier Science. Amsterdam, 1991, pp. 3-18. 
[4] M. Cosnard, P. Fraigniaud, Analysis of synchronous polynomial root finding methods on a distributed 
memory multicomputer, IEEE Trans. Parallel Distrib. Systems 5 (1994) 639-648. 
[5] M. Cosnard, M. Loi, B. Tourancheau, A framework for data migrations on the hypercube, in: NATO 
Advanced Research Workshop - Software for Parallel Computation (Cetraro), 1992. 
[6] D. Culler, R. Karp, D. Patterson, A. Sahay, K. Schauser, E. Santos, R. Subramonian, T. von Eiken, 
LogP: toward a realistic model of parallel computation. in: 4th ACM Conf. on Principles and Practice 
of Parallel Programming, 1993. 
[7] R. Cypher, G. Plaxton, Deterministic sorting in nearly logarithmic time on the hypercube and related 
computers, in: Twenty second annual ACM Symp. on Theory of Computing, 1990, pp. 193-203. 
[8] F. Desprez, J.J. Dongarra, B. Tourancheau, Performance study of LU factorization with low com- 
munication overhead on multiprocessors, Parallel Process. Lett. 5 (2) (1995) 157-169. 
[9] P. Fraigniaud, E. Lazard, Methods and problems of communication in usual networks, Discrete Appl. 
Math. 53 (1994) 79-133. 
[IO] A. Gibbons, W. Rytter, Efficient Parallel Algorithms, Cambridge University Press, Cambridge, 1988. 
P. Fraigniaudl Theoretical Computer Science 194 (1998) 225-237 237 
[1 l] J. Hopcroft, J. Ullman, Introduction to Automata. Languages and Computation, Addison-Wesley, 
Reading, MA, 1979. 
[12] T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, Morgan 
Kaufmann, Los Altos, CA, 1992. 
[13] L. Prylli, B. Tourancheau. Efficient block cyclic data redistribution, Research Report RR 2766, 
INRIA/LIP. Laboratoire de I’lnformatique du Parallelisme, ENS-Lyon, France, 1996. 
[14] L.G. Valiant, A bridging model for parallel computation, Comm. ACM 38 (8) (1990) 103-l 11. 
