Algorithms for optimal self-simulation of some restricted reconfigurable meshes by Murshed, M. Manzur & Brent, Richard P
TR-CS-97-16
Algorithms for Optimal
Self-Simulation of Some
Restricted Reconfigurable Meshes
M. Manzur Murshed and Richard P. Brent
July 1997
Joint Computer Science Technical Report Series
Department of Computer Science
Faculty of Engineering and Information Technology
Computer Sciences Laboratory
Research School of Information Sciences and Engineering
This technical report series is published jointly by the Department of
Computer Science, Faculty of Engineering and Information Technology,
and the Computer Sciences Laboratory, Research School of Information
Sciences and Engineering, The Australian National University.
Please direct correspondence regarding this series to:
Technical Reports
Department of Computer Science
Faculty of Engineering and Information Technology
The Australian National University
Canberra ACT 0200
Australia
or send email to:
Technical.Reports@cs.anu.edu.au
A list of technical reports, including some abstracts and copies of some full
reports may be found at:
http://cs.anu.edu.au/techreports/
Recent reports in this series:
TR-CS-97-15 Peter Strazdins. Reducing software overheads in parallel linear
algebra libraries. July 1997.
TR-CS-97-14 Michael K. Ng and William F. Trench. Numerical solution of
the eigenvalue problem for Hermitian Toeplitz-like matrices. July
1997.
TR-CS-97-13 Michael K. Ng. Blind channel identification and the eigenvalue
problem of structured matrices. July 1997.
TR-CS-97-12 Michael K. Ng. Preconditioning of elliptic problems by
approximation in the transform domain. July 1997.
TR-CS-97-11 Richard P. Brent, Richard E. Crandall, and Karl Dilcher. Two
new factors of Fermat numbers. May 1997.
TR-CS-97-10 Andrew Tridgell, Richard Brent, and Brendan McKay.
Parallel integer sorting. May 1997.
Algorithms for Optimal Self-Simulation of Some
Restricted Recongurable Meshes
M. Manzur Murshed

Richard P. Brent
Computer Sciences Lab, Research School of Information Sciences & Engg.
The Australian National University, Canberra ACT 0200, Australia
e-mail: fmurshed, rpbg@cslab.anu.edu.au
Tel: +61 6 279 8636, Fax: +61 6 279 8651
July 15, 1997
Abstract
There has recently been an interest in the introduction of recongurable
buses to existing parallel architectures. Among them the Recongurable Mesh
(RM) draws much attention because of its simplicity. However the wide ac-
ceptance of RM depends on its scalability through self-simulation. This paper
presents a simple self-simulation algorithm which can self-simulate the mono-
tonic RM model optimally and the piecewise-monotonic RM model asymptot-
ically optimally. We claim here that our algorithm preserves the essence of
congurational computation and uses less broadcasts than simulation by the
contraction and linear-connected component computation methods [1].
Keywords: Recongurable mesh; Simulation; Parallel algorithms; Parallel
architectures
1 Introduction
It is well known that interprocessor communications and simultaneous memory ac-
cesses often act as bottlenecks in present-day parallel machines. Bus systems have
been introduced recently to a number of parallel machines to address this problem.
Examples include the Bus Automaton [15], the Recongurable Mesh (RM) [9], the
content addressable array processor [17], and the Polymorphic torus [8]. A bus sys-
tem is called recongurable if it can be dynamically changed according to either global
or local information.

Corresponding author.
1
Can these models be the basis for the design of next generation of massively
parallel computers? Perhaps the answer depends on the most fundamental related
issue of virtual parallelism or self-simulation: Given an algorithm which is designed
for a large RM, can it be executed eciently on a smaller RM?
In [1] Ben-Asher et al. present optimal self-simulation algorithms for the HV-
RN and LRN models (dened in Section 2). They also present a self-simulation
algorithm for the RN model with an extra slowdown which is polylogarithmic in
the size of the simulated mesh. In self-simulating the HV-RN model they apply
a standard simulation technique, known as the contraction method, where a single
processing element (PE) simulates a submesh. This method destroys the beauty of
congurational computation [16], a key strength of RM-algorithms, as most of the bus
segments are congured virtually in a single PE. Self-simulation of the LRN model is
done by windowing the simulating mesh over the simulated mesh in a snakelike order
while computing linear-connected components. This introduces extra broadcasts in
addition to the necessary windowing broadcasts.
In this paper we present a self-simulation algorithm SIMPLE where the simulating
mesh is used as a window over the simulated mesh. The key issue in Algorithm
SIMPLE is to determine a suitable sequence of windowing so that after a nite number
of steps correct self-simulation is achieved. We show that Algorithm SIMPLE can
simulate the monotonic RM model optimally. We also show that Algorithm SIMPLE
can self-simulate the piecewise-monotonic RM model and the optimal slowdown can
be achieved if the simulating mesh is small compared to the simulated mesh which is
a desirable property of self-simulation.
This paper is organized as follows. In the next section we present the basic issues
of RM and present the monotonic and piecewise-monotonic RM models. The section
also includes denitions associated with the self-simulation problem. In Section 3.1
mapping of the simulated mesh into the simulating mesh is described in detail. The
Algorithm SIMPLE is presented in Section 3.2. In Section 3.3 we show that Algo-
rithm SIMPLE can self-simulate the monotonic and piecewise-monotonic RM models
optimally. Section 4 concludes the paper.
2 Preliminaries
For the sake of completeness, we briey dene the recongurable mesh and give
denitions of the problem of self-simulation.
2.1 Recongurable Mesh
The recongurable mesh is primarily a two-dimensional mesh of PEs connected by
recongurable buses. In this parallel architecture, a PE is placed at the grid points
as in the usual mesh connected computers. Each PE is connected to at most four
neighbouring PEs through xed bus segments connected to four I/O ports N &
S along dimension x and E & W along dimension y. These xed bus segments
2
: switch
N
W
S
E
Figure 1: A recongurable mesh of size 3 4.
[ESW,N][NES,W][ENW,S][NWS,E]
[EN,WS][ES,WN][ES,W,N][EN,W,S][WS,E,N]
[WN,E,S][EW,NS][E,W,NS][EW,N,S][E,W,N,S]
[EWNS]
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
Figure 2: Possible fteen local congurations of a PE.
are building blocks of larger bus components which are formed through switching,
determined entirely by local data, of the internal connectors (see Figure 1) between
the I/O ports of each processor. The fteen possible interconnections of I/O ports
through switching, also known as local congurations, are shown in Figure 2. Like
all bus systems, the behaviour of RM relies on the assumption that the transmission
time of a message along a bus is independent of the length of the bus [2].
A recongurable mesh operates in the single instruction multiple data (SIMD)
mode. Besides the recongurable switches, each PE has a computing unit with a
xed number of local registers. A single time step of an RM is composed of the
following four substeps:
BUS substep. Every processor switches the internal connectors between I/O ports
by local decision.
WRITE substep. Along each bus, one or more processors on the bus transmit
a message of length bounded by the bandwidth of the xed bus segments as
well as the switches. These processors are called the speakers. It is assumed
that a collision between several speakers will be detected by all the processors
3
connected to the bus and the transmitted message will be discarded.
READ substep. Some or all the processors connected to a bus read the message
transmitted by a single speaker. These processors are called the readers. In
this paper we assume that each reader can detect whether the designated port
carries any signal or not. A reader is allowed to read only when it detects a
signal in the associated port.
COMPUTE substep. A constant-time local computation is done by each proces-
sor.
The general recongurable mesh model, as presented above, does not specify the
exact operation of the switches. The following basic variants are proposed in [1]:
Horizontal-Vertical RM (HV-RN Model). Buses are formed along either rows
or columns, but may not contain building blocks from both dimensions. This
model supports local congurations 1 to 4 in Figure 2.
Linear RM (LRN Model). A bus may consist of any connected path of edges, not
only vertical or only horizontal. The buses are however only linear, i.e., a xed
bus segment is attached to at most one other xed bus segment at each end.
Local congurations 1 to 10 in Figure 2 are supported by this model.
General RM (RN Model). A conguration of buses is any partition of the net-
work into edge disjoint subgraphs, so buses are not necessarily linear. This
model supports all the fteen local congurations in Figure 2.
The variants dened above depends solely on local congurations of ports. Now
we present two additional models where restrictions are imposed over the global char-
acteristics of the buses.
Denition 1 A function f(x) is called positive monotonic w.r.t. x if f(x
1
)  f(x
2
)
whenever x
1
 x
2
. Similarly a function f(x) is called negative monotonic w.r.t. x if
f(x
1
)  f(x
2
) whenever x
1
 x
2
. A function f(x) is monotonic w.r.t. x if it is either
positive or negative monotonic w.r.t. x.
Denition 2 A function f(x) is called piecewise-monotonic w.r.t. x if axis-x can be
divided into successive ranges such that f(x) is positive monotonic in alternate ranges
and negative monotonic in the rest of ranges.
Monotonic RM Model. Each bus represents a monotonic function w.r.t. either
row and/or column index within a range.
Piecewise-Monotonic RM Model. Every bus represents a piecewise-monotonic
function w.r.t. either row or column index within a range. Moreover in any
step all buses represent functions w.r.t. same index.
4
Observe that both the models are included in the LRN model. Also observe that
the HV-RN model is included in the monotonic RM model which is again included
in the piecewise-monotonic RM model.
We believe that the monotonic and piecewise-monotonic RM models are dened
here for the rst time but many published algorithms for the LRN models can readily
be used in these models without any modications or with very small modications.
Among them PARITY algorithms [7, 14], conversion between number representations
algorithms [6], prex-sums algorithm [11], sorting algorithms [12, 13] etc. can be
adapted into the monotonic as well as the piecewise monotonic RM models and prex-
remainders algorithm [11], integer summing algorithms [6, 11], integer multiplication
algorithm [5], sorting algorithm [6], algorithms based on function decomposition [3],
HISTOGRAM algorithm [4] etc. can be applied into the piecewise-monotonic RM
model only. Moreover it is quite obvious that all the algorithms suitable for the
HV-RN model are applicable to both the models.
2.2 Problem Denition
Let RM
AB
C
denote a recongurable mesh of A rows and B columns with each PE
having C registers.
Denition 3 The self-simulation problem of RM is to step-by-step simulate RM
MN
R
by RM
PQ

(
R
d
M
P
ed
N
Q
e)
where P M , Q  N , and the computing power of the PEs and
the bus bandwidth (not less than logMN) are assumed to be equivalent in both the
meshes.
To simplify the exposition
M
P
and
N
Q
are assumed to be integers. If the memory
requirement of the simulating RM is bounded as dened in the above denition then
the slowdown remains as the key issue.
Denition 4 We say that recongurable mesh R
1
is simulated by R
2
with slowdown
S if the result for any algorithm A
1
on R
1
is achieved through the execution of a
step-by-step simulation algorithm A
2
on R
2
in which each step of A
1
is simulated
with slowdown at most S.
Obviously the self-simulation of RM
MN
R
by RM
PQ

(
R
M
P
N
Q
)
, P M and Q  N , is
said to be optimal if the slowdown is 

M
P
N
Q

. A smaller slowdown would lead to a
serial algorithm contradicting lower bound.
3 Self-Simulation Algorithm
Let the PEs of the simulated mesh RM
MN
R
and the simulating mesh RM
PQ

(
R
M
P
N
Q
)
be
denoted by the matrices R[0 :M 1; 0 : N 1] and S[0 : P  1; 0 : Q 1] respectively.
Let R(x; y), 0  x < M and 0  y < N , denote the PE at the intersection of row
5
Simulating RM of size P x Q
Simulated RM of size 4P x 4Q
y
x
reg
R3,3
R3,2
R3,1
R3,0
R1,0
R0,3
R0,2
R0,1
R0,0
Extra registers
reg
x
y
R0,0 R1,0 R2,0 R
R0,2 R1,2 R2,2 R3,2
R0,3 R1,3 R2,3 R3,3
R0,1 R1,1 R2,1 R3,1
3,0
of the simulating RM
Mapped to the origin 
Figure 3: Mapping of the simulated RM into the simulating RM.
x and column y of the simulated mesh. Similarly let the PE at the intersection of
row x and column y of the simulating mesh be denoted by S(x; y), 0  x < P and
0  y < Q.
We rst develop necessary mapping techniques for the simulation. An algorithm
SIMPLE is then presented and nally restrictions are imposed on the general RM to
make the algorithm SIMPLE optimal.
3.1 Mapping of the Simulated RM into the Simulating RM
In this paper the following two functions play important role in mapping meshes:
FOLD(a; b) =
(
a mod b if a div b is even
b  1  (a mod b) otherwise
UNFOLD(a; b; c) =
(
bc + a if c is even
b(c + 1)  1  a otherwise
Let the simulated RM be divided into
M
P
N
Q
nonoverlapping submeshes R
i;j
of size
P  Q containing the processing elements R[iP : (i + 1)P   1; jQ : (j + 1)Q   1]
for 0  i <
M
P
and 0  j <
N
Q
. Now the simulated mesh is mapped into the
6
simulating mesh in such a way that the processing element R(x; y) is simulated by
S(FOLD(x; P ); FOLD(y;Q) for 0  x < M and 0  y < N . This ensures one-
to-one PE mapping of each submesh R
i;j
into the simulating RM and whenever the
simulating RM simulates the submesh R
i;j
the processing element S(x; y) simulates
the processing element R(UNFOLD(x; P; i); UNFOLD(y;Q; j)) for 0  i <
M
P
,
0  j <
N
Q
, 0  x < M and 0  y < N . The same benets can also be achieved
through straightforward mapping without using any folding techniques. But the
mapping presented here has its unique characteristics - the external neighbours of a
boundary PE, p of the submesh R
i;j
are mapped in the same simulating PE where p
is also mapped, 0  i <
M
P
and 0  j <
N
Q
. This keeps the broadcasts of simulation
data low in the expense of the introduction of a mapping of the ports due to the
change in the direction of x- and/or y-axes in some of the mapped submeshes. For
the submesh R
i;j
, 0  i <
M
P
and 0  j <
N
Q
, the ports are mapped as follows:
MAPPORT (E ; i; j) =
(
E if i is even
W otherwise
MAPPORT (W; i; j) =
(
W if i is even
E otherwise
MAPPORT (N ; i; j) =
(
N if j is even
S otherwise
MAPPORT (S; i; j) =
(
S if j is even
N otherwise
Now each PE of the simulatingmesh simulates
M
P
N
Q
PEs of the simulated mesh. We
assume that the k-th register of the simulated processing element R(x; y) is mapped
into the

x
N
Q
+ y

(R + ") + k

-th register of the corresponding simulating process-
ing element S(x mod P; y mod Q) where 0  x < M , 0  y < N , 0  k < R, and "
is a small integer. If register is considered as the third axis then the above register
mapping stacks the submeshes R
i;j
over the simulating RM in column-major order
(Figure 3) and each submesh is alloted an extra " registers per PE for simulation
purpose.
3.2 SIMPLE: a Self-Simulation Algorithm
Let B denote the set of all the boundary PEs of the simulating RM, i.e., B =
fS(x; y) j x = 0_x = P 1_y = 0_y = Q 1g. Let a port, t of a boundary PE, p be
called *port if t is not connected to any port external to p. Every boundary PE has
exactly one *port except S(0; 0), S(0; Q  1), S(P   1; 0), and S(P   1; Q  1) which
have two *ports each. Whenever the submesh R
i;j
is simulated, for each boundary
PE, two registers from the " extra registers are allocated for each *port. Let these
special registers be called *reg1 and *reg2.
Let for each step s of an RM-algorithm A, b(s), r(s), w(s), and c(s) denote
the BUS, READ, WRITE, and COMPUTE substeps respectively. In the reminder
7
whenever we mention that some steps or substeps are executed in the simulating RM
while simulating a specic submesh, it is assumed that the references to any register,
to any port and to the coordinates of any PE are mapped accordingly.
We now present a self-simulation algorithm without considering any specic model
in mind. In Section 3.3 we show that this algorithm can optimally self-simulate some
classes of RM where restrictions are imposed over the global characteristics of bus
recongurations.
ALGORITHM: SIMPLE( RM-algorithm: A )
1 For each step s 2 A do the following
1.1 For each boundary PE 2 B do the following in parallel
For each *port t do the following
For each mapped submesh R
i;j
, 0  i <
M
P
and 0  j <
N
Q
, set
*reg1 to 0;
1.2 Generate a nite sequence of pairs (i
1
; j
1
); (i
2
; j
2
); : : : ; (i
L
; j
L
) of length L
where 8K : 0  i
k
< P and 0  j
k
< Q.
1.3 For each pair (i
k
; j
k
) do the following on the mapped submesh R
i
k
;j
k
1.3.1 Execute b(s);
1.3.2a Execute w(s);
1.3.2b For each boundary PE 2 B do the following in parallel
For each *port, t do the following
if *reg1 = 1 then write *reg2 to port t;
1.3.3a Execute r(s);
1.3.3b For each boundary PE 2 B do the following in parallel
For each *port, t do the following
if t senses signal then set *reg1 to 1 else set *reg1 to 0;
if *reg1 = 1 then read port t into *reg2;
1.3.4a Execute c(s);
1.3.4b For each boundary PE 2 B do the following in parallel
For each *port, t do the following
Copy *reg1 and *reg2 into the similar registers, allocated for t,
of the neighbouring mapped submeshes;
Step 1.2 is the most crucial part of the above algorithm. Generating a sequence
of length L which leads to correct self-simulation depends on many factors which are
discussed in the next section.
The order of step 1.1 of Algorithm SIMPLE is 

M
P
N
Q

. Let the order of step 1.2
be O
1:2
. Steps 1.3.2a and 1.3.2b can be done in a single WRITE substep. Similarly
steps 1.3.3a and 1.3.3b can be done in a single READ substep while steps 1.3.4a and
1.3.4b can be done in a single COMPUTE substep. So all the substeps of step 1.3
can be executed in order O(1). Hence the order of step 1.3 is O(L).
8
Lemma 1 Slowdown of Algorithm SIMPLE is max



M
P
N
Q

; O
1:2
; O(L)

.
3.3 Optimal Self-Simulation of Some Restricted RM
Let RSEQ(i) and CSEQ(j) denote the sequences of pairs (i; 0), (i; 1), . . . , (i;
N
Q
  1)
and (0; j), (1; j), . . . , (
M
P
  1; j) respectively. Let S denote the sequence S in reverse
order, S
1
+ S
2
+    + S
n
denote the concatenation of sequences S
1
, S
2
, . . . , S
n
in
order and S
k
denote the sequence S + S +   + S
| {z }
k times
.
Theorem 1 Any monotonic RM-algorithm can be self-simulated optimally by the
Algorithm SIMPLE.
Proof. Let us consider the sequences of pairs, S
+
= CSEQ(0) + CSEQ(1) +   +
CSEQ(
N
Q
  1) and S
 
= CSEQ(0) + CSEQ(1) +    + CSEQ(
N
Q
  1). Let S =
S
+
+S
 
. Let A be any arbitrary monotonic RM-algorithm written for the simulated
mesh. We now show that the sequence S + S = S
+
+ S
 
+ S
 
+ S
+
of length 4
M
P
N
Q
which is 

M
P
N
Q

can be used in the step 1.2 of Algorithm SIMPLE for each step of
A to achieve a correct self-simulation.
First consider only the positive monotonic buses. Let any arbitrary positive
monotonic bus u terminate in the submeshes R
a;b
and R
c;d
. Now assume a  c
and this implies b  d as u is positive monotonic. Based on the characteristics of
positive monotonic bus we can say that the trajectory of the bus through various sub-
meshes R
i;j
follow the sequence of pairs S
u
= (a; b), (a+ 1; b), . . . , (k
b
; b), (k
b
; b+ 1),
(k
b
+1; b+1), . . . , (k
b+1
; b+1), . . . , (k
d 1
; d), (k
d 1
+1; d), . . . , (c; d) where 0  k
b
 c
and k
l 1
 k
l
 c, b < l < d.
It is very easy to show that S
u
and S
u
are contained in S
+
and S
+
respectively
preserving the order.
Suppose, the processor that writes on the bus u resides on the submesh R
p;q
,
(p; q) 2 S
u
. Then after using the the sequence S
+
completely in the step 1.2 of
Algorithm SIMPLE it can be claimed that the portion of the bus u residing in the
submeshes R
i;j
, (i; j) 2 S
u
and the index of the pair (i; j) in S
u
is greater than or equal
to that of the pair (p; q), is simulated completely. Similarly after using the sequence
S
+
completely we can claim that the rest portion of the bus u is also completely
simulated.
The sequences of pairs S
 
and S
 
play similar role in simulating negative mono-
tonic buses correctly.
Now the generation of the sequence of pairs S + S is independent of any step
of RM-algorithm A. So the order of step 1.2 of Algorithm SIMPLE, O
1:2
can be
considered as O(1) and thus by Lemma 1 the slowdown of Algorithm SIMPLE in
self-simulating any monotonic RM-algorithm is 

M
P
N
Q

which is optimal. 2
Theorem 2 Any piecewise-monotonic RM-algorithm can be self-simulated by the Al-
gorithm SIMPLE.
9
-x [z = 0]
6
y
(0; 0)
k
r
r
r r
(1; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)
k
r
r
r r
(0; 1)
	
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)
k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(0; 0)


k
r
r
r r
(r
0
; r
1
)
=
(0; 0)


Figure 4: A step in the algorithm [6] of adding k integers of n bits each on an RM of
size 2n 2nk where  = n (gure generated by the serial simulator RMSIM [10]).
Proof. Let A be any arbitrary piecewise-monotonic RM-algorithm written for the
simulated mesh. Without any loss of generality we assume that the buses of any
particular step of A be piecewise-monotonic w.r.t. column index.
Let 
+
u
denote the minimum of the minimum PE distance along the row axis of
any two successive positive monotonic segments of bus u. Similarly let 
 
u
denote
the minimum of the minimum PE distance along the row axis of any two successive
negative monotonic segments of bus u. Let  = min
8u
(min(
+
u
;
 
u
)) and
K =
(
1 +

2
l
Q

m
+ 1

if  can be computed
3 otherwise.
An example of  is given in Figure 4.
Now consider the sequence of pairs
S =
8
>
<
>
:
P
N
Q
j=0

CSEQ(j) + CSEQ(j)

K div 2
if K > 3
P
N
Q
j=0

CSEQ(j) + CSEQ(j) + CSEQ(j)

otherwise.
We now show that the sequence S + S of length 2K
M
P
N
Q
can be used in the step 1.2
of Algorithm SIMPLE for each step of A to achieve a correct self-simulation.
Let any arbitrary piecewise-monotonic bus u terminates in the submeshes R
a;b
and
R
c;d
. Now assume a  c. Let the trajectory of the bus through various submeshes R
i;j
follow the sequence of pairs S
u
. As this trajectory S
u
passes through any submesh
R
i;j
at most K   1 times it is easy to show that S
u
and S
u
are contained in S and S
respectively preserving the order.
Suppose, the processor that writes on the bus u resides on the submesh R
p;q
,
(p; q) 2 S
u
. Then after using the the sequence S completely in step 1.2 of Algorithm
SIMPLE it can be claimed that the portion of the bus u residing in the submeshes R
i;j
,
(i; j) 2 S and the index of the pair (i; j) in S
u
is greater than or equal to that of the
10
pair (p; q), is simulated completely. Similarly after using the sequence S completely
we can claim that the rest portion of the bus u is also completely simulated. 2
In Theorem 2 nothing is stated about the slowdown of the self-simulation. In
general cases the slowdown will not be optimal. However we can achieve optimal
slowdown for instances where the following conditions are met:
1. For each step of piecewise-monotonic RM-algorithm A,  is known a-priori.
2.
N
Q
is large, a desirable property in self-simulation, so that K can be considered
as a constant.
4 Conclusion
In this paper we have presented a self-simulation algorithm SIMPLE for monotonic
and piecewise-monotonic RM model. Through Algorithm SIMPLE we have achieved
optimal self-simulation for monotonic RM model and asymptotically optimal self-
simulation for piecewise monotonic RM model. We believe that Algorithm SIMPLE
preserves the essence of congurational computation and uses less broadcasts than
the algorithms in [1].
References
[1] Yosi Ben Asher, Dan Gordon, and Assaf Schuster. Ecient self-simulation algo-
rithms for recongurable arrays. Journal of Parallel and Distributed Computing,
30:1{22, 1995.
[2] Y. Ben-Asher, D. Peleg, R. Ramaswami, and A. Schuster. The power of recon-
guration. Journal of Parallel and Distributed Computing, 13:139{153, 1991.
[3] Gen-Huey Chen, Biing-Feng Wang, and Hungwen Li. Deriving algorithms on
recongurable networks based on function decomposition. Theoretical Computer
Science, 120:215{227, 1993.
[4] Ju-Wook Jang, Heonchul Park, and Viktor K. Prasanna. A fast algorithm for
computing a histogram on recongurable mesh. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 17:97{106, 1995.
[5] Ju-Wook Jang, Heonchul Park, and Viktor K. Prasanna. An optimal multipli-
cation algorithm on recongurable mesh. IEEE Transactions on Parallel and
Distributed Systems, 8:521{532, 1997.
[6] Ju-Wook Jang and Viktor K. Prasanna. An optimal sorting algorithm on re-
congurable mesh. Journal of Parallel and Distributed Computing, 25:31{41,
1995.
11
[7] Philip D. Mackenzie. A separation between recongurable mesh models. Parallel
Processing Letters, 5:15{22, 1995.
[8] Massimo Maresca. Polymorphic processor arrays. IEEE Transactions on Parallel
and Distributed Systems, 4:490{506, 1993.
[9] Russ Miller, V. K. Prasanna Kumar, Dionisios I. Reisis, and Quentin F. Stout.
Data movement operations and applications on recongurable VLSI arrays. In
Proc. International Conference on Parallel Processing, pages 205{208, 1988.
[10] M. Manzur Murshed and Richard P. Brent. RMSIM: a serial simulator for recon-
gurable mesh parallel computers. Technical Report TR-CS-97-06, Joint Com-
puter Science Tech. Report Series, The Australian National University, April
1997.
[11] Koji Nakano. Prex-sums algorithms on recongurable meshes. Parallel Pro-
cessing Letters, 5:23{35, 1995.
[12] Madhusudan Nigam and Sartaj Sahni. Sorting n numbers on nn recongurable
meshes with buses. Journal of Parallel and Distributed Computing, 23:37{48,
1994.
[13] Stephan Olariu and James L. Schwing. A novel deterministic sampling scheme
with applications to broadcast-ecient sorting on the recongurable mesh. Jour-
nal of Parallel and Distributed Computing, 32:215{222, 1996.
[14] Stephan Olariu, James L. Schwing, and Jingyuan Zhang. On the power of two-
dimensional processor arrays with recongurable bus systems. Parallel Processing
Letters, 1:29{34, 1991.
[15] J. Rothstein. Bus automata, brains, and mental models. IEEE Trans. Syst. Man
Cybern, 18:522{531, 1988.
[16] B. F. Wang. Congurational computation: A new algorithm design strategy on
processor arrays with recongurable bus systems. PhD thesis, National Taiwan
University, 1991.
[17] C. C. Weems et al. The image understanding architecture. Internat. J. of
Comput. Vision, 2:251{282, 1989.
12
