Scaling Simulations of Reconfigurable Meshes. by Fernandez zepeda, Jose Alberto
Louisiana State University
LSU Digital Commons
LSU Historical Dissertations and Theses Graduate School
1999
Scaling Simulations of Reconfigurable Meshes.
Jose Alberto Fernandez zepeda
Louisiana State University and Agricultural & Mechanical College
Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_disstheses
This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in
LSU Historical Dissertations and Theses by an authorized administrator of LSU Digital Commons. For more information, please contact
gradetd@lsu.edu.
Recommended Citation
Fernandez zepeda, Jose Alberto, "Scaling Simulations of Reconfigurable Meshes." (1999). LSU Historical Dissertations and Theses.
7081.
https://digitalcommons.lsu.edu/gradschool_disstheses/7081
INFORMATION TO USERS
This manuscript has been reproduced from the microfilm master. UMI fWms 
the text directly from tfw original or copy submitted. Thus, some thesis and 
dissertation copies are in typewriter fKe, wfiile otfrers may be from any type of 
computer printer.
The quality of this reproduction is dependent upon ttie quality of the 
copy submitted. Broken or indistmct print, colored or poor qualify illustrations 
and photographs, print bleedtfuough, sut>standard margins, and improper 
alignment can adversely affect reproduction.
In the unlikely event that tfre author did not send UMI a complete manuscript 
and there are missing pages, tfiese will be noted. Also, if unauthorized 
copyright material fiad to be removed, a note will indicate tfie deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced tfy 
sectioning the original, beginning at the upper left-hand comer and continuing 
from left to right in equal sections with small overlaps.
Photographs included in the original manuscript fiave been reproduced 
xerographically in this copy. Higher qualify 6” x 9” t>lack and wtâte 
photographic prints are avaHat>le for any photographs or illustrations appearing 
in this copy for an additional cfiarge. Contact UMI directly to order.
Bell & Howell Information and Leaming 
300 North Z6eb Road, Ann Arbor, Ml 48106-1346 USA 
800-521-0600
UMI’
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
SCALING SIMULATIONS OF 
RECONFIGURABLE MESHES
A Dissertation
Submitted to the Graduate Faculty of the 
Louisiana S tate University and 
Agricultural and Mechanical College 
in partial fulfillment of the 
requirements for the degree of 
Doctor of Philosophy
m
The Department of Electrical and Computer Engineering
by
José Alberto Fernandez Zepeda 
B.S., Universidad Nacional Autonoma de Mexico, 1991 
M.S., Universidad Nacional Autonoma de Mexico, 1994 
December 1999
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UMI Number 9960052
UMI”
UMI Microform9960052 
Copyright 2000 by Bell & Howell Information and Leaming Company. 
All rights reserved. This microform edition Is protected against 
unauthorized copying under Title 17, United States Code.
Bell & Howell Information and Leaming Company 
300 North Zeeb Road 
P.O. Box 1346 
Ann Arbor, Ml 48106-1346
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Acknowledgments
I sincerely thank my advisors Dr. Jerry L. TVahan and Dr. Ramachandran Vaidyanathan 
for their supervision, guidance, patient, and for all the time th a t they dedicated to 
me during my studies a t Louisiana State University.
I am  grateful for the financial support provided by Consejo Nacional de Ciencia 
y Tecnologfa (CONACYT), the program Pulbright-IEE, the National Science Foun­
dation, and the  Department of Electrical and Computer Engineering at LSU.
I dedicate this dissertation to my mother Ana Marfa Zepeda for her love and 
encouragement throughout my life.
Finally, I would like to thank all my Mends for a  memorable time at LSU, espe­
cially to  Anu Bourgeois for her sincere friendship.
u
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table of Contents
Acknow ledgm ents.................. ................................................................................... ii
L ist o f  F i g u r e s .......................................................................................................  v
A b s t r a c t .......................   v iii
C hapter
1 In t r o d u c t i o n ............................................................................................................................  1
1.1 R econfiguration..............................................................................................  2
1 .2  Scaling Simulations and Previous W o rk ......................................................  7
1.3 Scope of the D issertation............................................................................... 11
1.4 Contributions of This W o r k ........................................................................  13
1.5 Organization of the D issertation .................................................................  15
2 D e f i n i t i o n s  a n d  T e r m i n o l o g y .................................................................................. 17
2.1 The R -M esh ..................................................................................................... 17
2.2 The F R -M e sh .................................................................................................  18
2.3 The L R -M esh .................................................................................................  19
2.4 Concurrent Writes ........................................................................................  19
2.5 Contraction and Windows M a p p in g s ........................................................ 22
3  F R -M e s h  S c a l in g  S i m u l a t i o n ....................................................................................  23
3.1 Scaling Simulation T erm inology.................................................................  24
3.2 Mapping for F R -M e s h .................................................................................. 25
3.3 General Description of the S im ulation........................................................ 27
3.4 Component D eterm ination...........................................................................  28
3.4.1 Horizontal Prefix A ssim ilation ......................................................  29
3.4.2 Vertical Prefix A ss im ila tio n .......................................................... 34
3.4.3 Component N um bering...................................................................  36
3.4.4 Second Vertical Component Sweep ............................................. 37
3.4.5 Second Horizontal Component S w eep .........................................  39
3.5 D ata D e liv e ry .................................................................................................  40
3.5.1 Window Homogenization................................................................  41
3.5.2 Second Vertical D ata S w e e p .......................................................... 44
3.5.3 Slice Homogenization......................................................................  44
3.5.4 Second Horizontal D ata Sweep......................................................  47
m
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.6 Other W rite Rules .......................................................................................  48
3.6.1 Simulatiou 1 .........................................................................................  50
3.6.2 Simulation 2 ..........................   51
3.6.3 Simulation 3 ......................................................................................... 54
3.6.4 Simulation 4 ......................................................................................... 55
3.7 Improved Scaling Simulation of the R -M e s h ..........................................  57
3.7.1 Existing R-Mesh Scalability S im ulation ........................................  58
3.7.2 The New S im u la tio n .........................................................................  60
4 B us Lin e a r iz a t io n .............................................................................................  63
4.1 D efinitions.......................................................................................................  6 6
4.1.1 G raph of an R -M e sh .............................................     . 67
4.1.2 Mapping R-Mesh Processors to LR-Mesh Processors..................  67
4.1.3 Leader E lec tio n ..................................................................................  69
4.2 Bus L ineariza tion ..........................................................................................  71
4.2.1 Simulation of R-Mesh by L R N -M esh ............................................ 73
4.2.2 Simulation Running T i m e ............................................................... 81
4.2.3 Allowing Other Write Rules in Q ..................................................  82
4.2.4 Reducing the Size of Z .....................................................................  83
4.2.5 Exclusive Write for 2  ^   83
4.3 Scaling S im ulations....................................................................................... 84
4.3.1 R-Mesh Scaling Sim ulation............................................................... 84
4.3.2 FR-Mesh Scaling S im ulation ...........................................................  87
4.4 Simulation of R-Mesh by P R -M e s h .........................................................  87
5 S im u la tio n  o f  D R -M esh b y  LR -M e s h .........................................................  91
5.1 The D R -M esh ................................................................................................  92
5.2 DR-Mesh Simulation Term inology............................................................. 94
5.3 DR-Mesh Simulation D e sc r ip tio n ............................................................. 96
5.4 Algorithm Going_Out...................................................................................  97
5.4.1 Procedure F ind_A *............................................................................ 98
5.4.2 Procedure F indJD out......................................................................... 100
5.4.3 Procedure F ind_A ^............................................................................ 101
5.5 Algorithm G o in g Jn ......................................................................................  102
5.6 Algorithm C o rrec tn ess ................................................................................  105
5.7 Simulation Improvements.............................................................................  109
6  Sum m ary a n d  F u t u r e  W o r k ..........................................................................  1 1 2
B ib l io g r a p h y ................................................................................................................. 115
V i t a ....................................................................................................................................  119
IV
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of Figures
1 .1  A reconfigurable linear array computing the OR fu n c tio n ............................  3
1 .2  Summation of eight binary bits on an R -M esh ...............................................  5
1.3 Example of a  graph a lgorithm ............................................................................  6
1.4 Simulating an R-Mesh using an smaller R -M esh ............................................ 9
2.1 Internal connections of a 3 x 5 R -M e sh ...........................................................  18
2.2 3 x 5  F R -M e s h .....................................................................................................  19
2.3 3 x 5  L R -M e sh .....................................................................................................  20
2.4 Mappings of 6  x  9 R-Mesh to 3 x 3 R -M e s h .................................................. 21
3.1 Slices and windows of the simulated N  y. N  F R -M esh .................................  24
3.2 Contraction mapping for an FR -M esh..............................................................  26
3.3 Pseudo-code for component d e te rm in a tio n ..................................................... 30
3.4 An illustration of horizontal prefix assimilation...............................................  31
3.5 Example of vertical component sw eep ..............................................................  38
3.6 Pseudo-code for da ta  d e liv e ry ............................................................................ 41
3.7 Configurations for window hom ogenization..................................................... 43
3.8 Example showing need for slice hom ogenization ...........................................  45
3.9 Ben-Asher et al. procedure to calculate connected com ponen ts..................  58
3.10 Decomposition of R-Mesh «S into «Si, «Çz, «Sa, and 6 4 ......................................  59
V
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.11 Embedding the incidence matrix in an FR -M esh .......................................... 61
4.1 Type of b u s e s ....................................................................................................... 64
4.2 Port partitions of an R - M e s h .......................................................................... 6 6
4.3 Graph of the R -M esh .......................................................................................... 67
4.4 Equivalent group configurations for R-Mesh p ro c e sso rs ............................. 6 8
4.5 Leader election exam ples...................................................................................  69
4.6 Linearization procedure .....................................................................................  72
4.7 LR-Mesh simulating an R-Mesh (first p a r t ) .....................................................  74
4.8 Raking chains of linear nodes in Step 3 ............................................................ 74
4.9 LR-Mesh simulating an R-Mesh (second p a r t ) ................................................  77
4.10 Grafting o p e ra t io n ............................................................................................. 79
4.11 The worst case scenario for grafting tre e s ......................................................  79
4.12 1 x 4  P R -M e sh .................................................................................................... 8 8
5.1 3 x 5  D R - M e s h .......................................................................................................................  93
5.2 Representation of connections of a DR-Mesh p ro c e s s o r ............................  94
5.3 Tile r ( l ) ,  its four sub-tiles r i(0 ),. . . ,  r 4 (0 ) ...................................................  95
5.4 Pseudo-code for algorithm G oing.O ut............................................................. 98
5.5 Moving matrices i4^(i — 1) and 4^ 4 ( 2  — 1 ) ................................................. 99
5.6 Algorithm Going_Out propagates bus data D o u t { i ) ....................................  1 0 1
5.7 Pseudo-code for algorithm G oingJn ................................................................  103
5.8 Procedure FindJDin .........................................................................................  105
vi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5.9 Algorithm G oing-O ut..........................................................................................  106
5.10 Algorithm Going_In...................................................................   108
V II
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Fernandez Zepeda, José Alberto, B.S., Universidad Nacional Autonom a de Mexico, 1991
M.S., Universidad Nacional Autonom a de Mexico, 1994
Doctor of Philosophy, Fall Commencement, 1999
Major: Electrical Engineering; Minor: Mathematics
Scaling Simulations of Reconfigurable Meshes
Thesis directed by Associate Professor Jerry L. T ra h a n  and Associate Professor Ramachan­
dran  Vaidyanathan
Pages in  thesis, 128. Words in abstract, 312.
ABSTRACT
This dissertation deals w ith reconfigurable bus-based models, a  new type of parallel machine 
th a t uses dynamically alterable connections between processors to  allow efficient communi­
cation and to perform fast computations. We focus this work on the  Reconfigurable Mesh 
(R-Mesh), one of the most widely studied reconfigurable models.
We study the ability of the R-Mesh to adapt an algorithm instance of an arbitrary 
size to run on a  given smaller model size without significant loss of efficiency. A scaling 
simulation  achieves this adaptation, and the simulation overhead expresses the efficiency of 
the simulation. We construct a  scaling simulation for the Fusing-Restricted Reconfigurable 
Mesh (FR-Mesh), an im portant restriction of the R-Mesh. The overhead of this simulation 
depends only on the simulating machine size and not on the sim ulated machine size. The 
results of this scaling simulation extend to a variety of concurrent write rules and also 
translate to an  improved scaling simulation of the R-Mesh itself.
We present a  bus linearization procedure that transforms an arb itrary  non-linear bus 
configuration of an R-Mesh into an  equivalent acyclic linear bus configuration implementable 
on an Linear Reconfigurable Mesh (LR-Mesh), a weaker version of the  R-Mesh. This pro­
cedure gives the algorithm designer the liberty of using buses of arb itrary  shape, while 
autom atically translating the algorithm  to run on a simpler platform . We illustrate our 
bus linearization method through two important applications. T he first leads to a faster 
scaling simulation of the R-Mesh. The second application adapts algorithms designed for
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
R-Meshes to run on models m th  pipelined optical buses.
We also present a  simulation o f a  Directional Reconfigiurable Mesh (DR-Mesh) on zin 
LR-Mesh. This simulation has a  much better efficiency compared to previous work. In 
addition to the LR-Mesh, this simulation also runs on models tha t use pipelined optical 
buses.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 1
Introduction
In recent years, reconfigurable models have drawn considerable interest and numer­
ous fast algorithms have been proposed for them. These models use dynamically 
alterable connections between processors not only to allow efficient communication, 
but also to perform computation faster than on conventional “non-reconfigurable” 
models. Researchers have proposed a number of reconfigurable models including 
the Reconfigurable Mesh (R-Mesh) [20, 26, 33], Reconfigurable Network (RN) [6 ], 
Polymorphic Processor Array (PPA) [30], Processor Array with Reconfigurable Bus 
System (PARES) [54], Reconfigurable Multiple Bus Machine (RMBM) [50], Recon­
figurable Buses with Shift Switching (REBSIS) [28], and D istributed Memory Bus 
Computer (DMBC) [42]. Nakano [37] presented a bibliography of published research 
on reconfigurable models.
This dissertation deals with the ability of a reconfigurable model to adapt an 
algorithm instance of an arbitrary size to run on a given smaller model size without 
significant loss of efficiency. A scaling simulation achieves this adaptation, and the 
simulation overhead expresses the efficiency of the simulation. For most conventional 
models, such a  scaling simulation is trivial. For reconfigurable models, however, the 
problem presents several challenges as explained later.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
In this dissertation, we focus our attention on the R-Mesh (one o f the most widely 
studied reconfigiurable models) and some of its variants. We present new and faster 
scaling simulations for these models using various write rules. We also demonstrate 
how some of these models can simulate each other, and describe some important 
applications of these results.
We organize the remainder of this chapter as follows. Section 1.1 illustrates the 
main features of reconfigurable models through some examples. Section 1.2 defines 
the concept of scaling simulations for models of parallel computation and describes 
previous work on this aspect o f reconfigurable models. Section 1.3 describes the scope 
of the dissertation and Section 1.4 details the main contributions of this work. Finally, 
Section 1.5 outlines the organization of the dissertation.
1.1 Reconfiguration
A reconfigurable (bus-based) model operates by creating elaborate patterns of “buses” 
between processors. Some o f  the most im portant features of such a  model are the 
following.
1. It uses an internal port connection mechanism to segment or fuse buses.
2. Each processor can independently change its internal port coimections a t each 
step.
3. It assumes a constant propagation delay on buses.
4- It uses its buses as a  computational resource.
We now present an example that illustrates these features and their use in con­
structing a very fast algorithm. Figure 1 .1  shows an eight-processor reconfigurable 
linear array (RLA). Each processor connects directly to its neighboring processors
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
through two ports (left and right). Each processor is permitted to  internally connect 
or disconnect its ports.
input data
------------------------------------------------------------------------------------A---------------------------------------------------------------------------------------------------- -/— 
0 0 I 0 0
Broadcast
a
a
(a)
Broadcast
(b)
Broadcast
(c)
n
Figure 1.1: A reconfigurable linear array computing the OR function: a) initial config­
uration and input data for each processor; b) processors holding ‘1 ’ split the  bus and 
write ‘1 ’ to their left bus; c) processor o: broadcasts the result to  all the processors.
In this example, the RLA calculates the OR function of eight bits. Each processor 
holds an input bit and assumes the port connection depicted in Figure 1 .1 (a). The 
algorithm  proceeds as follows.
Each processor that holds a  T ’ splits the bus by disconnecting its ports; otherwise, 
it  keeps the bus intact (see Figure 1.1(b)). Each processor th a t holds a T ’ writes to 
the bus through its left port and the leftmost processor, a, reads the result of the OR 
function from its left port. I f  all processors hold ‘O’s, then there is no write to  the bus
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
and processor a  reads a “null value” indicating th a t the result is ‘O’. If  a t least one 
processor holds a  ‘1 ’, then processor a  reads a  ‘1’ from the bus. This value is written 
by the processor nearest to a  th a t holds a  ‘1’ (processor in Figure 1.1(b)). Proces­
sors holding ‘O’s just provide an unbroken bus, so the value written by processor P 
reaches processor a . Finally, processors connect their ports (as in Figure 1.1(c)) and 
a  broadcasts the result to all processors.
This example illustrates most of the basic features of a reconfigurable model. The 
method of this example readily generalizes to a  constant-time OR algorithm  for N  
bits on an ^-processor RLA. Notice how each processor can disconnect or connect its 
ports to split the bus or fuse bus segments. Also notice how each processor can change 
its port connections at each step. This is a locad decision (based only on input data 
or da ta  read from the bus) and is independent of the decision taken by neighboring 
processors. In general, reconfigurable models are Single-Instruction, Multiple-Data 
(SIMD) machines, where all processors execute the same program, the program for 
computing the OR function in th is case. These models operate synchronously, so 
all processors change port connections, write to  buses, read from buses, and perform 
computations a t predefined cycles of a  master clock. Notice that in this example we 
assume tha t the propagation delay is constant, so processor a  broadcasts the result 
to all the processors connected to  the bus in a  single step. (This assumption is a 
good approximation for medium sized machines [26, 30, 33, 42].) In contrast, in an 
iV-processor (non-reconfigurable) linear array, a  broadcast takes 0 { N )  time. Notice 
how the d a ta  paths play an im portant role in determining the answer to  the problem 
in question. The Parallel Random Access Machine (PRAM), a very popular model 
of parallel computation, requires Q(logiV) tim e to compute the OR function on N  
bits, if only exclusive writes are allowed. In the  above example, the RLA solves the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
problem in constant time using only exclusive writes; notice how writing processors 
(such as P and 7  in Figure 1 .1 (b)) write to  different buses.
The one-dimensional RLA readily extends to a  two-dimensional Reconfigurable 
Mesh (R-Mesh). An R-Mesh processor has four ports connecting it to its neighbors 
to the North, South, East, and West. (We formally define the R-Mesh in Chapter 2 .) 
In our next illustration, we use an R-Mesh to  compute the sum of 8  bits (Figure 1.2); 
for simplicity, we show only the bottom four rows of the R-Mesh.
Each processor in the bottom row of the  R-Mesh holds one input bit. The algo­
rithm  proceeds as follows. The processor a t the bottom  of each column broadcasts its 
bit to all the processors in its column. Each processor reading a T ’ fiom its vertical 
bus, connects its West port to its North port, and its East port to its South port; 
otherwise, it just connects its West and East ports (see Figure 1.2).
I*rocessor
writes
00 0 10 0 0 1
row
number
3
^ 2
processor
reads
input data
Figure 1.2: Summation of eight binary bits on an R-Mesh.
This internal port connection combined with the external links between processors, 
creates a staircase-like bus structure. In fact, the bus originating at the bottom  left 
processor (shown in bold in Figure 1.2) steps up by one row for each T ’ in the input. 
The processor on the bottom left comer writes a  signal on its West port and each
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
processor of the rightmost column reads from its East port. The sum of the input 
bits equals x  {x =  2 va our example) if and only if the signal arrives at the East po rt 
o f the rightmost processor of row x. In general, this method sums N  bits in constant 
time on an  (iV -I- 1) X iV R-Mesh.
The main idea of this algorithm is to configure buses so tha t a signal sent a t a  
fixed point of the R-Mesh “arrives at the answer.” Similar techniques can be used to  
solve a  variety of fundamental problems, such as prefix sums, multiplication. Boolean 
m atrix multiplication, and sorting [16, 21, 22, 29, 35, 36, 38].
1 2 3 4  5
Î  I 1 0 0 0
/  2 0 1 1 0
^  I  3 0 1 I I
1 I
1
4
5
 
 
1 
0 0 
1 0 0 0
1
0
0
0
I
1
2
3
4
5
1 2 3 4  5
m m-æ
EÎ-ËME-É3-E3 
m H
(a) (b) (c)
Figure 1.3: Example of a graph algorithm: a) Graph with 2 components; b) Adjacency 
matrix; c) Embedding on the R-Mesh. The “bold” buses in (c) correspond to  the 
component shown in bold in (a).
Another class of algorithms where reconfiguration is advantageous is graph algo­
rithms. We illustrate one such algorithm in Figure 1.3, where the N  x N  adjacency 
m atrix of an iV-node graph is directly embedded into aa N x N  R-Mesh. This embed­
ding has the property that nodes i and j  have a path between them in the graph if and 
only if diagonal R-Mesh processors (i, i) and ( j , j )  are incident on the same bus. This 
property allows the R-Mesh to solve problems like s-t  connectivity, connected compo­
nents, and transitive closure in constant time [54]. Similar techniques have been used
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
to construct constant time algorithms for other graph problems, such as spanning 
trees, biconnected components, Euler tour, and tree traversal [2, 9, 25, 31, 49].
The planar topology of the R-Mesh also makes it suitable for problems in image 
processing (where each processor represents a  pixel of the image) and planar compu­
tational geometry. These include image labeling, template matching, histogramming, 
convex hull, dominance counting, Voronoi diagrams, and other proximity problems 
[1, 10, 11, 18, 20, 23].
1.2 Scaling Simulations and Previous Work
Let A4(iV) denote an ^-processor instance of a model, A4, of parallel computation. 
A scaling simulation for A4 is an algorithm tha t simulates an arbitrary step of A4(iV) 
on a smaller instance A4(P), for any P  < N . la  general, this simulation runs in 
P}) steps. Clearly, the work of N  processors on P  processors takes 
steps, therefore, f { N ,  P ), a non-decreasing function, is the simulation overhead. (The 
scaling simulation serves to establish tha t any algorithm designed to  run in T  steps 
on M ( N )  can run in P)  -T^  steps on A4(P).)
D efin itio n  1 For any P  < N,  let A4(P) simulate a  step of A4(iV) in P})
time.
(i) Model A4 has an optimal scaling simulation iff f {N,  P)  =  0 (1 ).
{ii) Model A4 has a strong scaling simulation iff f { N , P )  is independent of N  and 
f { N , P ) = o { P ) .
{Hi) Model A4 has a  weak scaling simulation iff it does not have an optimal or strong 
scaling simulation. ■
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
8If a  model possesses an optimal scaling simulation, then a programmer need not 
be concerned with the actual size of the machine on which a program is to  run. In 
this case, the best algorithm serves well on all model and problem sizes. On a  model 
with a strong scaling simulation, a single algorithm will serve all problem sizes as 
the simulation overhead is independent of N, the simulated model size. A compiler, 
th a t is in any case local to the model or machine instance, can map logical processors 
(defined by the algorithm and problem instance) to physical processors. On a model 
with a weak scaling simulation, however, the fastest algorithm for given problem 
and model sizes may not be the fastest (after scaling) for other problem and model 
sizes; it may have to be fine-tuned or possibly even replaced by another algorithm for 
different problem sizes. (This approach is taken in practice in some parallel machines, 
for example [8 , 16].) W ith a  weak scaling simulation, however, different problem sizes 
would call for different algorithms, and so different programs, to run in the best 
possible time on the same available machine. For a model with an optimal or strong 
scaling simulation, algorithmic results have significance whether or not problem size 
matches machine size. Algorithm development itself may be easier on these models 
(for example, using the work-time framework [19] for PRAM algorithms).
In traditional “non-reconfigurable” models, a large model instance M ( N )  can be 
simulated optimally by a small model instance M {P ) ,  simply by letting a  processor 
of M ( P )  simulate ^  processors of Ai{N) .  Thus these models have optimal scaling 
simulations. The difllculty in scaling algorithms for reconfigurable models stems from 
the fundamentally different way in which they perform computation. A sequence of 
steps typical to many algorithms is as follows: (a) processors configure themselves 
locally to establish a global pattern of buses interconnecting the processors; (b) a 
designated processor issues a special signal at a fixed position in the bus structure;
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(c) the processors deduce an answer depending on where the signal arrives. (The 
summing algorithm of Figure 1.2 is an example of such an algorithm.) In this setting, 
consider a problem tha t has N  possible answers. If a model instance is not large 
enough to accommodate N  distinct answers (positions for signal arrival), then the 
above method will not work.
□  " B  
□ □  [] 
m  a -H -è
Il □ [] [] a-ffl 
□ [] [] [] 
è-Éü [] m
□ S-«  []
Æ □ □ □ □ a  a  □  □  -g]
Figure 1.4: Simulating an R-Mesh using an smaller R-Mesh (shadow square).
The great variety of bus shapes in an R-Mesh, especially branches and cycles, make 
it diflScult to design an efficient scaling simulation. To illustrate this problem, consider 
Figure 1.4, which shows a 4 x 4 R-Mesh (shown as a shaded square) simulating a 
‘Svindow” of an 8  x  12 R-Mesh. In the window, the simulating R-Mesh detects four 
buses that are separated within the window, but are, in fact, part of a single bus. 
The simulating machine has to label each bus, keep track of them through the entire 
simulation, and update their labels when they fuse to other buses. This task is not 
trivial because of the large number of possible bus configurations and is the main 
cause of the simulation overhead.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
10
This notion of scaling simulation can be generalized to th a t of scaling simulations 
between diSerent models. Specifically, we will consider scenarios where a  single step 
of A i i ( N ) ,  a  model of size N  is simulated by M^{P)  a  smaller instance of a  diSerent 
model in P)^ time.
Reconfigurable models possess a  large body of fast algorithms, yet only a  handful 
of results exist for scaling algorithms on these models. Previously, Maresca [30] 
established th a t the Polymorphic Processor Array (PPA) possesses an optim al scaling 
simulation. The PPA restricts the pattern of buses tha t can be created, severely 
curtailing the power of the model [50]. Ben-Asher et al. [4] proved that the Linear 
Reconfigurable Mesh (LR-Mesh), a  restriction of the R-Mesh, has an optim al scaling 
simulation. The LR-Mesh admits only certain patterns of buses, making it  unsuitable 
for some fundamental problems such as graph connectivity. Murshed and Brent [34] 
defined certain global restrictions on bus configurations and designed simpler optimal 
scaling simulations for LR-Meshes under these restrictions. Ben-Asher et al. [4] 
developed a (weak) scaling simulation for an iV x iV (unrestricted) R-Mesh on a 
P  X P  R-Mesh that has a simulation overhead of log jV log Matias and Schuster 
[32] proposed a randomized scaling simulation for the unrestricted R-Mesh on the 
LR-Mesh; their method has a constant (with high probability) simulation overhead, 
only when P  < iog'A/i^i^giv» and uses the “A r b i t r a r y ” concurrent-write rule, a 
rule not easily implementable on a  bus. On other reconfigurable models, Trahan et 
al. [48] developed an O(logiV) simulation of an AT x AT Directional R-Mesh on an 
0{ N ^  X N ‘^ ) LR-Mesh. Trahan and Vaidyanathan [51] have shown that, for certain 
restrictions of local connections, the Reconfigurable Multiple Bus Machine (RMBM) 
has a strong scaling simulation. Trahan et al. [47] developed a  number of algorithms 
tha t scale with optimal overhead on the Linear Array with Reconfigurable Pipelined
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
11
Bus System (LARPBS), a  reconfigurable model tha t uses pipelined optical buses for 
communication.
1.3 Scope of the Dissertation
All prior approaches to scaling reconfigurable models either severely restrict the sim­
ulated model and/or grant extra capabilities to  the simulating model (in order to  
achieve constant overhead), or incur a high simulation overhead. We consider a  re­
striction of the R-Mesh, called the Fusing Restricted R-Mesh (FR-Mesh), for which 
we construct a strong scaling simulation. The FR-Mesh is as “powerful” as the un­
restricted R-Mesh [45], though it allows only two of the fifteen internal connections 
possible on the R-Mesh (see Section 2.2). Further, the FR-Mesh admits constant 
time algorithms for fundamental problems (such as s-t connectivity, connected com­
ponents, transitive closure, and cycle detection [2, 25, 54]) th a t are unlikely to be 
solvable in constant time on the LR-Mesh; many such problems are fundamental to  
algorithm development in general [53].
In Chapter 3 , we construct a  strong scaling simulation of the FR-Mesh in which 
the simulation overhead is logarithmic in the simulating machine size, and entirely 
independent of the simulated machine size. More precisely, we establish that for any 
P  < N , a. step of an AT X N  FR-Mesh (tha t has processors) can be simulated by 
a. P  y. P  FR-Mesh in o ( ^ l o g P )  time.
Additionally, we identify the bottleneck producing the simulation overhead of 
the FR-Mesh scaling simulation as “leader election.” Thus, any improvement in 
techniques for leader election will immediately translate to a further reduction of the 
overheads for scaling simulations of both the FR-Mesh and the R-Mesh. Indeed, an
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
12
FR-Mesh tha t can resolve concurrent writes by the “P r io r it y ” rule has an optimal 
scaling simulation (Section 3.6).
Although most of the dissertation uses the “C o m m o n ” concurrent write rule, 
we also consider other rules such as “C o l l isio n ” and “C o l l is io n "^ ” (Section 3.6) 
that are well known in the context of PRAM algorithms [19, 24]. For these rules, the 
FR-Mesh still has a strong scaling simulation with a logarithmic simulation overhead.
The strong scaling simulation of the FR-Mesh also leads to an improved (weak) 
scaling simulation of the R-Mesh (Section 3.7). The simulation overhead for this 
simulation is log P  log the previous fastest scaling simulation for the R-Mesh [4] 
had a simulation overhead of log AT log ^  (see Table 1.1).
The R-Mesh can create bus structures of many different shapes (see Figure 2.1). 
On the one hand, flexibility in shaping buses facilitates algorithm design and can re­
duce running time, but on the other hand, complex bus shapes complicate implemen­
tation of these models. In Chapter 4, we present a procedure called bus linearization 
that transforms a  bus of any shape allowed by the R-Mesh into one with an equivalent 
linear (non-branching) structure.
Specifically, we prove tha t an N  x AT LR-Mesh (an R-Mesh restriction tha t permits 
only linear buses) can simulate an arbitrary step of an A x  Af R-Mesh in 0 (log  N ) time. 
We illustrate the use of bus linearization through two important applications. The 
first constructs the best known deterministic scaling simulation for the R-Mesh. This 
approach has log N  simulation overhead, which improves on the simulation overhead 
of log P  log ^  (Chapter 3). The second application adapts algorithms designed for 
the R-Mesh to run on reconfigurable models that use optical buses [40, 44, 45].
Bus linearization also improves the FR-Mesh scaling simulation of Chapter 3 to 
a weaker simulating model (see Table 1 .1 ); the simulation in Chapter 3 requires a
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
13
“CRCW” FR-Mesh (with the abilily to  perform concurrent writes), while the simu­
lation of Chapter 4 needs only a “CREW ” LR-Mesh (without the need to perform 
concurrent writes).
In Chapter 5, we present the simulation of a  Directed Reconfigiurable Mesh 
(DR-Mesh) on an LR-Mesh. The DR-Mesh has directed buses with the ability to 
restrict data propagation to only one direction. We simulate an iV x  JV DR-Mesh 
in O(log^ N ) time on an 0(iV x iV x  1^ 7 ) (three-dimensional) R-Mesh and on an 
^  (two-dimensional) R-Mesh. This result is a substantial improve­
ment on the only previous work on this problem [48], where an x N*) R-Mesh
achieves O(logiV) time. Furthermore, we showed that the simulating machine can 
be a Pipelined Reconfigurable Mesh (PR-Mesh) or an equivalent pipelined optical 
model, thereby extending the scope of the  simulation.
1.4 Contributions of This Work
On the whole, this dissertation provides a better understanding about scalability of 
reconfigurable models in general, and th e  R-Mesh and its variants, in particular. It 
presents new approaches to problems in scalability, many of which could be useful for 
scaling simulations on other reconfigurable models as well. We present several simu­
lations among reconfigurable models. Tables 1.1 and 1.2 summarize the contribution 
of these simulations in the context of existing results.
Some algorithms run more efficiently on LR-Meshes than on FR-Meshes and vice- 
versa. The class of separable R-Mesh algorithms comprises algorithms in which each 
step runs on either an LR-Mesh or an FR-Mesh. Separable algorithms provide an 
extremely rich array of fast and efficient algorithmic building blocks. The main result 
in Chapter 3 is a  strong scaling simulation of the FR-Mesh. This result, coupled with
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
14
the optimal scaling simulation for the LR-Mesh [4] allows separable algorithms to 
scale with an overhead tha t depends only on the simulating machine size.
Table 1.1: Summary of scaling simulations
Simulated model Simulating model Simulation overhead Reference
CRCW LR-Mesh CRCW LR-Mesh 0(1) [4]
CRCW FR-Mesh
CRCW FR-Mesh 
C R E W  LR-Mesh 
C R E W  PR-Mesh
0(logP)
0(logP)
0(logP)
This work 
This work 
This work
CRCW R-Mesh
CRCW R-Mesh 
CRCW R-Mesh 
C o l l is io n  CRCW LR-Mesh^ 
A r b it r a r y  CRCW LR-Mesh*  ^
C R E W  LR-Mesh 
C R E W  PR-Mesh
O(logiVlog^) 
O (logP log^) 
0(logP) w.h.p. 
0(1) w.h.p. 
CXlogN) 
0(loglV)
[4]
This work 
[32] 
[32]
This work 
This work
The sizes of the simulated and simulating models are N  x N  and P  x P , respectively. 
iThe simulating model is randomized and its simulation overhead is with high probability.
The main result of Chapter 4 is the construction of a bus linearization algorithm. 
This algorithm transforms any R-Mesh algorithm to run on the optimally scalable 
LR-Mesh. Bus linearization gives an algorithm designer the liberty of using buses 
of arbitrary shape, while automatically translating the algorithm to run on a  more 
implementable platform. Bus linearization runs on an LR-Mesh with only exclu­
sive writes, whereas the simulating machine of all prior scaling simulations required 
concurrent writes (see Table 1.1).
Furthermore, bus linearization also transforms FR-Mesh algorithms to run on an 
LR-Mesh. This feature automatically allows any separable algorithm  to run on an 
LR-Mesh maintaining a simulation overhead of log P . Bus linearization also facili­
tates the simulation of the R-Mesh and FR-Mesh on reconfigurable pipelined optical 
models, which require linear acyclic buses. Thus, the importance of bus linearization
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
15
lies in its use in translating the vast body of R-Mesh and FR-Mesh algorithms to run 
on the LR-Mesh and reconfigurable pipelined optical models.
Table 1.2: Summary of DR-Mesh simulations
Simulated model Simulating model size Simulating model Time Reference
0{N* X N*) CRCW R-Mesh O(logiV) [48]
CRCW DR-Mesh CREW  LR-Mesh 0 ( l o ^ N ) This work
o {n ^ X E p r ) CREW  PR-Mesh O(log^iV) This work
The size of the simulated model is iV x iV.
The main result of Chapter 5 is a  simulation of a  CRCW DR-Mesh on a  CREW 
LR-Mesh and can be extended to reconfigurable pipelined optical models. Ben-Asher 
et al. [5] established that a  constant time simulation of an iV x  iV DR-Mesh on an 
R-Mesh bounded with a polynomial number of processors is not likely, while TVahan 
et al. [48] designed a simulation of each DR-Mesh step on an 0 {N ^  x  iV^) R-Mesh in 
0 (log  N )  time. The target of our simulation is to reduce the number of processors. Its 
contribution is a useful technique tha t dramatically reduces the size of the simulating 
model by a factor of 0 {N ^  log^ N ). Table 1.2 shows these results.
In addition to these broad results, this work has also generated several tools 
and techniques that may be of independent interest, such as prefix assimilation 
(Section 3.4.1), double bus structure (Section 4.1.2), and a terse representation for 
connectivity within R-Mesh “tiles” (Section 5.4).
1.5 Organization of the Dissertation
Chapter 2 defines the R-Mesh, some of its variants, and basic concepts such as con­
current write rules and leader election. Chapter 3 describes the FR-Mesh simulation
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
16
and its use in obtaining an improved scaling simulation for the R-Mesh. Chapter 4 
presents bus linearization and its applications. Chapter 5 deals with the simulation 
of the directed R-Mesh on an LR-Mesh. Finally, Chapter 6  summarizes this work 
and presents some directions for future work in the area.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 2
Definitions and Terminology
This chapter describes the reconfigurable models we use in this work. Also discussed 
are the definitions of several concurrent write rules and some mapping techniques 
used in scaling simulations.
2.1 The R-Mesh
An R x  C  Reconfigurable Mesh (R-Mesh) is a two-dimensional array of processors 
connected in an R x C  grid. Each processor in the R-Mesh has direct connections 
to adjacent processors through its North, South, West, and East input/output ports. 
A processor can internally partition its set of four ports so tha t all ports in the 
same block of a partition are fused. This allows the R-Mesh to construct various 
bus patterns and to change them dynamically according to  the requirements of the 
problem at hand. Figure 2.1 shows a 3 x  5 R-Mesh, depicting the fifteen possible port 
partitions of a processor. These partitions, along with external connections between 
processors, define a global bus structure consisting of a set of buses tha t weave through 
the ports. A component is a set of buses and ports that have a  common connection. 
The R-Mesh and all its restricted versions assume a constant propagation delay on 
buses [26, 33, 42].
17
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
18
0 1 2 3 4
Figure 2.1: Internal connections of a 3 x  5 R-Mesh.
2.2 The FR-Mesh
The Fusing Reconfigurable Mesh (FR-Mesh) is a  restricted version of the R-Mesh. 
Trahan et ai [45] proved th a t the class of languages accepted by an FR-Mesh is equiv­
alent to the class S L  (the same as the R-Mesh) of languages accepted in symmetric 
logarithmic space, on a Turing machine. T hat is, the FR-Mesh is as “powerful” as 
the R-Mesh though it allows only two of the fifteen internal connections possible on 
the R-Mesh, a fusing and a cross-over connection. A fusing connection joins all four 
ports (processor (0,2) in Figure 2.2), and a cross-over connection joins the North 
port with the South port and the West port with the East port (processor (0,0) in 
Figure 2.2). Because of the FR-Mesh connections, assume without loss of generality 
th a t each processor in an FR-Mesh has only two ports, the vertical port (North and 
South ports) and the horizontal port (East and West ports). The connections in an 
FR-Mesh allow a processor to directly connect to any other processor in its row (or 
column) via a horizontal (or vertical) bus. Figure 2 .2  shows a 3 x  5 FR-Mesh with 
one component shown in bold. This component consists of buses in row 1 and colunm 
3 and all the ports connected to these buses.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
19
0
Figure 2.2: 3 x 5  FR-Mesh.
2.3 The LR-Mesh
The Linear Reconfigurable Mesh (LR-Mesh) is a restricted version of the R-Mesh. 
Each port in the LR-Mesh can connect to a t most one other port in the same pro­
cessor, so the LR-Mesh allows ten of the fifteen connections of the R-Mesh (all of 
the configurations of Figure 2.1 except those of processors (0,3), (1,4), (2,1), (2,2), 
and (2,3)). Ben-Asher et al. [4] constructed an optimal scaling simulation for the 
LR-Mesh. Ben-Asher et al. [5] also proved that the class of languages accepted by 
an LR-Mesh (resp., R-Mesh) is equivalent to the class L  (resp., SL )  of languages 
accepted in logarithmic space (resp., symmetric logarithmic space [39]) on a Tur­
ing machine. Although it has not been proved, the class L  is conjectured to  be a 
proper subset of the class SL , so the LR-Mesh is likely to be a  weaker model than 
the R-Mesh. Figure 2.3 shows a 3 x 5 LR-Mesh with some typical bus configurations. 
The figure also shows a component in bold.
2.4 Concurrent Writes
Most of this dissertation deals with concurrent read, concurrent write (CRCW) re­
configurable models, in which several processors may simultaneously read Rom or
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
20
Figure 2.3: 3 x 5  LR-Mesh.
write to the same bus. Concurrent writes are resolved by the C o m m o n , C o l l is io n , 
C o l l is io n ^, P r io r it y , or A r b it r a r y  rules. These rules are well known in the 
context of PRAM algorithms [19, 24]. The C o m m o n  rule allows concurrent writes 
only if all the values written to a component are equal. Under the C o l l is io n  rule, if 
more than one processor attem pts to write to a component, then a collision symbol 
is written. The C o l l is io n '*’ rule behaves like the C o m m o n  rule when all processors 
attem pt to write the same value to a component, and like C o l l isio n  otherwise. In 
the P r io r it y  rule, the processor with highest priority among those attem pting to 
write (usually the lowest indexed processor) wins the write conflict and writes its 
value. In the A r b i t r a r y  rule, an arbitrary processor wins the write conflict and 
writes its value.
On a distributed shared resource such as a bus, only certain write rules such as 
C o m m o n , C o l l is io n , and C o l l is io n "*" [4] are feasible. Other rules exist, such as 
A r b i t r a r y  and P r i o r i t y , whose physical implementations on a bus are not feasible, 
on the one hand, but on the other hand, they are a very useful algorithmic abstraction 
that simplifies algorithm design. In Sections 3.6 and 4.2.3, we present procedures to 
simulate models with different write rules and in Section 4.2.5, we present a  procedure 
to remove the need of concurrent writes for the LR-Mesh.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
21
In th e  fo llo w in g  d iscu ssion , we refer t o  th e  process o f  ch o o sin g  a n  e lem en t & om  
a  cand idate p o o l b y  th e  P r io r it y  (resp ., A r b i t r a r y ) rule as priority  re so lu tio n  
(resp., arb itrary  se le c tio n ) .
L e a d e r  e l e c t i o n :  This is a procedure that, for each component, selects a  leader 
among a set of marked processors. (Marked processors may, for example, be the 
processors attem pting to write.) The following procedure performs leader election by 
prioTity resolution. Let each processor have a unique 0 (lo g P )-b it key (the key may be 
its index). By examining the keys bit-by-bit, the procedure reduces the set of potential 
candidates for the leader so that the elected leader is one with a highest (or lowest) 
key. Consequently, a  C o m m o n  CRCW R-Mesh can perform leader election among 
P  processors in O (logP) time. Section 3.6.3 describes procedures to perform leader 
election by priority resolution using the C o m m o n , C o l l is io n , and C o l l is io n ^  
rules.
(a) (b )
Figure 2.4: Mappings of 6  x 9 R-Mesh to  3 x 3 R-Mesh, where numbers indicate 
processors of the 3 x 3  R-Mesh: a) Contraction mapping; b) Windows mapping.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
22
2.5 Contraction and Windows Mappings
Let Q he an N  y. N  R-Mesh and 7t a P  y  P  R-Mesh, where P  < N . A  scaling 
simulation for an R-Mesh involves the simulation of a step of Q  on For this, 
processors of Q must map to  processors of Tt.
The most obvious mapping is to let each processor of I t  simulate an ^  x ^  
“sub-R-Mesh” of Q. Ben-Asher et al. [4] called this the contraction mapping (see 
Figure 2.4(a), where the bold processors m ap to  processor 1 of %).
The windows mapping [4] divides Q  into ^  “windows”, each a  P  x P  sub-R-Mesh. 
In the simulation, each processor of I t  simulates the same processor from each window 
(see Figure 2.4(b), where the bold processors map to processor 1 of 72.).
Ben-Asher et al. [4] proved th a t the contraction mapping will not allow an optimal 
scaling simulation for the LR-Mesh. Using a similar argument, an FR-Mesh scaling 
simulation that uses the contraction mapping has an overhead of (proved in
Section 3.2). Hence, we use the windows mapping to perform th e  scaling simulation 
of the FR-Mesh (see Chapter 3).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 3
FR-Mesh Scaling Simulation
This chapter presents a strong scaling simulation for the FR-Mesh [13, 14]. The 
following theorem summarizes the main result of this chapter.
T h e o r e m  3 .1  For any P  < N , any step of an N  x  N  C o m m o n  C R C W  FR-Mesh 
can be simulated on a P  x  P  COMMON CRCW  FR-Mesh in log P )  time. ■
The main objective is to simulate an arbitrary step of an JV x JV FR-Mesh, Q, on 
a P  X P  FR-Mesh, R . In the simulation, each processor of R  simulates ^  processors 
of Q. Therefore, R  can simulate local actions of processors of Q in time. The
global structure of Q is more difficult to simulate.
Chapter 3 is organized as follows. Section 3.2 establishes a lower bound for the 
FR-Mesh scaling simulation using the contraction mapping. Section 3.3 gives a gen­
eral description of the FR-Mesh scaling simulation. Sections 3.4 and 3.5 describe the 
two main parts into which we have divided the FR-Mesh scaling simulation and that 
constitute the proof of Theorem 3.1. Section 3.6 explains how to modify the FR-Mesh 
scaling simulation to accommodate diflEerent write rules. Finally, Section 3.7 describes 
a new R-Mesh scaling simulation using the FR-Mesh.
23
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
24
N
So Si
: Wo.o : 
2 ^
Wo.i
W i.o Wx.i
S„
— — - 1---------
p p ; p ; ; P
Wo..
W i,.
N
Figure 3.1: Slices and windows of the simulated JV x  iV FR-Mesh.
3.1 Scaling Simulation Terminology
Let Q  be the simulated machine, an iVxiV CRCW FR-Mesh. Let TZ- be the simulating 
machine, a F  x P  CRCW FR-Mesh, where P  < N. W ithout loss of generality, assume 
that ^  is an integer. We now define the terminology used in presenting a scaling 
simulation.
Slice: The simulated FR-Mesh, Q , contains ^  slices, each an  iV x P  sub-FR-Mesh 
(see Figure 3.1). Denote slice v by S^, where 0 < u <  ^ .
W indow : Each slice, S„, contains ^  windows, each a P  x  P  sub-FR-Mesh (see 
Figure 3.1). For 0  <  u, u <  ^ ,  denote window u of by W„,„.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
25
B u s index : This is an  identifier assigned to each horizontal and vertical bus in 
Q according to its position. The horizontal bus in row i  has bus index z, while the 
vertical bus in column j  has bus index j  4- iV, where Q < i , j  < N .  For 0 <  6  <  2N, 
let hus{h) denote the bus with index b.
B u s d a ta :  This is the value available on a  bus at the end of a  write cycle. If there 
is no write on a bus at the  current cycle, then the bus data  on the bus is said to be 
null', otherwise, it is non-null. Since the FR-Mesh permits resolution of concurrent 
writes by rules such as C o l l i s i o n ,  the  data  a port reads from the bus may not be 
the same as the value the same port wrote to the bus.
C o m p o n en t n u m b er: This is an identifier assigned to a  component. In the 
simulation presented in this chapter, the component number is equal to the largest 
bus index among all buses in the component. Initially, when the simulation is unaware 
of any connections between buses of Q, it assigns to each bus its bus index as its 
component number.
3.2 Mapping for FR-Mesh
Ben-Asher et al. [4] established that simulating the LR-Mesh using the contraction 
mapping (see Section 2.5) requires Q(N)  overhead, which does not allow an optimal 
or even a strong scaling simulation for the LR-Mesh.
By a similar argument, the required overhead to simulate an FR-Mesh using the 
contraction mapping is Cl(y/N'j, and so we cannot use the contraction mapping to 
design a strong scaling simulation for the FR-Mesh.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
26
L em m a 3.2 For any P  < N , simulating any step o f an N x N  FR-M esh on a P x P  
FR-Mesh using the contraction mapping requires time.
P roo f: We construct an example of a  bus configuration in an iV x  iV FR-Mesh, Q, 
th a t requires steps to simulate using a P  x  P  FR-Mesh, First, we describe
this bus configuration.
Assign each vertical and horizontal bus in Q a label between 1 and \ / N ; the labels 
for vertical buses, starting a t the left, are as follows:
1 .2 .1 .3 .1 .4 . . . . .1 ,v ^ ,2 ,3 ,2 ,4 , . . . ,2 ,  V W ,3 ,4 ,3 ,5 ,. . . ,3 ,\ /W ,.. . , \ / ]V --1 ,\ /]V . 
The labels for horizontal buses, starting a t the top, are as follows:
1 .2 .3 . . . . ,  \ /N ,  1 ,2 ,3 , . . ., y/N , and so on.
Each processor a t the intersection of a  vertical bus and a horizontal bus th a t have the 
same label sets a  fusing connection; all other processors set cross-over connections. 
Figure 3.2 shows this bus configuration.
1 2 1 3 1 4  1 Æ  2 3 2 4  2 Æ  Æ- 1  Æ
1
2
3
4
Æ - 1
E i f£3
IE 3
[E mim m
E3
' •
b
F
f - - + - h -
h
Figure 3.2: Contraction mapping for an FR-Mesh prevents a strong scaling simulation.
Assume th a t P  =  y .  By using the  contraction mapping, a  single processor of 
the simulating machine, R , performs the work of four processors of Q (dotted lines 
in Figure 3.2). Vertical buses with label 1  share processors w ith vertical buses with
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
27
labels 2 ,3 ,4 , . . . ,  y/N. Each column of processors of Tt only can simulate one vertical 
bus of Q a t a  time. When Tt simulates the component with label 1  (bold bus in 
Figure 3.2), it cannot simulate components with labels 2,3,4, . . ., V N  a t the same 
time. Similarly, vertical buses with label 2  share processors with vertical buses with 
labels 1 ,3 ,4 , . . . ,  y/N. When 72. simulates the component with label 2, it cannot 
simulate components with labels 1 ,3 ,4 ,. . . ,  y /N  a t the same time, and so on.
In general, given the above configuration, 72 can simulate a t most one component 
of Q a t a time. Consequently, the simulation of Q by 72 using the contraction mapping 
takes Ç L^y/^  time. ■
We will therefore use the windows mapping [4] (see Figure 2.4(b)).
3.3 General Description of the Simulation
Ben-Asher et al. [4] developed an optimal scaling simulation for the LR-Mesh that 
uses the windows mapping. This algorithm uses the fact that each bus has only two 
end-points (as each bus in the LR-Mesh is linear); consequently, it  is possible to track 
buses across windows by considering a t most two ports per bus. On the other hand, 
a bus in the FR-Mesh can have 0 ( f )  end-points in a window (because of fusing 
connections) and necessitates an entirely different approach.
The simulation of a step of Q, an iV x  iV FR-Mesh, on 72, a  f  x P  FR-Mesh, 
progresses in two phases. During the first phase, œmponent determination, the sim­
ulating machine 72 labels buses of Q  with their component numbers. This allows 
72 to treat ports with the same component number as if they were connected by a 
common bus, without the need to physically configure its buses exactly as in Q. 72 
ascertains the connection pattern of Q gradually by performing a  window-by-window 
vertical sweep down each slice of Q, and performing a slice-by-slice horizontal sweep
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
28
across Q. The biggest obstacle is tha t fusing connections outside a window can affect 
the component numbers of buses that are not connected within the window. Our 
main approach to overcoming this problem is assimilating into the current window 
the effect of connections in previously examined windows.
After assigning a component number to each bus, 7L proceeds to the second phase, 
data delivery. This phase conveys data  written to one or more ports of a  component to 
all ports of the component (this data  depends on the concurrent write rule). During 
this phase, Tt sweeps over each window of Q detecting and recording processors’ 
attem pts to write to each component. During a second sweep, Tt delivers the  d a ta  that 
appears on each component to all processors that simulate processors of Q  reading 
that component. Section 3.4 describes component determination and Section 3.5 
describes data delivery.
3.4 Component Determination
During the simulation, 7Z treats Q as a series of ^  slices. Component determina­
tion includes two horizontal sweeps across these slices to label the components. In 
the first sweep, 7Z applies the following procedures to each slice of Q in turn, ffor- 
izontal prefix assimilation embeds in the current slice the effect of bus fusings of all 
preceding slices by selectively adding fusing connections to the current slice. Af­
ter horizontal prefix assimilation, 72. can consider the current slice in isolation. To 
simulate the current slice, 7^ now moves downwards, applying a sequence of vertical 
prefix assimilation and component numbering procedures to each window in the slice. 
Like horizontal prefix assimilation, vertical prefix assimilation embeds in the current 
window the effect of bus fusings of all previous windows in the slice. After applying 
vertical prefix assimilation to a window, component numbering uses the leader elec­
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
29
tion procedure (Section 2.4) to update the component numbers for th a t window. The 
update includes the effect of fusing connections in all previously visited windows and 
slices. The sequence of calls to the vertical prefix assimilation and component num­
bering procedures concludes in the bottom  window of the slice. Component numbers 
of vertical buses in the bottom  window include the effects of the entire slice, so Tt 
broadcasts these to vertical buses in upper windows, and from here to any horizontal 
buses connected to them.
At this point in the simulation, for each bus in the slice, % holds the component 
number incorporating the effects of th a t slice and all slices to its left in Q, but none 
of the slices to the right. After completing the first horizontal sweep across all slices 
in Q, TZ- holds the final component number of each horizontal bus in Q. In the 
second horizontal sweep, TZ broadcasts these values to each vertical bus connected 
to a horizontal bus. The pseudo-code in Figure 3.3 describes the organization of the 
component determination phase.
Sections 3.4.1 to 3.4.5 explain horizontal prefix assimilation, vertical prefix assim­
ilation, component numbering, the second vertical component sweep, and the second 
horizontal component sweep.
3.4.1 Horizontal Prefix Assim ilation
Horizontal prefix assimilation is key to  the scaling simulation. It embeds in S,, the 
effect of fusing connections in slices 5o, - • -, Sv-i- TZ accomplishes this task by strate­
gically adding fusing connections to the slice. Let C be a component such th a t a t 
least one bus in C  is fused within the slice. The essence of horizontal prefix assimila­
tion is to select for each such component C  a  unique vertical bus h th a t can be used 
to connect horizontal buses in component C. T hat is, TZ fuses all horizontal buses
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
30
for t? G to ^  — 1 do for slice <S„ in Q 
horizontal prefix assimilation 
for u  4- 0  to ^  — 1  do for window VW„,„ in 
vertical prefix assimilation 
component numbering
end
second vertical component sweep
end
second horizontal component sweep
Figure 3.3: Pseudo-code for component determination.
known to be in component C  (because of fusing to the left of the current slice) to 
vertical bus b within the current slice.
Example: Figure 3.4(a) shows the effect of fusings in previous slices on the current 
slice. Figure 3.4(b) shows how horizontal prefix assimilation embeds these fusings in 
the current slice by adding some fusing connections (shown circled). 72. chooses verti­
cal bus B  to fuse the horizontal buses of component {2,6 } within the slice. Similarly, 
72. chooses vertical bus A  for component {3,5}. On the other hand, components {4} 
and {1 ,7} have no fusings within the slice, so they need no extra fusings because 
nothing in the slice changes their component numbers. ■
The following is a  description of variables and registers used during horizontal 
prefix assimilation.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A  B
(a)
2
3
5
6 
7
1
2
3
5
6 
7
A B
31
(b)
Figure 3.4; An illustration of horizontal prefix assimilation: a) Components and fusing 
connections in a slice before horizontal prefix assimilation; b) Slice after horizontal 
prefix assimilation, where added fusing connections are circled.
Let the horizontal fusing index of horizontal bus k  be the index of the rightmost 
vertical bus (if any) in Wu,v that is fused to bus k. Denote this index as hji{k). 
Each processor connected to horizontal bus k  holds this value (if any) in register hfi. 
Denote as hcomp{k) {vœmp{£), resp.) the component number of horizontal bus k 
(vertical bus £, resp.). Each processor connected to  horizontal bus k holds this value 
in register hcomp. Each processor in TZ, possesses a set of ^  registers reg^n, where 
0  <  m <  ^ .  Register reg^i, in each processor of column £, where 0 <  £ < P , holds 
the horizontal fusing index for component +  m.
Assume that each processor in H. holds the configuration of the processor it simu­
lates in each of the windows of Q. Also, assume that each processor in 72. holds 
the component numbers of its horizontal and vertical buses.
Horizontal prefix assimilation comprises four stages; each of these stages involves 
^  iterations. As we explain them, we will illustrate key ideas through the example 
of Figure 3.4.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
32
S tag e  1 . In this stage, % determines by leader election (by priority resolution) the 
fusing index, hfi{k), for each horizontal bus k 'm  S„ and stores it in an appropriate 
processor and register. Stage 1  proceeds as follows.
for n <— 0  to y  — 1  perform the following steps in window
1. Each processor of 72. configures its ports as cross-over.
2. In each row k  of %, each processor, writing to its horizontal bus, uses leader 
election (by priority resolution) to  find its hji{k) (if any).
3. In each row k  of 72-, the processor in colunm I  that satisfies the  relation -f- m  =  
compiuP  4 - k) for some m, where 0 <  m <  stores hji{k) (if any) in its register
overwriting any previously stored value.
72. executes the leader election (by priority resolution) of Step 2 in O (logP) time, 
and Steps 1  and 3 in 0 (1 ) time. 72 performs the steps above in ^  windows, so it 
executes Stage 1 in O ( ^ l o g P )  time.
Example: In Figure 3.4(a), buses 2, 3, and 5 have fusing connections w ithin the 
slice. The fusing indices for these buses are B , B, and A, respectively. Registers 
responsible for storing the fusing indices of component {2,6 } hold only index B . On 
the other hand, registers for component {3,5} hold indices A  and B.
S tage  2 . Stage 1 places the set of possible fusing indices for a component in 
processors of the same colunm. Stage 2 uses leader election (by priority resolution) 
to select one fusing index for each of the ^  components per column of 72.. Stage 2 
proceeds as follows.
for m 0  to ^  — 1  perform the following steps:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
33
1. Each processor of H. configures its ports as cross-over.
2. For each vertical bus of 72., use leader election (by priority resolution) to  select a 
processor (if any) that has a  non-null fusing index in register regm-
3. The selected processor broadcasts such a  fusing index (if any) to all th e  processors 
in its column. Each processor in the column stores this index (if any) in register regj^, 
overwriting any previously stored value.
72 performs ^  leader elections (by priority resolution) in each column. Therefore, 
72 executes Stage 2 in O (^ Io g P )  time.
Example: In Figure 3.4, this step selects index B  (the only one available) for 
component {2,6}. Although it is not clear in Figure 3.4, assume th a t this step 
chooses index A  (from the set (A, B })  for component (3 ,5}.
S tage 3 This stage completes horizontal prefix assimilation by adding a  fusing 
connection between a horizontal bus and the vertical bus given by the fusing index 
selected in Stage 2 for its component. Stage 3 proceeds as follows.
for u <— 0  to ^  — 1 perform the following steps in window Wu,v:
1. Each processor of 72 configures its ports as cross-over.
2. In each row k  of 72, the processor in column £ tha t satisfies the relation 4- m  =  
comp{uP -f- k) for some m, where 0 <  m <  broadcasts the contents (if any) of 
register regm to  all processors in its row.
3. Each processor reads the horizontal bus. If the colunm index of some processor 
matches the value on the bus, then that processor replaces the original port configu­
ration of the processor it simulates in Q by a fusing connection.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
34
71 performs these operations in each window in constant time, so Tt executes 
Stage 3 in O (^ ) time.
Example: In Figure 3.4, notice the circled fusing connections on vertical bus B  
(to embed component {2,6}) and on vertical bus A  (to embed component {3,5}).
Altogether, Tt performs horizontal prefix assimilation in lo g f )  time.
Because horizontal prefix assimilation exploits the continuity of horizontal and 
vertical buses, this method does not extend to a scaling simulation of the (unre­
stricted) R-Mesh. For example, in Figure 3.4, if bus B  was broken between buses 
2 and 6 , then placing a  fusing connection at the intersection of buses 6  and B  no 
longer accurately embeds the effects of buses 2  and 6  being in the same component. 
Since both buses are in different components of %, one of them will be assigned a 
wrong component number, and a new problem that arises is that the segment of bus 
B  below the break should be in a different component than  bus 6 . Furthermore, the 
number of possible components would be 0{N^)  instead of 2N.
3.4.2 Vertical Prefix Assimilation
Vertical prefix assimilation embeds in Wu,„ the effects of bus fusings in upper windows, 
Wo,„ , . . . ,  Wu-i,„. The procedure is a  special case of horizontal prefix assimilation 
that uses P  x P  windows rather than N  x  P  slices.
Vertical prefix assimilation includes an initial stage, described below, that provi­
sionally replaces the component number of each vertical bus in window by an­
other that is log P-bits long. This transformation maintains the size relation among 
the transformed component numbers. This stage proceeds as follows.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
35
S m a ll C o m p o n en t N u m b e rs  This stage assigns a component number of length 
lo g P  bits to each of the P  vertical buses in the window (the original component 
number is logiV +  1 bits long). This stage consists of the following steps.
1. Each processor of Tt configures its ports as cross-over.
2. The topmost processor in each colunm of % broadcasts its component number to 
all processors in its colunm.
3. Each main diagonal processor of 7L broadcasts the number it read to all processors 
in its row.
4-. Each processor of TL compares these two numbers. If both are equal, then the 
processor sets a provisional fusing connection in its ports.
5. The first row processors of Tt, writing on their vertical buses, use leader election 
(by priority resolution) to  find the largest column index for each component. Call the 
resulting index as the small component number.
TZ. executes Steps 1-4 in 0(1) time and Step 5 in O (logP) tim e (since the length 
of a colunm index i  is lo g P  bits). Therefore, IZ executes this stage in O (logP) time.
From this point, TZ applies a variation of horizontal prefix assimilation to window 
Wu.v- We will describe vertical prefix assimilation by pointing out the difierences 
with respect to horizontal prefix assimilation.
Notice that the prefix assimilation proceeds in a  vertical fashion rather than hor­
izontal. The number of possible small component numbers is a t most P , so each 
processor in TZ has only one register {reg\) (rather than the ^  registers reg^ n) for 
storing the fusing index th a t corresponds to the small component number of its ver­
tical bus.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
36
Stage 1 In this stage, H. determines by leader election (by priority resolution) the 
fusing index for each vertical bus £ in VVu,»- Assume tha t vertical bus i  has small 
component number a ,  where 0 <  a  <  P . Then, 72. stores the fusing index of column 
£ (if any) in register regi of the processor located in  row a  and column £.
S tage  2. The processors in each row a , writing on their horizontal bus, use leader 
election (by priority resolution) to select one fusing index for all vertical buses with 
small component a . Each procœsor in row a  stores th is index in its register r ^ .
Stage  3 This stage completes vertical prefix assimilation by adding the respective 
fusing connections as in horizontal prefix assimilation.
Component numbering also uses the small component number of each vertical 
bus. 72 performs vertical prefix assimilation in O (logP ) time.
3.4.3 Com ponent Numbering
The component numbering procedure assigns component numbers to buses (ports) in 
the current window. It incorporates the embedded eflTects of previous windows (gath­
ered by the horizontal and vertical prefix assimilation procedures). The component 
numbering procedure works as follows.
S tage 1 Each processor of 72 configures its ports according to the configuration of 
the processor it simulates in Q (as altered by the prefix assimilation phases).
S tage 2 The processors a t the top of each column, writing on the vertical buses, use 
leader election (by priority resolution) to find the vertical bus (colunm) with largest 
index in each component.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
37
S ta g e  3 The processor a t the top of a  column with largest index in its component 
writes its original logiV +  1 bits long component number on its vertical bus. Each 
processor reads its vertical (horizontal, resp.) bus and stores the value in its register 
vcomp (hcomp, resp.).
The values in registers vcomp and hcomp are the new component numbers for 
vertical and horizontal buses, respectively, at this point in the simulation.
TL performs the leader election (by priority resolution) of Stage 2 in O (logP) time 
and the remaining stages in 0(1) time. Therefore, 'R, executes component numbering 
in O (IogP) time.
Remark: If R  used leader election (by priority resolution) to directly identify 
component numbers instead of Stages 2 and 3, then this would take O(logiV) time, 
as the component numbers in Q could be O(logiV) bits long.
Example: Figure 3.5 illustrates component numbering. For simplicity, it is de­
signed to not require vertical prefix assimilation. Figure 3.5(a) shows the configura­
tion of Sy after horizontal prefix assimilation. After applying component numbering 
to window Wo,„ (Figure 3.5(b)), all buses in the same component receive the same 
component number, buses 0 and 2 are numbered B  and A , respectively. The same 
effect occurs in windows Wi,„ and W2,„ (Figures 3.5(c),(d)). Note tha t as the first 
vertical sweep advances, component numbers of vertical buses change (from A, B , C  
in Wo,u, to B, B , C  in Wi,„, and finally to C, C, C  in W 2 ,v)- These changes do not 
manifest themselves in upper windows until the second vertical component sweep.
3.4.4 Second Vertical Component Sweep
The second vertical component sweep propagates component numbers to each window 
in slice Sy. After the first vertical sweep, all buses in window „ include the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
38
m . v  :
i 2
Wi,„ ;
Î5
&
A B C
to
(a)
A B C
B
1
A
first
sweep
first
sweep
(b)
B  4 - .
1 4  -
A À B  B G 
3
C
B
(c)
B
I
A  _  
3
C  
B  
C
7
8
first
sweep
C C G
(d)
C G C
C
1
C
3
C
C
C
7
8
second
sweep
(e)
Figure 3.5: Example of vertical component sweep: a) Original slice connections and 
component numbers; b-d) First vertical component sweep; e) Second vertical com­
ponent sweep.
effect of all fusing connections in the slice S-o and all previous slices (see window 
W2,„ in Figure 3.5(d)). During the second vertical component sweep, vertical buses 
convey these updated component numbers to horizontal buses in each window (see 
Figure 3.5(e)). For simplicity, this sweep follows the same direction as the first vertical 
sweep, starting in window Wo,„ and moving down till window Because
vertical buses have the same component number in each window of the slice, the 
sweep over the windows can follow any order. 72. performs the following steps.
for u 0 to ^  — 1 perform the following steps in window
1. Each processor of 72^  configures its ports according to  the configuration of the 
simulated window in Q, plus the additional fusing connections obtained during the 
prefix assimilation sub-phases.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
39
2. The topmost processor of each column writes its component number (the value of 
its register vcomp) on its vertical bus.
3. Each processor reads this value (if any) from its horizontal bus and stores it in its 
register hcomp, overwriting any previous value.
If some processor does not read any value in Step 3, then its horizontal bus is 
not fused to any vertical bus. In this case, each processor in th a t row retains the 
component number of its horizontal bus (in register hcomp) it had a t the beginning 
of the simulation of the present slice.
V, executes the second vertical component sweep in time.
3.4.5 Second Horizontal Component Sweep
This phase propagates final component numbers to vertical buses in  all slices in Q. 
After the first horizontal sweep, all vertical and horizontal buses in the rightmost 
slice, have the correct component number while vertical buses in other slices
may not. Since each horizontal bus passes unbroken through all slices, 72. simply 
copies the final component number of a horizontal bus from the rightmost slice.
For each slice S„ of Q  (the order of this sweep does not m atter, but for simplicity 
start the sweep in S q and finish it in follow the next procedure:
for u 0 to ^  — 1 perform the following steps in window W„,„.
1. Each processor of 72 configures its ports according to the configuration of the 
simulated window W„,t„ plus the additional fusing connections obtained during the 
prefix assimilation sub-phases.
2. The leftmost processor in each window writes its component number on its hori-
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
40
zontai bus, then each processor in the window reads its vertical bus and stores this 
value (if any) in  its register vcomp, overwriting any previous value.
The second horizontal component sweep runs in time. The tim e to  execute
the component determination phase is O ( ^ l o g P )  time. Now, the conditions are set 
to apply the d a ta  delivery phase.
3.5 Data Delivery
So far, 72. holds the final component number of each bus in Q. Using this infor­
mation, data delivery ensures that each processor port receives its appropriate data. 
A component o f Q, must have the same data  on all of its buses. Call the actions 
of 72- to ensure this property in its simulation of Q as data homogenization. (Each 
processor in an FR-Mesh holds the bus data of its horizontal and vertical bus in 
registers hdata and vdata, respectively.) D ata delivery employs two procedures, win­
dow homogenization and slice homogenization. In addition, data delivery uses the 
second horizontal/vertical data sweeps th a t parallel their counterparts in component 
determination.
The pseudo-code in Figure 3.6 describes the organization of d a ta  delivery. 72. 
applies window homogenization to all windows of the slice. After finishing with 
the bottom  window, 72^  applies the second vertical da ta  sweep (similar to the second 
vertical component sweep) to broadcast any possible writes on lower windows to ports 
in upper windows. Because window homogenization acts locally a t a  window level, 
it cannot detect the relation between two buses with the same component number 
lying in different windows if they do not have an explicit connection between them in 
the slice. Therefore, after applying the second vertical da ta  sweep, TZ. performs slice 
homogenization to correct possible d a ta  inconsistencies th a t window homogenization
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
41
for u <— 0  to ^  — 1  do for slice 6  ^ in  Q
for M <— 0 to ^  — 1 do for window Wu,o in S„ 
window homogenization
end
second vertical data sweep 
slice homogenization
end
second horizontal data sweep
Figure 3.6: Pseudo-code for data delivery.
cannot solve. After applying slice homogenization to all slices, H. performs the second 
horizontal data  sweep (similar to second horizontal component sweep) across all slices 
to ensure tha t writes in a  slice reach slices to its left. We now describe the procedures 
in da ta  delivery.
3.5.1 W indow Homogenization
Window homogenization performs data  homogenization within a window by integrat­
ing data  entering a window through its borders with d a ta  written within the window, 
according to the concurrent write rule ( C o m m o n , in this case). The main step of 
window homogenization is to physically connect within the window all buses in the 
same component. Once components are physically connected within 72., a  write by 
the processors allows data  homogenization through the window.
Window homogenization uses three different connection patterns to achieve data  
homogenization in every coimected component of the window. The first connec­
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
42
tion pattern, the H V  configuration, handles connected components that have both 
horizontal and vertical buses in the window (see buses a  and d in Figure 3.7(b)). 
Similarly, the W  {HH, resp.) configuration handles connected components th a t have 
only vertical (horizontal, resp.) buses in the window. Figures 3.7(c) and 3.7(d) show 
examples of W  and HH configurations. Each component in  % can easily determine 
its component type by checking whether or not it possesses fusing connections. The 
following routine sets configuration HV.
1 . Each processor in sets its port partition as cross-over.
2. The topmost processor in each column and the leftmost processor in each row 
broadcasts its component number to its column and row processors, respectively.
3. Each processor in the window compares the two numbers it receives. If they are 
equal, then it sets a fusing connection in its ports.
The following routine sets configuration W  (HH, resp.).
1 . Each processor in 7t sets its port partition as cross-over.
2. The topmost (leftmost, resp.) processor in each column (row, resp.) broadcasts its 
component number to its column (row, resp.) processors. Then each main diagonal 
processor rebroadcasts the received number to its row (column, resp.) processors.
3. Each processor (including those in the main diagonal) in  the window compares 
the two numbers it receives. If they are equal, then it sets a  fusing connection in its 
ports.
Each processor in 72. that simulates a  writer processor of Q holds data  to  be 
written on the bus. Additionally, each first row and first colunm processor holds data  
generated in adjacent windows. Window homogenization proceeds as follows.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
43
c a d c f a  d c f b’ b ’
a
b
a
b
d
a
a
d
c’r
b
b
(a) (b) (c) (d)
Figure 3.7: Configurations for window homogenization: a) Window after component 
determination; b-d) Configurations HV, W ,  and HH for the same window.
1. Tt sets configuration HV ( W , HH, resp.). Only processors having ports attached 
to an HV ( W , HH, resp.) connected component take part in the following step.
2 . Each processor tha t simulates a writing processor of Q writes its data  on its 
respective bus or buses. Simultaneously, each border processor holding non-null bus 
data  writes this value to the bus. Each processor reads the bus and stores the  bus 
data.
The resulting d a ta  on each bus at the end of this step may be null or non-null. TZ 
performs window homogenization in constant time.
Example: Figure 3.7 illustrates window homogenization. Figure 3.7(a) shows the 
window after component determination, where the letters represent component num­
bers. Figure 3.7(b) shows the HV configuration. Notice how horizontal and vertical 
buses a form a connected component, as well as buses d. After applying a common 
write to each component, each processor port attached to the same component reads 
the same data. Figure 3.7(c) shows a W  configuration. Only buses with c and  /  
take part in this operation. Observe that horizontal buses cf connect vertical buses 
c. The function of f  is similar but unnecessary in this case, since there is only one 
vertical bus / .  Figure 3.7(c) shows the HH configuration with buses U working as 
auxiliary buses to connect vertical buses h.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
44
3.5.2 Second Vertical Data Sweep
The second vertical data sweep propagates bus da ta  to each horizontal bus in sUce Sy. 
After applying window homogenization and component numbering to  each window 
in the slice, all vertical buses include the effect of da ta  writings in  the slice. The bus 
data of horizontal buses in upper windows, however, may be altered by writings in. 
lower windows. The second vertical data sweep solves these inconsistencies for those 
horizontal buses that have a  fusing connection w ith some vertical bus in the slice; 
otherwise, we use slice homogenization. The second vertical da ta  sweep works in the 
same way as the second vertical component sweep; the only difference is that each 
processor writes the contents of vdata (rather than  vœmp) to its vertical bus, and 
each processor reading its horizontal bus stores the value in its register hdata (rather 
than hcomp)
'R, performs the second vertical data sweep in O (^ )  time. At this point, some 
data discrepancies may appear in horizontal buses of components without fusing 
connections in the slice. R. applies slice homogenization to correct them.
3.5.3 Slice Homogenization
After applying window homogenization to all the windows in the slice, it is possible 
that some buses with the same component number have different bus data. Figure 3.8 
illustrates this case. Since the two horizontal buses fuse the vertical bus in slice $v-u  
after executing connected component determination, the three buses have the same 
component number (a in this case). The two horizontal buses do not fuse to any 
other bus in the remaining slices (from Sv to Let some processor in slice «S„
write a. on the upper bus, while no processors write on the lower bus. After applying 
window homogenization to the entire slice, both buses will still have different bus data
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
45
window I
S(v-v S(v) a
slice
Figure 3.8: Ebcample showing need for slice homogenization.
because they lie in different windows and <S„ contains no explicit connection between 
them.
Slice homogenization acts over the entire slice, broadcasting data to  buses placed 
in different windows (even when there is no direct connection between them in the 
slice) and correcting any remaining data  discrepancy according to the concurrent write 
rule. Slice homogenization chooses one non-null datum  (if any) for each component 
and broadcasts it to all the buses in its corresponding component (recall that the 
C o m m o n  rule is assumed). This parallels the horizontal prefix assimilation algorithm 
that selects one fusing index (if any). Thus, the algorithm of Section 3.4.1 with 
minor modifications serves for slice homogenization. Slice homogenization performs 
the following steps.
Each row in 72. holds 2N  registers ( ^  registers per processor), as in horizontal 
prefix assimilation. Register regm, in each processor of column £, where 0 <  £ <  P , 
holds the bus data  for component + m.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
46
S ta g e  1 . This stage groups the bus data  of all horizontal buses according to their 
component numbers, and stores data  for the same component in processors in the 
same column of Stage 1 consists of the following steps.
for n  <— 0  to y  — 1 perform the following steps in window
1 . Each processor of 72. configures its port connections as cross-over.
2. Each processor of 72. writes the contents (if any) of register hdata to its horizontal 
bus.
3. In each row k  of 72, the processor in column i  that satisfies the relation -f- m  =  
comp{uP  4- k) for some m, where 0 <  m  <  stores the d a ta  on the bus (if any) in 
its register overwriting any previously stored value.
Remark: Since only common concurrent writes are allowed, we have only two possible 
classes of values, null and non-null.
72 executes Stage 1 in O (^ ) time.
S tag e  2. Since the bus data of buses having the same component number are in 
the same column, a common write among these values gives the final bus data  for 
these buses.
for m  0  to ^  — 1 perform the following steps in window
1 . Each processor of 72 configures its port connections as cross-over.
2. In each column i, each processor writes the contents (if any) of register regm to its 
vertical bus.
3. In each column £, each processor reads its vertical bus and stores that value in its 
register reçm-
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
47
executes Stage 2 in 0 { ^ )  time.
S ta g e  3. This stage returns the final bus d a ta  to the each processor in the slice, 
for u 0 to ^  — 1 perform the following steps in window W„,„.
1 . Each processor of 77 configures its ports as cross-over.
2. In each row k  of 77, the processor in column E that satisfies the relation -\-m  =  
comp{uP  4- k) for some m , where 0 <  m  <  broadcasts the  contents (if any) of 
register regm to all processors in its row.
3. Each processor reads the horizontal bus, and stores that value in register data.
77 executes Stage 3 in O (^ ) time.
After applying window homogenization and slice homogenization to each window 
and slice in Q, only the rightmost slice has its set of final bus data. 77 performs 
a reverse sweep, in which it broadcasts the d a ta  gathered during the first sweep, in 
order tha t each processor of 77 that simulates a  reading processor of Q can read the 
final value from its vertical or horizontal bus or from both. This sweep starts in slice 
and ends in slice 5b.
Remark: We note tha t the actions of slice homogenization for all slices of Q can be 
deferred to a  single slice homogenization on the last slice before the second horizontal 
d a ta  sweep. Our presentation of slice homogenization as a  p a rt of the data delivery 
phase allows each slice to be completely processed before the simulation moves on to 
the next slice.
3.5.4 Second Horizontal Data Sweep
The second horizontal data  sweep propagates bus data to  each vertical bus th a t is 
connected to a horizontal bus in Q. After sweeping the ^  slices of Q with slice
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
48
homogenization, each po rt connected to a horizontal bus can read its final bus data 
while ports connected to  vertical buses may not. Since each horizontal bus passes 
unbroken through all slices, % simply copies the final bus data  of each horizontal bus 
from the rightmost slice. The second horizontal d a ta  sweep works similarly to the 
second horizontal component sweep; the only difference is that, in each slice, each 
processor writes the contents of hdata (rather than hcomp) to its horizontal bus, and 
each processor reading its vertical bus stores the value in its register vdata (rather 
than vcomp). So each processor in Q receives its correct final bus data.
72. performs the second horizontal data sweep in time. IZ. also performs
data  delivery in time.
This completes the simulation of an arbitrary step of an JV x iV C o m m o n  CROW 
FR-Mesh, Q , on a P  x P  C o m m o n  CRCW FR-Mesh, 72.. On the whole, this simu­
lation takes O ( ^ l o g P )  time; that is, it has a  lo g P  simulation overhead.
Remark: The FR-Mesh scaling simulation uses only memory per processor,
which is optimal. Also, the simulation applies to any N \ x FR-Mesh on a P  x P  
FR-Mesh, with P  < N i, N 2 .
3.6 Other Write Rules
The simulation explained in previous sections assumes the C o m m o n  rule in Q and 
72. In this section, we simulate 0  on 72 and allow both of them to have any of 
the following write rules [12]; COMMON, COLLISION, COLLISION^, A r b it r a r y , or 
P r i o r i t y . We obtain the  following results.
T h e o r e m  3 .3  For any P  < N , any step o f an N  x. N  C o m m o n , C o l l is io n , 
C o l l is io n ’’", A r b i t r a r y , or P r io r it y  CRCW  FR-Mesh can he simulated on a
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
49
P x P  C o m m o n , C o l l is io n , o r  C o l l is io n '*' CRCW  FR-Mesh in log time.
■
T h e o r e m  3 .4  For any P  < N , any step of an N  x. N  COMMON, COLLISION, 
C o l l is io n '*', o r  A r b i t r a r y  CRCW  FR-Mesh can be simulated on a P  x  P  A r ­
b it r a r y  CRCW  FR-Mesh in time. ■
T h e o r e m  3 .5  For any P  < N , any step of an N  x  N  C o m m o n , C o l l is io n , 
C o l l is io n '*', A r b i t r a r y , o r  P r io r it y  CRCW FR-Mesh can be simulated on a 
P  X P  P r io r it y  CRCW  FR-Mesh in O ( ^ )  time. ■
Though we view A r b i t r a r y  and P r io r it y  as too powerful for a bus-based 
model, the simulations indicate that the FR-Mesh can scale an A r b it r a r y  or P r io r ­
i t y  algorithm with the same cost as scaling a C o m m o n  algorithm (see Theorems 3.3, 
3.4, and 3.5). Hence, algorithm development can be more flexible and efficient if the 
A r b i t r a r y  or P r io r it y  rules are permitted for algorithms that will be scaled and 
run  on a model with a feasible concurrent write rule.
The cases where 72. uses the A r b it r a r y  or P r i o r i t y  rules have a constant 
simulation overhead (see Theorem 3.4 and 3.5). This pinpoints leader election as the 
bottleneck in this scaling simulation. Thus any improvement in the technique for 
leader election immediately translates to a lower simulation overhead.
We prove Theorems 3.3 to 3.5 by performing the following simulations;
1. iV X AT C o m m o n , C o l l is io n , or C o l l isio n '*' CRCW  FR-Mesh on  a n  N  x AT 
A r b i t r a r y  or P r i o r i t y  CRCW FR-Mesh.
2 . N  X N  P r io r it y  CRCW  FR-Mesh on a P  x P  P r io r it y  CRCW FR-Mesh.
3. N x N  A r b it r a r y  CRCW  FR-Mesh o n a P x P  A r b i t r a r y  CRCW FR-Mesh.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
50
4. P  X P  P r i o r i t y  CRCW FR-Mesh on a P  x P  C o m m o n , C o l l is io n , or 
C o l l is io n "*’ CRCW FR-Mesh.
Simulations 1, 2, and 4 will establish Theorem 3.3, Simulations 1 and 3 will establish 
Theorem 3.4, and Simulations 1 and 2 will establish Theorem 3.5.
3.6.1 Simulation 1
An FR-Mesh using ARBITRARY or P r io r it y  writes can sim ulate in constant time 
each step of an FR-Mesh of the same size using C o m m o n , C o l l is io n , or C o l l is io n ’*’. 
(The proof of this assertion mirrors the corresponding proof for PRAMs [19, 24].) 
Consequently, a simulation of an A r b i t r a r y  or P r io r it y  CRCW  FR-Mesh with 
simulation overhead X  implies simulations of FR-Meshes with any of the C o m m o n , 
C o l l is io n , or C o l l is io n "*" rules with the same simulation overhead X .
L em m a 3.6 Any step o f an N  x  N  C o m m o n , C o l l is io n , or  C o l l is io n "*" CRCW  
FR-Mesh can be simulated on a n N x  N  A r b it r a r y  or P r i o r i t y  CRCW  FR-Mesh 
in 0 {1 ) time.
P ro o f; To prove Lemma 3.6 it is enough to show how to implement the COMMON, 
C o l l is io n , and C o l l is io n "*" rules on a  CRCW bus that uses the A r b it r a r y  rule, 
since a CRCW bus th a t uses the P r io r it y  rule simulates the A r b i t r a r y  rule in 
constant time.
C om m on  on  A rb itra ry :  Under the C o m m o n  rule, all the values written on the 
bus are equal. So, the bus using the ARBITRARY rule chooses anyone of the instances 
of the same value.
C ollision  on  A rb itra ry :  In this simulation, accompany any w riting with the index 
of the writing processor. Then, each writing processor compares its own index against
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
51
the index it reads from the bus; if they differ, then the  processor writes a coUison 
symbol to  the bus. If no processor writes a  collision symbol, then the da ta  on the bus 
is the final data.
C o llision^  o n  A rb itra ry :  Each writing processor writes its data to the bus. Then, 
each writing processor compares the data it wrote to  the bus against the  data  it 
reads from the bus; if they differ, then the processor writes a collision symbol. If no 
processor writes a  collision symbol, then the data on the  bus is the final data . ■
3.6.2 Simulation 2
Simulation 2 scales down the size of an FR-Mesh where both Q and 72. use the 
P r io r it y  rule. Lemma 3 .7  summarizes the result of Simulation 2.
L em m a 3 .7  A ny step o f an N  x. N  P r io r it y  C R C W  FR-Mesh can be simulated 
on a P  X P  P r i o r i t y  C RC W  FR-Mesh in time, that is, with a constant
simulation overhead.
P r o o f :  We identify portions of the C o m m o n  rule simulation that must change to 
accommodate the P r io r it y  rule.
P r io r i ty  o n  P r io r i ty :  In the component determination phase (Section 3.4), 72 
performs concurrent writes only during the following parts of the simulation:
1 . Horizontal prefix assimilation (Section 3.4.1) in Stages 1  and 2,
2. Vertical prefix assimilation (Section 3.4.2) in Stages 1, 2, and to generate the 
small component numbers, and
3. Component numbering (Section 3.4.3) in Stage 2.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
52
In these parts of the simulation, %  uses concurrent writes to find a  leader among 
a set of processors. In all the cases, the leader processor was always the one with 
highest index (we use the highest index for convenience, but the simulation can be 
easily modified to accept the lowest index). Using the C o m m o n  rule, 71 performs 
leader election (by priority resolution) in O(logP) time. If 71 uses the P r io r it y  
rule, then it performs leader election in constant time. So, using the P r i o r i t y  rule, 
72. performs component determination in time.
During the data  delivery phase, 72 selects bus data for each bus. Under the 
P r io r it y  rule, the data  on the bus a t the end of a writing cycle is the data  written 
by the processor with highest priority coimected to that bus. This simulation requires 
each bus datum written to have attached a tag. This tag is the index of the writing 
processor. Now, we identify the portions in the data delivery phase th a t change.
W indow  hom ogen ization : By assigning the indices of processors in 72 in an analo­
gous way to those of Q, we ensure th a t the priority of writing processors in 72 refiects 
the priority of the corresponding writing processors in Q. So using a  concurrent 
write, 72 finds, for each component, the written data with highest priority within the 
window in constant time.
Next, 72 compares the data  generated within the window against the d a ta  arriving 
from the top and left windows. Each border holds a unique bus datum  per component. 
Processors of 72 perform three writings (data generated within the window, data from 
left border, and data  from top border) in sequence for each component. Finally, each 
processor in 72 compares the tags to select the final bus data  on its buses.
Second  v e rtica l d a ta  sw eep: In this phase, 72 does not have to make any com­
parison, it only has to broadcast data. The broadcast d a ta  is the same for each
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
53
component in the slice, so processors in write their da ta  and  let the  bus resolve 
the concurrent writes.
Slice hom ogen iza tion : Remember th a t each horizontal bus in the slice has a  com­
ponent number and a bus datum with a tag. Also, in each row of processors of %, 
there is a  register where % stores a potential final bus datum  for each component. 
We modify the slice homogenization algorithm of Section 3.5.3 in the following way. 
First, % stores the bus datum (and its  tag) of each horizontal bus in its respective 
register (for each component that is present in the first window). Then, for each 
component, %  compares the bus data  tag in the first window against the one in the 
second window. 72. keeps in each register the bus datum with higher tag  and repeats 
this procedure to cover all the windows in the slice.
Since this procedure takes time per window, 72 executes slice homogeniza­
tion in time. For this reason, we defer slice homogenization to execute only
once before the second horizontal da ta  sweep, as mentioned in the remark a t the end 
of Section 3.5. This phase proceeds as follows.
S tage 1. This stage is a  combination of Stages I and 2 of slice homogenization 
(see Section 3.5.3). The first loop includes Steps 2 to 5; the second loop includes only 
Steps 4 and 5.
1. Each processor of 72 configures its port connections as cross-over, 
for It <— 0 to ^  — 1 perform Steps 2 to 5 in window Wu.v
2. The leftmost processor in each row of 72 writes the contents (if any) of register 
hdata (and its tag) to its horizontal bus.
3. In each row k  of 72, the processor in colunm i  that satisfies the  relation -H m  =
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
54
comp{uP  +  k) for some m, where 0 <  m  <  compares the tag  of its bus data  
(if any) against the tag  of the d a ta  stored in register r^m -  The processor keeps in 
register regm the data with higher priority tag.
for m  4- 0 to ^  — 1 perform Steps 4 and 5 in window
4- In each colum n E, each processor writes the contents (if any) of register regm to  its 
vertical bus only if that register changed its contents in Step 3.
5. In each column £, each processor reads its vertical bus and stores th a t value in its 
register regm-
S ta g e  2. This stage distributes the bus da ta  to each bus in the slice. It is the same 
as Stage 3 of slice homogenization (Section 3.5.3).
Stage 1  determines the execution time of slice homogenization. 72. executes Stage 1  
in time per window, so 71 performs slice homogenization in time.
S econd  h o rizo n ta l d a ta  sw eep: In this phase, 7t proceeds in the same way 
as in the second vertical data  sweep. 7t broadcasts d a ta  and lets the bus resolve 
concurrent writes.
71 performs data  delivery in time. On the whole, this simulation takes
time; that is, it has a constant simulation overhead.
3.6.3 Simulation 3
In the FR-Mesh simulation, we use leader election based on priority resolution. Pri­
ority resolution is a convenient and easy to implement method for choosing a leader. 
But in fact, any leader election method in the FR-Mesh simulation will work, not
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
55
only priority resolution. So, by using a procedure similar to the one in Simulation 2 , 
we obtain the  following result.
C o ro lla ry  3 .8  Any step o f an iV x iV A r b it r a r y  C R C W  FR-Mesh can be simulated 
on a F  X. P  ARBITRARY C RC W  FR-Mesh in time, that is, with a constant
simulation overhead.
3.6.4 Sim ulation 4
Now we prove tha t if 72. uses C o m m o n , C o l l is io n , or C o l l isio n ^ , then Tt simulates 
a step of Q with a simulation overhead of log P . Lemma 3.9 summarizes the result 
of Simulation 4.
L em m a 3.9 A ny step of a P  x  P  P r io r it y  C R C W  FR-Mesh can be simulated on 
a P  X P  C o m m o n , C o l l is io n , or C o l l is io n ^ C R C W  FR-Mesh in  O (logP ) time.
P ro o f; To prove Lemma 3.9, it is enough to show the simulation of a  CRCW  P r i­
o r it y  bus writing cycle on each of the COMMON, C o l l is io n , and C o l l is io n "^  
simulating models. These procedures find the writing processor with highest index 
from the set of writing processors. Assume that each processor has a unique O (logP)- 
bit key (the key may be its index). (To find the processor with lowest index, perform 
the procedures below, but each processor uses the I ’s complement of its index rather 
than its real index.)
P r io r i ty  o n  C om m on : In the first step, each processor writes a T ’ on its bus if the 
most significant bit of its index is T '.
If a processor with a  ‘0’ in this bit reads:
a) a  T ’, then it remains idle for the rest of the procedure, or
b) a null value, then it remains active in the following iteration.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
56
In the second iteration, the remaining processors repeat the process bu t now 
using the second most significant bit, and so on. As the algorithm progresses, fewer 
processors remain active in the contest. Finally, after 2 log P  iterations, only the 
processor with highest index remains active in the contest. This winner processor 
writes its data  to  the bus.
P r io r ity  on  C ollision: In the first step, each processor writes a T ’ on its bus if the 
most significant bit of its indec is ‘1 ’.
If a  processor with a ‘0’ in this bit reads:
a) a T ’, then there is only one writer, which wins the contest,
b) a collision symbol, then the processor remains idle for the rest of the procedure, or
c) a null value, then the processor remains active in the following iteration.
If a processor with a T ’ in this bit reads:
a) a T ’, then this processor wins the contest, and writes its data to the bus, or
b) a collision symbol, then tha t processor remains active in the following iteration.
In the second iteration, the remaining processors repeat the process but now using
the second most significant bit, and so on. This process continues for a t most 2  log P  
iterations before finding winner. The winner processor writes its data  to the bus.
P r io r ity  on  Collision'*': The C o m m o n  rule is a restricted case of the C o l l is io n '*' 
rule. Since the C o m m o n  rule simulates the P r io r it y  rule in 21ogP steps, so does 
the C o l l is io n '*' rule.
By combining Lemmas 3.6, 3.7, and 3.9, we obtain Theorem 3.3. By combining 
Lemmas 3.6 and Corollary 3.8, we obtain Theorem 3.4. Similarly, by combining 
Lemma 3.6 and Lemma 3.7, we obtain Theorem 3.5.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
57
3.7 Improved Scaling Simulation of the R-Mesh
This section describes an algorithm to simulate an arbitrary  step of an iV x  iV 
R-Mesh, Q, on a P  X P  R-Mesh, in O ( ^ I o g P I o g ^ )  steps; this establishes a 
log P  log y  simulation overhead. Our method applies our previous FR-Mesh scaling 
result (Theorem 3.3) to the R-Mesh scaling simulation of Ben-Asher et al. [4].
Ben-Asher et al. [4] proposed a  scaling simulation for the unrestricted R-Mesh 
that simulates a  step of an jV x  N  R-Mesh on a  P  x P  R-Mesh in log N  log 
time. The log iV factor in the simulation overhead is the contribution of a connected 
components algorithm used in the scaling simulation. This algorithm obtains the 
connected components of an iV-node graph in 0 (logN ) tim e using an N x N  LR-Mesh, 
and in log N j time after scaling it down.
We improve this simulation by replacing the connected components algorithm 
by one (based on the incidence m atrix of the graph) th a t runs in 0 ( 1 )  time on an 
N x N  A r b i t r a r y  (or P r i o r i t y )  FR-Mesh, and in 0 ( ^  logP ) time after scaling 
it down to a P  X P  COMMON FR-Mesh. Thus, the improved scaling simulation has 
a  simulation overhead of log P  log The following theorem summarizes this result.
T h e o r e m  3 .1 0  For any P  < N , any step of an N  x  N  COMMON, COLLISION, o r  
C o l l is io n ^ C R C W  R-Mesh can be simulated on a P x P  C o m m o n , C o l l is io n , or 
COLLISION‘S C R C W  R-Mesh in 0 { ^ \ o g P \ o g ^ ^  time. ■
Remark: If leader election can be done in T  =  o(logP) time, then the R-Mesh 
simulation overhead reduces to T lo g ^ .  In particular, if the simulating R-Mesh is 
allowed to use the ARBITRARY (or P r io r it y ) rule, then the overhead is only log 
We now briefly describe the existing R-Mesh scalability simulation [4] tha t we will 
subsequently modify.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
58
3.7.1 Existing R-M esh Scalability Simulation
The method of Ben-Asher et al. [4] simulates a step of an N  x. N  R-Mesh, Q, on a 
P x P  R-Mesh, The most time-consuming part of this algorithm is the recursive 
procedure of Figure 3.9. This procedure identifies the components of Q  and spends 
logiVIog time. The remainder of the simulation runs in log time.
Procedure leaders{S, X , P )
/*  Chooses a leader for each bus of an X  x  X  R-Mesh, S ,  using * / 
/*  a P  X P  R-Mesh. The output is the set, £ , o f leaders. * / 
If X  > P  then
Divide S  into four y  x y  sub-R-Meshes, «Si, «Sz, «%, and 
for j  <—  1 to 4 do
Cj <—  leaders(Sj, y ,  P) 
components{Ci, C2, C3, £4)
end
Figure 3.9; Ben-Asher et al. procedure to calculate connected components.
To find a leader processor for each bus of an X  x X  R-Mesh «S, 72. calculates the con­
nected components of an 8 X-node graph with at most 2X edges using the procedure 
components(Ci, C2 , C3 , C4 ). 72 obtains this graph from the connected components 
information generated on the four sub-R-Meshes of «S, each of size y  x  X. That 
is, each sub-R-Mesh contributes 4 nodes (border ports), and there are a t most 
4 edges between the border ports of these sub-R-Meshes. When X  < P , this 
problem reduces to finding a representative among 0 ( P )  border ports for each bus, 
which 72. can accomplish in O(logP) steps.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
59
S i S2
S3 5 4
X X
2 2  r
X
2
X
2
Figure 3.10: Decomposition of S  into «Si, «S2 , «S3, and «S4.
For m  <71, let tc(n, m) denote the time required for an m x m R-Mesh to find the 
connected components of a 8 n-node graph with a t most 2n edges. If T(N,  P)  denotes 
the time to simulate an iV x iV R-Mesh on a P  x P  R-Mesh, then firom the above 
discussion:
T (P ,P ) =  O (logP) and
T{ N ,P )  =  4T (f,P )-h tc (iV ,P ), for P < i V .
This gives T(iV, P) =  O p2
log?
lo g P +  Y .  4 - ‘ fc(2 ‘P ,P )
t=l
Ben-Asher et al. [4] showed that an A/" x  AT LR-Mesh can find the connected 
components of an 8 iV-node, 2iV-edge graph in 0(logA/') time. Since an LR-Mesh 
is completely scalable, a  P  x  P  LR-Mesh can find components{Ci, C2 , C3, C4 ) in 
tc(AT, P ) =  O (^ lo g A r) steps.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
60
W ith this value of tc{N, P) in the equation for T{N,  P),  we have
T(N,  P)  =  lo g JV lo g ^ j .
That is, the scalability factor of the simulation is logiV log
3.7.2 T he N ew  Simulation
Our idea is to replace the LR-Mesh connected components algorithm used by 
components(Ci, C2 , C3 , C^) with a faster, scalable FR-Mesh algorithm.
L em m a 3.11 For any P  < N , a P x P  FR-Mesh (using the C o m m o n  or C o l l i s i o n  
rule) can find  the connected components o f a graph with 0 ( N )  nodes and 0 ( N )  edges 
in log P )  time.
P r o o f :  Since the FR-Mesh scales with overhead O (logP) (Theorem 3.3), i t  suffices 
to prove th a t by embedding the incidence matrix of the ciN-node, cgW-edge graph 
(ci and C2  are constants) into a  c iN  x C2N  A r b i t r a r y  (or P r i o r i t y )  FR-Mesh, it 
can find the connected components of the graph in constant time. Let the vertical 
bus in column i  of the FR-Mesh represent node i of the graph, and  let the horizontal 
bus in row j  of the FR-Mesh represent edge j  of the graph. Now, embed the incidence 
matrix of the graph on the FR-Mesh as follows. Processor p,-j- of the FR-Mesh sets 
a fusing connection if edge j  is incident on node z; otherwise, it sets the cross-over 
connection.
Let (zz, x i) , (xi, X2 ), . . . ,  (xfc, v) be a  set of edges that connects node u  to  node v 
in the graph. In the FR-Mesh, the vertical buses in colunms u and xi are connected 
through the horizontal bus that represents edge (u, xi). Similarly, the vertical buses 
in columns x i and X2  are connected through the horizontal bus that represents edge
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
61
(x i ,X2 ), and so on. Consequently, all nodes in the same component of the graph 
are also in the same component of the FR-Mesh. To choose a single label in each 
component, each processor on the first row now writes its column index (which is 
the identity of the node th a t colunm represents) on its vertical bus, then it reads the 
resulting component number &om the same bus. ■
c .
a
b
c
d
1 2 3 4 5 6
0 1 1 0  0 0
1 0  0 1 0  0
0 0 1 0  1 0
0 0 0 0 1 1
a
b
c
d
$  ç a  $  $
■~E
(a) (b) (c)
Figure 3.11: Embedding the incidence matrix in an FR-Mesh: a) Two-component 
graph; b) Incidence m atrix; c) The embedding.
Example: Figure 3.11(a) shows a two-component graph with six nodes and four 
edges. Figure 3.11(b) shows the incidence m atrix of this graph, and Figure 3.11(c) 
shows the incidence m atrix  embedded in an FR-Mesh. Figure 3.11(c) shows the 
existence of a path between vertical buses 1 and 4 through horizontal bus b (this 
component is shown as solid bold lines). Also notice the path between vertical buses 
2 and 6  through buses a, 3, c, 5, d (this component is indicated with dotted lines). 
W hen processors on the first row resolve concurrent writes, the top processors in 
columns 1 and 4 read T ’, and the top processors in columns 2, 3, 5, and 6  read ‘2’.
Notice the importance of assuming the A r b i t r a r y  (or P r io r it y ) rule in the 
simulating machine to  choose the component label. By using either of these write 
rules, the algorithm runs in 0 (1 ) time. If we scale it down by using the FR-Mesh self-
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
62
simulation (Theorem 3.3), then we get an  overhead of O(IogP), run n in g  on a  P  x  P  
FR-Mesh with the C o m m o n  or C o l l is io n  rule. Otherwise, by initially assuming 
the C o m m o n  or C o l l is io n  rules, the connected components algorithm would run 
in O(logiV) time, then scaling it down results in an overhead of O(logiNTlogP).
Lemma 3.11 establishes tha t tc{N, P)  =  lo g P ^  With this value for tc{N, P)  
in the equation for T{N,  P ) , we obtain
T(iV.P)=of^logPlog^V
T hat is, the scalability factor is log P  log
As  noted before, the FR-Mesh scalability simulation of Theorem 3.3 has an O (logP) 
overhead due to leader election. Any improvement in leader election implies corre­
sponding improvements in Lemma 3.11 and Theorem 3.10.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 4
Bus Linearization
The unrestricted R-Mesh can create bus structures of many different shapes (see 
Figure 4.2). On the one hand, flexibility in shaping buses facilitates algorithm design 
and can reduce running time, but on the other hand, certain bus shapes such as 
branches and cycles require more complicated hardware to implement these models. 
In this chapter, we present a procedure called bus linearization [15] that transforms a 
bus of any shape allowed by the R-Mesh into one with an equivalent linear structure. 
This procedure gives an algorithm designer the liberty of using buses of arbitrary 
shape, while automatically translating the algorithm to run on a more implementable 
platform.
We illustrate the use of bus linearization through two important applications. The 
first constructs a  faster “scaling simulation” for the R-Mesh. The second application 
adapts algorithms designed for the R-Mesh to run on models that use optical buses 
[40, 44, 45].
The objective of bus linearization is to transform any “non-linear bus” (see 
Figure 4.1(a)) allowed by the R-Mesh into an “acyclic linear bus” (see Figure 4.1(b)). 
To this end, the LR-Mesh [4] can realize the resulting bus structure. Specifically, we
63
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
64
prove that an. N  x  N  LR-Mesh (with only acyclic buses) can simulate an arbitrary 
step of an JV X i\T R-Mesh in O(logiV) time.
r D~U~| D
(a) (b) (c)
Figure 4.1: Type of buses: a) Non-linear; b) Acyclic linear; c) Cyclic linear.
This procedure iteratively grows a spanning tree for each bus in the R-Mesh. 
Simultaneously, it creates a “pseudo-Euler” tour for each tree; this is an equivalent 
acyclic linear bus representation of the spanning tree.
Matias a n d  Schuster [32] also designed an algorithm th a t simulates an R-Mesh 
using an LR-Mesh. Although their algorithm targets a scaling simulation, one can 
use it for bus linearization. Their approach differs fundamentally from ours, however. 
They used a  randomized simulation with (among others) the C o l l i s i o n  rule for con­
currently writing on buses; the overhead of their method (with both machines of size 
N  X N) is log iV log log N. Our method does not require concurrent writes, is deter­
ministic, and has an smaller overhead of logiV for the same problem. Section 4.3.1 
gives more details contrasting their algorithm with ours.
Scaling S im u la tio n s  In this chapter, we use bus linearization to construct a 
new deterministic scaling simulation for the unrestricted R-Mesh. This approach has 
a log AT simulation overhead, which further improves the best previous determinis­
tic simulation overhead (Section 3.7.2) of log P  log Furthermore, the simulating
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
65
LR-Mesh uses only exclusive writes, whereas all prior scaling simulations required 
concurrent writes (see Table 1 .1 ).
We also use bus linearization to  allow the FR-Mesh scaling simulation of Section 3 
to run on a weaker simulating model (see Table 1.1); the previous simulation requires 
a  CRCW FR-Mesh, while the new simulation needs only a  CREW  LR-Mesh.
O p tic a l M odels Reconfigurable models with pipelined optical buses have at­
tracted research interest because of their ability to handle communication-intensive 
algorithms efficiently. The Pipelined Reconfigurable Mesh (FR-Mesh) [45] is a  vari­
ation of the R-Mesh tha t uses pipelined optical buses. Other optical models include 
the Array with Reconfigurable Optical Buses (AROB) [41] and the  Array of Pro­
cessors with Pipelined Buses using Switches (APPBS) [7, 17]. One can view the 
PR-Mesh as a version of the acyclic LR-Mesh with pipelined optical buses, or as a 
two-dimensional extension of the Linear Array with Reconfigurable Pipelined Bus 
System (LARPBS) [27, 40]. Due to the structure of its transm itting and receiv­
ing connections, the PR-Mesh (like other optical reconfigurable models) allows only 
acyclic linear connections.
We use bus linearization to translate the vast body of R-Mesh algorithms to  run 
on the above optical models. Because of its similarity to the LR-Mesh, the PR-Mesh 
is the best suited to use bus linearization; in fact, bus linearization can run on a 
restricted version of the PR-Mesh. Table 1.1 shows our results for the PR-Mesh. 
Using results from Bourgeois and Tfrahan [7], we extend the PR-Mesh results to  also 
include the AROB and APPBS.
The organization for this chapter is as follows. Section 4.1 gives some basic defi­
nitions. Section 4.2 details bus linearization and Section 4.3 uses bus linearization to
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
66
construct new scaling simulations for the  R-Mesh and FR-Mesh. Finally, Section 4.4 
presents scaling simulations of the R-Mesh and FR-Mesh on reconfigurable pipelined 
optical models.
4.1 Definitions
This section introduces some basic concepts, such as the graph of an R-Mesh, the 
processor mapping used by the simulation, and two procedures to  perform leader 
election on linear buses.
Figure 4.2(a) shows two buses in bold. Each bus induces a “component” in the 
R-Mesh. A component is the set of ports connected by the bus. It is possible for 
different buses to induce the same component. Two buses th a t induce the same 
component are said to be equivalent. The corresponding bold buses in Figure 4.2(a) 
and 4.2(b) are equivalent, though different in shape.
0 1 2 3 4  0 1 2 3 4
m
(a) (b)
Figure 4.2: Port partitions of an R-Mesh.
A bus is linear iff it connects its ports only as allowed by the LR-Mesh; otherwise, 
the bus is non-linear. A linear bus could be cyclic or acyclic as shown in Figures 4 .1 (b) 
and 4.1c.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
67
4.1.1 Graph o f an R-Mesh
The graph, of an R-Mesh configuration is a  graphical representation of the con­
nections between its ports. Each block in  the port partition of a processor generates 
a node of G- Thus, one can view each node of ^  as a  set of ports internally connected 
within a processor of the R-Mesh. Let v\ and ug be nodes of G- An edge exists 
in G between and U2 iflT an edge exists in the R-Mesh between ports pi and pg, 
where ports p i and pg are elements of the partition blocks that generate nodes V\ 
and V2 , respectively. Figure 4.3(a) shows an R-Mesh configuration with its graph G 
in Figure 4.3(b); the dotted squares represent the corresponding processors for each 
partition of ports, but they are not part of the graph. Clearly each node has degree 
0, 1 , 2, 3, or 4. Let the term terminal node refer to a degree- 1  node; linear node, to 
a degree-2 node; and non-linear node, to a  degree-3 or degree-4 node.
0
#
0
#-ie-
(a) (b)
3
# :
I .
Figure 4.3: Graph of the R-Mesh: a) Configuration of an R-Mesh; b) Graph G of the 
R-Mesh.
4.1.2 M apping R-M esh Processors to  LR-Mesh Processors
During the simulation of the R-Mesh by the LR-Mesh (Section 4.2), a group of four 
processors of the LR-Mesh simulates a  single processor of the R-Mesh. For each 
port partition of the R-Mesh, there is an equivalent group configuration assumed by a
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
68
group of processors of the LR-Mesh. Figure 4.4 shows representative configurations 
of an R-Mesh processor and its equivalent group configuration.
B
(a)
f f l
0>)
□
J i r h
(e)
a
(c)
n
a
A -
- a
(< 0
(0
n
Û
(g)
n
Figure 4.4: Equivalent group configurations for R-Mesh processors: a-d) Linear pro­
cessors; e-f) Non-linear processors; g) Terminal processor.
Each group of four processors has eight ports through which it connects to neigh­
boring groups. Assign a direction (incoming or outgoing) to each of these ports as 
shown in Figure 4.4. This assignment is only for ease of explanation and does not 
require the more powerful directional model [5] that can restrict information flow to 
only one direction on a  bus because ports in the LR-Mesh processors will read only 
from incoming ports and segment and write only to outgoing ports. These oppositely 
“directed” buses, combined with the equivalent group configurations of the R-Mesh 
processors, create a double bus structure that is fundamental for some of the proce­
dures presented in this chapter.
As is clear from Figure 4.4, each pair of incoming/outgoing ports of a group of 
processors in the LR-Mesh corresponds to a port of an R-Mesh processor. Since each 
node of graph Q of the R-Mesh configuration corresponds to a  set of ports, we will 
say tha t a  node reads or writes to refer to reads and writes a t the corresponding ports 
of the group.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
69
4.1.3 Leader Election
Let 5  be a  set of processors in the R-Mesh connected by a  linear bus. Let C Ç  ^  be a 
set of candidates for leader. Leader election is the problem o f selecting any one element 
horn C. Leader election is a  fundamental p a rt of the R-Mesh simulation in Section 4.2. 
Furthermore, the ability of the LR-Mesh to  perform leader election among processors 
connected by acychc linear buses in constant time allows a fast R-Mesh simulation 
without the  use of concurrent writes. We present two solutions to leader election 
corresponding to cyclic and acyclic linear buses. In these solutions we use a  group of 
four processors to simulate a single processor of S  (each square in Figure 4.5 is a  group 
of four processors). We also use the equivalent group configurations of linear nodes 
(Figure 4.4(a) to 4.4(d)) to embed the connections among ports and to create the 
double bus structure described in Section 4.1.2. Assume th a t each group of processors 
has identified whether the linear bus to which it connects is cyclic or acyclic, (the 
first step of the simulation of Section 4.2 explains how to make this determination).
Q  Candidates 
□  Non-candidates
[:i [\]
Cl □
Q u
(a) (b)
Figure 4.5: Leader election examples: a) Acyclic bus; b) Cyclic bus.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
70
L ea d e r E lec tio n  o n  A cyclic L in e a r B uses: This leader election method consists 
of two steps; (1 ) select one of the term inal groups of the acyclic linear bus as a  
reference group; (2 ) select the candidate closest to the reference group as the leader.
To select the reference group, the  two terminal groups exchange their indices 
(using the double bus structure described in Section 4.1.2) and decide upon the lower 
indexed group as the reference. To elect the leader, we use a  procedure called neighbor 
localization that works as follows. If the reference group is a  candidate, then it is 
elected as the leader; otherwise, each candidate segments its portion of linear bus, 
while non-candidate groups leave the bus unsegmented. A write by the reference 
group reaches the candidate closest to the reference. This candidate is elected leader. 
Notice tha t this procedure runs without concurrent writes and takes constant time.
Example: In Figure 4.5(a), the two terminal groups exchange their indices, 0 and 
6 . They select group 0 as the reference group. Then, group 0 writes on the bus, and 
group 1  (the closest candidate to the reference) reads the bus and declares itself the 
leader.
L em m a  4.1  A CREW  LR-Mesh can perform leader election on an acyclic linear bus 
{using a double bus structure) in constant time.
L ea d er E lection  o n  C yclic B uses: Finding a leader on a  cyclic bus is more 
involved than  on an acyclic bus and compels a  different approach. The procedure we 
use selects the least indexed element of the set C as the leader. Non-candidate groups 
simply provide buses between adjacent candidate groups, while candidate groups split 
their bus (see Figure 4.5(a)). The idea is to shrink C until it contains only the leader. 
Reduce set C as follows.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
71
Each candidate exchanges its index with its neighboring candidates (using the 
double bus structure described in Section 4.1.2). Since two outgoing ports are never 
adjacent, the exchange of indices is without conflict. Any candidate with an  index 
larger than either of its neighbors cannot be the leader, so it excludes itself from C and 
connects its internal ports as a non-candidate group. Since the procedure removes a t 
least one of each pair of neighboring candidates from C, the number of candidates a t 
least halves in each iteration. The procedure repeats on the reduced candidate set 
and stops when only one candidate remains. To determine this condition, test if a 
candidate is its own neighbor. Notice tha t this procedure does not require concurrent 
writes and takes O (log |C|) time. Therefore, we have the following result.
L em m a 4.2 A C REW  LR-Mesh can perform leader election on a cyclic bus {using 
a double bus structure) with X  candidates in O(logA’) time.
4.2 Bus Linearization
Bus linearization is a  procedure for transforming non-linear buses of an R-Mesh into 
acyclic linear buses. This section describes a method for bus linearization th a t sim­
ulates an R-Mesh (that uses non-linear buses) on an LR-Mesh (that uses only linear 
buses). While linear buses are important because of their simple structure, a  further 
restriction to acyclic buses has the advantage that they admit constant time leader 
election, an im portant procedure for eliminating concurrent writes. Pipelining on 
optical buses also assumes acyclic buses.
The idea of bus linearization is to generate a spanning tree (equivalent acyclic bus) 
of the graph G o f the R-Mesh and find a pseudo-Euler tour (equivalent acyclic linear 
bus) of the spanning tree. The pseudo-Euler tour enables the LR-Mesh to handle 
the spanning tree, hence, R-Mesh bus structure, as a linear bus. We construct the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
72
spanning tree and pseudo-Euler tour iteratively, growing the spanning tree &om an 
initial linear subgraph. An iteration starts with a  collection of partial spanning trees 
and their pseudo-Euler tours. The iteration merges partial spanning trees and their 
pseudo-Euler tours for the next iteration. This m ethod captures connectedness of a 
graph in a manner similar to the connected components algorithm of Shiloach and 
Vishkin [43].
R-Mesh configuration
I
Spanning tree
C ^
pseudo-Euler tour
I
LRN-Mesh configuration 
Figure 4.6: Linearization procedure.
This section establishes the following result.
T h e o re m  4.3 B us l in e a riz a tio n  Any step o f an N  x. N  C o m m o n , C o l l is io n , 
C o l l is io n "’’, A r b it r a r y , or P r io r it y  CRC W  R-Mesh can be simulated on an 
N  X N  C R EW  LR-Mesh in  0 (log  A ) time.
The first step in the proof of Theorem 4.3 is an 0 (log  A) time simulation of an 
N  X N  C o m m o n  CRCW R-Mesh, Q , on a 2N  x  2 N  C o m m o n  CRCW LR-Mesh, 
Z . We will later reduce the size of Z  and describe the modifications required for 
introducing other write rules in Q, and then eliminate the need for concurrent writes 
in Z .
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
73
4.2.1 Simulation o f  R-M esh by LRN-Mesh
We now describe the simulation with the running example of Figure 4.7. For clarity 
the example shows only one component. Each group of four processors of Z  holds the 
processor index, port configuration, data  to be written on buses, and the computation 
to be performed by the processor of Q. it simulates.
S te p  1  -  Iden tify ing  b u s  ty p es : This step classifies each bus of the R-Mesh as 
non-linear, cyclic linear, or acyclic linear. First, embed the graph G of the R-Mesh Q 
in Z  using the configurations of Figure 4.4(a)-(d) for linear nodes and without any in­
ternal connection in the groups for terminal and non-linear nodes (see Figure 4.7(b)).
Each non-linear node now broadcasts a  signal to all the nodes connected to it. 
If a node receives the signal, then its ports are on a  non-linear bus; otherwise, its 
ports are on a linear bus. The next phase determines whether a linear bus is cycUc or 
acyclic, so only nodes on linear buses participate. Each terminal node writes a  signal 
on its port. If a node receives this signal, then its ports are on an acyclic linear bus; 
otherwise, they are on a cyclic linear bus.
S te p  2  - E lim in a tin g  cyclic  lin e a r  buses: By Lemma 4.2, elect a  leader in each
cycle and cut the bus a t the leader. At this point, all linear buses are acyclic. This 
step generates graph G\ which is equivalent to graph Q and is also embedded in Z .
Next, Z  performs the writing cycle on linear acyclic buses of graph Q'. In the 
writing cycle, writing nodes write their da ta  to buses and reading nodes read and 
store the data. The remaining steps deal with the non-linear buses of graph G'.
S te p  3 - C o n s tru c tin g  th e  rak e d  g rap h : This step partially contracts the 
graph in a way analogous to raking (removing) the leaves of a  tree. Specifically,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
74
(a)
O
o o
o o
e e -
mm
a -{ H  ] - S - $
mm<
e - e -
e - e -
(b)
O O O C n j - $
O O O O u { ^ # #
o o o o i T W i i o
m # !#
^ o ooo
{■{>{>0000 tpij-{} {} o o o o
O O O Oooooôo-pi 
e - e  ' '  '
e - e -
•oo
r O O  
0 0 - $ - $ - 0 0 0 0  
$ - $ 0 0 0 0
(C)
0 0 0 0 $ * - ^ ^ i & - 0 0 0 0ooooo-e-o-ôoooo
(d)
Figure 4.7: LR-Mesh simulating an R-Mesh (first part): a) R-Mesh non-linear graph; 
b) LR-Mesh replication of edges of the non-linear graph; c) Raked graph; d) Distilled 
graph.
Z  constructs a raked graph by removing each chain of linear nodes that connects a  
terminal node to a non-linear node (Figure 4.7(c)). The reason for this step is to  
eliminate port configurations like those of processors (0,1) and (1,2) in Figure 4.3, 
for which a 2  x 2 group lacks sufficient width to directly embed the connections for 
constructing a pseudo-Euler tour in subsequent steps.
Figure 4.8: Raking chains of linear nodes in Step 3.
Consider a terminal node t, connected via linear nodes Ci, 6 2 , . . . ,  to non-linear 
node X (see Figure 4.8). F irst, Z  configures groups as in Step 1. Then, node t  trans-
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
75
mits a predefined signal (for example a ‘1 ’) to notify ei, 6 2 ,. - . ,  e* th a t they will be 
raked up. This signal also informs node x  th a t it needs to store the resulting bus data 
after performing the write cycle on the bus th a t connects the chain of linear nodes. 
During the write cycle, any writing port associated with nodes t, ei, 6 2 , . .  -, e&, x  writes 
to the bus, the  bus resolves the concurrent writes, and node x  (having been alerted by 
the previous signal) picks up the resulting data from the bus. (After the pseudo-Euler 
tour construction. Step 10 will incorporate this data while handling  other writes.)
S tep  4  - C o n s tru c tin g  th e  d is tilled  g rap h : In this step, 3  transforms the 
raked graph into the distilled graph by flagging maximal chains of linear nodes as 
edges. This transformation allows Z  to speed up the construction of the spanning 
tree, since long chains of linear nodes (now edges) can be flagged to  be part of the 
spanning tree in the following steps. Only the non-linear and the newly generated 
terminal nodes of the raked graph can be nodes in the distilled graph. (Figure 4.7(d) 
shows a distilled graph; notice how the linear node changed into an edge.)
S tep  5 - In itia t in g  spanning t r e e  co n stru c tio n : The goal of Steps 5 to 8  is 
to generate a  spanning tree for each connected component of the distilled graph. 
In Step 5, Z  starts the spanning tree construction by merging edges of the distilled 
graph. This step generates a set of partial trees. Subsequent steps will merge these 
trees iteratively to complete the spanning tree construction. For simplicity, we will 
explain the procedure for a single component and refer to the partial trees in the 
component as the forest.
Only groups of processors of Z  th a t represent nodes of the distilled graph are 
active during Step 5, we refer to them as active nodes. Step 5 proceeds as follows. 
The active nodes exchange their indices with each other if they are adjacent, that
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
76
is, if they are connected by an edge in the distilled graph (a chain of linear nodes in 
the raked graph). Each active node chooses its neighboring active node w ith smallest 
index as its parent, only if this index is smaller than  its own index. Then Z  flags or 
selects the edge between each parent and “child” as part of the new spanning tree.
At this point of the simulation, the active nodes do not have any internal con­
nections in their ports. So, the edges selected in this step are not connected to each 
other, even though two selected edges have a common active node. Figure 4.7(d) 
shows the first two edges (in bold) of the spanning tree. The dotted edges remain 
unselected in this step. Notice th a t each edge consists of one outgoing bus and one 
incoming bus. An active node incident with a selected edge marks itself as a  root if 
its index is smaller than the indices of all its neighboring active nodes connected to 
it through selected edges. Each root identifies itself by comparing its index against 
the indices of its neighbors.
Remark: The terms “parent” and “child” do not indicate a  directed tree. They just 
give a convenient form to describe the direction in which Z  performs grafting.
S tep  6  - C o n s tru c tin g  th e  in it ia l  p seudo -E u le r to u r :  This step constructs a 
pseudo-Euler tour of each tree of the forest obtained in Step 5.
Claim: Embedding in Z  the equivalent group configuration of each active node con­
nected to selected edges (of the distilled graph) generates a  set of Euler tours.
Proof: Each selected edge consists of two independent parallel buses (one outgoing 
bus and one incoming bus). Notice that the resulting graph is connected, since each 
partial tree in the distilled graph is connected and the configurations Figure 4.4(e)-(g) 
always connect together all the selected edges th a t represent a partial tree. Notice also
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
77
tossO O O O ^ ^o o o o i f l
im t  0  0  i r A - ^ ^ ^ O O O O -
U i - e - e - k #  4 > 4 i - o  o - o - o -
o  o  o  o  o  o  o  o
O 0 0 - 0  O O O O -
(b)
iitOO-OO
# # 0 0
O O O O  O O
o o o o $ o % j t H k # o o  
« ^ - e * * 4 > 4 > o o o o  fcj  0 0  iL#-$-$oooo oooo$o-#$oooo
O O O 0 0 -O -O -Ô O O O O  
(a)
oo  ooo o o o o o  o o o o o o  o o o o o o o o o o o o  o o o o o o o o o o o o
(C)
Figure 4.9: LR-Mesh simulating an R-Mesh (second part): a) Growing a  spanning 
tree and constructing pseudo-Euler tours; b) Pseudo-Euler tour of the spanning tree 
of the distilled graph; c) Pseudo-Euler tour of the spaiming tree after incorporating 
d a ta  of unselected edges; d) Broadcasting final values to raked segments.
th a t the degree of each port (number of buses connected to a port) is two, therefore 
the  bus that connect the ports must be a cycle. ■
Since each partial tree of the forest has only one root, the corresponding root 
group can remove one of the internal port connections to avoid forming a cycle in 
the  Euler tour (see Figure 4.9(a)), thus forming a “pseudo-Euler tour” . The root 
node broadcasts its index to aU nodes of its tree; this index is the label of the tree. 
Figure 4.9(a) shows the pseudo-Euler tours generated in Step 6 .
Steps 7 and 8  below form the body of the iterative process. Z  repeats them 
2(logiV -I-1) times to constructs a  spanning tree of the distilled graph. T he input to 
Step 7 is always a set of trees (pseudo-Euler tours). The output of Step 8  is a  smaller 
set of larger trees.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
78
S te p  7  - G rafting  tre e s : This step merges partial trees (pseudo-Euler tours) in 
the forest by a graft operation. The effect of this step is to connect subtrees of the 
spanning tree by using unselected edges. Step 7 proceeds as follows.
Each active node in each partial tree of the forest looks for a  potential new parent 
into which to graft. The selecting active node and the new parent must (1) be in 
different trees, (2) be connected by an unselected edge, and (3) have different tree 
labels, and the label of the potential parent must be smaller than th a t of the selecting 
active node. These conditions ensure an acyclic structure using only connections of 
the distilled graph. Selecting a parent ju st involves information exchange on buses of 
Z .  By Lemma 4.1, the root of each tree (pseudo-Euler tour) determines whether any 
active node in its tree has identified a potential parent, and, if so, chooses one such 
active node (there may be several). This node will perform the grafting operation and 
the root will close its cuts. Also, by Lemma 4.1, the root of a tree determines if some 
other tree has grafted into it; the root needs this information to establish whether or 
not its tree is a rejected tree.
The root of the tree resulting from the grafts above is the root of a tree with 
smallest index among the trees that comprise the new tree. Figure 4.9(b) shows the 
resulting pseudo-Euler tour after merging the two small pseudo-Euler tours. The 
algorithm incorporates one of the two unselected edges into the spanning tree.
Example: Figure 4.10 shows a graft operation between two trees (pseudo-Euler 
tours). Let s be a selecting active node th a t has determined p  to be its new parent. 
Let Ta and rp be the roots of the trees of s  and p, respectively. The idea is to cut the 
pseudo-Euler tours at s and p and join them  a t the cut. Active nodes s and p inform 
their respective parents about the graft operation, then r ,  closes the break, and Tp 
broadcasts its label to all the new nodes in its tree.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
79
&
LT n }
(a) (b)
Figure 4.10: Grafting operation: a) Trees of selected edges represented by their 
pseudo-Euler tours; b) Grafting the left tree onto the right tree.
S tep  8  - G ra ftin g  re je c te d  trees: Consider the situation in Figure 4.11 where 
several low labeled trees have edges only to one high labeled tree. If Z  uses only Step 7 
in the iterations, then it is possible for the high labeled tree to graft sequentially onto 
the low labeled trees, starting with the tree with label N  and continue with the trees 
iV — 1 , iV — 2, and so on. This could result in 0{N ^) simulation tim e. We avoid this 
situation by grafting trees that have not been involved in a graft operation in Step 7.
' »  A ' -
Low labeled trees <
High labeled tree
. n A
Figure 4.11: The worst case scenario for grafting trees occius when the high labeled 
tree on the right grafts onto the low labeled trees in the left following a descending 
order.
Step 8  forces each such rejected tree to graft onto some neighboring tree. For 
example, in Figure 4.11, if the high labeled tree grafted onto the tree  with label N ,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
80
then Step 8  forces trees 0 to  iV — 1  (the rejected trees) to graft onto the  high labeled 
tree. To graft the rejected trees, Z  performs the same procedure as Step 7, except 
without requiring the new parent to have a lower tree label.
Claim: Step 8  does not create cycles.
Proof: Since the label of a rejected tree is always smaller than any of the labels of its 
neighbor trees, two rejected trees cannot be neighbors. This implies th a t no rejected 
tree can graft onto another rejected tree, so the grafting of Step 8  does not form 
cycles. ■
Since each rejected tree always has a t least one neighbor tree, it is always possible 
to graft the rejected tree onto some non-rejected neighbor.
At this point, the 2 (logiV '+1 ) iterations of Steps 7 and 8  are completed and there 
is a  spanning tree for each connected component of the distilled graph.
S te p  9 - H an d lin g  w rites  o n  u n se lec te d  edges: This step accounts for edges
not included in the spanning tree (pseudo-Euler tour). Using Lemma A.l, Z  chooses 
a leader between the two active nodes at the ends of the edge. Z  performs a write 
cycle on the bus representing the edge; each group of processors (including the active 
nodes on the ends) that simulates a writer processor writes its da ta  to  the bus, then 
the leader reads from the bus and stores the value. The leader will write this value 
to the pseudo-Euler tour during the writing cycle of Step 10.
S te p  10 - S im ula tion  o n  p se u d o -E u le r to u r :  In this step, Z  simulates the 
communication among processors of Q. Each group of processors in the  pseudo-Euler 
tour th a t simulates a writer processor of Q  now writes its data (th a t includes the 
effects of raked linear chains in Step 3 and unselected edges in Step 9) to the pseudo-
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
81
Euler tour bus, and all the groups on the pseudo-Euler tour obtain a single value by 
letting the bus to resolve the concurrent writes.
S te p  1 1  -  C onveying inform ation  to  raked readers: Step 10 generated the 
final bus d a ta  for each bus of Z .  This information is available only to  groups of 
processors included in the spanning tree. In Step Z  conveys the final bus data 
to the remaining groups of processors (raked linear chains and unselected edges). Z  
first broadcasts the bus data to unselected edges that were processed in Step 9, then 
Z  repeats this action with the linear chains that were raked in Step 3.
This completes the transformation of all non-linear buses of the N  x  N  C o m m o n  
CRCW R-Mesh, Q, into acyclic linear buses on the 2 N  x 2N  C o m m o n  CRCW 
LR-Mesh, Z .  We next derive the running time of the algorithm, then extend the 
simulation to (1) permit other concurrent write rules for Q, (2) reduce the size of Z  
to N  X N , and (3) modify Z  to use only exclusive writes.
4.2.2 Simulation Running Time
The leader election in Step 2 runs in O (log AT) time (Lemma 4.2); all the remaining 
steps run in constant time. The iterative procedure involving Steps 7 and 8  reduces 
the number of possible trees (pseudo-Euler tours) by a t least a  factor of 2 in each iter­
ation (explained below). This is because Step 8  guarantees that each tree is involved 
in a t least one grafting. Since it is possible to have 0{N ^)  trees for the same compo­
nent after Step 6 , the algorithm executes Steps 7 and 8  O(logiV) times. Overall, the 
algorithm runs in O(logiV) time.
Assume th a t tree T, has label a*, where 0  <  i <  N ^. Each tree Ti must belong 
to one and only one of the following three disjoint sets of trees:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
82
1 . Set of all Ti whose neighbor trees have labels smaller th an  o%.
2. Set of all Ti whose neighbor trees have labels smaller and greater than o%.
3. Set of all 7\ whose neighbor trees have labels greater th an  cti.
Step 7 assures th a t all the trees in the first and second sets graft into some tree
with a smaller label. Partition the third set into two disjoint subsets: the first subset 
includes the trees th a t were grafted onto by trees with larger labels (these trees 
contains the root group of the new tree); the second subset includes all the rejected 
trees (these are the only trees that have not been subject to  a graft operation in 
Step 7). Step 8  ensures that each rejected tree grafts onto some non-rejected tree. 
So, Steps 7 and 8  involve every tree in a grafting operation and successfully reduce 
the number of trees in the forest by at least half.
4.2.3 Allowing Other Write Rules in Q
In the simulation described above, Z  and Q use the C o m m o n  rule. If Q  uses the 
C o l l i s i o n  rule, Z  can simulate Q  by modifying Steps 3, 9, and 10. First, Z  uses 
leader election (Lemma 4.1, since the buses are acyclic) to select one writer processor. 
Then, the leader broadcasts its index to all the processors in the  bus. If other writers 
are present on the bus, all of them broadcast a collision symbol; otherwise, the leader 
broadcasts its data. Z  performs a similar procedure if Q  uses the C o l l i s i o n "*" rule, 
the difference is th a t the leader broadcasts its data rather than  its index. If other 
writers are present on the bus with different data, they broadcast a collision symbol. 
When Q uses the P R I O R IT Y  rule, Z  resolves concurrent writes in Steps 3, 9, and 
10 using priority resolution in O(logiV) time. This time does not alter the overall 
execution time of the algorithm. The procedure for the A r b i t r a r y  rule (which is 
less restrictive than P r i o r i t y ) is the same.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
83
4.2.4 Reducing the Size o f Z
Z  is four times bigger than Q. Using the scaling simulation of Ben-Asher et aL [4], 
scale the 2 N  x  2 N  C o m m o n  CRCW LR-Mesh down to an y  x y  C o l l is io n ^  CRCW 
LR-Mesh. This reduction in size is necessary to allow the simulating LR-Mesh to get 
rid of the concurrent writes. Since the scaling simulation for the LR-Mesh uses the 
C o l l is io n '*' rule, we need to perform this size reduction before removing from the 
LR-Mesh the capability of using concurrent writes.
Remark: Since the buses of Z  are acyclic (except for the non-writing cycles of Step 1) 
and the scaling simulation [4] does not create any cyclic linear bus, the buses of the 
Y X Y are also acyclic (except for those in Step 1).
4.2.5 Exclusive Write for Z
We now simplify the simulating LR-Mesh, Z ,  to use only exclusive writes. Trahan 
et al. [50] proved for the RMBM reconfigurable model tha t the CREW version can 
simulate the COMMON, COLLISION, or COLLISION'*' CRCW version in constant time, 
utilizing the ability to  perform leader election. We follow a similar procedure to  show 
that a CREW LR-Mesh can implement bus linearization. Specifically, we prove that 
a  CREW LR-Mesh can simulate in constant time the C o l l is io n '*' LR-Mesh that 
uses only acyclic buses. This method is similar to the one discussed in Section 4.2.3 
that simulates the C o l l is io n '*' rule on a CRCW C o m m o n  LR-Mesh. Since we use 
Lemmas 4.1 and 4.2, we increase the size of the simulating machine by a  factor of 
four to accomodate the double bus structure.
To simulate the C o l l is io n '*' rule, Z  uses Lemma 4.1 (which runs with exclusive 
writes) to select a  leader among the writer groups. The leader broadcasts its d a ta  to all 
groups in the bus. If some writer groups hold different data, then Z  uses Lemma 4.1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
84
again to find a  leader to broadcast a  collision symbol. During the simulation, Steps 3, 
9, and 10 use concurrent writes, so Z  can handle them with the above procedure in 
constant time. All the remaining steps, except Steps 1  and 2, execute on acyclic linear 
buses and do not require concurrent writes. In Step 1, Z  may construct cyclic buses, 
but no processor writes on them. In Step 2, Z  handles the cycles using Lemma 4.2, 
which can also be implemented using exclusive writes. The O(logiV) time to execute 
Step 2 does not alter the overall running tim e of the algorithm, since all the other 
steps also run in O(logiV) time.
So, using exclusive writes, an N  x  N  LR-Mesh can simulate an N  x  N  R,-Mesh 
using the COMMON, C o l l is io n , C o l l is io n ^ , P r io r it y , or A r b i t r a r y  rules in 
O (log N ) time, proving Theorem 4.3.
4.3 Scaling Simulations
In Section 1.2, we introduced the notion of a scaling simulation, which adapts an 
algorithm instance designed to  run on a model of arbitrary size to run on a smaller 
model without significant loss of efficiency. This section applies bus linearization to 
construct improved scaling simulations for the R-Mesh and the FR-Mesh. In both 
cases, the simulating machine is a  CREW LR-Mesh. Before doing so, we briefly 
recount current results for scaling different versions of the R-Mesh. The result we 
present in the foUowin section improves over all previous R-Mesh scaling simulations.
4.3.1 R-Mesh Scaling Simulation
For ease of explanation, we describe the scaling simulation as a  sequence of three 
phases, but, in fact, the three phases roll into one for execution:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
85
Phase 1. Simulation of an iV x iV CRCW R-Mesh on an  iV x  IV CREW  LR-Mesh, 
Phase 2. Simulation of an iV x iV CREW LR-Mesh on a y  x y  CRCW  LR-Mesh, 
and
Phase 3. Simulation of a y  x  y  CRCW LR-Mesh on a  P  x  P  CREW LR-Mesh.
The first phase uses bus linearization to get rid of non-linear buses. The second 
phase (optimally) scales down the simulating LR-Mesh. Finally, the third phase 
refines this result so that the  simulating LR-Mesh uses only exclusive writes.
This simulation allows the powerful and flexible algorithm design perm itted by 
the CRCW model with arbitrary bus structure, while using a simple bus structure 
more feasible to implement since it just requires exclusive writes.
P h a s e  1
Use the bus linearization procedure of Section 4.2 to simulate each step of an iV x W 
C o m m o n , C o l l is io n , C o l l is io n ^, A r b it r a r y , or P r io r it y  CRCW  R-Mesh on 
an N  X N  CREW  LR-Mesh in O(logiV) time.
P h a s e  2
This phase uses the scaling simulation of Ben-Asher et al. [4] to scale the N  x 
N  CREW LR-Mesh of Phase 1 down to a y  x y  C o l l is io n ^ CRCW  LR-Mesh. 
Therefore, with Phase 1 , we obtain the following.
•  For any P  < N , any step of an AT x  W COMMON, C o l l is io n , C o l l is io n ^, 
A r b i t r a r y , or P r io r it y  CRCW R-Mesh can be simulated on a y  x  f  
COLLISION^ CRCW LR-Mesh in O (^ lo g iV ) time.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
86
P h a se  3
This phase simplifies the  simulating y  x y  LR-Mesh to  use only exclusive writes. 
Use the procedure presented in Section 4.2.5 to transform the C o l l is io n ^  CRCW 
buses to CREW buses. This procedure requires a P  x  P  CREW  LR-Mesh and  runs in 
constant time. Combining this result with the one of Phase 2, we obtain the following 
theorem.
T h eo rem  4.4 For any P  < N , any step o f an N  x  N  COMMON, COLLISION, 
COLLISION""", A r b i t r a r y , o r  P r io r it y  CRCW  R-Mesh can be simulated o n a P x P  
C REW  LR-Mesh in 0 { ^  log N ) time. ■
Although Matias and  Schuster [32] also simulated the general R-Mesh via the 
LR-Mesh, their simulation is randomized and quite different from the one proposed 
in this paper. Their simulation computes connected components in two stages. In the 
first stage, they obtained the connected components of each sub-mesh of size P  x P  
of the N  x N  R-Mesh, and used it in the second stage to compute the connected com­
ponents on a graph w ith nodes and edges. The connected components
algorithm in the second stage uses an LR-Mesh simulation of a randomized PRAM 
algorithm.
On the other hand, the main part of our scaling simulation is the bus linearization 
procedure, which transforms the simulated R-Mesh bus configuration into an  equiva­
lent LR-Mesh bus configuration of nearly the same size (that later can be scaled down 
optimally). Our simulation is deterministic and the input is a graph with O(N^)  nodes 
and 0{N^)  edges. Another important difference is that, to a ttain  the stated overhead 
in the simulation of M atias and Schuster, the write rule for the simulating machine 
must be ARBITRARY, which is difficult to implement in a  bus; when its simulation
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
87
uses the C o l l is io n  rule, the simulation overhead is not constant (see Table 1 .1 ). In 
contrast, our simulation uses only exclusive writes.
4.3.2 FR-M esh Scaling Simulation
In Chapter 3, we developed a scaling simulation for the FR-Mesh. Though most 
steps of this scaling simulation run on an LR-Mesh, some parts require processors 
to internally connect all four of their ports (this is not perm itted on an LR-Mesh) 
while handling a. P  x  P  sized “window” of the FR-Mesh. Given Theorem 4.3, a 
CREW  LR-Mesh can now simulate the connection pattern of the simulating FR-Mesh 
window in O (logP) time. Since the above FR-Mesh simulation requires the simulating 
LR-Mesh to find this equivalent configuration a constant number of times per window, 
the simulation overhead of this new simulation is still O (logP) (see Table 1.1).
The following corollary expresses this result.
C o r o l la r y  4 .5  For any P  < N ,  any step of an N  x  N  C o m m o n , C o l l is io n , 
C o l l is io n ^ , A r b i t r a r y , or P r io r it y  CRCW FR-Mesh can be simulated on a 
P x P  C REW  LR-Mesh in 0 [ ^  lo g P ) time.
This matches the overhead of the previous scaling simulation th a t used the more 
powerful C o m m o n  CRCW  FR-Mesh as the simulating model.
4.4 Simulation of R-Mesh by PR-Mesh
The Pipelined Reconfigurable Mesh or PR-Mesh [45] is a  special type of reconfigurable 
mesh tha t uses optical buses. It can configure its port partitions to form linear buses as 
in an LR-Mesh, but with no cycles. In addition, each optical bus can perform multiple 
one-to-one communications in constant time by using pipelining. Figure 4.12 shows a
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
88
1 x 4  PR-Mesh. It has two optical waveguides for addressing and  one for transm itting 
messages. Each of these waveguides consists of two bus segments (upper and lower) 
connected a t one of the ends, forming a  directional U-shaped bus. Processors use 
the upper bus segment to write and the lower segment to read. There are fixed 
and conditional delays in the addressing waveguides to allow processors to select the 
destinations of their messages. When more than one processor sends a message to the 
same destination, the destination accepts the first message th a t arrives, thus solving 
the conflict by a PRIORITY rule, where the processor nearest to  the “U-tum” has the 
highest priority.
Conditional
delayData
waveguide
Address
waveguides
Processor ( i
Figure 4.12: 1 x 4  PR-Mesh.
A P X P  PR-Mesh can simulate each step of a P  x P  CREW  LR-Mesh with no 
cycles in constant tim e as follows. Scale the LR-Mesh down to a y  x y  LR-Mesh. 
Then, replicate the connections of this new LR-Mesh on the PR-Mesh, creating a 
double bus structure to decide which of the two end-processors connects the upper 
and lower segments. Finally, use only one of these two buses to  broadcast the written 
value. (Earlier papers [41, 45] also used broadcasting on the  linear buses of the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
89
PR-Mesh to simulate the linear buses of the LR-Mesh, though they did not explicitly 
address the leader election problem.)
C o r o l la r y  4 .6  Any step o f an N x N  C REW  LR-Mesh can he simulated on an N x . N  
PR-Mesh in 0(log N ) time.
Remark: The running time of the above simulation is the time to  cut cycles of the 
LR-Mesh. For an acyclic LR-Mesh, the above simulation runs in constant time.
Combining Corollary 4.6 with Theorem 4.3, Theorem 4.4, and Corollary 4.5, we 
obtain the following.
C o r o l la r y  4 .7  Any step o f an N  x  N  C o m m o n , C o l l is io n , C o l l is io n ^, A r b i­
t r a r y , or P r io r it y  CRO W  R-Mesh can be simulated on an N  x  N  PR-Mesh in 
O(logiV) time.
C o r o l la r y  4 .8  For any P  < N , any step o f an N  x  N  COMMON, C o l l is io n , 
C o l l is io n '*', A r b it r a r y , or P r io r it y  CRCWR-M esh can be simulated o n a P x P  
PR-Mesh in O (^ lo g iV ) time.
C o r o l la r y  4 .9  For any P  < N , any step of an N  x N  C o m m o n , C o l l i s i o n ,  
C o l l is io n '* ' ,  A r b i t r a r y ,  or P r i o r i t y  CRCW  FR-Mesh can be simulated on a 
P  X P  PR-Mesh in O ( ^ l o g P )  time.
Since the PR-Mesh is simulating a CREW LR-Mesh, then a t most one processor 
at a  time broadcasts data on each bus. For this reason, a  restricted model of the 
PR-Mesh with no fixed and conditional delays with only one addressing waveguide 
rather than two suffices to simulate the CREW  LR-Mesh. The reason for the existence 
of the second waveguide in the  PR-Mesh is to be able to transm it messages to selected
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
90
destinations. Our simulation does not use this feature, so a simpler model is enough. 
Consequently, we can readily extend the simulation to work on other reconfigurable 
models with optical buses, as described below.
Bourgeois and Trahan [7] proved that the classes of languages accepted in constant 
time with polynomial number of processors by the PR-Mesh, the APPBS [17], and 
the AROB [41] are the same. Since the PR-Mesh is a restricted version of the AROB, 
the bus linearization method also works for the AROB with the same overhead as for 
the PR-Mesh.
The APPBS is different from the PR-Mesh; this model uses switches to  connect 
processors to buses, and these switches allow only four configurations. As a  result, 
the APPBS cannot end a bus in the middle of the mesh. Bourgeois and Trahan [7] 
presented a simulation of a  cycle-free LR-Mesh using an APPBS th a t runs in constant 
time. Bus linearization applies to the APPBS holding the same overhead as for the 
PR-Mesh.
Thus, the results of Corollaries 4.6, 4.7, 4.8, and 4.9 also apply to  the AROB and 
APPBS.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 5 
Simulation of DR-Mesh by 
LR-Mesh
This chapter deals with the problem of running algorithms designed for directed 
reconfigurable models, specifically the directed R-Mesh (DR-Mesh), on undirected 
models, specifically the LR-Mesh. Ben-Asher et al. [3] proved that an AT xiV  LR-Mesh 
can simulate each step of an A/" x  iV directed LR-Mesh (DLR-Mesh) in constant time. 
The reverse simulation also runs in constant time, but the DLR-Mesh uses 2N  x 2N  
processors to simulate an N x . N  LR-Mesh. In this simulation, they assumed a  directed 
model th a t is more restricted than the one we use in this chapter (see Section 5.1). 
They also proved that a  2 N  x 2 N  DR-Mesh can simulate each step of an  iV x iV 
R-Mesh in constant time.
Ben-Asher et al. [5] proved tha t the class of languages accepted by a  DR-Mesh 
(resp., R-Mesh) in constant time with polynomial number of processors is equivalent 
to the class NL  (resp., SL) of languages accepted in non-deterministic logarithmic 
space (resp., symmetric logarithmic space) on a  Turing machine. Since it  is widely 
conjectured that S L  C N L ,  it is not likely that the R-Mesh can simulate a  step of 
the DR-Mesh in constant time even with a  super-polynomial increase in the number
91
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
92
of processors. On the other hand, simulating an R-Mesh on a  DR-Mesh is straight­
forward, since the DR-Mesh possesses all the features of an R-Mesh.
Trahan et al. [48] proved that each step of an iV x iV DR-Mesh can be simulated 
by an O(N^) x  0{N^)  R-Mesh in 0(logiV) time. This simulation is fast, but requires 
too many processors. To perform the simulation, they constructed a graph where the 
ports are nodes, then, by finding the transitive closure, they determined the destina­
tions o f  messages written to ports. We present a  simulation of a CRCW DR-Mesh 
on a CRCW LR-Mesh tha t follows the same approach as theirs, but runs more effi­
ciently. Both the simulated and simulating machines use the same concurrent write 
ru le , one of COMMON, COLLISION, or C o l l i s io n '* ';  later we will extend this result by 
restricting the simulating LR-Mesh to  use only exclusive writes. In our simulation, 
the simulating LR-Mesh, a model weaker than the R-Mesh, has 0 ( N  x  iV x  
processors in three dimensions (or x processors in two dimensions). The
simulation runs in 0 (log^ N j  time.
Section 5.1 describes the DR-Mesh. Section 5.2 defines the basic terminology 
used in the simulation. Section 5.3 gives a  general description of the simulation, 
while Sections 5.4 and 5.5 detail the phases of the simulation, then Section 5.6 proves 
its correctness. Finally, Section 5.7 refines the DR-Mesh simulation to run on an 
exclusive write simulating model and on a model with pipelined optical buses.
5.1 The DR-Mesh
The structure of the DR-Mesh differs from the R-Mesh, in tha t the DR-Mesh has 
two oppositely directed buses to connect each pair of neighboring processors (see 
Figure 5.1), rather than one undirected bus as in the R-Mesh. The data on a directed 
bus propagates in only one direction. Each processor has four output ports (black
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
93
circles in Figure 5.1) connected to outgoing buses and four input ports (white circles 
in Figure 5.1) connected to incoming buses. The internal connection between ports 
can be any partition of the set of ports. This assumption allows 4140 different port 
partitions (the R-Mesh has only 15). Figure 5.1 shows a 3 x  5 DR-Mesh.
0 1
0
•  Output ports 
o  Input ports
Figure 5.1: 3 x 5  DR-Mesh.
In fo rm a tio n  p ro p a g a tio n  in  th e  D R -M esh: Consider the port configuration 
in the example of Figure 5.2(a). Assume that information a  éirrives a t port IVi. Since 
ports Wi, Eo, and So are connected together, information a  propagates to ports Eo 
and So- Notice in the transitive closure matrix of Figure 5.2(b) (that for this example 
is the same as the adjacency matrix) that columns Eo and S„ have ‘1 ’ entries in their 
intersections with row this means that both Eo and So are reachable from Wi- 
Now, assume tha t port Si receives information p. This information propagates 
only to port No and not to port Ei, even though the three of them are in the same 
block of the partition. Notice that the transitive closure of Figure 5.2(b) indicates 
that port Ei is not reachable from 5,- (there is no path between 5* and Ei, as shown in 
the transitive closure graph of Figure 5.2(c)), since there is a ‘0’ in the corresponding 
entry.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
94
Ni No
Ni No Ei
Wi
W„
Ni
No
Ei
Eo
Wi
%
Si
S„
1
0
0
0
0
0
0
0
Eo Wi Wo Si So 
0 0 0 0 0 
0 
0 
0 
0 
1 
0 
0
Ni No
0
0
0
0
0
1
0
0
0
0
1
0
0
1 Si
Wi
(a) (b) (c)
Figure 5.2: Representation of connections of a DR-Mesh processor: a) Port configura­
tion for a DR-Mesh processor; b) Tiransitive closure m atrix  of the ports; c) Transitive 
closure graph of the ports.
5.2 DR-Mesh Simulation Terminology
Let Q  denote an N  x  N  CRCW DR-Mesh (the simulated machine) and 2  a 16iV x 
16AT X CRCW LR-Mesh (the simulating machine). Let both 2  and Q  have the 
same concurrent write rule, one of C o m m o n , C o l l i s i o n , or C o l l is i o n "'". Let r ( i )  
(a simulated tile) denote a 2* x  2* sub-DR-Mesh of the  simulated machine Q. Let 
T(z) (a simulating tile) denote the  corresponding 16 (2 *) x 16 (2*) sub-LR-Mesh of the 
simulating machine 2  that simulates r ( i ) .  DR-Mesh Q and LR-Mesh 2  partition 
into ( |r )^  tiles of size 2 * x 2 * and 16(2*) x 16(2*), respectively, for each 0  <  z <  log JV. 
Note that r ( z )  and T (i)  are generic symbols for tiles of size 2* x 2* and 16(2*) x 16(2*), 
rather th an  particular tiles of th a t size. A tile r(z), where 1 <  z <  log W, contains four 
sub-tiles o f size 2*~^  x  2*“ L Denote these sub-tiles by n  (z — 1), TzÇi — 1), 7 3 (z — 1), and 
T4 (z—1 ) (see Figure 5.3). Let an internal port of tüe r ( i)  be a  port th a t communicates 
between two sub-tiles within tile r(z). Let an external port of tile r(z ') be a  port that 
communicates with a  neighboring tile r ( z ) .  Figure 5.3 shows tile r ( l )  comprising four
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
95
sub-tiles ri(0 ), 7 2 (0 ), 7 3 (0 ), and 7 4 (0 ). It also shows the  internal and external ports 
of r ( l ) .
TUe T(l)
•  External ports 
O Internal ports
Figure 5.3: Tile r ( l ) ,  its four sub-tiles Tx(0),. 7 4 (0 ), and its internal and external 
ports.
Let A{i) denote the adjacency matrix of the internal and external ports of a  tile 
r(i). An entry in the adjacency matrix is T ’ if and only if the two corresponding 
ports are neighbors. Different tiles r(i) have different matrices A{i). The size of 
matrix A{i) is 16(2*) x  16(2*), where the number of columns and rows corresponds 
to the to ta l number of internal and external ports in tile r(i). Let A*(i) denote 
the transitive closure matrix oî A{i). Let A'^(i) denote the reduced transitive closure 
matrix of A*(z). Obtain A ^{i) by removing from A*(i) rows and columns assigned to 
internal ports, so the size of m atrix A+(i) is 8(2*) x  8(2*). We use this reduction to 
keep the size of the transitive closure proportional to the size of the  tile; otherwise, 
by including all the ports within the tile in the transitive closure, its size will grow in 
terms of the square of the size of the tile. Each processor pk,i,Q of a  tile T (i), where 
0 <  A:, f <  16(2*) (resp., 0 <  A:, f  <  8(2*)), holds the entry of row k  and column I  
of the m atrix A*(i) (resp., v4^'(i)). Let Dout{i) denote the set of bus data  leaving 
the external output ports of tile r{i). The elements of Dout{i) are the results of the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
96
contributions of all writes tha t originate within tile r ( i) .  Let Din{i) denote the set 
of bus data entering the external input ports of tile r(z); this data  could result from 
writes originating anywhere in the DR-Mesh (including within the tile in question).
5.3 DR-Mesh Simulation Description
To simulate an iV x iV DR-Mesh, the simulating LR-Mesh uses 16JV x  1 6 #  x 
processors. In the  first two dimensions, a  group of 16 x 16 processors of the  simulating 
machine, Z , is responsible for a single processor of the simulated machine, Q. The 
processors in the third dimension increase the speed of the simulation in operations 
such as matrix multiplication and data movement. The simulation reduces to the 
problem of determining connectivity of the ports of the simulated machine, that 
is, determining all possible destinations for data written a t any output port of the 
simulated machine. Although the idea is similar to th a t of Trahan et al. [48], we 
use the processors much more eflSciently by using a divide-and-conquer strategy. The 
simulation consists of two phases. The first phase uses the algorithm Going-Out which 
we describe next.
The objective of algorithm Going-Out is to obtain the set of data Dout{iogN) 
and the reduced transitive closure A +(log#) for the simulated DR-Mesh, Q. The al­
gorithm divides Q, into four sub-meshes of the same size. Then, it solves the problem 
recursively to obtain Dont ( lo g #  — 1) and A +(log# — 1) for each sub-mesh. Finally, 
the algorithm combines the matrices A ^(log#  — 1 ) of the four sub-meshes to gen­
erate the transitive closure m atrix A*(log#). Using this m atrix and the set of data 
D o u t(lo g # —1 ) of each of the four sub-meshes, the algorithm calculates D out(log#). 
It is straightforward to obtain A'"'(log#) by removing specific columns and rows from 
A* (log#). The set Dont (log# ) represents the information th a t leaves Q, which is
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
97
irrelevant for our simulation; on the other hand, the sets Dout{i), for 1 <  x <  log JV, 
are fundamental to find the final bus data a t each port of Q.
The second part of the simulation (algorithm Going-In) starts by splitting an 
N x N  tile into four y  x ^  sub-tiles. It employs the transitive closure matrix A* (log N )  
that relates the ports a t the borders of the four y  x y  sub-tiles to combine the d a ta  
arriving a t the external ports of the N x N  tile with data Doutj{log N  — 1 ) generated 
within each sub-tile. The objective of this procedure is to  determine the final d a ta  
that reaches each input port of the sub-tiles of size y  x  y .  The algorithm iterates 
this procedure and stops when it reaches tiles of size 1 x 1 . The bus data a t the input 
ports of 1 X 1 tiles is the final bus data that includes the effect of all writing in the 
simulated machine.
5.4 Algorithm Going_Out
Figure 5.4 shows the recursive algorithm Going-Out It consists of logJV recursion 
levels. T he inputs for the x‘‘^  level of recursion are the reduced transitive closure 
matrices A^(x — 1) and the set of bus data Doutj{i — 1), for each 1 <  j  <  4, from 
the four component sub-tiles Tj{i — 1 ) of tile r(x). One of the outputs is the m atrix  
Dout(i), which represents the bus data  at the external output ports of r(x); the other 
output is the reduced transitive closure matrix A'*'(x). LR-Mesh Z  uses m atrix A*(i) 
to calculate Dout{i). Using the configuration of a tile r(0 ) (a single processor), the 
corresponding tile To can construct the transitive closure A*(0 ) =  A‘*'(0 ) of r ( 0 ) in 
constant tim e (for example, see Figure 5.2). Initially, Dowt(O) is just the data  w ritten 
by ports No, So, Wo, Eo- Figure 5.4 shows a  pseudo-code for procedure Going-Out. 
We next discuss subroutines within this procedure.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
98
Procedure GoingJOut (r(i)) /*  Determines matrices Dout(i) and */
if i =  0 then
return Daut(0) and A+(0)
else
Divide r(x) into foiur sub-tiles ri(i — 1), 7 ^ ( 1  — 1), 7^ (% — 1), and — 1) 
for j  i—  1 to 4 pardo 
Going JOut {Tj{i — 1))
A*(0 f—  F in dJ l' (A+(i -  l),v4+(i -  -  1))
Dout{i) i—  Find-Dout(A *(i),Douti{i — 1 ) , , Dout^ii — 1))
A+(i) <—  Fin<LA+ (A*(i)) 
return Daut(i) and
end
Figure 5.4: Pseudocode for algorithm Going-Out.
5.4.1 Procedure Find_A*
This procedure generates A*(i), the transitive closure of the internal and external 
ports of tüe r(t), given the matrices A j{ i  — 1 ), for each 1  <  j  <  4. Procedure 
Find-A* consists of three stages.
S tag e  1 - M oving m atrices A j (i — 1) : Move the matrices A j (t —1) and A f  (i—1 )
from tUes 72 (% — 1) and 7^(t — 1), respectively, to the main diagonal of T{i) (see 
Figure 5.5). These matrices are part of the adjacency m atrix A(i), which is completed 
in Stage 2. The movement of matrices in Stage 1 assigns the task of representing a 
port (internal or external) of r(z) to each row and column of processors of T{i).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
99
Using the processors of the third dimension, Z  performs Stage 1 in 
steps; this is a constant if 1 <  i  <  logiV — log log iV.
Transitive closure of 
external ports of T, (i-l) 
8(2 *■' ) X 8(2''' ) processors
16(2«)x 16(2^') 
processors
V — ►
'--------------
(a)
m
16(2') X 16(2') processors
(b)
Figure 5.5: Moving matrices — 1) and A t ( i  — 1): a) Initial location of matrices 
A j(z — 1 ) and AX{i — 1 ) in T(z); b) Final location of these matrices.
S tag e  2 - C o n s tru c tin g  m a tr ix  A{i): Each processor of r(z) th a t  does not have
an entry of A{i) generates an entry of the adjacency m atrix A{i) as explained below. 
The entry is T ’ if and only if a  processor is located a t any of the following positions 
(see Figure 5.3).
1 . The intersection of any row representing a north output port in tile r^{i — 1 ) 
(resp., 73 (2  — 1 )) and any column representing a south input po rt of tile ri(i — 1 ) 
(resp., 72(2  -  1 )).
2. The intersection of any row representing a south output port in tile Ti(z — 1 ) 
(resp., 72 (2  — 1 )) and any column representing a north input po rt of tile 74(2  — 1 ) 
(resp., 7 3 ( 2  -  1 )).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
100
3. The intersection of any row representing an east output port in tile T\{i — 1) 
(resp., 7 4 (1  — 1 )) and any column representing a west input port of tile — 1 ) 
(resp., Tz{i -  I)).
4. The intersection of any row representing a  west output port in tile r^ ii — 1) 
(resp., Tz{i — 1 )) and any colunm representing an east input port of tile Ti{i — 1 ) 
(resp., 7 4 (2  -  1 )).
These conditions are straightforward to check, so Z  executes Stage 2 in constant
time.
S ta g e  3 - C o n s tru c tin g  m a tr ix  A*{i): Compute the transitive closure m atrix 
A*(z) =  ( / +  where I  is the 16(2') x 16(2*) identity matrix [19].
Tile Ti (using the processors of the third dimension) applies the method of repeated 
squaring to calculate A*{i), where 1 <  i < logiV. This method uses z +  4 iterations, 
so tile Ti performs z +  4 Boolean m atrix multiplications. Tile TJ- computes a Boolean 
m atrix multiplication in steps, which is constant if z < log jV — log log iV.
5.4.2 Procedure FindJDout
Procedure Find-Out obtains the bus data, Dout(i), generated within the tile r(z), and 
determines how this data  propagates to the external output ports of r(z). Figure 5.6 
shows how Algorithm Going-Out propagates bus data Dout(i) to external ports for 
three different levels of recursion. The inputs for this procedure are the transitive 
closure matrix, A*(z), and the bus data, Doutj(i — 1), from each sub-tile Tj(i — 1), for 
each 1 <  j  <  4. Procedure Find-Dout consists of the following steps.
1. Each processor in T'(z) configures its ports as crossover.
2 . Each processor in the leftmost column of T (i)  whose row represents an output
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
101
(a)
rt
(b) (c)
Figure 5.6: Algorithm Going-Out propagates bus data  Dout(i) (shown by arrows) to 
external ports; Figures (a) to (c) show this propagation for different levels of recursion.
port (from some sub-tile Tj{i — 1 )) writes the data corresponding to tha t po rt from 
D outj{i — 1) on its horizontal bus.
3. Each processor in T(i) that holds a  ‘1’ entry in the m atrix A*(i) reads from its 
horizontal bus and writes that value to  its vertical bus (forming “paths” between 
ports represented by horizontal and vertical buses). Concurrent writes may occur 
in this step. 2  handles concurrent writes by using the same write rule ( C o m m o n , 
C o l l i s i o n , or C o l l i s i o n ^ )  as Q.
4- Each processor in the first row of T (i)  whose column represents an external output 
port reads its vertical bus and stores the value. These values represent the set D out{i). 
2  executes Procedure Find-Dout in constant time.
5.4.3 Procedure Find_A^
This procedure generates the m atrix A"*‘(i). Obtain this matrix from the transitive 
closure m atrix A*(i) by removing entries in processors of columns and rows th a t 
represent internal processors of r( i)  and compacting the remaining columns and rows
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
102
to the first 8(2') column and row processors of T{i). This is a  routing problem sim ila r  
to Stage 1 of procedure Find-A*.
Using the processors of the th ird  dimension, Z  performs procedure Find-A'^ in 
|-{i6}2Mo£jv-j g|.gpg which is a constant if 1 <  i < logiV — log log N .
E x ec u tio n  t im e . The execution time of algorithm Going-Out is due mainly to 
procedure Find-A*, which consists of three stages. For 1 <  z <  logiV — log log iV, 
Z  performs Stages 1 and 2 in constant time and Stage 3 in 0 { i)  time. Thus, Z  
completes recursion levels 1 to logiV — log log iV in
log AT—log log iV
i  = 0 ( lo ^  time.
f = i
For i > lo g N  — log log iV, Z  performs Stage 1 in time. Stage 2 in
constant time, and Stage 3 in time. Thus, Z  completes recursion levels
log N  — log log N  to log N  in
t=Iog AT—log log N
Overall, Z  executes algorithm Going-Out in 0 ( l o ^  time.
5.5 Algorithm Going-In
Algorithm Going-Out comprises Phase 1 of the simulation of a  DR-Mesh by an 
LR-Mesh. Phase 1 collects information from individual processors in tiles r( i)  and 
propagates them outward to the whole DR-Mesh at the level of tile borders. Phase 2 
distributes the information generated by Phase 1 down to individual processors. 
Phase 2 consists of Algorithm Going-In which we discuss next.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
103
Algorithm Going-In proceeds in a  reverse fashion with respect to Going-Out. The 
inputs to the i** level of recursion of Going-In are the following:
1) Tile T(%),
2) Set of bus data  D in{i),
3) Set of bus data  D outj(i — 1), for each 1 <  i  <  4, and
4) Transitive closure m atrix A*{i).
Remark: Since algorithm Going-Out computed Doutj{i — I) and A*{i) for every 
possible i, each tile r(z) holds these values.
Procedure Going-In (r(i), Din(i)) /* Determines Dinj(i — 1) for each sub-tile Tj(i — 1) */ 
if t = 0 then 
return
else
D ini(i—1) , . . £>iTï4 (i—1) <- Find-Din (Din(i), A*(i), Douti(i — 1 ) , , Dout^Çi — 1)) 
for j  t -  1 to 4 pardo
Going-In (rj{i — 1), Dinj{i — 1))
end
Figure 5.7: Pseudo-code for algorithm Going-In.
The outputs for each level of recursion are the sets of da ta  DiUj{i — 1), for each 
1 <  J <  4. The final objective of algorithm Going-In is to determine the set of bus 
da ta  Din{Q) for each tile r(0).
A fundamental part of algorithm Going-In is the procedure Find-D in, which we 
will describe next. Initially, i  =  logN  and each element of the set Din{\ogN) is null, 
because tile r(logN )  does not have any neighbors and no bus data  enters th e  tile
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
104
through its external ports. The inputs A*{i) and Doutj{i — 1), for 1 <  j  <  4, were 
calculated by algorithm Going-Out, so each tile possesses these inputs. Procedure 
Find-Din consists of the following steps.
1 . Each processor of T{i) configures its ports as crossover.
2. Each processor in the leftmost column of T{i) whose row represents an external 
input port or an internal output port of r(i)  writes on its horizontal bus the 
bus datum  of that port. {Din{i) contains the bus data for external input ports 
of r( i) , and Doutj{i — 1 ) for the internal output ports of r(i).)
3. Each processor in T(i) holding a T ’ entry in the matrix A*{i) reads from its 
horizontal bus and writes that value to its vertical bus. Concurrent writes 
may occur in this step and Z  handles them by the same write rule ( C o m m o n , 
C o l l i s i o n , or C o l l i s i o n ^ )  as Q.
4. Each processor of T{i) whose column represents an external or internal input 
port reads its vertical bus and stores the value. (These values comprise the set 
of bus data  Dinj{i — 1 ).)
Figure 5.8(a) shows how Din{i) (represented by black arrows) enters tile r(i). 
Figure 5.8(b) shows how Algorithm Find-Din combines Din{i) (black arrows) with 
Dout{i — 1) (represented by white arrows) to find Din{i — 1 ). Finally, Figure 5.8(c) 
shows D in{i — 1 ) for each of the four sub-tiles.
Z  executes procedure Find-Din in constant time and algorithm Going-In in 0(log N )  
time.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
105
(b) (c)(a)
Figure 5.8: Procedure FintLDim a) Data D in(i)  (black arrows) entering tile r(i); 
b) Combining da ta  Din(i) and D out(i—l)  (white arrows); c) Resulting da ta  D in{i—1 ) 
entering tiles r(z — 1 ).
5.6 Algorithm Correctness
This section proves the correctness of the DR-Mesh simulation algorithm. Specifically, 
we will prove the correctness of algorithms Going-Out and Going J n .
L em m a 5.1 During Algorithm Going-Out, the set o f bus data Dout{i), fo r  any 0 <  
i ^  log jV, represents the contribution of all writes generated within tile r{i) that 
propagate to neighboring tiles.
P ro o f: We use induction on the level of recursion i. Our induction hypothesis
is th a t for any 0 <  z < log N , any value written by an output port X  inside (not 
necessarily on the border of) tile r(z) appears in D out(i) for external output port Z  
of tile r(z), if X  has a path to Z  within tile r(z).
The basis of the induction is when z =  0; at this point, tile r(0) is a  single processor 
and the set of bus data, Dout{0), th a t leave the tile comprises just the d a ta  written 
by the four output ports of the processor.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
106
Assume th a t the induction hypothesis holds for all values of f, where 0 < i  < k. 
We will show that the induction hypothesis also holds for i =  k. By definition, all 
border ports have their data included in Dout(k), so consider non-border port X .  If 
X  has no p a th  to a  border port, then its da ta  is irrelevant. Suppose that there is a  
path from port X  to border port Z  within tile r(A;).
(a) (b)
Figure 5.9: Algorithm Going.Out ensures that any writing produced by arbitrary 
port X  inside tile Tj{k — 1) reaches the external output port Z  of tile r{k) if there is 
a path between X  and Z: a) The path from X  to Z  crosses two or more sub-tiles; 
b) The path  from X  to Z  never leaves the sub-tile.
C ase 1 : X  is inside sub-tile Tr(k — 1), Z  is in sub-tile r,(A: — 1 ) (r may be equal 
to s), and the path from X  to Z  crosses two or more sub-tiles (see Figure 5.9(a)). 
Since there is a path  between X  and Z  within tile r(Ar), there must be a  sub-path 
contained in  Tr{k — 1) between X  and some output port Y  of sub-tile Tr(k — 1). By 
the induction hypothesis, a value written by X  appears in Y  and also in Doutr{k — l) . 
Since there is a path  from Y  to Z , then the transitive closure matrix, A*(A:), has a  
T ’ entry in the intersection of row Y  and column Z . Procedure Find-Dout, using the 
information provided by A*{k), propagates data (from the set Doutj{k — 1 )) from 
horizontal buses th a t represent output ports of sub-tiles Tj{k — 1 ) to vertical buses
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
107
that represent external output ports of tile r(fc). Top row processors read the vertical 
buses and store the bus d a ta  tha t represents the set D out(k). Any write th a t appears 
at port Y  also appears a t port Z  and, hence, in the set Dout(k).
C ase 2: X , Z, and the path from X  to Z  are in the same sub-tile (Figure 5.9(b)). 
By the induction hypothesis, a value written by X  appears in D outj(k  — 1 ) a t port Z  
for some j ,  where 1  <  j  <  4. The transitive closure matrix, A*(k), has a ‘1’ entry in 
the intersection of row Z  and column Z. Using the  same reasoning as in the Case 1 , 
procedure Find-Dout propagates the data that leaves Z  from the set D outj{k  — 1 ) to 
the set Dout(k). ■
L em m a 5.2 TTie set o f bus data, Din{i), generated by Algorithm Going-In, for any 
1 <  i <  logiV, represents the final bus data that arrive at the external input ports o f 
tile r{i).
P ro o f; We use induction on the level of recursion i. Our induction hypothesis is 
that for any 1 <  i <  log JV, the values in Din{i) th a t appear a t the external input 
ports of tile r{i) contain the effects of all writes inside and outside the tile.
The basis of the induction is when i =  logiV; for this case, the bus data  at each 
external input port of tile r(log Af) is null because tile r(logiV) has no neighboring 
tiles.
Assume that the induction hypothesis holds for all values of i, where k  < i < 
logN . We will show th a t the induction hypothesis also holds for i = k — 1 . Tile r{k) 
comprises four sub-tiles Tj{k — 1 ), for each 1 <  j  <  4. The bus data  Ding{k — 1 ) 
a t the external input ports of sub-tile r,(A: — 1) must have originated by one of the 
following cases.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
108
C ase  1 : The bus datum comes from some output port of a neighboring tile r{k)' 
and arrives a t an external input port, A, of tile r{k) (see Figure 5.10(a)). Port A  
is also an external input port of tile Tr{k — 1 ). Assume tha t there exists a  path  in 
r{k) from port A  to some input port C  in sub-tile r,(A: — 1 ) (see Figure 5.10(a)). By 
the induction hypothesis, the bus datum a t port A  appears on Din{k). Since there 
is a path between port A  and port C, then there is a T ’ entry in the intersection of 
the horizontal bus that represents port A  and the vertical bus that represents port 
C  in the transitive closure m atrix A*(fc). Procedure Find-Din of algorithm Going-In 
propagates th e  bus datum  at port A  (provided by D in(k)) through the horizontal 
bus that represents port A , then through the vertical bus that represents port C, and 
stores the bus datum  in DiUs(k — 1 ).
(a)
w
(b)
Figure 5.10: Algorithm Going-In calculates the final bus data at input ports of each 
tile Tj{k — 1): a) A bus datum  generated outside tile r{k) arrives a t a  border port; 
b) Bus data  generated inside tile r{k).
C ase 2: The bus datum comes from some output port inside some sub-tile Tr{k — 1 ), 
where s ^  r. An output port X  inside (not necessarily on the border) of sub-tile 
Tr{k — 1 ) writes a bus datum  to its bus. Assume tha t there is a  path in r{k) from
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
109
port X  to some input port Z  in sub-tile r,(fc—1 ) (see Figure 5.10(b)). By Lemma 5.1, 
this bus datum  propagates to some ou tpu t port V  a t the border of sub-tile Tr(fc — 1) 
and shows up in Doutr{k — 1 ). Since there is a  path  between Y  and Z ,  using the 
same argument as in Case 1 , procedure FindJDin ensures that the bus datum  written 
by X  shows up in Diris{k — 1).
C ase  3: The bus datum comes from some output port inside sub-tile r,(A: — 1). An 
ou tpu t port U  inside (not necessarily on  the border) of sub-tile Ta{k — 1 ) writes a  bus 
datum  to its bus. Assume that there is a path th a t starts at port U, leaves sub-tile 
T,(A;—1 ) a t port V , and returns to sub-tile r,(fc—1 ) a t  port W  (see Figure 5.10(b)). By 
Lemma 5.1, the bus datum written by U  propagates to output port V  a t the border 
of sub-tile r,(A: — 1) and shows up in D o u ts { k  — 1 ). Since there is a  path between V  
and W , using the same argument as in  Case 1 , procedure FincLDin ensures th a t the 
bus datum  written by U shows up in DiUgik — 1 ).
So, algorithm GoingJn correctly calculates the final bus data at the input port of 
each tile r ( 0 ). ■
T h e o r e m  5 .3  Any step o f a n N x . N  C o m m o n , C o l l i s i o n ,  or C o ll is io n " ^  CRCW  
DR-Mesh can be simulated on an o ( n  x  N  x  C o m m o n , C o l l i s i o n ,  or
COLLISION‘S CRCW  LR-Mesh in O(log^ iV) time. m
5.7 Simulation Improvements
This section presents some improvements to the DR-Mesh simulation. 
T w o -d im en sio n a l LR -M esh. Vaidyanathan and Trahan [52] designed a  procedure 
to transform an A x  B  x C  three-dimensional LR-Mesh into a 6RC7 x (7A ■+■ A B )  two- 
dimensional LR-Mesh. We use this procedure to transform the simulating LR-Mesh.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
110
C o r o l la r y  5 .4  Any step o f a n N x N  C o m m o n , C o l l is io n , o r  C o l l is io n ^  C R C W  
DR-Mesh can be simulated on an x  C o m m o n , C o l l is io n , or  C o l l is io n "*"
C R C W  LR-Mesh in 0 ( l o ^  iV) time. ■
E x clu siv e  W rite s . The simulated DR-Mesh and the simulating LR-Mesh in the 
present simulation assume the C o m m o n , C o l l isio n , or C o l l is io n ^  rules. In 
Chapter 4, we proved that a  CREW  LR-Mesh that uses only acyclic buses can simu­
late a C o m m o n , C o l l is io n , or C o l l is io n "*" LR-Mesh in constant time (Theorem 4 .1 ) . 
In the present simulation, the  simulating LR-Mesh uses only acyclic buses, so we can 
replace it by one that uses only exclusive writes without altering the execution time 
of the simulation.
C o r o l la r y  5 .5  Any step o f an N x N  COMMON, COLLISION, o r  COLLISION"*" C R C W  
DR-Mesh can be simulated on an x  C R E W  LR-Mesh in
0 (log^ iVj time. ■
D R -M esh  o n  P R -M esh . In Chapter 4, we proved tha t an 0 { N  x  N) PR-Mesh 
can simulate in constant tim e an i\T x  iV CREW LR-Mesh th a t uses only acyclic 
buses (Corollary 4.6). In the simulation presented in this chapter, we use a  three- 
dimensional LR-Mesh to speed up the execution of certain operations, like da ta  move­
ment and Boolean matrix multiplication. A PR-Mesh performs these types of opera­
tions more efficiently because of its ability to transmit multiple messages on a  single 
bus by using pipeline. Combining Theorem 5.3 and Corollary 5.5 with PR-Mesh data  
movement abilities, we obtain the following.
C o r o l la r y  5 .6  Any step o f a n N x N  C o m m o n , C o l l is io n , o r  C o l l is io n "*" C R C W  
DR-Mesh can be simulated on an O ^N  x  PR-Mesh in  0 ( lo g ^  Af) time. ■
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
I l l
Notice how the PR-Mesh uses processors in. two dimensions to perform
the simulation. The LR-Mesh uses the same number of processors, but in three 
dimensions.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 6
Summary and Future Work
The general aim of this research is to provide a better understand ing  about scaling 
simulations of reconhgurable models, specifically of the R-Mesh and some of its vari­
ants. Many of the techniques in this dissertation can be used for other reconfigurable 
models as well.
In Chapter 3, we have identified a new R-Mesh restriction, called the FR-Mesh. 
For this model, we have designed a strong scahng simulation with overhead log F  
for a P  X P  simulating machine. By integrating the strong scaling simulation of the 
FR-Mesh and the optimal scaling simulation of the LR-Mesh, we have identified a 
new class of algorithms (called separable algorithms) with strong scaling simulations 
tha t accommodate solutions to a  wide range of problems.
We have also studied the effect of different concurrent write rules for the simulated 
a n d  simulating models, such as COMMON, C o l l is i o n , COLLISION""", A r b i t r a r y , 
a n d  P r i o r i t y , and have established that the simulation overhead for the FR-Mesh 
is due only to leader election.
The FR-Mesh scaling simulation also leads to an improved (weak) scaling simula­
tion for the R-Mesh. Its simulation overhead is 0 (log  P  log (the simulation over­
head of the previous fastest scaling simulation for the  R-Mesh was O (log iV log^)).
112
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
113
In th is scaling simulation, a part of the simulation overhead is due to leader election, 
as in the FR-Mesh simulation. Thus, any improvement in techniques to  perform 
leader election will immediately translate to a further reduction of the overheads for 
scaling simulations of the FR-Mesh, the R-Mesh, and separable R-Mesh algorithms.
In Chapter 4, we have presented bus linearization, a  procedure that transforms 
non-linear buses into acyclic linear buses. Using bus linearization, we simulate a 
CRCW  R-Mesh (for various write rules) on a CREW  LR-Mesh. This procedure gives 
an algorithm designer the liberty of using buses of arbitrary shape, while autom ati­
cally translating the algorithm to run on a simpler platform.
We have presented two important applications for bus linearization. T he first is 
a further improvement in the simulation overhead for the R-Mesh. This overhead of 
0(log N )  is even smaller than the one presented in Chapter 3. Moreover, the simulat­
ing model in this scaling simulation is an LR-Mesh, a  model weaker than the R-Mesh. 
Furthermore, the LR-Mesh uses only exclusive writes, while in all previous simula­
tions the simulating machine always used concurrent writes. The second appUcation 
of bus linearization transforms R-Mesh algorithms to run on reconfigurable models 
with pipelined optical buses such as the PR-Mesh, APPBS, and AROB.
In Chapter 5, we have presented a simulation of a CRCW DR-Mesh on a  CREW 
LR-Mesh. This 0(log^ iV^-time simulation, tha t uses O ^N  x. N  x  processors
in three dimensions (or x processors in two dimensions), is big improve­
ment over the previous best simulation that uses 0(iV®) processors to run in O(logiV) 
time. Our simulation also runs in 0(log^ iV) time on models with pipelined optical 
buses using only 0 ( l ^  x  processors in two dimensions.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
114
F u tu re  W ork: Open problems include investigating whether or not the R-Mesh 
and the FR-Mesh have optim al scaling simulation. So far, all approaches have resulted 
in weak scaling simulations for the R-Mesh and in a  strong scaling simulation for the 
FR-Mesh.
Another open problem is to improve the time complexity for leader election on 
C o m m o n , C o l l i s i o n , or C o l l i s i o n '*' CRCW R-Mesh or in CREW  R-Mesh. One 
possible way to accomplish this goal is by using randomization.
Another research direction is to improve the overhead of specific algorithms. For 
example, designing R-Mesh algorithms th a t use a  limited number of linear connections 
a t each step. An algorithm such as this will scale down with low overhead, since the 
overhead on the scaling simulation of Section 4.3.1 depends on the number of non­
linear connections.
An interesting problem is to identify new variants of the R-Mesh th a t admit scaling 
simulations with low overhead. A simpler scaling simulation for the LR-Mesh will also 
be beneficial, since many of the results in this dissertation are based on the LR-Mesh 
scaling simulation.
Exploiting knowledge of the structure of a particular algorithm for scaling pur­
poses, is another research direction. For example, Jang and Prasanna [22] have proved 
th a t for any I <  T  <  y /N , an ^  x ^  R-Mesh can sort N  elements in 0 (T ) time. 
Similar results may be found in [46].
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Bibliography
[1] H. M. Alnuweiri, “Constant-Time Algorithms for Image Labeling on a  Reœnfigurable 
Network of Processors,” IEEE Trans. Parallel Distrib. Systems, vol. 5, no. 3, (1994), 
pp. 320-326.
[2] H. M. Alnuweiri, “Parallel Constant-Time Connectivity Algorithms on a Reconfig­
urable Network of Processors,” IEEE Trans. Parallel Distrib. Systems, vol. 6 , no. 1, 
(June 1995), pp. 105-110.
[3] Y. Ben-Asher, D. Gordon, and A. Schuster, “Optimal Simulations in Reconfigurable 
Arrays,” Technion Israel Institute of Technology, Technical Report 716, (Feb. 1992).
[4] Y. Ben-Asher, D. Gordon, and A. Schuster, “Efficient Self Simulation Algorithms for 
Reconfigurable Arrays,” J. Parallel Distrib. Comput., vol. 30, no. 1, (1995), pp. 1-22.
[5] Y. Ben-Asher, K.-J. Lange, D. Peleg, and A. Schuster, “The Complexity of Reconfig­
uring Network Models,” Info, and Comput., vol. 121, no. 1, (1995), pp. 41-58.
[6 ] Y. Ben-Asher, D. Peleg, R. Ramaswami, and A. Schuster, “The Power of Reconfigura­
tion,” J. Parallel Distrib. Comput., vol. 13, (1991), pp. 139-153.
[7] A. G. Bourgeois and J. L. Trahan, “Relating Two-Dimensional Reconfigurable Meshes 
with Optically Pipelined Buses,” manuscript, (1999).
[8 ] J. Bruck, L. De Coster, N. Dewulf, C.-T. Ho, and R. Lauwereins, “On the Design 
and Implementation of Broadcast and Global Combine Operations Using the Postal 
Model,” IEEE Trans. Parallel Distrib. Systems, vol. 7, no. 3, (Mar. 1996), pp. 256-265.
[9] G. -H. Chen, B. -F. Wang, and C. -J. Lu, “On the Parallel Computation of the 
Algebraic Path Problem,” IEEE Trans. Parallel Distrib. Systems, vol. 3, (1992), 
pp. 251-256.
[10] K.-L. Chung, “Image Template Matching on Reconfigurable Meshes,” Parallel Proc. 
Letters, vol. 6 , no. 3, (1996), pp. 345-353.
[11] H. ElGindy and L. Wetherall, “A Simple Voronoi Diagram Algorithm for a Reconfig­
urable Mesh,” IEEE Trans. Parallel Distrib. Systems, vol. 8 , (1997), pp. 1133-1142.
[12] J. A. Femandez-Zepeda, J. L. Trahan, and R. Vaidyanathan, “Scaling the FR-Mesh 
under Different Concurrent Write Rules,” Proc. World Multiconference on Systemics, 
Cybernetics and Informatics, (1997), pp. 437-444.
115
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
116
[13] J. A. Fernândez-2^peda, R. Vaidyanathan, and J. L Trahan, '‘Scalability of the 
Fusing-Restricted Reconfigurable Mesh,” Proc. 8th lASTED In t’L Canf. Par. Distrib. 
Comput. and Sys., (1996), pp. 467-471.
[14] J. A. Femàndez-Zepeda, R. Vaidyanathan, and J. L. "Dahan, “Scaling Simulation of 
the Fusing-Restricted Reconfigurable Mesh,” IEEE Trans. Parallel Distrib. Systems, 
vol. 9, no. 9, (Sep. 1998), pp. 861-871.
[15] J. A. Femandez-Zepeda, R. Vaidyanathan, and J. L. Trahan, “Improved Scalability 
Simulations of the General Reconfigurable Mesh,” Proc. 6th Reconfigurable Architecture 
Workshop. LNCS vol. 1586, April 1999, pp. 616-624.
[16] J. Gunnels, C. Lin, G. Morrow, and R. van de Geijn, “A Flexible Class of Parallel 
Matrix Multiplication Algorithms,” Proc. I2th In tl. Par. Processing Symp. & 9th 
IEEE Symp. Par. Distrib. Processing, (1998), pp. 110-116.
[17] Z. Guo, “Optically Interconnected Processor Arrays with Switching Capability,” J. 
Parallel Distrib. Comput., vol. 23, (1994), pp. 314-329.
[18] T. Hayashi, K. Nakano, and S. Olariu, “An 0((loglogn)^) Time Algorithm to 
Compute the Convex Hull on Reconfigurable Meshes,” IEEE Trans. Parallel Distrib. 
Systems, vol. 9, no. 12, (1998), pp. 1167—1179.
[19] J. JâJà, An Introduction to Parallel Algorithms, Addison-Wesley Publishing Co., 1992.
[20] J.-W. Jang, M. Nigam, V. K. Prasanna, and S. Sahni, “Constant Time Algorithms for 
Computational Geometry on the Reconfigurable Mesh,” IEEE Trans. Parallel Distrib. 
Systems, vol. 8 , (1997), pp. 1-12.
[21] J.-w. Jang and V. K. Prasanna, “An Optimal Multiplication Algorithm on Reconfig­
urable Mesh,” IEEE Trans. Parallel Distrib. Systems, vol. 8 , no. 5, (1997), pp. 521-532.
[22] J.-w. Jang and V. K. Prasanna, “An Optimal Sorting Algorithm on Reconfigurable 
Mesh,” J. Parallel Distrib. Comput., vol. 25, no. 1, (1995), pp. 31—41.
[23] T.-W. Kao, S.-J. Homg, and Y.-L. Wang, “An 0(1) Time Algorithms for Computing 
Histogram and Hough Transform on a  Cross-bridge Reconfigurable Array of Proces­
sors,” IEEE Trans. System, Man and Cybernetics, vol. 25, no. 4, (1995), pp. 681-687.
[24] R. M. Karp and V. Ramachandran, “Parallel Algorithms for Shared-Memory 
Machines,” in Handbook of Theoretical Computer Science, vol. A: Algorithms and 
Complexity, J. van Leeuwen, ed., MET Press, (1990), pp. 869-941.
[25] T. H. Lai and M.-J. Sheng, “Constructing Euclidean Minimum Spanning Ttees and All 
Nearest Neighbors on Reconfigurable Meshes,” IEEE Trans. Parallel Distrib. Systems, 
vol. 7, no. 8 , (Aug. 1996), pp. 806-817.
[26] H. Li and Q. F. Stout, “Reconfigurable SIMD Massively Parallel Computers,” IEEE 
Proceedings, vol. 79, no. 4, (Apr. 1991), pp. 429-443.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
117
[27] K. Li, Y. Pan, and S. Q. Zheng, Parallel Computing Using Optical Interconnections, 
Kluwer Academic Publishers, Boston, MA, 1998.
[28] R. Lin and S. Olariu, “Reconfigurable Buses with Shift Switching: Concepts and 
Applications,” IEEE Trans. Parallel Distrib. Systems, vol. 6 , no. 1, (Jan. 1995), 
pp. 93-102.
[29] R. Lin, S. Olariu, J. L. Schwing, and J. Zhang, “Sorting in 0(1) Time on an n  x n 
Reconfigurable Mesh,” Proc. Plenary Address Proc. EWPC, (1992), pp. 16-27.
[30] M. Maresca, “Polymorphic Processor Arrays,” IEEE Trans. Parallel Distrib. Systems, 
vol. 4, no. 5, (May 1993), pp. 490-506.
[31] M. Mar%ca and P. Baglietto, “Tkmsitve Closure and Graph Component Labeling on 
Realistic Processor Arrays Based on Reconfigurable Mesh Network,” Proc. IEEE In tl. 
Conf. on Comp. Design: VLSI in Comp, and Proc., (1991), pp. 229-232.
[32] Y. Matias and A. Schuster, “Fast, Elfficient Mutual and Self Simulations for Shared 
Memory and Reconfigurable Mesh,” Parallel Algorithms and Architectures, vol. 8 , 
(1996), pp. 195-221.
[33] R. Miller, V. K. Prasanna-Kumar, D. Reisis, and Q. Stout, “Parallel Computa­
tions on Reconfigurable Meshes,” IEEE Trans. Comput., vol. 42, no. 6 , (June 1993), 
pp. 678-692.
[34] M. M. Murshed and R. P. Brent, “Algorithms for Optimal Self-Simulation of Some 
Restricted Reconfigurable Meshes,” Proc. 2nd Intl. Conf. Computational Intelligence 
and Multimedia Applic., (1998), pp. 734—744.
[35] K. Nakano, “Efficient Summing Algorithms for a Reconfigurable Mesh,” Proc. IPPS 
9 4  Workshop on Reconfigurable Architectures.
[36] K. Nakano, “Prefix-Sums Algorithms on Reconfigurable Meshes,” Parallel Proc. 
Letters, vol. 5, no. 1, (1995), pp. 23-35.
[37] K. Nakano, “A Bibliography of Published Papers on Dynamically Reconfigurable 
Architectures,” Parallel Proc. Letters, vol. 5, no. 1, (Mar. 1995), pp. 111-124.
[38] M. Nigam and S. Sahni, “Sorting n Numbers on n x n  Reconfigurable Meshes with 
Buses,” J. Parallel Distrib. Comput., vol. 23, (1994), pp. 37-48.
[39] N. Nisan and A. Ta-Shma, “Symmetric Logspace is Closed Under Complement,” Proc. 
27th ACM Symp. Theory of Computing, (1995), pp. 140-146.
[40] Y. Pan and K. Li, “Linear Array with a Reconfigurable Pipelined Bus System: 
Concepts and Applications,” Informations Sciences -  An International Journal, 
vol. 106, (1998), pp. 237-258.
[41] S. Pavel and S. G. Akl, “On the Power of Arrays with Optical Pipelined Buses,” Proc. 
In tl. Conf. Par. Distr. Proc. Techniques and AppL, (1996), pp. 1443-1454.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
118
[42] S. Sahni, ‘‘Computing on Reconfigurable Bus Architectures,” in Computer Systems & 
Education, Balakrishnan et al. (Eds.), 'Ihta McGraw-Hill Publishing Co., New Delhi, 
1994, pp. 386-398.
[43] Y. Shiloach and U. Vishkin, “An O(logiV) Parallel Connectivity Algorithm,” Journal 
of Algorithms, vol. 3, (1982), pp. 57-67.
[44] J. L. Trahan, A. G. Bourgeois, Y. Pan, and R. Vaidyanathan, “Optimally Scaling 
Permutation Routing on Reconfigurable Arrays with Optically Pipelined Buses,” Proc. 
13th Intl. Par. Process. Symp. & 10th Symp. Par. Distr. Process., (1999), pp. 233—237.
[45] J. L. Trahan, A. G. Bourgeois, and R. Vaidyanathan, “Tighter and Broader Com­
plexity Results for Reconfigurable Models,” Parallel Proc. Letters, vol. 8 , (1998), 
pp. 271—282-
[46] J. L. Trahan, C-m. Lu, and R. Vaidyanathan, “Scalable Reconfigurable Mesh Algo­
rithms fi}r Matrix Operations with Integer and Floating Point Inputs,” manuscript, 
1998.
[47] J. L. Trahan, Y. Pan, R. Vaidyanathan, and A. G. Bourgeois, “Scalable Basic 
Algorithms on a Linear Array with a Reconfigurable Pipelined Bus System,” Proc. 
10th ISCA In tl. Conf. Par. Distr. Comput. Sys., (1997), pp. 564-569.
[48] J. L. Trahan, R. Vaidyanathan, and A. G. Bourgeois, “LRN Simulation of RN and 
RN Simulation of DRN,” manuscript, June 1997.
[49] J. L. Trahan, R. Vaidyanathan, and C P Subbaraman, “Constant Time Graph Al­
gorithms on the Reconfigurable Multiple Bus Machine,” J. Parallel Distrib. Comput., 
vol. 46, (1997), pp. 1-14.
[50] J. L. Tk^ahan, R. Vaidyanathan, and R. K. Thiruchelvan, “On the Power of Segmenting 
and Fusing Buses,” J. Parallel Distrib. Comput., vol. 34, no. 1, (Apr. 1996), pp. 82-94.
[51] J. L. Trahan and R. Vaidyanathan, “Relative Scalability of the Reconfigurable 
Multiple Bus Machine,” Proc. Workshop Reconfigurable Arch, and Algs., 1996.
[52] R. Vaidyanathan and J. L. Trahan, “Optimal Simulation of Multidimensional Recon­
figurable Meshes by Two-dimensional Reconfigurable Meshes,” Information Processing 
Letters, vol. 47, (no. 5, Oct. 1993), pp. 267—273.
[53] U. Vishkin, “Structural Parallel Algorithmics,” Proc. Intl. Colloq. Automata, Lan­
guages and Programming, (1991), pp. 363-380.
[54] B. F. Wang and G. H. Chen, “Constant Time Algorithms for the Tk^ansitive Closure 
and Some Related Graph Problems on Processor Arrays with Reconfigurable Bus Sys­
tems,” IEEE Trans. Parallel Distrib. Systems, vol. 1, no. 4, (Oct. 1990), pp. 500-507.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Vita
José Alberto Fernandez Zepeda was bom  in Mexico City on June 26, 1966. He 
received the degree of Ingeniero Mecanico Electricista and the  degree of Maestro en 
Ingenieria Electrica from the Universidad Nacional Autonoma de Mexico in 1991 
and 1994, respectively. He is currently a  doctoral student in Computer Engineering 
and a teaching assistant in the Department of Electrical and Computer Engineering 
at Louisiana State University. His current research interests include reconfigurable 
based-bus architectures and design and analysis of parallel algorithms. He will receive 
the degree of Doctor of Philosophy in December, 1999.
119
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
DOCTORAL EXAMINATION AND DISSERTATION REPORT
Candidate: Jose Alberto FERNANDEZ ZEPEDA
Major Field: E le c tr ic a l  Engineering
Title of Disaertation: Scaling Simulations o f  Reconfigurable Meshes
im proved :
EXAMINING COMMITTEE:
Date of Examination :
October 12. 1999
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
