Louisiana State University

LSU Digital Commons
LSU Doctoral Dissertations

Graduate School

2003

On implementing dynamically reconfigurable architectures
Hatem Mahmoud El-Sayed El-Boghdadi
Louisiana State University and Agricultural and Mechanical College

Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_dissertations
Part of the Electrical and Computer Engineering Commons

Recommended Citation
El-Boghdadi, Hatem Mahmoud El-Sayed, "On implementing dynamically reconfigurable architectures"
(2003). LSU Doctoral Dissertations. 3900.
https://digitalcommons.lsu.edu/gradschool_dissertations/3900

This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It
has been accepted for inclusion in LSU Doctoral Dissertations by an authorized graduate school editor of LSU
Digital Commons. For more information, please contactgradetd@lsu.edu.

ON IMPLEMENTING DYNAMICALLY
RECONFIGURABLE ARCHITECTURES

A Dissertation
Submitted to the Graduate Faculty of the
Louisiana State University and
Agricultural and Mechanical College
in partial ful llment of the
requirements for the degree of
Doctor of Philosophy
in
The Department of Electrical and Computer Engineering

by
Hatem Mahmoud El-Sayed El-Boghdadi
B.Sc., Assiut University, Egypt, 1991
M.Sc., Assiut University, Egypt, 1994
May 2003

To My Parents & Brothers

ii

Acknowledgments
I would like to thank the members of my committee Dr. S. Rai, Dr. J. Trahan,
Dr. J. Ramanujam, Dr. S. Kundu, and Dr. R. Litherland.
I would like to express my gratitude to my advisor Dr. R. Vaidyanathan for his
guidance, and technical advice during the course of this work. The initial ideas for
the E-SRGA and bends-cost LR-Mesh grew out of many meetings where we tried to
reconcile the bene ts provided by both FPGAs and R-Mesh model.
I would like also to thank Dr. M. Abdelrahman who helped me to come to the
United States.
Finally, I would like to express my gratitude to my parents and brothers in Egypt.
Their unwavering support has made this work eventually possible. It is to them that
I dedicate the dissertation.

iii

Table of Contents
Acknowledgments
List of Tables

: : : : : :: : : : : : : : : : :: : : : : : : : : : :: : : : : :

List of Figures
Abstract

: : :: : : : : : : : : : :: : : : : : : : : : :: : : : : :

: : : : :: : : : : : : : : : :: : : : : : : : : : :: : : : : :

: : : : : : : : :: : : : : : : : : : :: : : : : : : : : : :: : : : : :

Chapter
1

: : : : :: : : : : : : : : : :: : : : : : : : :
The State of the Art : : : : : : : : : : : : : : : : : : :
Background : : : : : : : : : : : : : : : : : : : : : : : :
1.2.1 Self-Recon gurable Gate Array Architecture : :
1.2.2 The Circuit Switched Tree : : : : : : : : : : : :
1.2.3 Segmentable Bus : : : : : : : : : : : : : : : : :
1.2.4 Recon gurable Mesh : : : : : : : : : : : : : : :
Scope of the Dissertation : : : : : : : : : : : : : : : : :
1.3.1 Communication Capability of the CST : : : : :
1.3.2 Con guring the CST : : : : : : : : : : : : : : :
1.3.3 Implementing R-Mesh-Type Models : : : : : : :
1.3.4 Cost-Bene t Tradeo Study : : : : : : : : : : :
Contributions of this Work : : : : : : : : : : : : : : : :
Organization of the Dissertation : : : : : : : : : : : : :

Introduction

1.1
1.2

1.3
1.4
1.5

: : : : :: : : : :
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

2

Preliminaries

3

CST Communication|Width Partitionable Sets

2.1
2.2
2.3
2.4
2.5

The Circuit Switched Tree
Segmentable Bus : : : : :
The Recon gurable Mesh :
The LR-Mesh : : : : : : :
Bus Delay : : : : : : : : :

3.1 Communicating over the CST : : : : : : : : : : :
3.2 Communication Sets with Disjoint Incompatibles
3.3 Sets with Overlapping Incompatibles : : : : : : :
3.3.1 Combining Communication Sets : : : : : :
iv

:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

iii
vii
viii
xv
1
2
4
4
4
8
8
10
11
12
13
14
14
17
18
18
20
23
24
24
28
29
38
41
42

3.4
3.5
3.6
3.7
4

5

6

7

Well-Nested Communication Sets
Monotonic Communication Sets :
Segmentable Bus : : : : : : : : :
Concluding Remarks : : : : : : :

48
51
56
57
CST Communication|Sets That Are Not Width Partitionable
60
4.1 The Simplest Communication Sets That Are Not Width Partitionable 61
4.1.1 Requirement of the Simplest Set : : : : : : : : : : : : : : : : : 61
4.1.1.1 Preliminary Results : : : : : : : : : : : : : : : : : : 62
4.1.1.2 Number of Communications in a Simplest Set : : : : 65
4.1.1.3 Number of Incompatibles in a Simplest Set : : : : : 69
4.1.2 Choices of the Simplest Sets : : : : : : : : : : : : : : : : : : : 76
4.2 A Bound on the Number of Extra Steps : : : : : : : : : : : : : : : : 80
4.3 Non-Oriented, Well-Nested Sets : : : : : : : : : : : : : : : : : : : : : 81
4.4 Non-Oriented, Monotonic Sets : : : : : : : : : : : : : : : : : : : : : : 88
4.5 Concluding Remarks : : : : : : : : : : : : : : : : : : : : : : : : : : : 91
Configuring the CST : : : : : : : : : : : : : : : : : : : : : : : : : : :
92
5.1 CST Con guration|A Broad Outline : : : : : : : : : : : : : : : : : : 93
5.2 Edge-Exclusive Communication Sets : : : : : : : : : : : : : : : : : : 95
5.3 Edge-Exclusive Decomposition : : : : : : : : : : : : : : : : : : : : : : 101
5.4 Concluding Remarks : : : : : : : : : : : : : : : : : : : : : : : : : : : 104
Segmentable Bus Implementation : : : : : : : : : : : : : : : : : : : : 105
6.1 Our Approaches : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 106
6.2 Methods for Large Processors : : : : : : : : : : : : : : : : : : : : : : 107
6.2.1 Implementing an Oriented Segmentable Bus : : : : : : : : : : 107
6.2.2 Segmentable Bus with Exclusive Writes : : : : : : : : : : : : : 112
6.2.3 Segmentable Bus with Concurrent Writes : : : : : : : : : : : : 114
6.3 Method for Small Processors : : : : : : : : : : : : : : : : : : : : : : : 116
6.3.1 Another Segmentable Bus Implementation : : : : : : : : : : : 116
6.4 Concluding Remarks : : : : : : : : : : : : : : : : : : : : : : : : : : : 122
Implementing the Linear Reconfigurable Mesh : : : : : : : : : : 123
7.1 Preliminaries : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 124
7.1.1 Exploiting Features of the LR-Mesh : : : : : : : : : : : : : : : 126
7.2 The Bends-Cost Measure : : : : : : : : : : : : : : : : : : : : : : : : : 132
7.3 A Bends-Cost LR-Mesh Implementation : : : : : : : : : : : : : : : : 134
7.4 Designing Implementable LR-Mesh Algorithms : : : : : : : : : : : : : 136
7.5 Simulating Semimonotonic Con gurations : : : : : : : : : : : : : : : 137
7.5.1 Simulation Algorithm : : : : : : : : : : : : : : : : : : : : : : : 137
7.5.2 The Channel Assignment Problem : : : : : : : : : : : : : : : 145
7.5.3 Restricted Channel Assignment : : : : : : : : : : : : : : : : : 146
v

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

7.5.3.1 Applications : : : : : : : : : :
7.5.4 General Channel Assignment : : : : : :
7.5.4.1 Stage 1|Leader Determination
7.5.4.2 Stage 2|List Creation : : : : :
7.5.4.3 Stage 3|Broadcasting in List :
7.5.4.4 Pre x Sums of Bits : : : : : : :
7.5.5 Special Cases : : : : : : : : : : : : : : :
7.6 Simulating General Con gurations : : : : : : :
7.7 Concluding Remarks : : : : : : : : : : : : : : :

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

Computational Power of the Bends-Cost LR-Mesh

9

The Enhanced-SRGA

9.1
9.2

9.3
9.4
9.5
9.6

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

: :: : : : : :
The Simulation Algorithm : : : : : : : : : : : : : : : : : : : : : : : :
Concluding Remarks : : : : : : : : : : : : : : : : : : : : : : : : : : :

8

8.1
8.2

:
:
:
:
:
:
:
:
:

: : : : : : : : :
Architecture Overview : : : : : : :
Architectural Details : : : : : : : :
9.2.1 Interconnection Network : :
9.2.2 Switches : : : : : : : : : : :
9.2.3 Processing Elements : : : :
9.2.4 Logic Cells : : : : : : : : : :
9.2.5 Memory Block : : : : : : : :
9.2.6 Registers : : : : : : : : : : :
Implementation : : : : : : : : : : :
Modeling : : : : : : : : : : : : : : :
Programming Model : : : : : : : :
9.5.1 Com Step : : : : : : : : : :
9.5.2 Sel Step : : : : : : : : : : :
9.5.3 Con step : : : : : : : : : : :

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
9.5.4 Relation between High and Low Level Commands :
Concluding Remarks : : : : : : : : : : : : : : : : : : : : :

10 Conclusions

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

: : : : : :: : : : : : : : : : :: : : : : : : : : : :: : : : : :
:: : : : : : : : : : :: : : : : : : : : : :: : : : : :

10.1 Future Directions

Bibliography
Vita

: : : : : : :: : : : : : : : : : :: : : : : : : : : : :: : : : : :

: :: : : : : : : : : : :: : : : : : : : : : :: : : : : : : : : : :: : : : : :

vi

148
151
152
158
160
164
166
168
169
171
172
174
175
176
178
178
179
180
181
182
182
184
188
191
191
191
192
193
193
195
196
200
204

List of Tables
3.1 Width partitionability of communication sets satisfying conditions from
fcap, concat, interg : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
9.1 Low level commands of the E-SRGA : : : : : : : : : : : : : : : : : : : :
9.2 E ect of array size and di erent features on clock : : : : : : : : : : : : :
9.3 E ect of memory size on PE area : : : : : : : : : : : : : : : : : : : : : :
9.4 Translation between high and low level commands : : : : : : : : : : : : :
9.5 Estimated time for high level commands : : : : : : : : : : : : : : : : : :

vii

46
185
188
190
193
194

List of Figures
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10

4  4 PE array : : : : : : : : : : : : : : : : : : : : : : : :
A sample set of communications : : : : : : : : : : : : : : :
Examples of oriented communication sets : : : : : : : : : :
Examples of non oriented communication sets : : : : : : :
Examples of communication sets : : : : : : : : : : : : : :
An 8-processor segmentable bus : : : : : : : : : : : : : : :
Examples of buses in a 3  5 R-Mesh and LR-Mesh : : : :
Counting N bits : : : : : : : : : : : : : : : : : : : : : : :
A sample set of communications : : : : : : : : : : : : : : :
Some CST switch con gurations : : : : : : : : : : : : : : :
Internal structure of the CST switch : : : : : : : : : : : :
Structure of an 8-processor segmentable bus : : : : : : : :
A con guration of an 8-processor segmentable bus : : : : :
Another representation of an N -processor segmentable bus
Example of buses in a 3  5 R-Mesh and LR-Mesh : : : :
Structure of a traditional bus : : : : : : : : : : : : : : : :
A segmentable bus with all processors connected : : : : : :
A segmentable bus with all processors disconnected : : : :
viii

: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :

5
6
6
7
7
8
9
10
19
20
21
21
21
23
23
25
25
26

2.11
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
3.17
3.18
3.19
3.20

A bus represented as a combinational circuit : : : : : : : : : : : : :
An example of a communication set : : : : : : : : : : : : : : : : : :
Communication set with disjoint incompatibles : : : : : : : : : : :
Illustration of the proof of Lemma 3.1. : : : : : : : : : : : : : : : :
Illustration of the proof of Lemma 3.3 : : : : : : : : : : : : : : : :
Width-2 communication set requiring three steps : : : : : : : : : :
Incompatibility graph of the communication set of Figure 3.5 : : : :
Illustration of the proof of Lemma 3.4 : : : : : : : : : : : : : : : :
Constructing set C1 for a set with disjoint incompatibles : : : : : :
Examples of oriented and non-oriented communication sets : : : : :
Illustration of conditions for combining communication sets : : : : :
A communication set satisfying the set fc ap; inter g : : : : : : : : :
Mapping communication set of Figure 3.11 on the CST : : : : : : :
The incompatibility graph of the communication set of Figure 3.12
Examples of well-nested sets : : : : : : : : : : : : : : : : : : : : : :
Illustration of Lemma 3.14 : : : : : : : : : : : : : : : : : : : : : : :
Illustration of the proof of Theorem 3.15 : : : : : : : : : : : : : : :
Monotonic communication set : : : : : : : : : : : : : : : : : : : : :
Illustration of the proof of Lemma 3.16 : : : : : : : : : : : : : : : :
Illustration of the proof of Theorem 3.17 : : : : : : : : : : : : : : :
An ordered incompatibility graph that is not parallel : : : : : : : :
ix

: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :

26
30
32
33
34
35
36
37
40
42
43
45
47
48
49
50
51
52
53
54
55

3.21
3.22
3.23
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
4.16
4.17
4.18

Illustration of the proof of the \if" part of Theorem 3.18 : : :
Illustration of the proof of the \only if" part of Theorem 3.18
Broadcasting on the CST : : : : : : : : : : : : : : : : : : : :
Width-2 communication set requiring three steps : : : : : : :
Illustration of the proof of Lemma 4.1 : : : : : : : : : : : : :
Illustration of the proof outline of Lemma 4.2 : : : : : : : : :
Illustration of the proof of Lemma 4.3 : : : : : : : : : : : : :
Illustration of the proof of Lemma 4.5 : : : : : : : : : : : : :
Source incompatibles for Subcase 2.3.1 of Theorem 4.6 : : : :
Incompatibility graph for Subcase 2.3.2 of Theorem 4.6 : : : :
Possibilities for Subcase 3.1 of Theorem 4.6 : : : : : : : : : :
Possibilities for Subcase 3.2 of Theorem 4.6 : : : : : : : : : :
Illustration of the proof of Lemma 4.7 : : : : : : : : : : : : :
Illustration of the proof of Case 1 Lemma 4.8 : : : : : : : : :
Illustration of the proof of Case 2 Lemma 4.8 : : : : : : : : :
Illustration of the proof of Case 3 Lemma 4.8 : : : : : : : : :
Illustration of the proof Lemma 4.9 : : : : : : : : : : : : : : :
Relationship between source incompatibles for a simplest set :
Relationships between disjoint incompatibles of a simplest set
The two forms of a smallest set : : : : : : : : : : : : : : : : :
Simplest sets with disjoint destination incompatibles : : : : :
x

: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :

55
56
58
61
63
64
65
66
67
67
68
69
70
71
72
73
74
76
76
77
78

4.19
4.20
4.21
4.22
4.23
4.24
4.25
4.26
4.27
4.28
4.29
4.30
4.31
4.32
4.33
5.1
5.2
5.3
5.4
5.5
5.6

Simplest sets with overlapping destination incompatibles :
An N -extension of an incompatibility graph : : : : : : : :
Width-2, non-oriented well-nested set requiring three steps
Level-1 oriented well nested sets : : : : : : : : : : : : : : :
Level-2 oriented well nested sets : : : : : : : : : : : : : : :
Unoriented set and its oriented counterpart : : : : : : : :
Illustration of the proof of Lemma 4.14 : : : : : : : : : : :
The communication set C 00 : : : : : : : : : : : : : : : : : :
Level-1, non-oriented well-nested sets : : : : : : : : : : : :
Level-2, non-oriented well nested sets : : : : : : : : : : : :
The set CR : : : : : : : : : : : : : : : : : : : : : : : : : :
Width-2, non-oriented, monotonic set requiring 3 steps : :
Separable monotonic sets : : : : : : : : : : : : : : : : : : :
Illustration of the proof of Lemma 4.16 : : : : : : : : : : :
A separable monotonic communication set : : : : : : : : :
Internal Structure of the Switch : : : : : : : : : : : : : : :
Edge-exclusive communication set : : : : : : : : : : : : :
A communication set that is not edge-exclusive : : : : : :
The function fs for edge-exclusive sets : : : : : : : : : : :
The function fc for edge-exclusive sets : : : : : : : : : : :
Illustration of the proof of Lemma 5.1 : : : : : : : : : : :
xi

: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :
: :: : : : : :

79
80
82
83
83
84
84
85
85
86
87
88
89
89
91
94
96
96
97
98
99

5.7
5.8
5.9
5.10
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
6.11
7.1
7.2
7.3
7.4
7.5
7.6

Illustration of the proof of Lemma 5.1 : : : : : : : : : : : :
Incoming and outgoing edges of a switch : : : : : : : : : : :
Edge-Exclusive Decomposition Procedure : : : : : : : : : : :
Decomposition of width-1 communication set : : : : : : : :
Right oriented segmentable bus : : : : : : : : : : : : : : : :
The function gs for segmentable bus : : : : : : : : : : : : :
The function gc for segmentable buses : : : : : : : : : : : :
Illustration of the proof of Lemma 6.1 : : : : : : : : : : : :
Implementation of a segmentable bus with exclusive writes :
Implementation of a segmentable bus with concurrent writes
Reversing the directions of data ow : : : : : : : : : : : : :
Structure of a segmentable bus implementation S (x) : : : :
Structure of S (3) : : : : : : : : : : : : : : : : : : : : : : : :
An illustration of the functioning of S (3) : : : : : : : : : : :
A balanced ternary (k = 3) tree of height 3 : : : : : : : : : :
Examples of buses in a 3  5 LR-Mesh : : : : : : : : : : : :
Replacing a linear, acyclic bus by two \directional buses" : :
Detection of a column monotonic bus : : : : : : : : : : : : :
Detection of a row monotonic bus : : : : : : : : : : : : : : :
Illustration of the case vw 6= 5 : : : : : : : : : : : : : : : : :
Illustration of the case vw = 5 : : : : : : : : : : : : : : : : :
xii

:: : : : : :
:: : : : : :
:: : : : : :
:: : : : : :
:: : : : : :
:: : : : : :
:: : : : : :
:: : : : : :
:: : : : : :
:: : : : : :
:: : : : : :
:: : : : : :
:: : : : : :
:: : : : : :
:: : : : : :
:: : : : : :
:: : : : : :
:: : : : : :
:: : : : : :
:: : : : : :
:: : : : : :

100
102
103
104
108
109
109
111
113
114
115
117
117
118
121
125
126
130
130
131
131

7.7
7.8
7.9
7.10
7.11
7.12
7.13
7.14
7.15
7.16
7.17
7.18
7.19
7.20
7.21
7.22
7.23
7.24
7.25
7.26
7.27

General scaling simulation do not work for semimonotonic buses : : : :
Using a restricted scaling simulation for semimonotonic buses : : : : :
Buses with di erent numbers of bends for an N  N LR-Mesh (N = 7)
Structure of a bends-cost LR-Mesh implementation : : : : : : : : : : :
Switching fabric of a bends-cost LR-Mesh processor : : : : : : : : : : :
Dividing a slice into x slices : : : : : : : : : : : : : : : : : : : : : : : :
Bus types : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
Handling Category 2 buses : : : : : : : : : : : : : : : : : : : : : : : : :
Routing Type A and B buses in di erent tiers : : : : : : : : : : : : : :
Combining two tiers : : : : : : : : : : : : : : : : : : : : : : : : : : : :
Assignments of columns to buses : : : : : : : : : : : : : : : : : : : : :
Counting bits on the LR-Mesh : : : : : : : : : : : : : : : : : : : : : : :
An example of the channel assignment problem : : : : : : : : : : : : :
Con guration for Stage 1 : : : : : : : : : : : : : : : : : : : : : : : : : :
Con guration and result of Stage 2 : : : : : : : : : : : : : : : : : : : :
Illustration of Stage 3 : : : : : : : : : : : : : : : : : : : : : : : : : : :
The result of channel assignment : : : : : : : : : : : : : : : : : : : : :
Illustration of Stage 2 : : : : : : : : : : : : : : : : : : : : : : : : : : :
Examples of pointers in a window : : : : : : : : : : : : : : : : : : : : :
Example of an oscillating con guration : : : : : : : : : : : : : : : : : :
Example of a parallel con guration : : : : : : : : : : : : : : : : : : : :
xiii

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

132
133
134
135
135
139
140
142
143
144
146
149
153
154
155
156
157
162
163
167
168

7.28
8.1
9.1
9.2
9.3
9.4
9.5
9.6
9.7
9.8
9.9
9.10
9.11
9.12

Bus types with U-turns : : : : : : : : : : : : : : : : : : : : : : : : :
Some CST switch con gurations : : : : : : : : : : : : : : : : : : : :
Associating CST switches with PEs : : : : : : : : : : : : : : : : : :
Overview of the E-SRGA architecture : : : : : : : : : : : : : : : :
4  4 PE array : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
Structure of a CST switch : : : : : : : : : : : : : : : : : : : : : : :
Structure of a PE : : : : : : : : : : : : : : : : : : : : : : : : : : : :
Logic cell structure : : : : : : : : : : : : : : : : : : : : : : : : : : :
Memory Architecture : : : : : : : : : : : : : : : : : : : : : : : : : :
Details of a con guration word : : : : : : : : : : : : : : : : : : : :
Global registers : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
Interaction between controller and PE array : : : : : : : : : : : : :
E ect of array size and di erent features on clock : : : : : : : : : :
E ect of memory size on PE area for di erent optimization options

xiv

: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :
: : :

170
173
176
176
177
179
180
181
183
184
184
186
187
189

Abstract
Dynamically recon gurable architectures have the ability to change their structure at
each step of a computation. This dissertation studies various aspects of implementing dynamic recon guration, ranging from hardware building blocks and low-level
architectures to modeling issues and high-level algorithm design.
First we derive conditions under which classes of communication sets can be optimally scheduled on the circuit-switched tree (CST). Then we present a method to
con gure the CST to perform in constant time all communications scheduled for a
step. This results in a constant time implementation of a step of a segmentable bus,
a fundamental dynamically recon gurable structure.
We introduce a new bus delay measure (bends-cost) and de ne the bends-cost
LR-Mesh; the LR-Mesh is a widely used recon gurable model. Unlike the (idealized)
LR-Mesh, which ignores bus delay, the bends-cost LR-Mesh uses the number of bends
in a bus to estimate its delay. We present an implementation for which the bends-cost
is an accurate estimate of the actual delay. We present algorithms to simulate various LR-Mesh con guration classes on the bends-cost LR-Mesh. For \semimonotonic
con gurations," a (N )  (N ) bends-cost LR-Mesh with
at most D can
 bus delay
2 
simulate a step of the idealized N  N LR-Mesh in O log Dlog Nlog  time (where
 is the delay of an N -element segmentable bus), while employing about
the same


number of processors. For some special cases this time reduces to O log Dlog Nlog  . If
D = N  , for an arbitrarily small constant  > 0, then the running times of bends-cost
LR-Mesh algorithms are within a constant of their idealized counterparts. We also
prove that with a polynomial blowup in the number of processors and D = N , the
bends-cost LR-Mesh can simulate any step of an idealized LR-Mesh in constant time,
thereby establishing that these models have the same \power."
xv

We present an implementation (in VHDL) of the \Enhanced Self Recon gurable
Gate Array" (E-SRGA) architecture and perform a cost-bene t study for di erent
dynamic recon guration features. This study shows our approach to be feasible.

xvi

Chapter 1
Introduction
Advances in technology and the need for more powerful and faster devices have produced a range of computing devices varying in their eÆciency and their exibility. At
one extreme are Application Speci c Integrated Circuits (ASICs) that are narrowly
tailored to solve a small suite of problems. At the other end of the spectrum are programmable processors that can be programmed to solve any solvable problem. This
dissertation focuses on devices and models that occupy a middle ground that deals
with recon gurable computing [13].
ASICs are devices that have dedicated hardware designed for one speci c task.
The function of this hardware is xed at the time of fabrication. Thus, one can
expect such devices to be fast and have an eÆcient use of chip area and power. Once
manufactured, however, these devices can only perform the tasks for which they are
designed. Thus, they lack exibility.
On the other hand, programmable processors are devices that can execute a number of di erent functions. The user can program these devices, after fabrication, to
perform any desired task. Therefore, such devices are very exible but only at the
expense of eÆciency.
Recon gurable computing started as a new method that promised speeds not
possible on traditional models of computation. This initial thrust was centered around
models such as the R-Mesh that used dynamic recon guration, the ability to change
the structure of the architecture very rapidly, possibly at each step.
Subsequently, recon guration moved to Field Programmable Gate Arrays (FPGAs), devices that can be con gured as circuits suited to the problem at hand. Run1

2
time recon guration (RTR) deals with fast recon guration on FPGAs. Although
RTR is not as fast as dynamic recon guration, it allows the device to con gure at
run time to suit the problem at hand.
This dissertation deals with implementing Dynamic Recon guration using ideas
from both the R-Mesh model and FPGA-type platforms.

1.1 The State of the Art
Dynamic recon guration has been shown to be a very powerful computing paradigm,
capable of extremely fast solutions to many problems. Models such as the R-Mesh [32]
have been extensively studied and solutions developed for a wide range of problems.
Nakano [34] provides an extensive bibliography of results in computing with dynamic
recon guration. Although models such as the R-Mesh provide an abstract platform
to develop recon gurable algorithms, they are diÆcult to implement. This is due
to the fact that most algorithms employ buses whose delay is proportional to the
problem size. On such buses, the constant bus-delay assumption that is central to all
R-Mesh algorithms does not hold.
Compared to the volume of results published for recon gurable models, relatively
little work has been reported on implementing these models or speci c algorithms
for them. One direction has been algorithmic, using an R-Mesh with restricted bus
lengths (that restrict the delay). Beresford-Smith et al. [12] developed a sorting algorithm thatruns
on an N  N R-Mesh with bus delay bounded by D. This algorithm

N
incurs a  D overhead in time. Kunde and Gurtzig [23] designed an hN + o(hdN )
time algorithm for h-h sorting and routing problems on a d-dimensional R-Mesh of
side length N using constant delay buses. The fact that this algorithm runs at the
same speed as the one using buses spanning more processors is due to the properties of the problem and its solution, and is not a general technique for implementing
buses spanning a large number of processors. Bertossi and Mei [6] showed that the
simulation of the Basic R-Mesh (a very restricted version of the LR-Mesh) reduces
to the segmented scan problem and proved that this problem can be solved on xed
connection networks. Other approaches [3, 17] tried to scale down the size of the

3
R-Mesh enabling the algorithms to run on smaller size array. These approaches simulate the large size model on the smaller size model and hinge on computing the
connected components for the simulated array at every step of the simulated algorithm. Murshed [33] introduced other simulation algorithms for the LR-Mesh with
monotonic con gurations and without solving the connected components problem.
However, even for the smaller sized R-Mesh, the constant delay assumption for the
buses is diÆcult to realize. Other directions for implementing dynamic recon guration are technology-based [28, 29]. Several prototypes were proposed [27, 39, 45] but
they have not kept pace with algorithmic advances on recon gurable models.
FPGAs (for example, see Wakerly [51]), though very di erent from models such
as the R-Mesh, provide a practical platform that supports recon guration. These
devices have hardware that can be con gured to suit the problem at hand. Generally
speaking, an FPGA consists of an array of logic blocks that can be connected using
horizontal and vertical channels. Switches that can be con gured are located at the
intersection of the horizontal and vertical channels. By con guring the switches, different connections can be established between logic blocks. FPGAs have the exibility
of implementing di erent functions and the eÆciency that comes from exploiting the
possible parallelism in the problem. They have also proven very useful for rapid
prototyping. Traditional FPGAs con gure their switches using information generated outside the chip. Consequently, pin limitation is one of the biggest hurdles for
run-time recon guration in these devices.
For FPGAs too, technological advances, coupled with ideas such as pre-loaded
contexts [9, 18, 38] and partial recon guration [2, 52] have reduced recon guration
time substantially. However, one cannot expect current FPGA-type devices to support the \R-Mesh-type" recon guration, in which connections could be altered at
each step of the computation.
Perhaps one of the most promising rst steps in implementing the R-Mesh-type
recon guration was due to Sidhu et al. who proposed the Self-Recon gurable Gate
Array (SRGA) [40, 41, 42, 43]. This architecture augments an FPGA-type structure
with the ability to generate recon guration information from within the device (selfrecon guration). Consequently, it can change its con guration extremely fast (in a

4
few clock cycles). This self-recon guration feature has much in common with the
R-Mesh in that local information is used to generate a con guration with global
relevance.

1.2 Background
In this section we introduce some ideas that will assist in describing the scope and
contribution of this work. Speci cally, we discuss four broad ideas: (1) the SelfRecon gurable Gate Array Architecture (SRGA), (2) the Circuit Switched (binary)
Tree (CST) interconnect, (3) the segmentable bus and (4) the Recon gurable Mesh
(R-Mesh). We now discuss these topics brie y.
1.2.1 Self-Recon gurable Gate Array Architecture
The Self-Recon gurable Gate Array Architecture (SRGA) consists of an array of
processing elements (PEs) connected by rows and columns of trees, much like a meshof-trees structure [24] (see Figure 1.1). Each PE consists of a ip- op, look-up table
(LUT), local memory and a small amount of control hardware. One could view a
PE as a small 1-bit processor. Each PE is a leaf of a row tree and column tree
that the PE uses to communicate with other PEs. As de ned by Sidhu et al. [43],
the SRGA permits a tree to only connect one pair of PEs at a time. We adopt a
somewhat more general view of the tree and allow it to connect multiple processor
pairs simultaneously. This general tree, called the circuit switched tree, is discussed
next.
1.2.2 The Circuit Switched Tree
The Circuit Switched Tree (CST) is a balanced binary tree whose leaves are PEs and
internal nodes are switches that can be con gured to establish various paths among
leaves. The edges of the CST represent full duplex links (that allow simultaneous
communications in opposite directions.) Thus, we will view the CST as a directed
graph with each tree edge replaced by two oppositely directed edges. For a communication to be performed on the CST, a directed shortest path must be established

5
Switch
PE

Figure 1.1: 4  4 PE array
between two leaves (PEs) (since the underlying structure is a tree, there is a unique
shortest path between any pair of leaves.)
A communication that has only one source and one destination is called a oneto-one communication (see Figure 1.2 for examples of one-to-one communications
on the CST). Two communications can be accommodated simultaneously on the
CST if they do not use a common directed edge. The width of a set of one-toone communications is the maximum number of communications that use any given
directed edge (the width of the communication set of Figure 1.2(a) is 1 because no two
communications use a common edge whereas the communication set of Figure 1.2(b)
has width 2 because the communications labeled c1 and c2 share a common directed
edge.) Clearly, all communications in a width-1 communication can be accommodated
simultaneously on the CST. If two communications use a common directed edge, then
they are said to form an incompatible. The incompatible is called a source (resp.
destination) incompatible if the two communications use an edge going up (resp.
down) the tree. Clearly, the size of the largest incompatible in a communication set

6

c1
s1

s2
d1

s3
d2

c2

s4
d3

d4

(a) A width-1 communication set
(b) A width-2 communication set
Figure 1.2: A sample set of communications. Sources, si, and destinations, di are
shown as white and black circles, respectively. A PE could be both a source and a
destination, or neither (shown shaded in grey).

(a) An oriented well-nested set
(b) An oriented monotonic set
Figure 1.3: Examples of oriented communication sets
is the same as the width of the communication set. If a width-w communication set
can be scheduled on the CST in w steps, then the communication set is said to be
width partitionable.
A communication set is oriented if either (1) for each communication in it, the
source is a leaf that is to the left of the destination on the CST, or (2) for each
communication, the source is to the right of the destination (Figures 1.3(a) and
(b) show oriented communication sets while those in Figures 1.4(a) and (b) are not
oriented because some sources are to the right of their destinations while other sources
are to the left of their destinations.)
Another way to classify communication sets is by using the pattern of communications they form, regardless of the source-destination orientation. In a well-nested
communication set the communications can be represented as well nested parentheses
(see Figure 1.3(a)); i.e., each communication is entirely inside another communication

7

(a) A non-oriented well-nested set
(b) A non-oriented monotonic set
Figure 1.4: Examples of non oriented communication sets
sources

destinations

(a) A communication set with disjoint (b) A communication set with non-disjoint
incompatibles
incompatibles
Figure 1.5: Examples of communication sets with (a) disjoint incompatibles and (b)
non-disjoint incompatibles
or concatenated with another well-nested communication set. A monotonic communication set forms a stride of communications (see Figure 1.3(b)). Yet another way
to classify communication sets is based on the properties of its incompatibles. For
example, the incompatibles of a communication set could be disjoint if no two incompatibles have a common element (see Figure 1.5(a)) or non-disjoint (see Figure 1.5(b).)
The bipartite graphs shown in Figure 1.5 are called incompatibility graphs and represent communication sets. The source incompatibles and destination incompatibles
are shown encircled. We use these ideas to derive properties related to accommodating communication sets on the CST. In particular, we will address the question of
scheduling a communication set so that only a width-1 set is scheduled in a step.
The ability of the CST to accommodate a width-1 communication set does not
guarantee the ability to perform these communications. Performing the communications requires con guring the switches of the CST to physically establish the communications paths. We use the term con guring the CST to refer to con guring its
switches to establish the communications paths between sources and destinations.

8
The settings (con guration) of each switch needed to successfully establish the communications can be computed at run time or compile time. In run-time con guration,
di erent con guration information is generated at each step of the algorithm. This
information could be based on the particular input instance and on the results of the
previous steps of the algorithm. On the other hand, compile-time con guration information is instance-independent and is computed before the algorithm starts. One
of the derivatives of our method to accommodate width-1 communication sets on the
CST is an implementation of the segmentable bus, described below.
1.2.3 Segmentable Bus
The structure of an N -processor segmentable bus [48] is shown in Figure 1.6. Each
processor controls (opens/closes) a segment switch on the bus using local informa0

1

2

3

4

5

6

7

Figure 1.6: An 8-processor segmentable bus; bidirectional lines are data links between
the processors and the bus; dashed lines allow processors to control their segment
switches.
tion. Opening or closing the switches transforms the segmentable bus into blocks
of contiguous processors (segments); that is, local information at each processor is
translated into information with global relevance. Each processor can write to its segment and all other processors incident on the segment can read the written data. A
segmentable bus can also be viewed as a one-dimensional R-Mesh (see Section 1.2.4.)
A segmentable bus plays a vital role in our implementations of R-Mesh-type models.
We now describe these models.
1.2.4 Recon gurable Mesh
The Recon gurable Mesh (R-Mesh) is a two-dimensional array of processors connected
by an underlying mesh (see Figure 1.7). Each processor has four ports (called North,

9
column 0

1

2

3

4

0

1

2

3

4

row 0
1
2

(a) R-Mesh
(b) LR-Mesh
Figure 1.7: Examples of buses in a 3  5 R-Mesh and LR-Mesh
South, East, and West ports in the obvious manner, and abbreviated N, S, E, and
W). Each processor can independently partition its ports so that ports in the same
block of a partition are connected to each other. As shown in Figure 1.7(a), fteen
di erent port partitions are possible. The port partitions along with the underlying
mesh connections between neighboring processors form buses connecting processors.
Figure 1.7 shows buses in bold, dashed, and dotted. An R-Mesh that assumes constant
bus delay, regardless of the number of ports spanned by the bus, is called a unit-cost
R-Mesh. The Linear R-Mesh (LR-Mesh) [4]is a restricted version of the R-Mesh whose
buses are not allowed to branch (see Figure 1.7(b)). Numerous R-Mesh algorithms
run on the LR-Mesh without loss of speed. A Horizontal-Vertical Recon gurable
Mesh (HV-R-Mesh) [4] is another restricted version of the R-Mesh whose buses are
not allowed to bend from a row to a column or vice versa. A bit-model R-Mesh [21]
is a ne-grained version of the R-Mesh with processors of constant size (like PEs of
the SRGA architecture).
The R-Mesh solves problems very di erently than conventional models. Consider
the problem of counting the number of 1's among N input bits (see Figure 1.8). The
inputs are at the top row and available to the respective columns. The counting
algorithm constructs buses starting at the processors of the West border of the RMesh and move down one row for each 1 in the input (see Figure 1.8).

10
Inputs :

1

1

1

0

0

0

1

0

1

1

0

1

0

signal

1
2
3
4
5
6
7 = Answer
8
9
10
11
12

Figure 1.8: Counting N bits
The bus starting at processor (0; 0) (top left corner) reaches processor ( ; N 1)
(in row and column N 1) i the input bits include 1's. If a processor (0; 0)
sends a signal from its West port, it will reach processor ( ; N 1) where is the
number of 1's in the input.

1.3 Scope of the Dissertation
As mentioned earlier, little work has been reported on implementing dynamic recon guration. This dissertation addresses various aspects of implementing dynamic

11
recon guration. The work is in four main directions. The rst direction analyzes the
communication capability of the CST (Section 1.2.2). The second direction examines
strategies in con guring the CST to perform a set of communications. The third
direction grapples with the issue of implementing dynamically recon gurable models such as the R-Mesh and the LR-Mesh (see Section 1.2.4). The fourth direction
is a practical study of the cost-bene ts tradeo of various dynamic recon guration
features in the setting of an FPGA-like device.
1.3.1 Communication Capability of the CST
This direction studies the problem of scheduling communication sets on the CST so
that each step of this schedule accommodates a width-1 communication set. In Chapter 3 we rst prove that a width-w communication set requires at least w steps to
schedule on the CST. In fact, this chapter deals primarily with width-w communication sets that can be scheduled in w steps (or width partitionable sets). Then we
show three important classes of communication sets (namely, (a) those with \disjoint
incompatibles" (see Figure 1.5(b)) (b) oriented, well-nested sets (see Figure 1.3(a)),
and (c) oriented monotonic communication sets (see Figure 1.3(b))) to possess this
property. As a special case of the second result, we show that the set of communications that can be performed in one step on a segmentable bus (see Section 1.2.3) can
be scheduled in two steps on the CST. This result implies that the communication
ability of the bit-model HV-R-Mesh [4], a special case of the bit-model R-Mesh [21],
can be emulated by an SRGA-like architecture without signi cant overhead. Also
as a special case of the third result, we show that the communications of a uniform
hypercube [50] can be scheduled optimally on the CST.
Chapter 4 considers communication classes that are not necessarily width partitionable. We derive the minimum requirement for a communication set to be not
width partitionable. Speci cally, we prove that for a communication set to be not
width partitionable, the set must be at least of width 2, have at least ve communications, three source incompatibles and three destinations incompatibles. Further,
we prove that only two such \simplest sets" are possible (to within isomorphism).
Figure 1.5(b) shows one of these sets. This set requires one extra step (beyond its

12
width) to schedule on the CST. In general, we show that there exists a width-w set
that is not width partitionable and which requires (w) extra steps. Recall that in
Chapter 3 we prove that oriented well-nested sets and oriented monotonic sets are
width partitionable. If we allow these sets to be non-oriented (see Figure 1.4), then
they are not nessecarily width partitionable. We establish this by constructing a
non-oriented well-nested set and a non-oriented monotonic set whose incompatibility
graphs are both the same as the one in Figure 1.5(b).
1.3.2 Con guring the CST
The work in Chapters 3 and 4 deals with converting a communication set into a series
of width-1 communication sets. In Chapter 5 we deal with the issue of con guring
the CST switches to accommodate any given communication set of width 1.
We rst identify a class of communication sets (called edge-exclusive sets) for
which the CST can be con gured in one step (at run time). The idea for con guring
the tree is to translate the local information at processors to global information that
represents the connectivity of the communication set. Next we present an algorithm
to decompose any width-1 communication set into at most three \edge-exclusive"
sets. (In general, the decomposition algorithm works at compile time.) Thus, any
width-1 communication set can be performed on the CST in at most three steps.
Since an edge-exclusive set can be accommodated on a CST with half duplex links,
a half duplex CST can simulate a full duplex CST in at most 3 steps.
Chapter 6 deals with a particular communication set, namely that of a segmentable
bus. We give methods to dynamically con gure CST switches to implement the functionality of a segmentable bus (see Section 1.2.3). As in Chapter 5, the idea is to
translate the local information at processors to global information that represents the
connectivity of a segmentable bus. We present two approaches. The rst is suitable
for large processors of word-size (log N ) bits in which one \step" (cycle) can accommodate (log N ) gate delays. This approach emulates each step of a segmentable
bus in O(1) steps. Although the main idea is similar to the one in Chapter 5, the
rst approach exploits features speci c to the segmentable bus and does not use any
decomposition algorithm. The second approach is suitable for smaller processors of

13
word-size (log k) bits where log log N  log k  log N and emulates a segmentable
bus step algorithmically using a normalized tree algorithm [24] in O(logk N ) steps.
1.3.3 Implementing R-Mesh-Type Models
As noted earlier, the main obstacle to implementing an R-Mesh (or other related
models) is the bus delay, which these models assume to be constant. In Chapter 7 we
introduce a new measure for the bus delay called the bends-cost measure. We show
that there exists an LR-Mesh implementation for which the bends-cost is a faithful
measure of the actual bus delay. This \bends-cost LR-Mesh" implementation uses
the segmentable bus derived in Chapter 6 as a building block. Then we describe
methods to use the bends-cost measure in algorithm design. Let  denote the delay
of an N -processor segmentable bus. We prove that for any delay D  , a (N ) 
(N ) bends-cost
LR-Mesh can nd the pre x sums of N bits or sort N elements in

O log Dlog Nlog  time. A similar result for adding N b = O(log N )-bit numbers runs
in the same time but on a (Nb)  (Nb) bends-cost LR-Mesh. These processor
resources are within a constant factor of the original (unit-cost) LR-Mesh algorithms
on which they are based. In particular, if D = N , 0 <  < 1, then our algorithms
have the same time computed on the idealized LR-Mesh, but run on an implementable
platform. The ideas used to achieve these results apply to a large class of algorithms
that use \incremental buses" (de ned formally in Section 7.1.) That is, we establish
that any T step (unit-cost) LR-Mesh
algorithm
using incremental buses runs on

 
the bends-cost LR-Mesh in O T log Dlog Nlog  time using buses of delay at most D.
We then further generalize this result for \semi-monotonic buses," a large class of
bus con gurations, and prove that any T step (unit-cost)
algorithm
using
 LR-Mesh

2 
incremental buses runs on the bends-cost LR-Mesh in O T log Dlog Nlog  time using
buses of delay at most D.
In Chapter 8 we consider the computational power of the bends-cost LR-Mesh.
We show that an arbitrary
con guration
of an N  N unit-cost LR-Mesh can be

 2
simulated in O log Dlog Nlog  time on an O DN  O DN 2 bends-cost LR-Mesh whose
buses have a delay of at most D. In other words, if D = N  (where 0 <  < 1), then

14
the bends-cost LR-Mesh can emulate any step of a unit-cost LR-Mesh in constant
time; that is, the bends-cost LR-Mesh is equal in power to the unit-cost LR-Mesh.
1.3.4 Cost-Bene t Tradeo Study
In Chapter 9 we combine ideas from the bends-cost LR-Mesh (see Section 1.3.3), the
CST (see Section 1.2.2), and the SRGA (see Section 1.2.1) to construct the \Enhanced
SRGA" (E-SRGA) architecture. The E-SRGA adds dynamic recon guration features
to the SRGA platform and can be viewed as a possible implementation of a bitmodel bends-cost LR-Mesh. These features include the ability to connect its PEs
in rows/columns as a segmentable bus using local data and the ability of each PE
to con gure its switches directly on the basis of local data. These features are in
addition to the SRGA's ability to connect pairs of PEs.
We have coded the E-SRGA in VHDL and synthesized the architecture using a 0.5
micron library of standard cells from AMI. The Leonardo Spectrum (synthesis tool)
was used for the synthesis and optimization of the architecture. A C program was
written to automate the implementation of E-SRGAs of di erent sizes. We conducted
experiments to ascertain the cost-bene t tradeo of these dynamic recon guration
features.

1.4 Contributions of this Work
Dynamic recon guration has provided platforms capable of very fast solutions to
many problems. However, models of dynamic recon guration such as the R-Mesh
are diÆcult to realize because of the constant delay assumption that is central to
such models. This work bridges the gap between theory and practice. Signi cant
contributions have been made towards translating theoretical algorithms to practical
solutions. Many aspects of dynamic recon guration have been examined as explained
below.
One important component of our work is the CST (see Section 1.2.2). Chapters 3
and 4 present a formal study of the communication capability of the CST and provide a better understanding of the capabilities and the limitations of this important

15
structure. In Chapter 3 we show that some interesting communication classes can
be scheduled optimally on the CST. In particular, we show that the communications
of a step of the segmentable bus can be accommodated in at most two steps on the
CST. This result is signi cant as the segmentable bus is one of the most fundamental
components of a dynamically recon gurable architecture. Not all communication sets
are width partitionable (see Figure 1.4). Chapter 4 deals with such communication
sets. Our work here provides a clearer understanding of some conditions under which
a communication set is not width partitionable.
The analysis of the CST introduces new concepts and methodologies whose utility
extends far beyond the CST and the communication sets constructed. That is, the
concepts presented here are general enough so that they can be used in analyzing
other interconnection networks (not necessarily the CST) and communication sets.
In Chapter 5 we study the problem of con guring the CST. Con guring the CST
switches based on local data to re ect a global context is the essence of dynamic
recon guration. This important issue is not considered by Sidhu et al. [43], who originally proposed the SRGA which is based on the CST. As we noted in Section 1.3.2,
we propose a method to decompose any width-1 communication set into at most three
edge-exclusive sets (for which CST switches can be con gured at run time).
The results of Chapters 3, 4, and 5 provide a comprehensive set of results to
perform virtually any set of communications on the CST. If the communication set
is width partitionable, then Chapter 3 gives means to schedule the set; i.e., breaks
it up into width-1 sets. For non-width partitionable sets, Chapter 4 provides the
same means. The work of Chapter 5 allows the communications of these width-1
sets to be actually performed on the CST in at most three steps. Thus, for example,
communications of every width partitionable set of width w can be performed on the
CST in at most 3w steps.
Chapter 6 uses the CST to implement segmentable buses. As noted earlier, a
segmentable bus is a fundamental dynamically recon gurable structure. We present
two approaches. One is suitable for large processors of size (log N ) while the other is
suitable for smaller processors of size O(log k) bits, where loglog N  log k  log N .
Collectively, the two approaches allow all levels of processor granularity to adopt

16
the segmentable bus structure varying from an FPGA type structure to a mesh of
processors.
An important contribution of Chapter 7 is the new measure of bus delay called
bends-cost. This new measure allows the algorithm designer to estimate bus delay
accurately, yet abstract away from hardware details. It also provides a new approach
to designing R-Mesh algorithms in which the designer carefully factors in the number
of bends in buses used by the algorithm.
The bends-cost LR-Mesh implementation for the LR-Mesh validates the bendscost measure. It has independent value as an LR-Mesh realization as well. Then, we
present simulation algorithms to simulate the unit-cost LR-Mesh with semimonotonic
con gurations on the bends-cost LR-Mesh. Our main result shows that if D = N  for
an arbitrarily small constant  > 0, then the running times of the bends-cost LR-Mesh
algorithms are within a constant of their ideal (unit-cost) LR-Mesh counterparts.
This is the rst general result that admits constant time algorithms on recon gurable
models without resorting to the use of the unit-cost measure for bus delay. Our
approach also opens the door to translating the large body of fast LR-Mesh algorithms
to run on a more practical platform.
Chapter 8 which deals with the computational power of the bends-cost LR-Mesh,
serves to show that with delay of D = N  (where 0 <  < 1) the bends-cost LR-Mesh
can compute anything the idealized unit-cost LR-Mesh can compute without signi cant loss of speed.
The practical study of Chapter 9 brings together many of the concepts explored
in previous chapters in a practical FPGA-type setting. The E-SRGA architecture
that we propose uses an interconnection network that occupies only about 6% of
the chip area while accommodating a rich array of communication patterns. Thus
it provides a higher functional density than typical FPGAs. While the E-SRGA is
well suited for algorithmic solutions to problems, it is not as nimble as an FPGA for
implementing circuits. Most importantly, the results here point to the feasibility of
the ideas proposed in previous chapters.

17

1.5 Organization of the Dissertation
In the next chapter we present some preliminary concepts and de nitions. The communication capability of the CST is analyzed in Chapter 3 (for width partitionable
sets) and in Chapter 4 (for non-width-partitionable sets). In Chapter 5 we address
the issue of con guring the CST for any width-1 communication set. Chapter 6 deals
with segmentable bus implementations. Chapter 7 presents the bends-cost LR-Mesh
and its simulation of the unit-cost LR-Mesh. Chapter 8 addresses the computational
power of the bends-cost LR-Mesh. Chapter 9 describes the cost-bene t tradeo of
the E-SRGA architecture. Chapter 10 summarizes our results and identi es several
open problems.

Chapter 2
Preliminaries
In this chapter we introduce some basic ideas and de nitions used throughout the
dissertation. In the next section we present the circuit switched tree (CST). In Section 2.2 we describe the segmentable bus, a fundamental recon gurable structure.
Section 2.3 presents the R-Mesh model. Section 2.4 introduces the LR-Mesh, an important restriction of the R-Mesh. Finally, Section 2.5 discusses the notion of bus
delay.

2.1 The Circuit Switched Tree
The circuit switched tree (CST) is a balanced binary tree whose leaves are PEs and
whose internal nodes are switches (see Figure 2.1). Each switch has a full-duplex
link to its parent (if any) and two children. (A full-duplex link can carry information
in both directions simultaneously.) The switch can be con gured to connect to its
parents and children in various ways. Figure 2.2 shows representative con gurations.
Additional con gurations can be obtained from those shown in the gure by symmetry
and rotation. Some of these con gurations are simple extensions of those used in the
SRGA architecture to include broadcasting. Observe that while an incoming link
can connect to two outgoing links (for broadcasting), two incoming links cannot both
lead to the same outgoing link (concurrent writes 1 are not permitted). Also a switch
cannot connect an incoming link to an outgoing link in the same \side" of the switch.
This ensures that for a tree with N leaves (PEs), every communication will traverse
1 we relax this assumption in Section 6.2.3 and allow concurrent writes.

18

19

c1
s1

s2
d1

s3
d2

c2

s4
d3

d4

(a)
(b)
Figure 2.1: A sample set of communications. Sources, si, and destinations, di, are
shown as white and black circles, respectively. A PE could be both a source and a
destination, or neither (shown shaded in grey).
no more than 2 log N switches. Any pair of leaves (PEs) connected by a dedicated
path through the switches can communicate in one step.
In addition to the data links, there is a control line between each node and its
parent that conveys control symbols from the switch to its parent. Control symbols
are used to con gure the switches.
Figure 2.3 shows the internal structure of the switch. The connection unit box
labeled C is a combinational logic that connects the appropriate data inputs to the
data outputs to achieve the con gurations illustrated in Figure 2.2. The control unit
has an input to the connection unit that selects one of these con gurations.
It should be pointed out that the CST is very di erent from a traditional point-topoint tree topology, where each node is a processor that stores and forwards packets
along the correct path. In contrast, the CST switches consist of combinational logic
and establish dedicated paths between leaves of the tree (as in circuit switching).
Two paths can be used simultaneously, only if they have no tree edges in common.
In the next few chapters we will consider two broad issues regarding the CST.
The rst addresses the ability of the CST to simultaneously accommodate many oneto-one communications (Chapters 3 and 4). For example, the communications of
Figure 2.1(a) can be accommodated simultaneously on the CST because no two communications use the same directed edge, whereas the communications of Figure 2.1(b)

20

Figure 2.2: Some CST switch con gurations
cannot be accommodated simultaneously on the CST because communications c1 and
c2 use a common edge. However, the communications of Figure 2.1(b) can be accommodated in two steps (scheduled). We will use the term scheduling a communication
set to refer to the need for more than one step to accommodate the communications.
This issue of accommodating (scheduling) communication sets on the CST does not
re ect the complexity of generating control information to con gure the connection
unit within each switch. The second issue addresses this point. Therefore we will
make a distinction between accommodating (or scheduling) a set of communications
and performing that set of communications.

2.2 Segmentable Bus
The segmentable bus is one of the most fundamental structures in recon gurable
computing. In Chapter 6 we present methods to implement segmentable buses using
binary trees. Functionally speaking, an N -element segmentable bus has the structure shown in Figure 2.4. Each processor is connected to a bus by a bidirectional
(read/write) port and controls a segment switch that is placed on the bus. Each
segment switch can be in the \open" or \closed" state. When open, a segment switch
cuts the bus at the point at which it is placed; otherwise, the switch is closed and the

21
Data Path
control information

Control
Unit
control information

control information
C

Data Path

Data Path

Figure 2.3: Internal structure of the CST switch
0

1

2

3

4

5

6

7

Figure 2.4: Structure of an 8-processor segmentable bus
bus passes through it seamlessly. A bus con guration is a set of segment switch states.
Each segment switch is controlled by a processor (the one to its right in Figure 2.4).
Since each segment switch can be controlled indepedently, numerous bus con gurations are possible. An unsegmented portion of the bus between two open segment
switches is called a bus segment. Figure 2.5 shows an example bus con guration of
an 8-element segmentable bus with three bus segments.
0

1

2

3

4

5

6

7

Figure 2.5: A con guration of an 8-processor segmentable bus

22
The segmentable bus architecture, like other recon gurable architectures, is synchronous. At any given step each processor could perform the following actions:
(1) open or close its segment switch, (2) read from or write to its bus segment, and
(3) perform a local computation.
The decision to open or close the segment switch is based entirely on local information. Thus, each processor can independently control its segment switch. The reading
or writing on a segment could be exclusive or concurrent. In an exclusive read (resp.,
write), a segment can have only one reader (resp., writer). In a concurrent read (resp.,
write) segmentable bus, a segment can have multiple readers (resp., writers). As in a
PRAM [20], we use ER, CR, EW, and CW to denote exclusive read, concurrent read,
exclusive write, and concurrent write. Thus, for example, a segmentable bus with
concurrent read and exclusive write is called a CREW segmentable bus. Again as in
a PRAM with concurrent writes, the segmentable bus uses a rule to determine the
values written to the bus. In this dissertation we consider the Common, Collision,
Collision+ , Priority, and Arbitrary rules. In the Common rule, all the values
written to any bus segment must be the same. Under the Collision rule, a collision
symbol is written on any segment with multiple writers. The Collision+ rule is the
same as Common if all the values written to the segment are the same, otherwise
a collision symbol is written to the segment. The Priority rule assumes a xed
priority (usually the index) to each processor and allows the highest priority writer
to write its value to the segment. Finally, the Arbitrary rule selects any one writer
to write to the segment.
The segmentable bus can also be viewed as shown in Figure 2.6 where each processor has two ports (East and West) and an internal switch. By closing (resp.,
opening) its internal switch, each processor can connect (resp., disconnect) its two
ports forming segments of ports. The structure shown in Figure 2.6 is also known as
a one dimensional R-Mesh.
The segmentable bus assumes that two processors can communicate in one step. In
other words, the segmentable bus assumes a unit-cost bus delay. As a recon gurable
model, it could use any of the bus delay measures described in Section 2.5. As
we explained before, in Chapter 6 we present methods to implement segmentable

23
Segment switch

0

1

2

West port

N−1

East port

Figure 2.6: Another representation of an N -processor segmentable bus
column 0

1

2

3

4

0

1

2

3

4

row 0
1
2

(a) R-Mesh
(b) LR-Mesh
Figure 2.7: Example of buses in a 3  5 R-Mesh and LR-Mesh
buses. If the implementation takes s steps to perform the functionality of a 1-step
segmentable bus, then we say that the implementation of the segmentable bus runs
in s steps.

2.3 The Recon gurable Mesh
An R  C recon gurable mesh or R-Mesh [32] consists of an R-row, C -column array
of processors connected by an underlying mesh (see Figure 2.7). Number the rows
(resp., columns) 0; 1;    ; R 1 (resp., 0; 1;    ; C 1). Each processor has four ports
(called North, South, East, and West ports in the obvious manner, and abbreviated
N, S, E, and W). Each processor can independently partition its ports to connect
certain ports together leaving other ports unconnected. For example, the top left
processor of Figure 2.7(a) connects its N port to its S port, and its E port to its W
port. The corresponding partition is denoted by fN; S ; E; W g. Figure 2.7(a) shows
the fteen possible port partitions of the R-Mesh. An assignment of a port partition

24
to each R-Mesh processor is called a con guration. Figure 2.7 shows two di erent
con gurations. The port partitions along with the underlying mesh connections between neighboring processors form buses connecting processors. Figure 2.7(b) shows
buses in bold, dashed, and dotted. An assumption central to all traditional R-Mesh
algorithms is that buses have constant delay, regardless of the number of processors
they span. An R-Mesh making this assumption is called a \unit-cost R-Mesh." While
this assumption enables us to design very fast algorithms, it makes it very diÆcult
to implement such a model.
At each step of an R-Mesh algorithm, a PE could perform the following actions:
(1) con gure (partition) its ports, (2) read from and write to its ports, and (3) perform
a local computation. As in a segmentable bus, an R-Mesh could permit concurrent
reads and writes. If more than one processor is allowed to write to a bus at the same
time, then the R-Mesh has concurrent write ability and the concurrent write rules
(described in the previous section) are used to resolve the values written to the bus.

2.4 The LR-Mesh
In Chapter 7 we consider a restricted version of the R-Mesh called the Linear R-Mesh
or LR-Mesh [3, 17] (see Figure 2.7(b)) whose buses are linear (non-branching); that
is, an LR-Mesh processor cannot use the ve partitions fN; S; E; W g, fN ; S; E; W g,
fS ; N; E; W g, fE ; N; S; W g and fW ; N; S; E g in the shaded processors of Figure 2.7(a). Notwithstanding this restriction, the LR-Mesh can generate an exponential number of di erent buses among its processors and solve many problems extremely quickly. Indeed, most R-Mesh algorithms run on the LR-Mesh. The counting
algorithm presented in Section 1.2 is an example of an algorithm that runs on both
an R-Mesh and an LR-Mesh.

2.5 Bus Delay
A traditional bus (see Figure 2.8) is a set of wires with multiple taps connecting
processors to it. Each processor incident on the bus has a read port and a write port

25

Figure 2.8: Structure of a traditional bus
0

1

2

3

4

5

6

7

Figure 2.9: A segmentable bus with all processors connected
connecting it to the bus. Capacitive loading [44] due to a large number of taps causes
the bus to use a reduced data rate.
On the other hand, a recon gurable bus connects elements (processors) in di erent
ways at di erent steps. The set of elements it connects can change at each step.
The bus uses switches to change its structure to achieve the desired connectivity.
Therefore, switches are located on the data path, forming a combinational circuit
rather than that shown in Figure 2.8. Consider the simple case of a segmentable bus all
of whose processors are connected (see Figure 2.9) and a segmentable bus all of whose
processors are disconnected (see Figure 2.10). To achieve di erent con gurations for
the segmentable bus switches, have to be closed using logic gates on the data path.
Figure 2.11 shows a data path between processors 0 and 5. The AND gates on the
data path represent a combinational circuit. Because there are relatively few taps
between two successive gates, capacitive loading is not a big problem. The primary
concern is the switch delay of the longest path of this circuit. A large delay forces a
reduction in the data rate as a bus should not be recon gured before the current bus
cycle has been completed. A conventional implementation of a general recon gurable
bus (see Figure 2.11) spanning N processors has a combinational path with (N )
gate delays.
For other technologies, for example with recon gurable optical buses, switches
are placed on the optical path. Each switch attenuates the optical signal requiring
detectors to be illuminated for longer periods of time to ensure reliable operation.

26
0

1

2

3

4

5

6

7

Figure 2.10: A segmentable bus with all processors disconnected
0

1

2

3

4

5

6

7

combinational
circuit

Figure 2.11: A bus represented as a combinational circuit
Once again the e ect is a longer bus cycle. Thus the net e ect of buses spanning a
large number of processors is a reduction in data rate. We use the term bus-delay to
capture this detrimental e ect.
A good algorithm on a recon gurable model should use buses with small delay.
Because buses can take numerous shapes and forms, it is very diÆcult to accurately
ascertain bus-delay. Consequently, bus-delay measures have to be used as approximations of the actual delay. These include the linear-cost measure, unit-cost measure,
and logarithmic-cost measure. With the linear-cost measure, a bus spanning N processors has a delay of (N ). While this measure is quite accurate, it renders most
recon gurable models' algorithms too slow for practical use. Most work on recon gurable models assumes the unit-cost measure [32] that assumes a bus to have constant
delay, regardless of the number of processors it spans. Clearly, this measure does not
re ect reality. A more conservative log-cost measure [32] assigns a log N delay to a
bus spanning N processors. While this measure is reasonable for a xed bus, it does
not capture the complexities arising from the ability of a recon gurable model to
con gure its buses in an exponential number of ways. In Chapter 7 we introduce a

27
new measure for bus delay called bends-cost that accurately represents the actual bus
delay and yet provides the abstraction needed for convenient design.

Chapter 3
CST Communication|Width
Partitionable Sets
In this chapter we study the circuit switched tree (CST) interconnect of the SRGA
architecture of Sidhu et al. [40, 41, 42, 43] (see also Section 2.1). The CST is a
balanced binary tree with PEs (or processors) at its leaves and switches at its internal
nodes. These switches can be con gured to establish dedicated directed paths between
pairs of leaves. At most one path may use a tree edge in any given direction (child
to parent or parent to child).
In this chapter, we consider sets of one-to-one communications between leaves of
the CST and study properties that allow communications from a set to be accommodated on the CST (see the distinction between \performing" and \accommodating/scheduling," page 20).
We rst derive a condition under which pairs of processors can communicate
simultaneously on the CST. Then we introduce a quantity called the \width" of the
communication set and use it to derive a necessary condition for any set of k oneto-one communications to be scheduled in t steps (where 1  t  k) on the CST.
This necessary condition is also suÆcient if the communication set has a property
that we call width partitionability. We show that the class of communication sets
with disjoint incompatibles (see Section 1.2 for an intuitive de nition) possesses this
property. We then identify three conditions that can be used to construct other
classes of communication sets: (1) capping, (2) concatenation, and (3) interleaving.
We use these conditions to construct two important classes of communication sets
28

29
called (a) oriented, well-nested sets, and (b) oriented, monotonic sets. We prove
these sets (of communications) to be width partitionable. The set of communications
that can be accommodated in one step on a segmentable bus (see Section 2.2) is a
special case of the \non-oriented," well-nested sets. We apply our results on oriented,
well-nested sets to show that a CST can emulate a step of a segmentable bus in two
steps. Oriented monotonic sets represent a rich array of communications, including
those of a uniform hypercube [50].
Although the work here is motivated by the interconnect structure of the SRGA
architecture, the results and techniques could be of interest in a general FPGA-type
setting in which the interconnection fabric can be con gured to establish various
connection patterns. It should be noted that in this chapter the analysis of the
communication capability of the CST does not consider the switch con gurations
needed to perform these communications. Chapter 5 deals with that issue (see also
Section 2.1).
In the next section we derive a lower bound on the number of steps needed to
schedule a set of communications on the CST and identify a property of the communication set for which this lower bound can be met. Sections 3.2{3.5 deal with three
classes of communication sets that possess the above property. Section 3.6 deals with
segmentable bus communications. Section 3.7 summarizes our results in this chapter
and makes some concluding remarks.

3.1 Communicating over the CST
In this section we formally de ne the notion of communication width and prove that
the CST requires at least t steps to schedule all communications from a width-t
communication set. Next, we identify a property of the communication set, called
width partitionability, that allows a width-t communication set to be scheduled on
the CST in t steps. That is, the lower bound imposed by the width can be achieved
for width partitionable communication sets. We rst introduce some de nitions.
Represent a CST as an N -leaf tree with A denoting its set of leaves (PEs). To
account for the full duplex links of the CST, replace each tree edge by two oppositely

30
directed edges. For the following de nitions, let T denote this \directed tree." For
any internal node u of T , let `evel (u) denote its level; the leaves are at level 0 and
the root is at level log2 N . For example, the node labeled v in Figure 3.1 is at level
2. For a set S  A of sources and a set D  A of destinations, a set of k one-to-one
level 4
3

v

2
1
0

b’

c’ c d’ d

a’

e’ e

a

b

Figure 3.1: An example of a communication set. Each source-destination pair is labeled (x; x0 ), where x 2 fa; b; c; d; eg. Sources are unshaded circles while destinations
are shaded
communications, (x; x0 ) where x 2 S and x0 2 D, is simply a pairing of the elements
of S and D (in Figure 3.1, k = 5, S = fa; b; c; d; eg and D = fa0; b0; c0; d0; e0g). We
note that a leaf of T could be both a source and a destination (for example, see
Figure 2.1(a), page 19). Source x and destination x0 of communication (x; x0 ) are
said to form a \matching" source-destination pair.
For a set X  A of leaves of T , let `ca (X ) denote the lowest common ancestor of
all elements of X , and let `(X ) = `evel (`ca (X )) be the level of this lowest common
ancestor. For instance in Figure 3.1, if X1 = fa; b; eg, then `ca (X1) = v and `(X1 ) =
2. As another example, if X2 = fc0; d; d0g, then `(X2) = 3.
For small sets such as fa; bg, we will write `ca (fa; bg) and `(fa; bg) without braces
as `ca (a; b) and `(a; b).
For any communication c = (x; x0 ), the edges from node x (the source of c) of
the directed tree T to node `ca (x; x0 ) are called upward edges of c. All these edges
are from a node to its parent. Similarly, the edges of T from node `ca (x; x0 ) to the
destination x0 are the downward edges. Upward (resp., downward) edges are shown
solid (resp., dashed) in Figure 3.1.

31
Let X  S be any set of sources. Set X is called a source incompatible if and only if `(X ) < `(x; x0 ), for each communication (x; x0 ), x 2 X . Similarly,
Y  D of destinations is a destination incompatible i `(Y ) < `(y; y 0), for each
communication (y; y0), y0 2 Y .
De nition 3.1

In Figure 3.1 the set fa; b; eg is a source incompatible. This is because `(a; b; e) =
2, which is smaller than `(a; a0) = `(e; e0 ) = 3 and `(b; b0 ) = 4. Similarly, set fc0; b0 g is
a destination incompatible. On the other hand set fa; cg is not a source incompatible
as `(a; c) = 4 > `(a; a0) = `(c; c0) = 3. We will use the term incompatible (without
the attribute source or destination) to refer to a source or a destination incompatible.
Intuitively, the CST cannot accommodate two communications simultaneously if their
sources and/or destinations are in the same incompatible; we prove this below in
Lemma 3.1.
Remarks: We say that communications cx = (x; x0) and cy = (y; y0) are incompatible
i fx; yg or fx0 ; y0g is an incompatible.
An incompatible I is maximal if no superset of I is an incompatible.
A maximal incompatible I is maximum if no incompatible has more elements than I .

De nition 3.2

Note that while a maximum incompatible is always maximal, a maximal incompatible could contain as few as a single element and, therefore, need not be maximum.
De nition 3.3

incompatible.

The width of a set of communications is the size of its maximum

The communication set of Figure 3.1 has source incompatibles fa; b; eg, fcg, fdg,
and destination incompatibles fa0; e0g, fb0 ; c0g, fd0g. The width of this communication
set is 3 because the maximum incompatible (the set fa; b; eg) is of size 3. Although
the incompatibles of this example do not contain any common elements, in general
incompatibles need not be disjoint.
For convenience we will represent a communication set, C , as an annotated bipartite graph called an incompatibility graph. The incompatibility graph G = (V; E ) has

32
a b e

a’ b’

c

e’ c’

d

d’

Figure 3.2: Communication set with disjoint incompatibles
the set of sources and destinations of C as its set of vertices. Note that since a leaf
(PE) of the CST can be both a source as well as a destination, a leaf could appear
twice in the set V . The edges of G connect source-destination pairs in accordance
with the given communication set C . Arrange nodes of G as sources and destinations to form a bipartite graph and indicate incompatibles by encircling nodes in the
same incompatible. For example, Figure 3.2 shows the incompatibility graph of the
communication set of Figure 3.1.
Since an incompatible is a collection of sources or destinations of communications
that interfere with each other, the width of a set of communications is an important
factor in the amount of time required to schedule the communications. We now derive
a necessary condition for the CST T to schedule k one-to-one communications in t
steps (where 1  t < k). If t  k, then the communications can be trivially scheduled
one at a time.
Lemma 3.1 The CST requires at least t steps to schedule communications from a
set that has a t-element incompatible.

Proof: Let C denote a communication set. We prove that if C has an incompatible
with t elements, then the CST cannot accommodate the communications associated
with this incompatible in t 1 steps. Without loss of generality, let S 0 be a source
incompatible with t elements. Since `(S 0) < `(x; x0 ) for each x 2 S 0, and since
`ca (S 0 ) is an ancestor of every x 2 S 0 , each communication (x; x0 ) with x 2 S 0 has to
traverse the upward edge between `ca (S 0) and its parent (see Figure 3.3). That is,

33
u`(i)
A
... A
.
AA.
parent of `ca (S 0) u
..
..
0
..
`ca (S ) u
..
 A
 A
..
 ..
A
..
 ..
A
..
 .
A
.
 .
A
AA
 .
A


A
AAu

u
A
0

x

S0

x

Figure 3.3: Illustration of the proof of Lemma 3.1.
all t communications with sources in S 0 require the link from `ca (S 0) and its parent.
Consequently, they require at least t steps.
In a similar manner, the existence of a destination incompatible D0 with t elements
implies that the link from the parent of `ca (D0) to `ca (D0) would be used at least t
times.
Corollary 3.2 A width-w set of communications requires at least w steps to be scheduled on a CST.

Only communications from a width-1 communication set can be accommodated
simultaneously on the CST. A width-w set (where w > 1) could be partitioned into
width-1 sets C1; C2;    ; C (for some  w) so that communications from di erent
Ci 's (1  i  ) are accommodated in di erent steps. Since each Ci has width 1, all
communications in Ci can be accommodated in the same step. Thus the partition
corresponds to an -step schedule for accommodating all communications in set C .
Corollary 3.2 implies that  w.
Lemma 3.3 A set S of elements is an incompatible if and only if for all a; b
fa; bg is an incompatible.

2 S,

34
v

u=v

u

x

0

x

1

...........

x

n−1

x

n

x

0

x

1

.........

x

n−1

x

n

(b)
(a)
Figure 3.4: Illustration of the proof of Lemma 3.3
Proof: If S is an incompatible, then so is any subset of S . That is, for all a; b 2 S ,
fa; bg is an incompatible.
In the other direction, let S = fx0 ; x1 ;    ; xng. We are given that
for all 0  i < j  n; fxi; xj g is an incompatible:

(3.1)

Equation 3.1 implies that S fx0 g is an incompatible (by the induction hypothesis). Equation 3.1 also implies that fx0; xg is an incompatible for all x 2 S fx0 g.
Let u = `ca (S fx0 g) and let v = `ca (S ). Clearly u is a descendant of v (including v itself). Since S fx0g is an incompatible, each communication (x; x0 ) (with
x 2 S fx0 g) has an upward edge from u to parent(u). That is, `evel (u)  `evel (v ).
We consider two cases.
Case 1 [`evel (u) < `evel (v )]: The situation is shown in Figure 3.4(a). Clearly for
all 0  i  n, communication (xi ; x0i) has upward edge hv; parent(v)i. Thus, S is an

incompatible (see proof of Lemma 3.1).

35
Case 2 [`evel (u) = `evel (v )]: The situation is shown in Figure 3.4(b). For 1 
i  n, let `ca (xi ; x0 ) = wi and let `evel (wi ) = `i . Without loss of generality, let
`1  `2      `n (see Figure 3.4(b)). Then, `ca (x0 ; xn ) = u, otherwise `ca (S ) 6= u.
Then, again each communication (xi ; x0i), 1  i  n, has an upward edge from u to
parent(u). Thus in either case, S is an incompatible.

The necessary condition of the Corollary 3.2 applies to any set of one-to-one
communications and the CST. Is this condition suÆcient for all one-to-one communications? In general, the answer is \no" as it is possible for a width-w communication
set to require more than w steps.
Consider the communication set C = f(a; a0); (b; b0 )    ; (e; e0)g of Figure 3.5,
whose incompatibility graph is shown in Figure 3.6. Clearly, the width of C is 2.
b’

a’

e’ e

d

a

d’ c’ c

b

Figure 3.5: Width-2 communication set requiring three steps
The only communication that can be scheduled simultaneously with (b; b0 ) is either
(d; d0) or (e; e0 ). Since C f(b; b0 ); (d; d0)g or C f(b; b0 ); (e; e0)g has width 2, it follows
that C cannot be scheduled in two steps.
If a width-w communication set possesses certain properties, however, then it can
be scheduled on the CST in w steps. We present one such property in the following
discussion.
Lemma 3.4 The CST can accommodate communications c1 = (x; x0 ) and c2 = (y; y 0)
simultaneously if and only if sets fx; y g and fx0 ; y 0g are not incompatibles.

36
e d

a

e’ a’

b

b’

c

d’ c’

Figure 3.6: Incompatibility graph of the communication set of Figure 3.5
Proof: Clearly if either fx; yg or fx0; y0g is an incompatible, then the width of the
communication set fc1; c2g is 2. Lemma 3.1 implies that they cannot be accommodated simultaneously.
If fx; yg is not an incompatible, then both `(x; x0 ) and `(y; y0) cannot be strictly
larger than `(x; y). Without loss of generality, let `(x; x0 )  `(y; y0). This implies that
`(x; x0 )  `(x; y ): Let `ca (x; x0 ) = u and `(x; y ) = v . Since both u and v are ancestors
of x, either v is an ancestor of u, or u = v. We now consider these two cases.
Case 1 [`(x; x0 ) = `(x; y )]: Here `evel (u) = `evel (v ). With the observation made
above, this implies that u = v (see Figure 3.7(a)). Clearly, y and x0 are leaves of a
subtree T 0 rooted at a child w of node u = v. Every upward edge of communication
c2 = (y; y 0) is either incident on a node of subtree T 0 or is on the path between w and
the root of the CST T . In contrast, every upward edge of c1 = (x; x0 ) is on the path
between x and u. Clearly, c1 and c2 have no common upward edges.
Case 2 [`(x; x0 ) < `(x; y )]: Here `evel (u) < `evel (v ), and hence u is a descendant
of v (see Figure 3.7(b)). All upward edges of c1 = (x; x0 ) are con ned to the subtree
T 00 rooted at u. Since `(v) > `(u), node y lies outside subtree T 00. Consequently, all

37
u=v

v

w

u
T’’

T’
x

y

(a)

x’

x

x’

y

(b)

Figure 3.7: Illustration of the proof of Lemma 3.4
upward edges of (y; y0) lie outside T 00. In any case, communications c1 and c2 have
no common upward edges.
By an analogous argument we can use the fact that `(x; x0 )  `(x0; y0) to establish
that communications c1 and c2 have no common downward edges. Thus, the CST
can accommodate communications c1 and c2 simultaneously.
An obvious consequence of Lemma 3.4 is the following result.
Corollary 3.5 For any k  1, the CST can simultaneously accommodate a set C of k
one-to-one communications if and only if for any two communications (x; x0 ); (y; y 0) 2
C , sets fx; y g and fx0 ; y 0g are not incompatibles.

We now formalize the notion of scheduling a width-w communication set in w
steps.
De nition 3.4

A set, C , of communications with width w is width partitionable if

and only if
(a) C has only one communication,
or
(b) C satis es the following two conditions:
(i) There exists a set C1  C such that C1 has width 1.
(ii) The set C C1 has width w 1 and is width partitionable.

38
Remarks: For the recursive de nition, the singleton set C forms the base case. For
width-w set C (where w > 1), the set C1 consists of a set of communications that can
be scheduled in one step such that C C1 has width w 1. A similar width-1 subset
C2  C C1 reduces the width of C C1 C2 to w 2. Thus, C can be partitioned into
w subsets C1 ; C2 ;    ; Cw , each of width-1. Since the width of C is w, no fewer than
w blocks are possible in this partition; in that sense it is width partitionable. Implicitly, the above de nition also speci es a w-step schedule for an width partitionable
set of communications with width w. Note that while the schedule C1; C2;    ; Cw
prescribes the communications that can be performed simultaneously, the order in
which communications of sets C1; C2;    ; Cw are performed is not important.
Theorem 3.6 The CST can schedule the communications from a width partitionable
set in w steps if and only if the width of the set of communications is at most w.

In the next two sections we discuss width partitionable communication sets. Chapter 4 deals with sets that are not width partitionable.

3.2 Communication Sets with Disjoint Incompatibles
In this section we consider a particular class of communication sets and prove them
to be width partitionable. Consequently, Theorem 3.6 provides a tight bound on the
time for scheduling communications from this class on the CST.
A set C of communications has disjoint incompatibles if and only if
no source or destination appears in more than one incompatible.

De nition 3.5

Figure 3.2 shows an example of a communication set with disjoint incompatibles. Consider any width-w set, C , of communications with disjoint incompatibles.
Clearly, every subset of C also has disjoint incompatibles. To prove that C is width
partitionable (when the width of C is greater than 1), we only need show the existence of C1  C that includes exactly one source or destination for each maximum
incompatible. We now show that this set C1 can be constructed.

39
Broadly speaking, the idea is to represent the incompatibles of C as nodes of
a graph and the communications themselves as edges in this graph. The task of
selecting a set of communications (edges of this graph), subject to the restrictions
in the de nition of a width partitionable set, will be shown to be that of nding a
matching in the graph.
To help us along with the proof, we rst add some dummy communications to
C . To each source (resp., destination) incompatible, I , of C , add w jI j dummy
sources (resp., destinations); jI j denotes the number of elements in set I . If the
number of source and destination incompatibles is di erent, then add dummy source
or destination incompatibles, each with w dummy elements, so that the number of
source and destination incompatibles is the same. Since C has an equal number of
sources and destinations, the number of dummy sources and destinations added is
equal. Pair each dummy source with a dummy destination to form a dummy one-toone communication. (It is not important for the pairing to ensure that the dummy
communications satisfy the membership requirements of their incompatibles.)
Figure 3.8(a) shows an example of a communication set with disjoint incompatibles. Figure 3.8(b) shows the set after adding the dummy communications.
Let the augmented set of communications be Cb . Clearly, Cb has an equal number
of source and destination incompatibles, each of size w. We now represent Cb as a bipartite graph and show that it has a complete matching from the source incompatibles
to the destination incompatibles. Subsequently, we will use this to construct C1  C
that includes exactly one source or destination for each maximum incompatible of C .
Let Ib = f 1; 2 ;    ; z g and Jb = f 1 ; 2;    ; z g be the sets of source and destination incompatibles of Cb . Construct a bipartite graph Gb = (Ib [ J;b E ) with an edge
( i; j ) i there is a communication whose source is in i and whose destination is in
j . Figure 3.8(c) shows a bipartite graph of the communication set of Figure 3.8(b).
A matching on a bipartite graph is a subset of its edges so that no two selected
edges share a common vertex. A matching is complete if for every vertex v of the
graph, the matching includes an edge incident on v. The following result is well
known.

40

(a) Original communication set

(b) Adding dummy communications

(c) A bipartite graph

(d) A complete matching

(e) The set C1
Figure 3.8: Constructing set C1 for a set with disjoint incompatibles
Theorem 3.7 (Hall's Theorem) [1, p. 667]: A bipartite graph with vertex set
V = A [ B has a complete matching if and only if for each Q  A, jQj  jR(Q)j,
where R(Q)  B is that subset of B with edges to elements of Q.

We now use Hall's Theorem to prove that the graph Gb = (Ib[ J;b E ) has a complete
matching. Select any Q  Ib with jQj = q. Since each of the q incompatibles of Q
has w elements, Q represents wq communications (both real and dummy). These
wq communications are spread over at least q destination incompatibles, as each
destination incompatible holds only w elements. Therefore jR(Q)j  q = jQj. Thus

41
Gb has a complete matching Q  E (Theorem 3.7). Figure 3.8(d) shows one possible

matching between nodes of Figure 3.8(c).
Let M = f( i; f (i) ) : 1  i  zg be a complete matching of graph Gb. Construct
set C1  C as follows. If i is a maximum source incompatible of C , then it contains
no dummy elements. Consequently, there is a (non-dummy) communication (x; x0 )
with x 2 i and x0 2 f (i) . For each such i, select (x; x0 ) to be in C1 . Similarly,
if f (i) is a maximum destination incompatible of C , then there is a (non-dummy)
communication (y; y0) with y 2 i and y0 2 f (i) . Again for each such f (i) , select
(y; y0) to be in C1. No other communication is selected to be in C1. Figure 3.8(e)
shows the communications included in the set C1.
Clearly, each communication of C1 has its source and destination in a di erent
incompatible; this is because their selection is based on a matching on a graph with
incompatibles as vertices. Also for each maximum incompatible, there is a communication in C1 whose source or destination is in the maximum incompatible. This proves
that every set of communications with disjoint incompatibles is width partitionable.
With Theorem 3.6 we have the following result.
Theorem 3.8 The CST can schedule communications from a set with disjoint incompatibles in w steps if and only if the width of the set of communications is at most
w.

3.3 Sets with Overlapping Incompatibles
In the last section we proved that communication sets with disjoint incompatibles are
width partitionable. In general, however, two incompatibles may overlap (have some
common elements). In this section we consider some important classes of communication sets whose incompatibles need not be disjoint. We rst de ne three conditions
that establish building blocks for constructing complex communication sets. Subsequently, we use these conditions to construct two important classes of communication
sets, that we then prove to be width partitionable.
In this section we restrict our discussions to communication sets that are \oriented". For the purpose of our discussion, number the sources and destinations

42

(a)
(b)
Figure 3.9: Examples of oriented (a) and non-oriented (b) communication sets
(leaves of the CST) in ascending order from left to right. Thus we may say that for
two leaves x, y, x < y to mean that x is to the left of y, or x  y to mean that x
is not to the right of y.
De nition 3.6 A communication set C is oriented if and only if either (i) for every communication (x; x0 ) 2 C; x < x0 or (ii) for every communication (x; x0 ) 2
C; x > x0 .
Remark: Where there is no ambiguity we will drop the attribute \oriented" for
communication sets in this section.
Figure 3.9(a) shows example of an oriented communication set as each source is
to the left of its destination. The communication set of Figure 3.9(b) is not oriented.
Considering oriented communication sets greatly simpli es the discussion without
giving up too much in the generality of the results. (This is because every nonoriented communication set is a union of at most two oriented communication sets.)
Without loss of generality, we assume each communication to have its source to the
left of its destination at the leaves of the CST (as in Figure 3.9(a)).
3.3.1 Combining Communication Sets
The combination of two communication sets is simply their union. When combined
under certain conditions, the resulting communication set can be proved to have some
useful properties. In this section we identify three such conditions termed capping,
concatenation, and interleaving. Subsequently we use these conditions to express two
useful classes of communications called well-nested sets and monotonic sets, and prove
these sets to be width partitionable.
We now describe the conditions referred to above. Although we apply these conditions to communications oriented from left to right, they can be adapted to sets

43
C’’

C’

C’

C’’

C’’

C’
y

x

x’

y’

x

x’

y

y’

x

y

x’

y’

(b) Concatenation
(c) Interleaving
(a) Capping
Figure 3.10: Illustration of conditions for combining communication sets
oriented from right to left and to non-oriented sets. For the following de nitions let
C 0 and C 00 be communication sets and let C = C 0 [ C 00 .
Capping:

The capping condition (see Figure 3.10(a)) requires that
C 00 = f(y; y 0)g and for all (x; x0 ) 2 C 0 ; y < x and x0 < y 0:

(3.2)

Remark: Since x < x0, the capping condition implies that y < x < x0 < y0. Intuitively, the capping condition requires the singleton set C 00 to span across (or cap)
the entire set C 0.
To emphasize that sets C 0 and C 00 satisfy the capping condition we will express
set C = C 0 [ C 00 as C = cap(C 0; C 00). Since C 00 = fcg, a singleton element set, we will
write cap(C 0 ; C 00) simply as cap(C 0; c) (rather than cap(C 0; fcg).
Concatenation:

The concatenation condition (see Figure 3.10(b)) requires that
for all (x; x0) 2 C 0; (y; y0) 2 C 00;

x0 < y:

(3.3)

Remark: Again since x < x0 and y < y0, the concatenation condition implies that
x < x0 < y < y 0. Intuitively, all communications of C 0 are to the left of those of C 00 .
Here we write C = C 0 [ C 00 = concat(C 0; C 00): In general, jC 00j  1. However, if
C 00 = fcg, a single element set, then we write concat(C 0 ; fcg) as concat(C 0 ; c).

44
Interleaving:
C 00 = (y; y 0);

The interleaving condition (see Figure 3.10(c)) requires that
and 8(x; x0 ) 2 C 0;

x < y and x0 < y 0;

and 9(z; z0 ) 2 C 0;

y < z0:

(3.4)
Remark: Here for some (z; z0 ), z < y < z0 < y0. Intuitively, C 0 is to the left of C 00
but not entirely. At least one destination of C 0 is to the right of y.
Here we write C = C 0 [ C 00 = inter(C 0 ; C 00). As in the capping condition, we will
only use singleton sets for C 00 and use inter(C 0 ; c) to mean inter(C 0 ; fcg).
The restriction that jC 00j = 1 for the capping and interleaving conditions is necessary for properly de ning the well-nested and monotonic sets later in this section.
For a di erent setting however, our de nitions of the conditions could be generalized
to jC 00j  1.
We now show how the conditions identi ed above could be used to de ne classes of
communication sets. We consider communication sets that satisfy a set of conditions.
Although some de nitions are made in a more general setting, we implicitly assume
the conditions to be one of capping, concatenation, and interleaving discussed above.
Again we will use the symbol (C 0 ; C 00) to denote that the communication set C 0 [ C 00
(obtained from sets C 0 and C 00) emphasizing the fact that C 0 [ C 00 satis es condition
 , where  2 fcap; concat; interg.
De nition 3.7

tions as follows.

Inductively, de ne a communication set satisfying a set S of condi-

(a) A set with only one communication satis es S .
(b) If sets C 0 and C 00 satisfy S , then for each condition  2 S , set (C 0; C 00)
satis es S .
(c) No other communication set satis es S .
The above de nition allows a communication set C to be constructed inductively
using only the conditions of S ; all conditions of S need not be used, however. The
sequence of conditions applied to construct a communication set gives a (condition)
expression for C . For example the expression for the communication set in Figure 3.11

45

y

v w x z

w’ v’ x’ y’ z’

Figure 3.11: A communication set satisfying the set fcap ; inter g
is inter(cap(inter(cap(c1 ; c2 ); c3); c4); c5) where c1 = (v; v0), c2 = (w; w0), c3 = (x; x0 ),
c4 = (y; y 0), c5 = (z; z 0 ). Note that this expression is not necessarily unique. (That is,
an expression may have di erent conditions in di erent order but the set of conditions
used is always the same.) We can think of a condition set S as inducing a class C of
communication sets such that each C 2 C can be constructed as described above.
Lemma 3.9 Let S1 ; S2 be sets of conditions inducing classes C1 ; C2 of communication
sets. If S1  S2 , then C1  C2 .
Proof: Since every condition  2 S1 is also in S2, then it is always possible to construct
the class C1 of communications using the set of conditions S2. Consequently, C1  C2 .
Let the base set of conditions of a communication set C be the smallest set of
conditions that it satis es. Observe that the three conditions described earlier in
this section combine two communication sets with di erent relative placement of
communications (see Figure 3.10). That is, one cannot replace another. For example,
if C 0 containing (x; x0) and C 00 = fcg = f(y; y0g) satisfy the capping condition, then
y < x < x0 < y 0. This implies that they cannot satisfy the concatenation condition
(requiring x < x0 < y < y0) or the interleaving condition (requiring x < y < x0 < y0,
assuming (x; x0 ) to satisfy the last clause of Equation 3.4. Thus capping cannot be
replaced by concatenation or interleaving. Similar assertions can be made about
concatenation and interleaving. Thus we have the following lemma.
Lemma 3.10 If a communication set satis es a set of conditions, then it has a unique
base set of conditions.

46
We now consider oriented communication sets that can satisfy various subsets of
the set fcap; concat; interg and examine whether these sets are width partitionable
(see De nition 3.4).
We rst prove that there exists a communication set satisfying fcap; interg that
is not width partitionable. This implies that a set satisfying fcap; concat; interg is
not width partitionable as well. Next in Lemma 3.12 we prove that a communication
set satisfying fconcatg is width partitionable. We use this result in Sections 3.4
and 3.5 to establish that sets satisfying fcap; concatg (called well-nested sets) or
fconcat; interg (called monotonic sets) are width partitionable. These results, with
Lemma 3.12, imply a communication satisfying only one of the three conditions in
fcap; concat; interg is width partitionable. Table 3.1 summarizes these results.
Table 3.1: Width partitionability of communication sets satisfying conditions from
fcap, concat, interg
Base condition set Width Partitionable?
Remarks
fcap; concat; interg
No
implied by Lemma 3.11
fcap; concatg
Yes
well-nested set; Section 3.4
fcap; interg
No
implied by Lemma 3.11
fconcat; interg
Yes
monotonic set; Section 3.5
fcapg
Yes
implied by Theorem 3.15
fconcatg
Yes
implied by Corollary 3.13
finterg
Yes
implied by Theorem 3.17
Lemma 3.11 There exists a communication set satisfying fcap; interg that is not
width partitionable.

Proof: Consider the set C = fc1; c2; c3; c4 ; c5g of communications in Figure 3.11 where
c1 = (v; v 0); c2 = (w; w0); c3 = (x; x0 ); c4 = (y; y 0); c5 = (z; z 0 ). This communication
set of Figure 3.11 has the condition expression inter(cap(inter(cap(c1 ; c2); c3); c4); c5)
and its base set of conditions is fcap; interg. By Lemma 3.10, this set of conditions
is unique; that is, the communication set cannot be constructed with a di erent set
of conditions. Figure 3.12 shows one possible mapping of these communications on

47
w
y

z’
y’

v x v’ w’

z

x’

Figure 3.12: Mapping communication set of Figure 3.11 on the CST
a CST and Figure 3.13 shows the incompatibility graph of this communication set.
Clearly, the communication set has width 2. As in Figure 3.6, the only communication
that can be scheduled with (x; x0) is either (w; w0) or (y; y0). Since C f(x; x0 ); (w; w0)g
or C f(x; x0 ); (y; y0)g has width 2, it follows that C cannot be scheduled in two steps.
We close this section with the following results that are used in subsequent sections.
Lemma 3.12 For any width partitionable, oriented communication sets C 0 and C 00 of
widths w1 and w2 , respectively, concat(C 0 ; C 00 ) is a width partitionable communication
set of width maxfw1 ; w2 g .

Proof: If C 0 and C 00 share a common portion of a CST, then it must be at the right
(destination) end of C 0 and the left (source) end of C 00 (see Figure 3.10(b)). Since the
CST has full duplex links, upward edges from sources will not con ict with downward
edges to destinations. That is, the sets C 0 and C 00 do not interfere with each other.
Consequently, concat(C 0; C 00 ) is width partitionable and has width maxfw1 ; w2g.
Corollary 3.13 Every communication set that satis es the set fconcatg is width
partitionable.

48
y w

z

y’

x

z’ x’

v

w’

v’

Figure 3.13: The incompatibility graph of the communication set of Figure 3.12

3.4 Well-Nested Communication Sets
In this section we de ne a class of communications called \oriented well-nested (communication) sets." A special case of a well-nested set is the set of communications on
a segmentable bus [48], a fundamental structure of recon gurable computing.
An oriented well nested communication set is one that satis es the
conditions in the set fcap; concatg.
De nition 3.8

Figure 3.14 shows examples of oriented well-nested sets and their condition expressions. The well-nested set derives its name from its similarity to a well-nested
parenthesis sequence. If each source (resp., destination) in a well-nested set is replaced by a \(" (resp., \)"), then the resulting sequence is a well-nested (parenthesis)
sequence. For example the communication set of Figure 3.14(a) can be represented
as the well-nested sequence ((( )( ))) ((( ))).
De nition 3.9

as follows:

The depth of an oriented well-nested set can be de ned inductively

(a) The depth of a singleton communication set is 1.

49

a b

(a)

c

d

e f g

( ( (
( ) ) ),
( ( ) )) of depth 3

concat cap cap concat c; d ; b ; a
cap cap g; f ; e

a b

(b)

c

( (

d

e f

g

( ) ) ( (
of depth 4

) ) )

cap cap concat c; d ; b ; cap cap g; f ; e ; a

Figure 3.14: Examples of well-nested sets, with the corresponding condition expressions and parentheses depths. The letter next to each source represents the entire
communication.
(b) If C is an oriented well-nested set of depth d, then cap(C; c) has depth d +1.
(c) If C 0 and C 00 are well-nested sets of depths d1 and d2, then concat(C 0 ; C 00)
has depth maxfd1; d2g.
For brevity, we will use the terms well-nested sequence and well-nested set interchangeably. Recall that we consider only sets of communications that are oriented
from left to right. The condition expression may not be unique. For example the
condition expressions concat(a; concat(b; c)) and concat(concat(a; b); c) represent the
same set of communications. However, the order in which the conditions apply is
unique.
Let C be an oriented well-nested set. Let (x; x0) 2 C be the communication with
the leftmost source x; i.e. for all (y; y0) 2 C , x  y. If x0 is also the rightmost
destination (i.e., for all (y; y0) 2 C x0  y0), then C is terminally capped. In other
words, a terminally capped set has an outermost communication (x; x0).
Lemma 3.14 Let C1 be a terminally capped oriented well-nested set with an outermost communication c0 . Let c00 be any communication such that C1 and c00 satisfy the
capping condition. Then for any communication c 2 C1 , the sources (or destinations)
of c and c00 are in the same incompatible only if the sources (or destinations) of c0 and
c00 are in the same incompatible.

Proof: If jC1j = 1, then C1 = fc0g and there is nothing to prove. So assume that
C1 = cap(C2 ; c0 ), c = (x; x0 ); c0 = (y; y 0) and c00 = (z; z 0 ). Clearly, z < y < x < x0 <

50
v

c’’
c’
C1
z

y

w

c
x

x’

(a)

y’

z

α

z

Figure 3.15: Illustration of Lemma 3.14

x

y

y’

(b)

y 0 < z 0 (see Figure 3.15(a)). Let `ca (x; z ) = v and `ca (y; y 0) = w. Suppose that x
and z are in the same incompatible but y and z are not (see Figure 3.15(b)). This
requires that `evel (w) < `evel (v). Let be the rightmost node of the subtree rooted
at w. So, y0  and x0 > which contradicts the fact that x0 < y0.
Theorem 3.15 Every oriented well-nested set is width partitionable.

Proof: We proceed by induction on the depth of the well-nested set. A depth-1
well-nested set satis es the set fconcatg. By Corollary 3.13, it is width partitionable.
Assume the theorem to hold for any well-nested set with depth at most Æ and consider
an oriented set C with depth Æ + 1. We have two cases corresponding to (a) C =
concat(C 0 ; C 00 ) and (b) C = cap(C 0 ; c), the two parts in the recursive de nition of an
oriented well-nested set.
Suppose that C = concat(C 0; C 00), for well-nested sets C 0 and C 00 of widths w1 and
w2 , respectively. By Lemma 3.12, C is of width maxfw1 ; w2 g and can be scheduled
on the CST in maxfw1; w2g steps. That is, C is width partitionable.
Suppose now that C = cap(C 0 ; c), where C 0 is an oriented set of width w1. Let
w denote the width of C . Observe that C 0 has depth Æ , so the induction hypothesis
applies to it. Recall that proving a width-w communication set width partitionable
is tantamount to scheduling its communications in w steps. If w = w1 + 1, then
schedule C 0 in w1 steps and communication c all by itself in another step.

51
Suppose w = w1. Here we only need identify a width-1 set C1 such that C C1 has
width w 1. Let C 0 be the concatenation of C10 , C20 ,   , C 0 , for some integer  1
(see Figure 3.16) where for 1  i  , Ci0 is a terminally capped well-nested set.
Let the outermost communication of Ci0 be ci = (xi ; x0i). Clearly Ci0 has depth  Æ and
the induction hypothesis applies to it. Thus Ci0 has an optimal schedule. Let Si be
the set of all communications that are scheduled at the same step as communication
ci in this optimal schedule.
C

c

x1

c 1 C’1

c 2 C’2

C’’
1

C’’
2

x’1 x2

......

x2 xα

c α C’α

C’’
α

x’α

Figure 3.16: Illustration of the proof of Theorem 3.15
[

De ne set C1 = Si. By the argument used in Lemma 3.12, width of C1 =
i=1
maxfwidth of Si : 1  i  g. Since each Si has width 1, C1 has width 1. By the
induction hypothesis, the set C 0 C1 has width w 1. Set C C1 also has width
w 1 for the following reason. Suppose c and ci are in the same incompatible, then
an element of this incompatible has already been included in C1. If c is not in the
same incompatible as any of the ci 's, then by Lemma 3.14, c cannot be incompatible
with any communication in the set C 0.

3.5 Monotonic Communication Sets
In this section we consider another class of communication sets called \oriented monotonic sets" (see Figure 3.17) and prove it to be width partitionable. This class has
many important communication sets including those of the uniform (or normal) hypercube (in which only one dimension of the hypercube is used for communications).
De nition 3.10 An oriented monotonic communication set is one that satis es the
conditions in the set fconcat; interg.

52

Figure 3.17: Monotonic communication set
The monotonic set derives its name from the nature of its incompatibility graph.
We provide more details at the end of this section.
De nition 3.11 The breadth of an oriented monotonic set can be de ned recursively
as follows.
(a) The breadth of a singleton communication set is 1.
(b) If C is an oriented monotonic set of breadth b, then inter(C; c) has breadth
b + 1.
(c) If C 0 and C 00 are monotonic sets of breadth b1 and b2 , then concat(C 0 ; C 00)
has breadth maxfb1 ; b2 g.
As for oriented well-nested set, the condition expression of an oriented monotonic set
is unique.
Lemma 3.16 Let C be an oriented monotonic set such that either C = fc0 g or it
is possible to express C as inter(C fc0 g; c0) for some monotonic set C fc0 g and
communication c0 . Let c00 be any communication such that C and c00 satisfy the inter-

leaving condition. Then for any communication c 2 C , the sources (or destinations)
of c and c00 are in the same incompatible only if the sources (or destinations) of c0 and

c00 are in the same incompatible.

Proof: The proof is similar to the proof of Lemma 3.14. If jC j = 1, there is nothing to
prove. So assume that C = inter(C fc0 g; c0), c = (x; x0 ); c0 = (y; y0) and c00 = (z; z0 ).
Clearly x0 < y0 (see Figure 3.18(a)). Here (y; y0) is the rightmost communication of
C . Let `ca (x; z ) = v and `ca (y; y 0) = w. Suppose that z is incompatible with x but
not incompatible with y (see Figure 3.18(b)). This requires that `evel (w) < `evel (v).
Let be the rightmost node of the subtree rooted at w. So, y0  and x0 > which
contradicts the fact that x0 < y0.

53

C1

x

v
c
y

c’
z x’

(a)

c’’
y’

w
z’
α

x

y

z

y’

(b)

Figure 3.18: Illustration of the proof of Lemma 3.16
Theorem 3.17 Every oriented monotonic set is width partitionable.

Proof: The proof follows the same lines as the proof of Theorem 3.15. We proceed
by induction on the breadth of the oriented monotonic set. One step suÆces for
a breadth-1 monotonic set. Assume the theorem to hold for a monotonic set with
breadth at most and consider a monotonic set C with breadth +1. We have two
cases corresponding to: (a) C can be expressed as concat(C 0 ; C 00) and (b) C cannot
be expressed as concat(C 0 ; C 00).
As described before, if C = concat(C 0 ; C 00), then C can be scheduled in maxfw0; w00g
steps and it is width partitionable (where w0; w00 are the widths of C 0 ; C 00 respectively).
If C cannot be expressed as concat(C 0 ; C 00) then C = inter(C 0 ; c), where C 0 is a
monotonic set of width w0 (see Figure 3.19). Let w denote the width of C . Observe
that C 0 has breadth , so the induction hypothesis applies to it.
If w = w0 + 1, then schedule C 0 in w0 steps and then schedule the communication
c all by itself in another step. Suppose w = w0 . Here we only need identify a width-1
set C1 such that C C1 has width w 1. Let the rightmost communication of C 0 be
c0 = (y; y 0). Clearly C 0 has breadth at most and the induction hypothesis applies
to it. Thus, C 0 has an optimal schedule. Let S 0 be the set of all communications of
C 0 that are scheduled at the same time as communication c0 .

54
C’

c

c’
y

y’

Figure 3.19: Illustration of the proof of Theorem 3.17
De ne set C1 = S 0. Clearly, C1 has width 1. By the induction hypothesis the set
C 0 C1 has width w 1. Set C C1 also has width w 1 for the following reason.
Suppose c and c0 are in the same incompatible, then an element of this incompatible
has already been included in C1. If not, then, by Lemma 3.16 c cannot be in the
same incompatible as any communication in the entire set C 0.
Now we show that the incompatibility graph of an oriented monotonic communication set has a special property from which its name derives.
An ordered incompatibility graph of an oriented communication set
is one in which the sources and destinations are arranged in increasing order of their
indices.
De nition 3.12

For example, Figure 3.6 (page 36) shows an ordered incompatibility graph if e <
d < a < b < c and e0 < a0 < b0 < d0 < c0 .
An ordered incompatibility graph of an oriented communication set
is parallel if and only if, for all communications (x; x0 ); (y; y0), if x < y, then x0 < y0.
De nition 3.13

Intuitively, if the ordered incompatibility graph of an oriented communication set is
not parallel, then it has edges that intersect (see Figure 3.20).
Theorem 3.18 A communication set is oriented monotonic if and only if its ordered
incompatibility graph is parallel.

Proof: Let C be an oriented monotonic set and let G be its incompatibility graph.
We now prove that G is parallel.

55
x

y

x’

y’

Figure 3.20: An ordered incompatibility graph that is not parallel
We proceed by induction on jC j. Clearly, if jC j = 1 then G is parallel. Assume the
lemma to hold for any oriented monotonic communication set of size at most n and
consider the case where jC j = n + 1. Let C = (C 0; C 00), where  2 fconcat; interg.
Both C 0 and C 00 are of size at most n and the induction hypothesis applies to them
(i.e., their incompatibility graphs are parallel). Regardless of the identity of , each
source of C 0 is to the left of all source(s) of C 00 and each destination of C 0 is to the left
of all destination(s) of C 00. Therefore, edges of the ordered incompatibility graphs of
C 0 and C 00 (when placed next to each other) do not intersect. That is, C is parallel
(see Figure 3.21).
S’

S’’

C’

C’’

D’

D’’

Figure 3.21: Illustration of the proof of the \if" part of Theorem 3.18

56
Now let G be a parallel incompatibility graph of an oriented communication set
C . We prove that C is monotonic.
We proceed by induction on number of communications in G . Clearly, if G has one
communication, then (by De nition 3.11) C is monotonic. Assume the lemma to hold
for any parallel incompatibility graph with at most n communications, and consider
the case where the incompatibility graph G of set C has n + 1 communications. Let

c

Figure 3.22: Illustration of the proof of the \only if" part of Theorem 3.18
communication c be the rightmost communication of G . The incompatibility graph
of set C fcg (see Figure 3.21) has n communications and the induction hypothesis
applies to it (i.e. C fcg is monotonic). Since the source of communication c is to
the right of all sources of C fcg and the destination of communication c is to the
right of all destinations of C fcg, then C could be constructed as (C fcg; fcg)
where  2 fconcat; interg. Since the set fconcat; interg is used to construct oriented
monotonic sets, then C is monotonic.

3.6 Segmentable Bus
Recall that, functionally, a segmentable bus has the structure shown in Figure 2.4
(page 21). Each processor controls (opens or closes) a segment switch on the bus
using local information, creating bus segments connecting consecutive processors.
Section 2.2 provides more details. For now assume that at most one processor writes
on any bus segment and at most one processor reads from a segment (exclusive

57
read, exclusive write model). Thus, every pair (x; x0 ); (y; y0) of communications is
on a di erent bus segment; that is, x; x0 < y; y0. Consider a con guration of the
segmentable bus with k segments (numbered 1; 2;    ; k). Let (xi; x0i ) denote the
communication (if any) in segment i. Since each communication is con ned to a
segment of contiguous processors, the set of communications on a segmentable bus
form a depth-1, well-nested set. If for each i, xi < x0i (or for each i, xi > x0i ), then
the well-nested set is oriented and its width is 1. That is, the CST can accommodate
these oriented communications in one step (Corollary 3.5). If a well-nested set is not
oriented (the set has width at most 2), then we partition it down into two oriented
sets, each of width 1, and schedule them in two steps. The width cannot exceed 2 for
a well-nested set of depth 1.
So far, we have considered only one-to-one communications. The segmentable bus
permits broadcasting on its segments, however. If the source of a broadcast is the
leftmost (or rightmost) processor of a segment, then the CST simply connects all
switches in the tree that are \below" the path between the source and destination to
receive information from one \side" of the switch and transmit it out of the remaining
two sides (see Figure 3.23). If the source is in the middle of the segment, then simply
schedule two oriented broadcasts, one to the left and the other to the right. Thus we
have the following result.
Theorem 3.19 A CST can schedule the communications of a segmentable bus in at
most two steps.

3.7 Concluding Remarks
In this chapter, we have derived the idea of communication width which provides a
lower bound on the time to schedule any set of one-to-one communications on the
CST. We have identi ed a property of the communication set, called width partitionability, for which the above lower bound is tight. Then we showed two classes of
communication sets to possess this property. As a special case of one of these results,
we showed that the set of communications that can be accommodated in one step on
a segmentable bus [48] can be scheduled in two steps on the SRGA architecture.

58

bus segment

Figure 3.23: Broadcasting on the CST with the source at the right end of a bus
segment
The results developed here have a simple generalization to CSTs whose edges
correspond to multiple full duplex links (we used only one full duplex link per edge
so far). Suppose that there are ki full duplex links at each edge ei of the tree, then
the \e ective width" of a communication set of (actual) width w is max fd wkii eg; the
quantity wi is the number of communications traversing edge ei in any one direction.
One interesting case is to use a fat tree [14, 26] where edges between levels l and
l + 1 (where 0  l  log N ) have 2l full duplex links. Then any set of one-to-one
communications can be scheduled in one step.
Another interesting case is when ki = k for all i, then the e ective width=fd wki eg.
For segmentable bus communications setting k = 2 makes the e ective width 1. That
is, if each CST edge has two full duplex links, then a step of the segmentable bus can
be accommodated in one step on the CST. As we noted at the start of this chapter,
we have only considered the issue of accommodating communications from a set in
the CST. Chapter 6 addresses the issue of setting CST switches to actually establish
the dedicated paths needed to accomplish the communications. It turns out that

59
for a segmentable bus the switches can be set at run-time with local information (as
required by the functional description of the segmentable bus).

Chapter 4
CST Communication|Sets That
Are Not Width Partitionable
In Chapter 3 we showed that some communication classes (oriented well-nested sets
and oriented monotonic sets) are width partitionable. In this chapter, we consider
communication classes that, in general, are not width partitionable. This study provides a better understanding of the conditions under which a communication set is
not width partitionable. First, we show that the incompatibility graph of Figure 4.1
represents one of the simplest communication sets that is not width partitionable.
As we explained in Section 3.1, this incompatibility graph is of width 2, but requires
three steps on the CST; in other words, the communication set requires one \extra"
step beyond its width. We show that the number of extra steps for scheduling a
width-w communication set can be as large as d w4 e. In Chapter 3 we proved that
oriented well-nested sets and oriented monotonic sets are width partitionable. Here,
we show that the non-oriented counterparts of these communication sets are, in general, not width partitionable. However, with some restrictions (that still keep the sets
non-oriented), these sets are width partitionable.
On the whole, this chapter provides a better understanding of sets that are not
width partitionable. Subsequent chapters build only on width partitionable sets (such
as those of a segmentable bus.) Consequently, this chapter could be skipped by the
reader without loss of continuity.
In the next section we identify a \simplest" communication set that is not width
partitionable. Section 4.2 uses the idea of such a simplest set to nd an upper bound
60

61
e d

a

e’ a’

b

b’

c

d’ c’

Figure 4.1: Width-2 communication set requiring three steps
on the number of extra steps. Sections 4.3 and 4.4 consider non-oriented well-nested
and non-oriented monotonic communication sets.

4.1 The Simplest Communication Sets That Are
Not Width Partitionable
In this section we explore the simplest set of requirements that a communication set
must have so that it is not width partitionable. The set of requirements we consider
includes size (number of communications), width, and number of incompatibles for
the communication set. We show that every communication set that is not width
partitionable must have at least ve communications, at least a width of two, and
at least three source incompatibles and three destination incompatibles. This result
makes the communication set of Figure 4.1 a \simplest" set that is not width partitionable. Further, we show that there are only two choices (to within isomorphism)
for such a simplest set. Figure 4.1 shows one of the two choices.
4.1.1 Requirement of the Simplest Set
We rst derive a series of intermediate results that lead to the main result of Theorem 4.11 (page 75). In particular, we prove that a communication set with at most
four communications is always width partitionable (see Lemma 4.4 and Theorem 4.6).
Theorem 4.10 shows that the communication set must have at least three source in-

62
compatibles and three destination incompatibles to be not width partitionable. This
sequence of arguments establishes the basic requirements for a simplest set that is
not width partitionable.
Clearly, reversing the direction of each communication in a communication set C
produces a \dual" communication set Cb and vice versa. This relationship between
C and Cb also carries into the incompatibles. Each source incompatible of C is a
destination incompatible of Cb and vice versa. For brevity, we will derive intermediate
results for either C or Cb but not both. However, it should be understood that they
apply to both. For example, Lemma 4.1 talks of overlapping source incompatibles I1
and I2 and disjoint destination incompatibles J1 and J2. Clearly, the result applies to
overlapping destination incompatibles J1 and J2 and disjoint source incompatibles I1
and I2. To indicate that we are referring to this \dual" result, we will cite the \dual
of Lemma 4.1." Similar conventions are adopted for other results.
We organize this section as follows. The rst part (Section 4.1.1.1) considers
some general results. Section 4.1.1.2 derives the simplest set from the number of
communications point of view. Section 4.1.1.3 derives the simplest set from the
number of incompatibles point of view. The simplest set from the width point of
view is straight forward, so no separate section is devoted to that. We put all these
results together in Theorem 4.11.
4.1.1.1 Preliminary Results

In this section we derive some general results that nd use in later sections.
Lemma 4.1 Let (x; x0 ) and (y; y 0) be two communications in any communication
set. Let I1 ; I2 (resp., J1 ; J2 ) be source (resp., destination) incompatibles such that
x 2 I1 I2 , y 2 I2 I1 , x0 2 J1 and y 0 2 J2 . If I1 \ I2 6= ;, then J1 \ J2 = ;.

Proof: Let v 2 I1 \ I2 (see Figure 4.2(a)). Let `ca (x; v) = p and let `1 = `evel (p).
The directed CST link hp; parent(p)i is used by communications (x; x0 ) and (v; v0)
(see Figure 4.2(b)); note that since x; v 2 I1 (an incompatible), p cannot be the root
of the CST. Let `ca (v; y) = m and let `2 = `evel (m). Again, the link hm; parent(m)i
is used by (v; v0) and (y; y0). Since fx; yg is not an incompatible, communication

63
I

I

1

x

v

x’
11
00
00
11

11
00
00
11

J1

v’

2

p

y

x

T1

m

y’

T2

1
0
0
1

J2

v

y

(a)
(b)
Figure 4.2: Illustration of the proof of Lemma 4.1
(y; y0) cannot traverse p and (x; x0 ) cannot traverse m; that is `1 6= `2. Without loss
of generality, let `1 > `2. This guarantees that the communication originating at y
has no upward edges at level `1 or higher (otherwise y 2 I1 ) (see Figure 4.2(b)). This
implies that fx0 ; y0g is not an incompatible. Also each destination z0 such that fy0; z0 g
is an incompatible has to be a leaf of subtree T1 (see Figure 4.2(b)), whereas each
destination w0 such that fx0 ; w0g is an incompatible is a leaf of T2. Since T1 and T2
have disjoint sets of leaves, then J1 \ J2 = ;.
Remarks: Intuitively, Lemma 4.1 shows that if two source incompatibles overlap,
then their exclusive elements (elements that are not common to both) must have
destinations in disjoint destination incompatibles (see Figure 4.2). In e ect, a destination incompatible can have edges to only one source incompatible (i.e., the size of
a destination incompatible is no more than the size of the source incompatible that
has edges to) and we have the following result.
Lemma 4.2 Let I1 ; I2 be two source incompatibles such that jI1 \ I2 j  2. For any
communications (x; x0 ) and (y; y 0) such that x; y 2 I1 \ I2 , x0 2 J1 , y 0 2 J2 , then
there are no communications (u; u0) and (v; v 0) such that u 2 I1 I2 , v 2 I2 I1 ,
u0 2 J1 J2 and v 0 2 J2 J1 .

Proof outline: If the situation described by the lemma is possible, then Figure 4.3(a)
shows that situation. Figure 4.3(b) shows that it is impossible for the situation in
Figure 4.3(a) to occur on the CST.

64
u

u’

x

y

x’

v

y’

(a)

v’

v

y

x

u

(b)
Figure 4.3: Illustration of the proof outline of Lemma 4.2
Lemma 4.3 Let C be a communication set with at most two maximal source incompatibles I1 and I2 . If I1 \ I2 6= ;, then C is width partitionable.

Proof: If C has only one source incompatible, then the width of C is jC j; simply
schedule the communications one by one.
Let C have two distinct maximal source incompatibles I1 and I2 with I1 \ I2 6= ;.
Let x 2 I1 I2 , y 2 I1 \ I2 and z 2 I2 I1 (see Figure 4.4). Since C has exactly two
source incompatibles, fx; zg cannot be an incompatible, as this would make fx; y; zg
an incompatible (Lemma 3.3, page 33), and hence I1 and I2 would not be maximal
as assumed.
Let x0 and z0 be the destinations corresponding to sources x and z, respectively.
By Lemma 4.1, x0 and z0 must be in disjoint destination incompatibles J1 and J2,
respectively (say). This statement holds for any x 2 I1 I2 and z 2 I2 I1 . Thus,
every element of destination incompatible J1 (resp., J2) must have its source in I1
(resp., I2). That is, if J1 (resp., J2) is a maximum incompatible, then so is I1 (resp.,
I2 ). This, in turn, implies that at least one of I1 and I2 must be maximum.
This observation gives us a simple method to schedule C . Simply schedule all the
elements in I1 \ I2 rst. After this we have a communication set with disjoint incom-

65
I1

x

I2

y

z

x’

z’

J1

J2

Figure 4.4: Illustration of the proof of Lemma 4.3
patibles, which by Theorem 3.8 is width partitionable. Thus, C is width partitionable.
4.1.1.2 Number of Communications in a Simplest Set

In this section we show that a communication set with at most four communications
is width partitionable.
Lemma 4.4 Every set of three communications is width partitionable.

Proof: For the set C = f(x; x0 ); (y; y0); (z; z0 )g to be not width partitionable, there
must be overlapping source incompatibles and/or overlapping destination incompatibles (Theorem 3.8). Because C has only three communications, either there are only
two overlapping source incompatibles or there are only two overlapping destination
incompatibles. Without loss of generality, let fx; yg and fy; zg be incompatibles. By
Lemma 3.3, page 33, fx; zg can not be an incompatible. By Lemma 4.3 and its dual,
C is width partitionable.
Lemma 4.5 Let C = f(w; w0); (x; x0 ); (y; y 0); (z; z 0 )g be a communication set. If
fw; xg; fx; yg, fy; zg and fw; zg are source incompatibles, then C has at most two
maximal source incompatibles.

Proof: If fw; x; y; zg is an incompatible, there is nothing to prove. So by Lemma 3.3
(page 33) fw; yg or fx; zg is not an incompatible.

66
r
w

r
x

q

q

p

x

y

p

z

y

z

w

(b)
(a)
Figure 4.5: Illustration of the proof of Lemma 4.5
Case 1 fw; yg is not an incompatible (see Figure 4.5(a)). Since fw; yg is not an
incompatible, `ca (w; x) = r must be at a higher level than `ca (x; y) = q.
Since fy; zg, fw; zg are incompatibles, so must fx; y; zg and fw; x; zg (see Figure 4.5(a)) of which fw; xg; fx; yg; fy; zg and fw; zg are subsets.
Case 2 fx; zg is not an incompatible (see Figure 4.5(b)). Since fx; zg is not an incompatible, `ca (x; y) = q must be at a higher level than `ca (y; z) = p. Since fw; xg,
fw; zg are incompatibles, so must fw; y; zg and fw; x; yg (see Figure 4.5(b)) of
which fw; xg; fx; yg; fy; zg and fw; zg are subsets.
Theorem 4.6 Every set of four communications is width partitionable.

Proof: Consider a communication set, C = f(x; x0 ); (y; y0); (z; z0 ); (w; w0)g, that has
four communications. Recall that an overlap must exist for the set to be not width
partitionable (Theorem 3.8). We have several cases. Subcases within a case are
appropriately indented.
Case 1 If C has width 4, then four steps suÆce for scheduling (schedule one communication at each step.)
Case 2 If C has width 3, then there exists a 3-element source incompatible and/or
a 3-element destination incompatible. We have the following cases.

67

Figure 4.6: Source incompatibles for Subcase 2.3.1 of Theorem 4.6

Figure 4.7: Incompatibility graph for Subcase 2.3.2 of Theorem 4.6
Subcase 2.1 Suppose there exists a 3-element source incompatible that overlaps
with another maximal incompatible. This implies that C has at most two
source incompatibles, and by Lemma 4.3, C is width partitionable. The
argument is similar if there exists a size 3 destination incompatible that
overlaps with another maximal destination incompatible.
Subcase 2.2 Suppose there is a 3-element destination incompatible that does
not overlap with any maximal incompatible, then assume that there are
overlapping source incompatibles of size at most 2; if this is not true, then
Theorem 3.8 suÆces to complete the proof. There are two cases.
Subcase 2.3.1 Let all sources be included in some overlapping incompatible (see Figure 4.6). If this is the situation at the source side, then,
by Lemma 4.1, the destination side cannot have disjoint 3-element
destination incompatibles, as assumed.
Subcase 2.3.2 Let one source be not included in an overlapping incompatible. Figure 4.7 shows the only possible case (within isomorphism
and duality). If the communication shown in bold in the gure is
scheduled in the rst step, then the remaining communications form

68

Figure 4.8: Possibilities for Subcase 3.1 of Theorem 4.6
a set of width 2 and the incompatibles are disjoint. Therefore, set C
is width partitionable.
The argument is similar if there exists a 3-element source incompatible.
Case 3 If C has width 2, then we have the following cases.
Subcase 3.1 Let all sources be included in some overlapping incompatible (see
Figure 4.6). Figure 4.8 shows all possible cases for the incompatibility
graph (within isomorphism and duality). If the communications shown
bold in the gure (for all cases) are scheduled in the rst step, then the
remaining communications form a set of width 1. Therefore, set C is width
partitionable.
Subcase 3.2 Let one source be not included in an overlapping incompatible.
Figure 4.9 shows all possible cases here (within isomorphism and duality).
If the communications shown bold in the gure (for all cases) are scheduled

69

Figure 4.9: Possibilities for Subcase 3.2 of Theorem 4.6; the \uncircled" destinations
of the rst two graphs could be together or separate.
in the rst step, then the remaining communications form a width-1 set.
Therefore, set C is width partitionable.
4.1.1.3 Number of Incompatibles in a Simplest Set

Here we examine the minimum number of source and destination incompatibles in
a simplest set that is not width partitionable. First we develop some intermediate
results.
Lemma 4.7 Let communication set C have only two disjoint source incompatibles
I1 and I2 . Let G1 and G2 be the sets of destinations corresponding to sources in
incompatibles I1 and I2 , respectively. Then each destination of C could be in at most
two maximal destination incompatibles.

Proof: Let destination x0 be in three distinct maximal destination incompatibles.
Since x 2 J1 \ J2 and J1 6= J2 , there are destinations y0 2 J1 J2 and z0 2 J2 J1.
Let (y; y0); (z; z0 ) 2 C . By the dual of Lemma 4.1, y and z are in di erent source
incompatibles. Without loss of generality, let y 2 I1 and z 2 I2 . Recall that x0 2 J3.
We now consider two cases.
Case 1 Suppose there is a destination w0 2 J3 such that w0 62 J1 [ J2. Let (w; w0) 2 C
and without loss of generality let w 2 I1. The fact that w0 2 J3 J1 and
y 0 2 J1 J3 while w; y 2 I1 is a contradiction of Lemma 4.1.
Case 2 Suppose there is no destination w0 2 J such that w0 62 J1 [ J2 . That is, either
w0 62 J1 or w0 62 J2 . Without loss of generality, let w0 62 J2 . Therefore we have

70
I

I

1

2

....

z’

J1

....

w’

x’

y’

J2

Figure 4.10: Illustration of the proof of Lemma 4.7
the situation in Figure 4.10. By Lemma 4.2, this situation cannot happen; that
is, w and x cannot be in I1 and I2, respectively.
Now we consider a special case of the communication set C of Lemma 4.7, which
has two disjoint source incompatibles. This represents one case in the proof of Theorem 4.11. For some communication (x; x0 ) 2 C , let C 0 = C f(x; x0)g. In the next
two lemmas we prove that if C 0 is width partitionable, then C is also width partitionable. We consider two cases. The rst (Lemma 4.8) considers the situation where
x0 is in only one maximal destination incompatible. The second case (Lemma 4.9)
considers the situation where x0 is in two maximal destination incompatibles. Recall that Lemma 4.7 has established that x0 cannot be in more than two maximal
incompatibles. We now consider the rst situation.
Lemma 4.8 Let communication set C have two disjoint source incompatibles. For
some communication (x; x0 ) 2 C , let C 0 = C f(x; x0 )g. Let x0 be in only one maximal
destination incompatible. If C 0 is width partitionable, then C is width partitionable.

71
I1

x

x’

G1

I2

J

G2

Figure 4.11: Illustration of the proof of Case 1 Lemma 4.8
Proof: Let widths of C and C 0 be w and w0, respectively. Clearly w = w0 or w =
w0 + 1. Note that C 0 is width partitionable. If the width of w = w0 + 1, then C is
width partitionable; simply schedule (x; x0) after all communications of C 0 . If w = w0,
then we only need to show the existence of a width-1 set C1  C such that C C1 has
width w 1 (set C1 has only two communications, one from each source incompatible
of C ). Let I1 and I2 be the two source incompatibles of C . Let G1 and G2 be the
sets of destinations corresponding to sources in incompatibles I1 and I2 respectively.
We have the following cases.
Case 1 Let x0 be in only one maximal incompatible, J , all of whose elements are
from G2 (see Figure 4.11). Here choose C1 as the communications of any step
of the schedule of C 0 that contains a communication (y; y0) with source y 2
I2 fxg. Since C1 is a step of the schedule for C 0 , its width is 1. Since
jJ j  jG2j = jI2j, incompatible J can be maximum only if I2 is. Therefore
scheduling communication (y; y0) with source y 2 I2 suÆces to guarantee that
destination incompatibles are taken care of.

72
I1

x

I2

x’

J

G1

G2

Figure 4.12: Illustration of the proof of Case 2 Lemma 4.8
Case 2 Let x0 be in only one maximal incompatible, J , all of whose elements (except
x0 ) are from G1 (see Figure 4.12). Here by Lemma 4.1, no destination in J fx0 g
is in the same incompatible as any destination in G2 fx0 g or any destination
in G1 J . Choose C1 as the communications of any step of the schedule of C 0
that includes a communication (y; y0) such that y 2 I1 and y0 2 J . If this step
does not contain any communication with source in I2 , then add any one (other
than (x; x0 )).
Case 3 Let x0 be in only one incompatible, J , whose elements are from both G1 and
G2 (see Figure 4.13). Choose C1 as any step of the schedule of C 0 that contains
a communication (y; y0) 6= (x; x0 ) such that y 2 I2 and y0 2 J . As before this
will guarantees that C C1 has width one less that C .
We now consider the situation where x0 is in two destination incompatibles.
Lemma 4.9 Let a communication set, C , have two disjoint source incompatibles. For
some communication (x; x0 ) 2 C , let C 0 = C f(x; x0 )g. Let x0 be in two destination
incompatibles. If C 0 is width partitionable, then C is width partitionable.

Proof outline: Let I1 and I2 be the two source incompatibles of C . Let G1 and
G2 be the sets of destinations corresponding to sources in incompatibles I1 and I2 ,

73
I1

x

I2

x’
J

G1

G2

Figure 4.13: Illustration of the proof of Case 3 Lemma 4.8
respectively. Since the main arguments here mirror those of Lemma 4.8, we only
outline the proof. Again let the widths of C and C 0 be w and w0, respectively. As
before in the proof of Lemma 4.8, if w = w0 + 1, then C is width partitionable;
therefore, let w = w0. Once again, we only need to show the existence of a width1 set C1  C such that C C1 has width w 1. Let x0 be in two overlapping
incompatibles, J1 and J2; that is, x 2 J1 \ J2 (see Figure 4.14). Choose C1 as follows;
the argument for why this choice works is as in Lemma 4.8.
Case 1 If there exists a communication (y; y0) 6= (x; x0 ) such that y 2 I2 , y0 2 J1 \ J2 ,
then choose C1 as the set of communications of any step of the schedule of C 0
that contains the communication (y; y0).
Case 2 If there exists a communication (y; y0) such that y 2 I1 , y0 2 J1 \ J2 , then
choose C1 as the communications of any step of the schedule of C 0 that contains
the communication (y; y0). If this step does not contain any communication
with source in I2 , then add a communication (z; z0 ) such that z 2 I2 fxg.
Case 3 If Case 1 and Case 2 do not apply, then choose C1 as the communications of
any step of the schedule of C 0 that includes a communication (y; y0) such that

74
I1

I2

x

x’

J1

G1

J2

G2

Figure 4.14: Illustration of the proof Lemma 4.9
y 2 I1 and y 0 2 J1 . If this step does not contain any communication with source
in I2, then add one (other than (x; x0 )).

Now we show that a communication set must have three source incompatibles and
at least three destination incompatibles to be not width partitionable.
Theorem 4.10 Every communication set with less than three source incompatibles
or less than three destination incompatibles is width partitionable.

Proof: We consider several cases.
Case 1 If the communication set C has only one source incompatible or only one
destination incompatible, then it is easy to see that C is width partitionable
(schedule one communication at each step).
Case 2 If C has two source incompatibles and two or more destination incompatibles,
then we have the following cases.
Subcase 2.1 If the two source incompatibles overlap, then by Lemma 4.3, C is
width partitionable.

75
Subcase 2.2 If the two source incompatibles are disjoint, then we have the
following cases.
Subcase 2.3.1 If the destination incompatibles are disjoint, then C is
width partitionable (Theorem 3.8).
Subcase 2.3.2 If the destination incompatibles overlap, then we proceed
by induction on the number of communications, n  3, in the communication set C . If n = 3, then by Lemma 4.4, C is width partitionable.
Assume the lemma to hold for any set with n communications and
consider an (n + 1)-element communication set, C , with two disjoint
source incompatibles. Let C have width w. For some communication
(x; x0 ) 2 C , let C 0 = C f(x; x0)g. Clearly, jC 0j = n and by the
induction hypothesis C 0 is width partitionable.
The destination x0 could be in one destination incompatible, or in
two overlapping destination incompatibles. By Lemma 4.7, it cannot
be in three or more incompatibles. If x0 is in only one destination
incompatible, then by Lemma 4.8, C is width partitionable. If x0 is
two overlapping destination incompatibles, then by Lemma 4.9, C is
width partitionable.
Theorem 4.11 Let C be a communication set that is not width partitionable. The
following statements hold.

(i) The width of C is at least two.
(ii) C has at least ve communications.
(iii) C has at least three source incompatibles and at least three destination incompatibles.

Proof: Clearly, a communication set with width 1 is width partitionable since the
set has disjoint incompatibles. For part (2), Lemma 4.4 and Theorem 4.6 show that
the number of communications must be at least ve for the set C to be not width
partitionable. For part (3), Theorem 4.10 shows that if the number of source (or
destination) incompatibles is less than three, then the set C is width partitionable.

76

Figure 4.15: Relationship between source incompatibles for a simplest set

Figure 4.16: Relationships between disjoint incompatibles of a simplest set
4.1.2 Choices of the Simplest Sets
In this section we show that there are only two sets (to within isomorphism and
source/destination duality) that satisfy the requirements of a simplest set in Theorem 4.11. Without loss of generality assume that source incompatibles overlap. (For
a communication set to be not width partitionable, an overlap between source incompatibles and/or destination incompatibles must exist.) Destination incompatibles
may or may not overlap. We now examine the relationships between source incompatibles and between destination incompatibles. Recall that the simplest set that is
not width partitionable has at least ve communications, a width of two, three source
incompatibles, and three destination incompatibles.

Within the given constraints,
the only possibility for source incompatibles is as shown in Figure 4.15.
Relationship between Source Incompatibles:

Destination incompatibles may or may not overlap. If destination incompatibles do not overlap, then the
only possibility between destination incompatibles, while satisfying the simplest set
conditions, is as shown in Figure 4.16. If destination incompatibles overlap, then by
duality the only possibility is as shown in Figure 4.15.
Relationships between Destination Incompatibles:

From the above discussion, the simplest set that is not width partitionable can have one of two general forms
The Simplest Set that is Not Width Partitionable:

77
e d

u v

a

y

b

c

e d

a

b

c

w x

u v

y

w

x

(a)
(b)
Figure 4.17: The two forms of a smallest set
shown in Figure 4.17. At this stage we have not yet paired sources and destinations.
The dashed lines indicate this situation. We now prove that only two pairings are
possible (to within isomorphism and duality) so that the communication set is not
width partitionable. Let the sources be a; b; c; d; e and the destinations be u; v; w; x; y
(see Figure 4.17).
By Lemma 4.1 and its dual, if two source (resp., destination) incompatibles overlap, then their exclusive elements (elements that are not common to both) must
have destinations (resp., sources) in disjoint destination (resp., source) incompatibles. Also note that for any schedule of a communication set, every communication
must be scheduled at some step. There is no loss of generality in assuming that a
communication of our choosing is scheduled in the rst step. We use this fact in
ascertaining whether or not a set of communications is width partitionable. We have
the following cases.
Case 1 Here we consider the case where destination incompatibles are disjoint (see
Figure 4.17(a)). We have two subcases.
Subcase 1.1 If (b; y) is a communication, then Figure 4.18(a) shows the only
possible mapping between sources and destinations. By Lemma 4.1, sources
a and c cannot have destinations in the same incompatible. Without loss
of generality, let (a; v) and (c; x) be communications. This implies that
(e; u) and (d; w) (or (e; w) and (d; u)) must be communications. As ex-

78
e d

u v

a

y

b

c

w x

e d

u v

a

y

b

c

w x

(a)
(b)
Figure 4.18: Simplest sets with disjoint destination incompatibles
plained for Figure 3.6 (page 36), the communication set of Figure 4.18(a)
is not width partitionable.
Subcase 1.2 Suppose (b; y) is not a communication. Without loss of generality, let (b; w) be a communication (see Figure 4.18(b)). By Lemma 4.1, a
communication must exist between a source in incompatible fe; dg and a
destination in incompatible fu; vg (say communication (e; u)). The communication set C of the incompatibility graph of Figure 4.18(b) is width
partitionable because communications (b; w) and (e; u) can be scheduled
at the same step. Since C f(b; w); (e; u)g has width 1, it follows that C
can be scheduled in two steps.
Case 2 Here we consider the case where destination incompatibles overlap (see Figure 4.17(b)). We have three subcases. The rst case consider the situation
where the overlapped source, b, is mapped to the overlapped destination w.
The second case is such that the overlapped source, b, is mapped to a destination in incompatible fu; vg, and a source in the incompatible fe; dg is mapped
to the overlapped destination w. The third case examines the situation where
b and w are mapped to elements that belong to an overlapping incompatible.
The case where b is mapped to either y or x and w is mapped to an element in
incompatible fe; dg is not possible by Lemma 4.1.

79
e d

a

b

c

e d

a

b

c

e d

a

b

c

u v

y

w

x

u v

y

w

x

u v

y

w

x

(a)
(b)
(c)
Figure 4.19: Simplest sets with overlapping destination incompatibles
Subcase 2.1 If (b; w) is a communication, then by Lemma 4.1 there must be a
communication (say (d; v)) between source incompatible fe; dg and destination incompatible fu; vg (see Figure 4.19(a)). As in Subcase 1.2,
C f(b; w); (d; v )g has width 1, it follows that C is width partitionable.
Subcase 2.2 Suppose that source b correspond to a destination in incompatible
fu; vg (say communication (b; v)), and that a communication exists (say
communication (d; w)) between a source in fe; dg and destination w (see
Figure 4.19(b).) Again communication set C of the incompatibility graph
of Figure 4.19(b) is width partitionable as C f(b; v); (d; w)g has width 1,
it follows that C can be scheduled in two steps.
Subcase 2.3 If (b; y) (or (b; x)) is a communication and (c; w) (or (a; w)) is
a communication, then the communication set C of the incompatibility
graph of Figure 4.19(c) is not width partitionable because the only communication that can be scheduled simultaneously with (b; y) is either (e; u)
or (d; x). Since C f(b; y); (e; u)g or C f(b; y); (d; x)g has width 2, it
follows that C cannot be scheduled in two steps.
In summary, Subcase 1.1 and Subcase 2.3 are the only possibilities for the smimplest sets that are not width partitionable.
Theorem 4.12 The simplest set of communication that is not width partitionable
has an incompatibility graph whose form is (to within isomorphism and duality) one
of the two shown in Figure 4.18(a) and Figure 4.19(c).

80
Z

U
.....

.....

.....

C1

V

.....

N

.....

N
.....

.....

.....

N
.....

X

.....
.....

.....

N

.....

N
.....

.....

Y

Figure 4.20: An N -extension of an incompatibility graph
Call the communication sets corresponding to these two graphs the basic simplest
sets.

4.2 A Bound on the Number of Extra Steps
The width-2 communication set of Figure 4.18(a) requires three steps for scheduling
on the CST. In other words, it requires one extra \step" beyond its width. In this
section we prove that for any w  2, there exists a communication set of width w
that requires d w4 e extra steps.
Consider the incompatibility graph of Figure 4.20. It is identical to the graph of
Figure 4.18(a) except that each communication is replaced with group of N communications. If communication set C and C (N ) denote the sets corresponding to the
graphs in Figures 4.18(a) and 4.20 respectively, then C (N ) is called an N -extension
of C .
Theorem 4.13 An N -extension of a basic simplest graph has width w
requires a schedule of w + d w4 e steps on the CST.

= 2N

and

Proof: We prove the theorem for an N -extension of the graph of Figure 4.18(a).
The proof for the other basic simplest graph (Figure 4.19(c)) is similar. Let C1 be

81
the communications corresponding to the overlapped sources of C (N ). Any schedule
of C (N ) must have each communication of C1 in di erent steps. Without loss of
generality, assume that the rst N steps of this schedule include communications of
C1 . During these steps, let the schedule include communications with destinations
in X and communications with destinations in Y (see Figure 4.20). Clearly, their
sources must be from Z . Therefore, any given step can include a communication with
a destination in X or with a destination in Y (but not both).
Thus 0  +  N . Without loss of generality, let  . Then at the end
of the rst N steps, we have the following situation. Incompatibles U and V have
N sources each. Incompatibles X and Y have 2N
and 2N
destinations
respectively. Since  , 2N  2N and since 2N  N , Y is a maximum
incompatible. Thus communication set C (N ) needs at least N + 2N = 3N
steps. The maximum value of minimizes the number of steps in the schedule. This
maximum value is = b N2 c and therefore the minimum number of steps is w + d w4 e.

4.3 Non-Oriented, Well-Nested Sets
In Section 3.4 we proved that all oriented, well-nested communication sets are width
partitionable. In this section we consider non-oriented, well-nested sets. In an oriented, well-nested set if some communications changed orientation, we call such set
a non-oriented, well-nested set. In general, such sets are not width partitionable.
Figure 4.21(a) and (b) show an example of a width-2, non-oriented, well-nested set
that requires three steps on the CST. To see that tree steps are required, note that
the incompatibility graph of Figure 4.21(b) is the same as that in Figure 4.1.
Even though non-oriented well-nested sets are not width partitionable, we identify
a class of non-oriented well-nested sets that are.
De ne a level-1 oriented well-nested set to have one of the two forms shown in
Figure 4.22. That is, every level-1 oriented well-nested set has a condition expression
that uses only cap (Figure 4.22(a)) or only concat (Figure 4.22(b).) Call a level-1 set
that uses only cap (resp., only concat) as a level-1 cap set (resp., level-1 concat set).

82

a

b

c c’ d’

d

b’ e

e’ a’

(a) Communications on the CST

b’

a

b

c

c’

d’

d

e

e’

a’

(b) Incompatibility graph
Figure 4.21: Width-2, non-oriented monotonic set requiring three steps. In part (b),
the incompatibility graph has been drawn di erently to show the communications
clearly.
A level-2 oriented well-nested set has the forms shown in Figure 4.23. That is,
a level-2 set is either a concatenation of several level-1 cap sets or repeated capping
of a level-1 concat set. All oriented sets other than level-1 or level-2 sets described
above are said to have level  3.
Let C be a non-oriented set. Construct the oriented counterpart C~ of C by
replacing each communication (x; x0 ) 2 C such that x > x0 by communication (x0 ; x).
That is, C~ contains the same communicating pairs as C , except that all sources are
to the left of their destinations. For example, Figure 4.24(b) shows the oriented
counterpart of the communication set of Figure 4.24(a).
The level (1, 2, or  3) of a non-oriented well-nested set is the same as its oriented
counterpart. The level of a non-oriented, well-nested set appears to be an important
factor in determining its width partitionability.

83

(b)

(a)
Figure 4.22: Level-1 oriented well nested sets

level−1
cap set

(a)

......

....

level−1
cap set

level−1
cap set

level−1
concat set

Figure 4.23: Level-2 oriented well nested sets

(b)

Lemma 4.14 Let C be a level-2 oriented well-nested communication set formed by
repeated capping of a level-1 concat set. Then every optimal schedule of C can schedule

all communications of the level-1 concat set in the same step.

Proof: Since C is oriented, its optimal schedule has w steps (where w is the width
of C ). The only possible form for C is shown in Figure 4.25. Let C 0 = fc1; c2;    ; ck g
be the level-1 concat set and let 1; 2;    ; q be the capping communications in the
order of their proximity to elements of C 0 (see Figure 4.25). Let w1; w2;    ; wk be the
widths of the incompatibles that include c1; c2;    ; ck , respectively. By Lemma 3.12
(page 47), the source incompatibles are all di erent (that is, fci; cj g is not an incompatible for any 1  i < j  k). Observe that by Lemma 3.14 (page 49), for any
1  i  k, if ci is incompatible with h (1  h  q), then ci is also incompatible with
every g (where 1  g  h). Let wm = max(w1; w2;    ; wx) and consider the communication set C 00 = C C 0 + cm (see Figure 4.26 where the communications of C 00 are
in bold). Set C 00 is an oriented well-nested set and therefore is width partitionable.
Consider the step s (say) in the optimal schedule of C 00 that includes communication
cm . In step s, none of the wm 1 communications that are incompatible with cm are

84

(a)
(b)
Figure 4.24: Unoriented set and its oriented counterpart

...
......
γq

...

γ

2

γ

1

c1

c2

......
cm

ck

Figure 4.25: Illustration of the proof of Lemma 4.14
scheduled. This also implies that none of the communications that are incompatible
with any of the ci's (1  i  k) is also scheduled in step s (from our earlier observation based on Lemma 3.12). Thus, all communications in C 0 can be scheduled in the
step s.
Now we return to non-oriented sets.
Theorem 4.15 Every level-1 or level-2 non-oriented, well-nested communication set
is width partitionable.

Proof: First we consider level-1 non-oriented, well-nested sets (see Figure 4.27(a)
and (b)). Figure 4.27(a) uses only the cap condition and Figure 4.27(b) uses only the
concat condition.
Consider any level-1 capped set C . This set can be partitioned into two oriented
sets C 0; C 00 (one in each direction). Let w0 and w00 be their widths. Clearly, the width
of C must be at least max(w0; w00). Any c0 2 C 0 and c00 2 C 00 are not incompatible.
Let c0 = (x; x0 ) and c00 = (y; y0). Without loss of generality, let x < y0 < y < x0 . Thus,
`(x; y ), `(x0 ; y 0)  `(y; y 0), and hence fx; y g and fx0 ; y 0g are not incompatible. Thus,

85
...
cm

Figure 4.26: The communication set C 00

(b)
(a)
Figure 4.27: Level-1, non-oriented well-nested sets
the communications of C 0 and C 00 are not restricted in any way by each other (only
by themselves). Since C 0 and C 00 are width partitionable (Theorem 3.15), so is C .
A level-1 concat set can only be of width 1 or 2. If the width is 1, then no
communication is incompatible with another; schedule in one step. If the width is
2, partition the set into two oriented width-1 sets and schedule the two directions in
two steps.
We now consider level-2 sets (see Figure 4.28). We consider two cases corresponding to Figure 4.28(a) and (b).
Case 1 Communication set C has the form shown in Figure 4.28(a). Let C have
width w. We proceed by induction on the width of C . Clearly, a width-1
communication set is width partitionable. Assume the assertion to hold for a
non-oriented set with width at most w 1 and consider a set C with width w.
We only need show the existence of a set C1  C of width 1 such that C C1
is of width w 1.
Let C be the concatenation of level-1 cap sets P1; P2;    ; P , for some integer
 1 (see Figure 4.28(a)). For 1  i  , let Pi = Li [ Ri, where Li (resp.,
Ri ) is the set of communications of Pi oriented to the left (resp., to the right).
[
[
De ne CL = Li and CR = Ri. Individually, L1 ; L2    Lx, R1 ; R2    R are
i=1

i=1

86
P1

P2
l

l1

Px
lx

2

(a)

(b)
Figure 4.28: Level-2, non-oriented well nested sets
width partitionable (Theorem 3.15). First we select elements of CL that will
be in C1 and then will deal with CR . Let li be the outermost communication
of Li . Let S (i) be the set of communications that are scheduled at the same
[
step as the communication li. Let S = S (i). By Lemma 3.12, S has width 1.
i=1
Include S in C1. Clearly the width of CL has been reduced by 1. Some of the
communications of CR (oriented to the right) may be incompatible with some
communications of S , while others may not.
Let ri be the outermost communication of Ri that is not incompatible with
any communication of S . Let T (i) be the set of communications of Ri that
are scheduled in the same step as ri (in the optimal schedule of Ri ). If T (i)
contains any communication that are incompatible with communication in S ;
[
then simply exclude them from T (i). Let T = T (i).
i=1
Clearly, T has width 1 and CR T has width one less than CR. Let C1 = S [ T .
Clearly, S [ T has width 1. To see that C C1 has width w 1 observe that
the only communications excluded from T (i)'s above are those incompatible

87
C

1

R

2
R

C

Figure 4.29: The set CR
with communications of S . These incompatibles are clearly represented by
communications of S .
Case 2 Communication set C has the form shown in Figure 4.28(b). Partition C
into two oriented communications CL = CL1 [ CL2 , and CR = CR1 [ CR2 (see
Figure 4.29) oriented towards the left and right, respectively. Each CL and CR
is width partitionable, and all communications in the level-1 concat set CL2 or
CR2 can be scheduled in one step (Lemma 4.14).
Note that for any c0 2 CR1 and c00 2 CL1 , c0 and c00 are not incompatible (as
described for level-1 sets). For any c0 2 CR2 and c00 2 CL2 , if c0 and c00 are not
incompatible, then the same can be said for the entire sets CL and CR. Thus
CL and CR can be scheduled together, and hence C is width partitionable.
If there is some c0 2 CR2 and c00 2 CL2 such that c0 and c00 are incompatible,
then the width of level-1 concat set, CR2 [ CL2 , is 2. Schedule CL such that all
the communications of CL2 are in one step (Lemma 4.14) and call this step sL.
Similarly, schedule CR such that all the communications of CR2 are in one step
sR . Now, schedule CL and CR together such that sL 6= sR ; permuting the steps
of an optimal schedule gives an optimal schedule.

88

a

b

c

a’ d’

b’

e’ c’ d

e

(a) Communications on CST

a

b

c

a’

d’

b’

e’ c’

d

e

(b) Incompatibility graph
Figure 4.30: Width-2, non-oriented, monotonic set requiring three steps; In part (b)
the incompatibility graph has been drawn di erently to show the communications
clearly.

4.4 Non-Oriented, Monotonic Sets
The de nition of the oriented, monotonic set in (Section 3.5) requires the communications to be directed from left to right or vice versa (oriented). In an oriented,
monotonic communication set, if some of the communications changed orientation,
we call such a set non-oriented, monotonic set. In other words, a non-oriented monotonic set is a set that has communications in di erent directions, but its oriented
counterpart is monotonic.
In general, a non-oriented monotonic communication set is not width partitionable
(see Figure 4.30(a) and (b)). The communication set of Figure 4.30(b) has the same
incompatibility graph as Figure 4.1 and hence it is not width partitionable. As in the
non-oriented, well-nested case, with some restrictions a non-oriented monotonic set is
width partitionable. Consider the monotonic set of Figure 4.31(a) where each source

89

(a)
Figure 4.31: Separable monotonic sets

c1

c2

y
p

x y

v’

w’

(a)

x’

y’

v

(b)

m

w

w

α

x w’

x’ y’

(b)
Figure 4.32: Illustration of the proof of Lemma 4.16
is to the left of all destinations. We call such set as a separable set. A non-oriented
monotonic set is separable if its oriented counterpart is separable (see Figure 4.31(b)).
Lemma 4.16 Let C be a separable, non-oriented, monotonic set. Let CL (resp., CR )
be the set of communications in C that are oriented to the left (resp., right). Let c1
(resp., c2 ) be the rightmost (resp., leftmost) communication of CL (resp., CR ). The
communication c1 is incompatible with any communication in CR if and only if c1 is
incompatible with c2 .

Proof: If c1 and c2 are incompatible, then obviously C1 is incompatible with some
communication of CR. We now proceed in the only if direction. Let (x; x0); (y; y0) 2
CR and (w; w0); (v; v 0) 2 CL . Let c1 = (w; w0) and c2 = (x; x0 ) (see Figure 4.32(a)).
Let `ca (x; x0 ) = p and `ca (y0; w0) = m. Suppose that fw0; y0g is incompatible but
fw0; x0g is not (see Figure 4.32(b)). This requires that `evel (p) < `evel (m). Let be
the leftmost node of the subtree rooted at p. It follows that x  and y < which
contradicts the fact that x < y.

90
Remarks: Similarly, the communication c2 is incompatible with any communication
in CL if and only if c2 is incompatible with c1.
Intuitively, Lemma 4.16 shows that the destinations of the oriented sets CL and
CR are not incompatible unless the destination, w0 , of the rightmost communication
of CL, and the destination, x0 , of the leftmost communication of CR are incompatible.
By the same argument, a similar assertion could be made about the possible
interaction between the sources of CL and CR . Thus, we have the following result.
Lemma 4.17 Let C be a separable, non-oriented, monotonic set. Let CL (resp., CR )
be the set of communications in C that is oriented to the left (resp., right). Let c3
(resp., c4 ) be the leftmost (resp., rightmost) communication of CL (resp., CR ). The
communication c3 is incompatible with any communication in CR if and only if c3 is
incompatible with c4 .

Remarks: Similarly, the communication c4 is incompatible with any communication
in CL if and only if c4 is incompatible with c3.
Theorem 4.18 Let C be a separable, non-oriented, monotonic set. Let CL (resp.,
CR ) be the set of communications in C that is oriented to the left (resp., right). Let
c1 (resp., c3 ) be the rightmost (resp., leftmost) communication of CL. Let c2 (resp.,
c4 ) be the leftmost (resp., rightmost) communication of CR . If at most one of fc1 ; c2 g
and fc3 ; c4 g is an incompatible, then C is width partitionable.

Proof: If neither fc1; c2 g nor fc3 ; c4g is incompatible, then CL and CR can be scheduled independently. Without loss of generality, let c1; c2 be incompatible (see Figure 4.33). We proceed as in the proof of Theorem 4.15. Let step s in the schedule
for CL include communication c1. Let S be the set of all communications in step s.
Let c0 be the leftmost communication of CR such that c0 and c1 are not incompatible.
Clearly, c2 6= c0 . Let t be the step of the schedule of CR that includes c0 . Let T be
the set of all communications of CR that are scheduled in step t. Exclude from T
any communication that is incompatible with communications of S . If C1 = S [ T ,
then C1 has width 1 and C C1 has width w 1, where w is the width of C . The
reasoning for this assertion is the same as that in the proof of Theorem 4.15.

91

c2

c4

c3

c1

Figure 4.33: A separable monotonic communication set. Letters next to sources
represent the communication.

4.5 Concluding Remarks
In this chapter, we showed that any communication set that is not width partitionable
has a width of at least 2, it has at least ve communications, at least three source
incompatibles, and at least 3 destination incompatibles. We presented a \simplest
set" that have exactly these minimum requirements. Further we showed these simplest
sets are the only onespossible (to within isomorphism). We showed that there exists
a width-w set requiring w + d w4 e steps to be scheduled on the CST. We also showed
that while non-oriented, well-nested and monotonic sets are not width partitionable,
in general, they are under some restrictions.

Chapter 5
Con guring the CST
The communication capability of the CST (see also Section 2.1) has been analyzed
in Chapters 3 and 4 and methods have been devised to schedule many interesting
communication classes. Such a schedule partitions the communications into several
\width-1" communication sets; all communications from a width-1 set can be simultaneously accommodated on the CST. In this chapter we consider only width-1
communication sets. The ability of the CST to accommodate communications of a
width-1 set does not mean that it can actually establish in one step the dedicated
paths between communicating pairs. Here we address the issue of con guring the CST
to perform any width-1 communication set. In other words, we discuss how the CST
generates information to con gure switches (at its internal nodes) to establish the
required paths. Henceforth, \con guring the CST" refers to con guring its switches
(see Section 2.1). Once the CST is con gured, it is straightforward to perform the
communications of a width-1 set. In this chapter we only discuss the con guration,
with the understanding that the communications follow in a straightforward manner.
Before we proceed, we note some assumptions used in this chapter and the next.
As described in Section 2.1, each pair of communicating PEs of the CST uses a
shortest path through the tree. Thus, each path can traverse O(log N ) switches
(where N is the number of leaves in the CST). We assume these O(log N ) switch
delays to be a basic time unit and allow a \step" to have O(log N ) switch delays.
This assumption has been justi ed by Sidhu et al. [43] and independently by us in
Chapter 7.
92

93
In this chapter we rst present a general approach for con guring the CST in
one step. The basic idea of con guring the switches is to somehow re ect the global
information of connections among PEs based on limited local knowledge. We show
that this can be accomplished in one step if the communications possess certain
properties. Next we show a class of communications called an \edge-exclusive set"
that possesses these properties. This implies that for an edge-exclusive set, the CST
can be con gured in one step. Then, we present a method to decompose any width-1
communication set into at most three edge-exclusive sets. This, in e ect, proves that
any width-1 set of communications can be performed on the CST in at most three
steps.
Considering that Chapters 3 and 4 provide means to convert a set of communication requirements into a sequence of width-1 sets, the results of Chapters 3, 4 and
5 provide a comprehensive solution to communicating on the CST. In Chapter 6 we
apply our techniques to communications on a segmentable bus.
In the next section we outline a general technique for con guring the CST. In
Section 5.2 we detail our technique and show that it can be applied to the edgeexclusive sets. Section 5.3 proves that every width-1 set can be decomposed into
three edge-exclusive sets. In Section 5.4 we summarize our results.

5.1 CST Con guration|A Broad Outline
In this section, we describe the CST switches, the information ow through them,
and outline an approach to a 1-step con guration of the CST to establish the paths
of any given width-1 communication set. This con guration approach is based on the
idea of Sidhu et al. [43]; we extend it to include multiple communications.
The key aspect of this approach is that it uses only information locally available to
PEs and knowledge of the pattern of communications to be performed to appropriately
con gure the CST. For some communication patterns, such as the oriented wellnested sets, the local information is the source/destination status of each PE. In other
communication patterns such as those of a segmentable bus, the local information
is knowledge of whether a PE is a writer and if it cuts the bus. In general, the

94
information needed to con gure a CST switch could come from the local information
for any set of PEs. Our approach restricts this switch information to come from only
the leaves of the subtree rooted at the switch. (Though this appears quite restricted,
we show (Section 5.3) that our approach works for virtually all width-1 sets.) Thus,
our approach only requires that local information from the PEs (leaves of the CST)
be fanned-in through their ancestors. The tree structure of the CST provides an ideal
platform to accomplish this.
We now describe the approach in detail. Recall that each CST switch has a fullduplex data link to its parent (if any) and two children (see Section 2.1). In addition to
the data links, the switch has a control line from each node to its parent. These control
lines are used to carry control symbols (holding local information) from a switch (or
leaf) to its parent. The CST switch has two main blocks (a) the communication
unit (labeled C in Figure 5.1) and (b) the control unit. The communication unit
establishes data paths between the three data inputs and the three data outputs
Data Path
control information

Control
Unit
control information

control information
C

Data Path

Data Path

Figure 5.1: Internal Structure of the Switch
of the switch. (Figure 2.2, page 20, shows a sample of data path con gurations of a
switch.) The control unit accumulates information from the descendants of the switch
(through two control input lines) and generates (i) accumulated information to pass

95
on to its parent (if any) and (ii) information to select the data con guration of the
communication unit of the switch.
Putting these ideas together, we now describe the actions performed during a
\CST con guration cycle." The leaves use local information to generate control symbols and send these symbols to their parents using control lines. Based on the symbols
received, the control logic (combinational logic) in each switch decides on the appropriate data con guration for the switch and passes a symbol (control information) to
its parent. This way, the control symbols ow up through the tree, setting switches
level by level until they reach the root of the CST. This process involves information
ow through O(log N ) switches and, as explained earlier, runs in one step. If the PE
can obtain the required local information at run-time, then note that this procedure
con gures each switch also at run-time.
Based on the discussion so far, our approach imposes the following constraints.
1. Each switch con gures its communication unit based on the control symbols it
receives from its children.
2. Each switch can generate a control symbol (that captures all relevant information from its descendants) to send to its parent.
Further, to ensure that each control unit is of constant size, we impose the following additional restriction.
3. The CST con guration algorithm must use a constant number of control symbols.
It should be noted that di erent control logic may be required for di erent communication classes. For example, a width-1 segmentable bus may need a di erent
control logic compared to an edge-exclusive set (described in Section 5.2).

5.2 Edge-Exclusive Communication Sets
In this section we show that it is possible to con gure the CST in one step for an
important class of communications called edge-exclusive communications.

96

11
00
00
11

1
0
0
1

1
0
0
1

Figure 5.2: Edge-exclusive communication set

11
00
00
11
00
11

c1
c2

1
0
0
1
0
1

1
0
0
1
0
1

Figure 5.3: A communication set that is not edge-exclusive
A set C of communications is edge exclusive if and only if no two
communications of C use the same CST edge (even in opposite directions).
De nition 5.1

For example, Figure 5.2 shows an edge-exclusive set, whereas the width-1 communication set of Figure 5.3 is not edge exclusive because communications c1 and
c2 share a common edge. Recall that the width of a set C of communications is
the maximum number of communications requiring the use of any one directed edge.
Clearly, the width of an edge-exclusive set is 1.
Intuitively, our approach works for an edge-exclusive set because of the following
reason. Control information ows up the CST until information from a source meets
information from a destination at their lowest common ancestor. The fact that the
communication set is edge exclusive guarantees that this source-destination pair is a
matching pair (see Lemma 5.1). We now detail a method to con gure the CST in
one step for any edge-exclusive set.
Assume each PE to only know whether it is a source, destination or neither. If
a PE (leaf of CST) is a source (resp., destination), then it passes control symbol s
(resp., d) to its parent (a CST switch). If the PE is neither a source nor a destination,
it passes symbol n to its parent. For a similar approach involving one source and

97
fs s d n
s
d
n

{

n
n
s d

{

s
d
n

Figure 5.4: The function fs for edge-exclusive sets
one destination (a single element communication set), Sidhu et al. [43] used 2-bit
quantities with s = 01, d = 10, and n = 00.
Each switch (internal node) receives control symbols (from the set S = fs; d; ng)
from its children. It uses these symbols to produce a symbol (again from the set
S ) for its parent. Let C be the set of con gurations of the communication unit of
a switch. Then each CST switch can be viewed as two functions, fs : S  S ! S
providing a symbol (see Figure 5.4) and fc : S  S ! C providing a con guration (see
Figure 5.5).
Let u be any node of the CST. Let Tu denote the subtree rooted at u.
Lemma 5.1 For the algorithm explained above,
1. If u is an internal node, then it cannot receive symbols s, s from its children or
d, d from its children.

2. If u receives symbols s, d from its children, then these symbols correspond to a
matching source-destination pair.
3. Node u sends symbol s to its parent if and only if there is a source in Tu and
the corresponding destination is outside Tu .
4. Node u sends symbol d to its parent if and only if there is a destination in Tu
and the corresponding source is outside Tu .
5. Node u sends symbol n to its parent if and only if each source (if any) in Tu has
its corresponding destination in Tu .

Proof: Let the CST have 2n leaves. Its internal nodes are arranged in n levels
numbered 1; 2;    ; n (with the root at level n). We proceed by induction on the level

98
fc

s

s

not possible

d

d

n

not possible

n

Figure 5.5: The function fc for edge-exclusive sets
(1  l  n) of an internal node u of the CST. If l = 1, then u is a parent of two
leaves (PEs). Suppose that u receives symbols s, s from its children (see Figure 5.6).
Clearly both sources must use the link from u to its parent to communicate with their
corresponding destinations. This is not possible in an edge-exclusive set. Similarly,
u cannot receive symbols d, d from its children. For part 2, if u receives symbols s,
d from its children, then they form a matching pair. This is because, if they do not
form a matching pair, then they have to use the link from u to its parent (even though
in opposite directions) which does not meet the requirement of an edge-exclusive set.
For part 3, if u sends symbol s to its parent, then it must have received symbols s, n
from its children (leaves); see Figure 5.5. So part 3 holds for the base case. Part 4
holds similarly. For part 5, if u sends symbol n to its parent, then it must have
received symbols n, n from both its children or it must have received symbols s, d
from its children (see Figure 5.5).
l

99

u

s

s

Figure 5.6: Illustration of the proof of Lemma 5.1
Now assume the lemma to hold for any node at level l (where 1  l < n) and
consider node u at level l + 1. Let v and w be children of u. Let Tv and Tw denote
the subtrees rooted at v and w, respectively. Nodes v and w are at level l and
the induction hypothesis applies to them. Suppose u receives symbols s, s from its
children (see Figure 5.7). By the induction hypothesis, Tv and Tw contain sources
whose destinations are outside Tv and Tw . As before, both sources must use the link
from u to its parent, and this is not possible for an edge-exclusive set. The only
di erence between the proof for part 1 in the base case and the induction step is that
the induction hypothesis permits us to treat v and w as we did with the leaves in the
base case. The remaining parts also use the induction hypothesis and use the same
argument employed by the base case.
Theorem 5.2 If each CST switch is con gured using functions fs and fc, then the
CST establishes the paths corresponding to the communications of the given edgeexclusive set.

Proof: We prove the lemma by considering the following cases for any internal node
u of the CST. Let Tu be the subtree rooted at any internal node u.
We consider three cases based on the symbol u sends to its parent.
Case 1 Suppose node u sends s to its parent. By part 2 of Lemma 5.1, Tu contains
a source x such that its corresponding destination x0 is outside Tu . Consider

100

u
v

w

Tv

Tw
s

s

Figure 5.7: Illustration of the proof of Lemma 5.1
any internal node v on the path from x to u. Since x is in Tv and x0 is not, by
Lemma 5.1, part 2, v sends an s to its parent. Let w and z be the children of
v , and let w = s or z = s. Without loss of generality, let w = s. Clearly,
z 6= s, Lemma 5.1, part 1. If z = d, then v sends an n to its parent, which
contradicts our assumption. So, z = n. Thus every node in the path from x
to u receives symbols s, n (with the s from the subtree containing x). From
Figure 5.5, entries fc(s; n) and fc(n; s), it is clear that the CST establishes a
physical path from x to the parent of u.
Case 2 Suppose node u sends symbol d to its parent. This is the dual of Case 1. An
identical argument proves that there is a path in the CST from the parent of u
to the destination x0 in Tu.
Case 3 Suppose u sends symbol n to its parent. Let w and z be the children of u.
From Figure 5.4, two subcases are possible.
Subcase 3.1 Suppose both w and z send symbol n to u. By part 2 of Lemma 5.1,
each source in Tw or Tz has its destination within the same tree; that is,
there are no sources or destinations in Tw (or Tz ) left to match. Node u,
therefore, correctly does nothing (see fc(n; n) in Figure 5.5).

101
Subcase 3.2 Suppose the children of u send symbols s, d to it. Without loss of
generality, let w send s and z send d. By Lemma 5.1, parts 2, 3, and 4,
there is a source x in Tw that matches a destination x0 in Tz . By Case 1
and Case 2, the CST correctly connects x to u and sets up a path from
u to x0 . In Figure 5.5, fc(s; d) shows that node u correctly connects these
paths.
Note that edge-exclusive sets are also width-1 sets of a CST using half duplex
links. Therefore, the results of this section may hold independent interest for such
CSTs.

5.3 Edge-Exclusive Decomposition
In this section we present an algorithm to partition any communication set C into
at most three edge-exclusive sets C1 ; C2; C3 such that each of C1; C2; C3 is edgeexclusive. We will call such a partitioning an edge-exclusive decomposition of C .
Since communications from any edge-exclusive set can be established on the CST in
one step, it follows that all communications from any width-1 communication set can
be established on the CST in at most three steps.
Broadly speaking, the algorithm assigns a color to each communication in the set
C such that no two communications with the same color share an edge of the tree.
Clearly, all communications of any one color form an edge-exclusive set. We show
that C can be colored with three colors.
Each communication of a width-1 set corresponds to a directed path between a
source-destination pair (leaves) of the CST. Thus coloring the set of communications
amounts to assigning colors to the directed edges of the CST. A correct coloring
must:
 assign di erent colors to directed CST edges with the same end points (corre-

sponding to the same undirected edge),
 assign the same color to all directed edges of a communication.

102
p (u)

p (u)

i

o

1

2

1

l (u)
o

2
3

l (u)
i

3

r (u)
i

r (u)
o

Figure 5.8: Incoming and outgoing edges of a switch
We will use a palette f1; 2; 3g of three \colors" to color the communications.
Edges that do not correspond to any communication will be assigned \color" 0 to
indicate their status. We assume that the correspondence between communications
and directed edges of the CST is known a priori. That is, a switch can match each
incoming edge with an outgoing edge in accordance with a communication (if any).
For ease of explanation we will use the following notation to distinguish between the
incoming and outgoing edges of an internal node u of the CST (see Figure 5.8).







po (u) is an outgoing edge from u to the parent of u.
pi (u) is an incoming edge from the parent of u to u.
lo (u) is an outgoing edge from u to its left child.
li (u) is an incoming edge from the left child of u to u.
ro (u) is an outgoing edge from u to its right child.
ri (u) is an incoming edge from the right child of u to u.

We now detail the steps of the procedure and establish its correctness. Consider
the procedure in Figure 5.9.
This step de nes the termination of the recursion. If u is a leaf, then its
only edges are to its parent; these have already been correctly colored.
Step 1:

103
Procedure Color(

)
/* The procedure assigns a correct coloring to the subtree of T rooted at node u,
given that the incoming edge pi (u) and outgoing edge po (u) have been assigned
colors x and y, respectively, where x; y 2 f0; 1; 2; 3g */
T ; u; x; y

begin

1. If u is a leaf then return
/* let v and w be the left and the right children, respectively of u */
2.
Assign correct colors x1 to lo (u), y1 to li (u), x2 to ro (u), y2 to ri (u)
3.
Color(T ; v; x1 ; y1 )
4.
Color(T ; w; x2 ; y2)

end.

Figure 5.9: Edge-Exclusive Decomposition Procedure
The assignment of colors in this step depends on x; y and the correspondence between incoming and outgoing edges of u (as dictated by communications
traversing u). For example, if x = 1 and y = 2 and the correspondence of u is shown
in Figure 5.8, then clearly lo(u) has to be colored 1 and ri(u) has to be colored 2.
This also implies that li(u) and ro(u) have to be colored 3.
In general no more than three communications can traverse switch u (with three
incoming and three outgoing edges) and at most two of these three communications
can traverse the edge between u and its parent. Therefore, the edges of the third
communication can always be correctly colored with a color from f1; 2; 3g. If an edge
does not correspond to a communication, simply assign 0 to it.
Step 2:

These steps respectively color the subtrees at the children of u,
given a correct coloring for the edges between them and u.
Steps 3 and 4:

Theorem 5.3 Every width-1 set of communications can be decomposed into at most
three edge-exclusive sets.

Proof: Call procedure Color(T ; root ; 0; 0), where root is the root of the CST T .
This colors the entire tree correctly in accordance with the given communication set
C . Partition C into sets C1 ; C2 and C3 such that for i = 1; 2; 3, Ci contains only

104

Figure 5.10: Decomposition of width-1 communication set into edge-exclusive sets.
Sets are shown in solid, dashed, and dotted
these communications of C whose edges in T have been colored i. By virtue of the
two conditions that make a coloring correct (see page 101), Ci is edge-exclusive.
Figure 5.10 shows an example of decomposing a width-1 communication set into
three edge-exclusive sets (the sets are represented using solid, dashed, and dotted
lines).

5.4 Concluding Remarks
In this chapter, we presented a one-step method to con gure the CST to establish
the communication paths of a width-1 communication set. We identi ed a class of
communication sets called edge-exclusive sets for which the above method applies.
Then, we showed that every width-1 communication set can be decomposed into at
most three edge-exclusive sets.
Theorem 5.3 shows that every width-1 set of communications can be performed
(including con guration of switches under local control) in at most three batches.
Together with the schedules implied by Theorems 3.8, 3.15, and 3.17, these results
provide a comprehensive approach to perform communications on the CST.

Chapter 6
Segmentable Bus Implementation
In Section 2.2 we described the segmentable bus and explained the importance of
implementing it with small bus delay. In this chapter, we build on the techniques
of Chapters 3 and 5 to derive a segmentable bus implementation with small bus
delay. In Chapter 3 we presented a segmentable bus implementation as a special
case of oriented well-nested sets. The treatment here addresses many issues (such as
concurrent writes and processor word-size) not considered in that result.
A good segmentable bus implementation immediately translates to a good implementation of the HVR-Mesh [4], Basic R-Mesh [35] and the polymorphic processor
array [30], models on which many algorithms have been designed. We also show (in
Chapter 7) that a segmentable bus can be used as a building block for implementing
an LR-Mesh (see Section 7.2).
We present two approaches for implementing segmentable buses; one is based on
a hardware solution that builds on the CST, while the other uses an algorithmic
approach. The problem of implementing a segmentable bus allows each processor,
i, to assume only the answers to the following questions: (a) How does processor i
want to con gure its segment switch (open or closed|see Section 2.2)? (b) Does
processor i want to write to its bus segment? This ensures that its solution matches
the functional description of the segmentable bus in Section 2.2.
In the next section we brie y describe the two approaches to implementing the
segmentable bus. Sections 6.2 and 6.3 detail the approaches. In Section 6.4 we
summarize our results and make some concluding remarks.
105

106

6.1 Our Approaches
We present two approaches for implementing a segmentable bus, both employing a
balanced tree. The rst approach, based on the CST, is suitable for large processors
of word-size (log N ) bits (where N is the number of processors on the segmentable
bus). In such a processor, one step can accommodate (log N ) gate delays1 . This
approach draws upon the techniques presented in Chapter 5, at the same time exploiting properties of communication patterns possible on a segmentable bus. Here
the main idea is to con gure the CST to establish a dedicated path from each writer
to all readers of a bus segment. Once these paths are established, data communication is seamless. Therefore we only detail the con guration phase. In general, we will
admit concurrent writes to a bus segment. But if the application guarantees exclusive
writes, then further improvement in cost and performance is possible.
The second approach is suitable for smaller processors of word-size (w) bits
where loglog N w  log N . This approach uses a normal 2w -ary tree algorithm [24]
and runs in O logwN steps, each of (w) delay. A normal tree algorithm proceeds
level by level in the tree. Here we use a normal tree algorithm to translate local
information from processors to a con guration of global relevance. We permit our
algorithm to use any implementation of a 2w -processor segmentable bus. (The rst
approach provides one such implementation. A class of structures that is capable of
implementing normal tree algorithms eÆciently are multiple bus networks (MBNs)
[15].)


The two methods collectively allow (log log N )-bit processors to use  logloglogNN
steps, each of (loglog N )-delay, or larger (log N )-bit processors to use a constant
number of (log N )-delay steps, or all shades in between. In both approaches, the
idea is to translate the local information at processors to global information that
represent the connectivity of the segmentable bus.
1 A processor of word-size w can usually address a 2(w)-location memory in one step.

for such an addressing requires ( ) gate delays.
w

A decoder

107

6.2 Methods for Large Processors
In this section we use the CST to implement a segmentable bus. We rst present a
method to implement a special case of a segmentable bus (called the right-oriented
segmentable bus), in which we assume that for each bus segment only the leftmost
processor of each segment writes and all other processors read. A left-oriented segmentable bus is similar with the rightmost processor of each segment as its only writer.
Next we use oriented segmentable buses to derive an implementation of a (general)
segmentable bus with only exclusive writes. Finally, we augment this implementation
to support concurrent writes.
The main result of this section is that an N -processor segmentable bus can be
implemented on a CST to run in (1) steps. The di erence between the exclusive
writes and concurrent write implementations is in the complexity of the hardware
and constants in the running time.
As noted earlier, the problem boils down to con guring the CST (using local
information at leaves) to establish communication paths. Our approach to con guring
the CST builds on the technique in Chapter 5. As in Chapter 5, the CST operates
as follows.
1. Each switch con gures its communication unit based on the control symbols it
receives from its children.
2. Each switch generates a control symbol (that captures all relevant information
from its descendants) to send to its parent.
3. The CST con guration algorithm uses a constant number of control symbols.
6.2.1 Implementing an Oriented Segmentable Bus
Without loss of generality, let the N -processor segmentable bus be right oriented.
Therefore we assume that the leftmost processor of each bus segment writes to the
segment and all other processors read (see Figure 6.1(a)). Consider a con guration
of the segmentable bus with k segments.

108

(a)

(b)
Figure 6.1: Right oriented segmentable bus
For 1  i  k, let the ith segment be Si = (wi; ei) where wi is the index of
the leftmost processor (writer) of the segment and ei is the index of the rightmost
processor of the segment. Clearly, wi  ei for all i. Also observe that if wi+1 exists,
then ei  wi+1. Since all communications on segment Si are from wi toward ei, we
will say that the segment Si \starts at wi" and \ends at ei". For technical reasons,
we assume that for all 1  i  k, ei = wi+1; that is segment Si ends at the same
processor as the one at which Si+1 starts. Consequently, the writer wi of segment Si
writes also to processor wi+1 (see dashed communications in Figure 6.1(b)), which
simply ignores the value read. Note that this makes it possible for a processor to be
a writer of one segment and a reader of the previous segment.
Leaves (processors) of the CST generate control symbols from the set S = fw; rg
(the symbol w corresponds to a writer processor whereas symbol r corresponds to a non
writing processor). These symbols propagate up the tree, con guring switches levelby-level. Each switch (internal node) receives control symbols (from set S = fw; rg)
from its children. It uses these symbols to produce a symbol (again from the set
S ) for its parent. Let C be the set of con gurations of the communication unit of
a switch. Then each CST switch can be viewed as two functions, gs : S  S ! S
providing a symbol (see Figure 6.2) and gc : S  S ! C providing a con guration (see
Figure 6.3).

109
gs r w
r
w

r
w

w
w

Figure 6.2: The function gs for segmentable bus
gc

r

w

r

w

Figure 6.3: The function gc for segmentable buses
Recall that we assume the CST leaves to be labeled in increasing order from left
to right. Thus a statement such as u  v < w is to be interpreted as leaf u is not
to the right of v and leaf v is to the left of w. Let u be any node of the CST. Let
Tu denote the subtree rooted at u. If u is an internal node, then its left (resp., right)
subtree is the subtree of Tu rooted at the left (resp., right) child of u.
Lemma 6.1 For any internal node u of the CST, the functions gs and gc establish
paths as follows.
1. If u sends symbol r to its parent, then the algorithm connects the incoming edge
from the parent of u to all leaves of Tu .

110
2. If u sends symbol w to its parent, then let the writers in Tu be w1 , w2 ,
such that w1  w2      w .

  , w

(a) The incoming edge from the parent of u is connected to all leaves z
of Tu .

(b) For

1<i<

wi+1 .

 w1

, writer wi is connected to each leaf z such that wi < z



(c) The last writer w is connected to each leaf z > w of Tu and to the
outgoing edge from u to its parent.

Proof: We proceed by induction on the level l  1 (where leaves of the CST are
at level 0) of an internal node u. If l = 1, then u is the parent of two leaves. If u
sends an r to its parent, then both children of u are readers who send r, r to u. From
Figure 6.3, it is clear that part 1 holds for the base case.
If u sends w to its parent, then the three cases correspond to gc(r; w), gc(w; r),
and gc(w; w) in Figure 6.3. In the rst two cases there is only one writer ( = 1).
So parts 2a and 2c apply to this writer. It is simple to verify that the case holds for
parts 2a, 2b and 2c.
Assume the lemma to hold for any node at level l  1, and consider node u at
level l + 1. Let v and w be the left and right children of u.
If u sends r to its parent, then both v and w send r to u (see Figure 6.2). By
the induction hypothesis, the CST establishes a path from u to all leaves of v and
w. The switch con guration of gc (r; r) (Figure 6.4(a)) ensures that the parent of u is
connected to all leaves of u.
Suppose u sends symbol w to its parent, then at least one of v or w must send w
to u. We now consider three cases.
Case 1 Suppose v sends r to u and w sends w to u. By the induction hypothesis,
we have the situation depicted in Figure 6.4(b). We only need observe that all
leaves z  w1 , of Tu are indeed connected to, from the parent of u as required,
and that w is connected to the parent of u.

111
r
r

r

r

v

w

.....

w

v

w

.....

.....

.....

v

w

1

u

w

2

w

v

.....

w ... wα

2

u

w

w

.....

1

(b)

r

w

w ... wα

w

(a)
w

u

w

u

w

.....

.....

w’ w’... w’
1

2

β

(c)
(d)
Figure 6.4: Illustration of the proof of Lemma 6.1

w’’ w’’... w’’
γ
1

2

Case 2 Suppose v sends w to u and w sends r (see Figure 6.4(c)). Observe that the
parent is connected to every leaf z  w1 , and that w is connected to the parent
and every leaf z > w , as required.
Case 3 Suppose both v and w send w to u (see Figure 6.4(d)). Let the writers of Tv
and Tw be w10 , w20 ,   , w0 , and w100, w200,   , w00 enumerated from left to right.
Then for Tu, w1 = w10 and w = w00. Notice (especially between w0 and w100)
that all required connections are connected.
Theorem 6.2 A CST con gured by the functions gs and gc can perform all communications of a right-oriented segmentable bus in one step.

112
Recall that we modi ed the de nition of the right-oriented segmentable bus to
require writers to be readers as well. We now provide the intuition for this modication. As Theorem 6.2 establishes, the modi ed communication set requires only
two symbols. As noted earlier, the spurious read from the previous segment can be
discarded by the writer. On the other hand, with the original de nition of a rightoriented segmentable bus (see Figure 6.1(a)), the algorithm would have to distinguish
between switches whose leaves are all readers, all writers, and some readers/writers.
The modi cation lumps the latter two cases into one.
Clearly the method of this section readily translates to one for a left-oriented
segmentable bus.
6.2.2 Segmentable Bus with Exclusive Writes
In this section we use the oriented segmentable bus implementation of Section 6.2.1
to realize a segmentable bus in which each segment has at most one writer (not
necessarily at one end of the segment).
Generally speaking, the idea is to partition the given exclusive write, segmentable
bus communications into two blocks as follows. Recall that processors are indexed
in increasing order from left to right. Since only writes have to be exclusive, each
segment, Si , has exactly one writer, wi (say). Call processor j 6= wi of segment Si a
left reader (resp. a right reader) i j < wi (resp. j > wi). Partition the segmentable
bus communications so that one block (the left block) contains only communications
from the writers to their readers to the left and the other block (right block) contains
only the communications from the writers to their readers to the right. Figure 6.5
shows an example of a segmentable bus con guration and Figure 6.5(b) show its
communication patterns. Figure 6.5(c) shows the partitioning of the communications
into two blocks (shown solid and dashed). Figure 6.5(d) shows the communications
of the right block (shown solid) and some dummy communications (shown dashed).
No data are sent on the dummy communications (they exist only to simplify the
solution). Figure 6.5(e) shows the left block similarly. Note that the communications
of the right (resp., left) block are the same as the communications of a right (resp.,
left) oriented segmentable bus. Here we implement the communications of the right

113

(a) Segmentable bus con guration
(b) Communications of the segmentable bus
(c) Partitioning the communications
(d) Right block communications
(e) Left block communications
Figure 6.5: Implementation of a segmentable bus with exclusive writes
block and the communications of the left block separately using the method presented
in Section 6.2.1. Note that when implementing the communications of the right (resp.,
left) block, many paths are established (shown dashed in Figure 6.5(c) and (d)) but
not used to transfer data.
With one full duplex link between each node and its parent, the right and the left
blocks will have to be scheduled separately, as they could form a width-2 communication set. If each link is replaced by two full duplex links, then the communications
from both blocks can be performed simultaneously in one step. (This amounts to
setting k = 2 in the remark at the end of Chapter 3.)
Theorem 6.3 A CST with two full duplex links per edge can perform all communications of a segmentable bus with exclusive writes in one step.

114

(a) Bus segments

(b) Collection phase

(c) Broadcast phase
Figure 6.6: Implementation of a segmentable bus with concurrent writes
6.2.3 Segmentable Bus with Concurrent Writes
Here we present a method to implement a segmentable bus that admits concurrent
writes to a segment. As explained in Section 2.2, concurrent writes are resolved using
resolution rules. In this section we discuss the Common rule, in which all writers
write the same value (other concurrent write rules can also be implemented in a
similar way). In a concurrent writes implementation, each processor is a potential
writer. In other words, paths should be established from each processor to all other
processors. We accomplish this in two phases called the collection and broadcast
phases. The collection phase collects the result of all concurrent writes (bus-value) to
a segment into a xed processor; we choose the leftmost processor of each segment as
the collector. The second phase broadcasts the bus-value from the collector of each
segment to the remaining processors of the segment. Consider the segmentable bus of
Figure 6.6(a). Figure 6.6(b) shows the communications of the collection phase, and

115
w

w

w

w

w

r

(a)

(b)
w

r

r

r

(c)

r

w

(d)

Figure 6.7: Reversing the directions of data ow
Figure 6.6(c) shows communications of the broadcast phase. The broadcast phase
is simply the communications of a right-oriented segmentable bus, so the method of
Section 6.2.1 suÆces for its implementation.
The communications of the collection phase are the \dual" of those of the broadcast phase. In other words, if each switch is con gured the same way as in the
broadcast phase, except for reversing the directions of information ow, we end up
with the con guration of the collection phase (see Figure 6.7). The only point that
requires further elaboration is the case where two incoming lines are \connected"
together. For the Common rule, these lines could simply be ORed (assuming 0 for
no writes). For other rules, use other simple functions requiring constant hardware.
Remarks: Connections (data paths) of both phases could be cascaded to provide
one seamless path from writers to readers.

116
Theorem 6.4 A CST with two full duplex links per edge can perform all communications of a segmentable bus with concurrent writes in one step.

Remarks: The di erence between results of Theorems 6.3 and 6.4 is that additional
resolution hardware is employed for concurrent writes.

6.3 Method for Small Processors
In this section we present an approach to implement segmentable buses with smaller
processors of word-size (w) bits, where log log N  w  log N . This approach
implements an N -processor segmentable bus to run in ( logN
w ) steps.
6.3.1 Another Segmentable Bus Implementation
This implementation of an N -processor segmentable bus uses a 1-step, 2w -processor
segmentable bus as a building block. We will refer to this building block as the base
segmentable bus. (Section 6.2 gives one implementation of the base segmentable bus.)
Construction: We now give a recursive description of this new implementation
of an N -processor segmentable bus. Without loss of generality, let N = 2x(w 1)+1,
where w re ects the processor word-size and relates to the number of processors
in the

base segmentable bus; x  1 is an integer. Let S (x) denote the 2x(w 1)+1 -processor

segmentable bus implementation.

 If x = 1, then N = 2w . Here S (1) is the base segmentable bus.
 If x > 1, then construct the segmentable bus as shown in Figure 6.8.

Divide the 2x(w 1)+1 processors into 2Nw = 2(x 1)(w 1) groups (or sets) G(1), G(2),   ,
G(2(x 1)(w 1) ), each with 2w contiguous processors. Connect processors in each group
by the base segmentable bus S (1). Let pi and qi be the leftmost and the rightmost
processors in group G(i). Recursively connect the 2(x 1)(w 1)  2 = 2(x 1)(w 1)+1
processors, pi and qi (where 1  i  2(x 1)(w 1) ) using a 2(x 1)(w 1)+1 -processor
segmentable bus, S (x 1).

117

S (x−1)

G(1)

G(2)

....

G(2

.........

....

p

q

1

p

1

(x−1)(w−1)

)

....

q

2

2

base segmentable bus

w
2 processors

Figure 6.8: Structure of a segmentable bus implementation S (x)
We now illustrate this construction for w = 2 and x = 3, so N = 16.
Figure 6.9 shows the structure of S (3). We also illustrate the operation of S (3) using
An Example:

S(1)

G’’(1)

x=3
G’(1)

G’(2)

x=2

S(2)
G(1)

x=1

G(2)

G(3)

G(4)

9 10 11 12

13 14 15 16

S(3)
1

2

3

4

5

6

7

8

Figure 6.9: Structure of S (3)
this example. Suppose the function of S (3) is as shown in Figure 6.10(a) where
processors 4, 6, 9, and 15 open their segment switches. Suppose the segmentable
bus uses a Collision writing rule (see also Section 2.2). Recall that under the
Collision rule, a collision symbol, \#", is written to a segment, if more than one
processor attempts a write to that segment. Let processors 3, 4, 5, 7, 8, 14, and 15

118
1

2

3
v

4
v

3

5

6

v

4

7
v

5

8

9

10

11

12

13

v

7

14
v

8

15

16

v

14

15

(a)

v

v

3

v

4

#

5

v

v

14

15

(b)
v

v

4

v

#

5

#

v

14

#

15

v

v

14

15

(c)
v

v

3

3

v

3

v

3

#

#

#

#

#

#

#

v

#

14

v

v

14

14

v

14

v

14

v

v

14

14

v

14

v

v

15

15

v

15

(d)
Figure 6.10: An illustration of the functioning of S (3)
write values v3, v4 , v5 , v7, v8 , v14 , and v15 , respectively. In Figure 6.10, we follow the
convention that only writers have arrows to the segmentable bus. The symbol on a
bus segment is the bus-value and a symbol on a write arrow is the value written. The
symbol above a processor is the value read.
Figure 6.10(b) shows the rst step. Four groups G(1) = f1; 2; 3; 4g, G(2) =
f5; 6; 7; 8g, G(3) = f9; 10; 11; 12g, and G(4) = f13; 14; 15; 16g are formed. Each
processor opens or closes its switch based on local data and writers write their data
to the bus. The values on each segment (see Figure 6.10(b)) represent the value read

119
by all processors incident on that segment. Note that only the leftmost processor and
the rightmost processor of each group (i.e., processors 1, 4, 5, 8, 9, 12, 13, and 16) are
included in the next step. These processors determine the settings of their segment
switches for the next step as follows. All leftmost processors (i.e., processors 1, 5, 9,
and 13) retain their original segment switches' settings. Each rightmost processor of
each group (i.e., processors 4, 8, 12, and 16) determines if it is on the same segment
as the leftmost processor of that group. If it is, then the rightmost processor closes
its switch in the next step. If not, then the rightmost processor opens its switch
in the next step. Figure 6.10(c) shows the second step of the algorithm (processor
groups are G0(1) = f1; 4; 5; 8g and G0(2) = f9; 12; 13; 16g). Note that the rightmost
processor of group G(2) (processor 8 in the rst step) opens its switch in the second
step even though its switch was closed in the rst step. This is because (in the rst
step) processor 8 is not on the same segment as processor 5. Again, all processors
write their values, and the leftmost and the rightmost processors determine their new
switch settings for the third step as explained before. This process is repeated till the
number of leftmost and rightmost processors reduces to 2w , at which time a single
base segmentable bus suÆces. Now, a reverse process is applied, and the collected
data is broadcast to the appropriate processors. Figure 6.10(d) shows the nal value
read by each processor.
Operation of

S (x): We now generalize the idea of the above example. Initially

each processor has its local information; that is, each processor knows whether it is
segmenting the bus and is aware of the value (if any) it is to write to the bus. As
we saw in the example, the aim is to determine the bus-value for each processor on
the segmentable bus. Let P = f1; 2;    ; N g be the set of processors on S (x). For
any R  P and any i 2 P , let v(R; i) be the bus-value at processor i, assuming that
only the writers in subset R write to the segmentable bus. For example, if N = 10,
and suppose that processors 1; 5; 7 and 10 write values v1; v5 ; v7 and v10 , then for
R = fz1 ; z2 ; z3 ; z4 ; z5 g, the quantity v (R; z2 ) would be the value read by processor
z2 assuming only processors 1 and 5 write values v1 and v5 ; the segment switches'
settings are not altered by a choice of the set R. The procedure for S (x) is as follows.

120
Phase 1:

Processors of G(j ) (where 1  j  2(x

1)(w 1) ) determine the following:

 bus-value v(G(j ); z) for each z 2 G(j ).
 processor qj determines if it is in the same segment as processor pj .

Let j = v(G(j ); pj ) and j = v(G(j ); qj ) be the \local values" read by
the end processors of the group. Processor pj cuts the bus to its left if and only if it
was supposed to do so initially (at the start of Phase 1). Processor pj cuts the bus
to its left if and only if in Phase 1 it determined that it was not in the same segment
as pj . Processors pj , qj write j , j , respectively.
With the above \local information," as write values and segment switches states
of the processors pj , qj (for all 1  j  2(x 1)(w 1) ) recursively determine bus-values
and segment switches states on S (x 1). Let j and j be the new bus values of pj
and qj , respectively.
Phase 2:

Phase 3: Processors pj and qj write j and j on the local base segmentable bus of
G(j ). If processor z 2 G(j ) receives a value  , then  is its nal bus-value. Otherwise,

it retains its bus-value from Phase 1.
In Phase 1, getting the bus-value, v(G(j ); z), is simply a matter of segmenting
the local S (1) as speci ed and writing to and reading from the segmentable bus.
Processor qj determines if it is on the same bus as pj by waiting for a step to receive
a signal issued by pj . So Phase 1 runs in two steps. Phase 3 again is a matter of
using the local base segmentable bus and runs in one step.
The correctness of the procedure stems from the following facts.
 Phase 1 correctly determines the \local" bus-values within groups.
 Adjacent groups G(j ) and G(j +1) can interact only through processors qj and
pj +1 .
 Phase 2 recursively captures the nal bus-value for processors pj and qj .
 Phase 3 conveys the nal values locally.
Let S (x) require T (x) steps. Clearly, T (1) = 1 and from the explanation above
T (x) = T (x 1)+3. Solving this recurrence gives T (x) = 3x 2 = (x) =  logwN .

121

Figure 6.11: A balanced ternary (k = 3) tree of height 3
Given that a one-step 2w -processor segmentable bus (with each processor of wordsize w) can be constructed (Section 2.2) we have the following result.
Theorem 6.5 For any w where loglog N  w  log N , the proposed implementation


of an N -processor segmentable bus (using (w)-bit processors) runs in  logwN steps.

Remark: The reason we require w = (log log N ) is that two processors of the segmentable bus are part of all x levels of the recursion. These processors are connected
to x di erent base
segmentable
buses. Consequently, their word-sizes must be at


least log x = log logwN . That is, if w = (log log N ), then the segmentable bus can
operate as stated. If w = o(log log N ), the ideas presented by Vaidyanathan et al.
[49] could be used.
For any k  2, a balanced k-ary tree of height h has N = kh
leaves, each at a distance of h from the root, and each internal node has k children
(Figure 6.11 shows a balanced ternary tree (k = 3) of height 3).
A k-ary tree algorithm on N = kh inputs proceeds level-by-level from the leaves
to the root of a k-ary tree. Each node u of the tree has a value (u) associated
with it. The value of a leaf is an input. The value (u) of an internal node u
with children u1; u2;    ; uk is a function f ( (u1); (u2);    ; (un)). The value (ur )
of the root r is the output of the algorithm. Reversing the direction of a k-ary
tree algorithm generates outputs at the leaves. There is a clear parallel between
Using k-ary Trees:

122
a k-ary tree algorithm and the segmentable bus implementation described above
(see Figure 6.10 for which k = 4). Thus any platform suitable for a k-ary tree
algorithm works for a segmentable bus as well. Dharmasena [15] proposed a multiple
bus network (MBN) to run a k-ary tree algorithm
on N = kh node in h steps. This

MBN can also serve to implement a  logwN -step segmentable bus connecting N
processors.

6.4 Concluding Remarks
We have presented two approaches for segmentable bus implementation using binary
trees. The rst is suitable for large word-size processors and has variations that
accommodate di erent writing abilities. The second approach achieves the implementation as a k-ary tree.
A Horizontal-Vertical Recon gurable Mesh (HV-R-Mesh) [4] is an R-Mesh with a
segmentable bus in each row and column. The bit model HV-R-Mesh [21] is a negrained version of the (word model) HV-R-Mesh with processors of constant size (like
the PEs of the SRGA architecture). Theorem 3.19 extends to the following result.
Theorem 6.6 If the SRGA architecture can support an N -leaf CST per row and
column, then it can emulate any step of a bit-model N  N HV-R-Mesh in two steps.

Chapter 7
Implementing the Linear
Recon gurable Mesh
As described in Chapter 1, most work on dynamically recon gurable models such as
the (R-Mesh) assumes \unit-cost" buses and entirely skirts the issue of bus delay.
This makes the R-Mesh very diÆcult to implement. A more conservative \log-cost"
measure [32] assigns a log N delay to a bus spanning N processors. While this measure
is reasonable for a xed bus, it does not capture the complexities arising from the
ability of the LR-Mesh to con gure its buses in an exponential number of ways. In this
chapter we introduce a new measure of bus delay called \bends-cost" that considers
the delay of a bus to be proportional to the number of times it bends between rows and
columns of the LR-Mesh. We show that there exists an LR-Mesh implementation for
which bends-cost is a faithful measure of the actual bus delay. This implementation
uses a segmentable bus implementation. Consequently, our results are expressed in
terms of , the delay introduced by a segmentable bus spanning N processors. It
should be noted that the speci c implementation of the segmentable bus presented in
Chapter 6 bounds  to be O(log N ). The method proposed in this chapter is general
enough to accommodate future improvements in the value of , however.
We now describe our results in this chapter in a little more detail. Our results
are primarily for LR-Meshes with \semimonotonic" buses. In any given step of such
an LR-Mesh, all buses are laid out in some general orientation with respect to the
underlying processor array (such as top to bottom, or left to right); Section 7.1 de nes
semimonotonic buses formally. Many fundamental algorithms (such as those for pre x
123

124
sums, multiple addition and sorting) run on LR-Meshes with semimonotonic buses
[22,
prove that each step of an N  N (unit-cost) LR-Mesh can be run in
32, 36]. We
2 
O log Dlog Nlog 
time on a (N )  (N ) bends-cost LR-Mesh whose buses have a
delay of at most D. For some special cases this time overhead can be reduced further
to O log Dlog Nlog  . In particular, if D = N  for an arbitrarily small constant  > 0, then
the running times of the the bends-cost LR-Mesh algorithms are to within a constant
of their ideal (unit-cost) LR-Mesh counterparts. One implication of this result is that
with N  delay, a (N )  (N ) bends-cost LR-Mesh can perform pre x sums of N
bits, add N b-bit numbers and sort N inputs in constant time. To our knowledge,
this is the rst general result to produce constant time algorithms on recon gurable
models without using the unit-cost assumption for bus delay.
We also present results for simulating the LR-Mesh (whose buses are not necessarily semimonotonic) and the more general R-Mesh on recon gurable models with
limited delay buses.
In the next section we introduce some de nitions and properties of the LR-Mesh.
In Section 7.2 we describe the bends-cost measure and an LR-Mesh implementation
for which the bends-cost measure models bus delay accurately. Section 7.5 is devoted
to the simulation of a unit-cost LR-Mesh with semimonotonic buses on a bendscost LR-Mesh. Section 7.6 presents results for more general bus con gurations. In
Section 7.7 we summarize our results and make some concluding remarks.

7.1 Preliminaries
In this section we describe some properties of the LR-Mesh and de ne some terms.
Recall that an R  C LR-Mesh consists of an R-row, C -column array of processors
connected by an underlying mesh (see Figure 7.1). Each processor in an LR-Mesh
has four ports (called North, South, East, and West ports in the obvious manner, and
abbreviated N, S, E, and W).
A linear bus can be cyclic (see dotted bus in Figure 7.1)
or acyclic. The row sequence (resp., column sequence) of an acyclic linear bus is the
Linear Bus Types:

125

Figure 7.1: Examples of buses in a 3  5 LR-Mesh
sequence of row numbers (resp., column numbers) traversed when one traces the path
of the bus from one of its end points to the other. For example, the row sequence of the
bus shown solid in Figure 7.1 is h0; 1; 0; 1i, as the (left end of the) bus starts at a port
of a processor in row 0, moves to a processor at row 1, comes back to row 0 and nally
ends in a port of a processor in row 1. (Note that reversing the sequence also produces
a valid row sequence of the bus.) The column sequence of the above bus is h1; 2; 3; 4i.
The row and column sequences of the bus shown dashed in Figure 7.1 are h0; 1; 2i
and h1; 0; 1; 0i, respectively. A bus is row monotonic (resp., column monotonic) if its
row sequence (resp., column sequence) is monotonic (either non-increasing or nondecreasing). The solid bus of Figure 7.1 is column monotonic but not row monotonic,
whereas the dashed bus is row monotonic but not column monotonic. A bus that is
row monotonic or column monotonic is said to be semimonotonic. An incremental
bus is a row (resp., column) monotonic bus for which any two consecutive elements
of its column (resp., row) sequence di er by 1. An incremental column monotonic
bus moves up or down by at most one row at a time. The above ideas also apply to
pieces of a linear bus. For example, in Figure 7.1 the piece of the solid bus between
columns 1 and 2 is column monotonic but not row monotonic.
An LR-Mesh con guration is said to be row monotonic if every bus in the con guration is row monotonic. A column monotonic con guration is de ned similarly.
De nition 7.1 A con guration that is row monotonic or column monotonic is said
to be semimonotonic. An LR-Mesh con guration is incremental, if every bus in the
con guration is incremental.

126

Figure 7.2: Replacing a linear, acyclic bus by two \directional buses"
7.1.1 Exploiting Features of the LR-Mesh
Here we rst note some previous results on LR-Meshes. Then we use these to derive
some properties of LR-Mesh algorithms that will simplify subsequent discussion.

A linear acyclic bus is oriented i each processor on the bus can
determine which of its ports is closer to (say) the left end of the bus.
Oriented bus:

Lemma 7.1 Every linear acyclic bus of an X  Y LR-Mesh can be oriented on an
2X  2Y LR-Mesh.

Proof: Every linear acyclic bus has two end points. Call any one of these ends the
left end and the other the right end. The LR-Mesh replaces each bus by two \oriented
buses" as shown in Figure 7.2 [17] and assigns di erent orientations to each bus.
An important operation on an oriented linear bus is \neighbor localization." Given
an oriented linear bus with each processor pi on it agged by a Boolean variable fi,
neighbor localization constructs a linked list of processors pi with fi = 1 in the order
in which agged processors are placed on the bus. If the linear bus is oriented, then
neighbor localization can be solved in constant time on the bus [17].
If a constant blowup in size is permissible, then all concurrent writes
(except with Priority) on an LR-Mesh can be replaced by exclusive writes.
Write rules:

Lemma 7.2 Every step of an R  C LR-Mesh with only acyclic buses and which
does not use the Priority rule can be emulated on a 2R  2C CREW LR-Mesh in
constant time.

127
Proof: The LR-Mesh replaces each bus by two \directional buses" as shown in Figure 7.2. Each bus is now oriented. Then ag each writer on the bus and apply the
neighbor localization algorithm to determine the \leftmost" writer on the bus. For
the Arbitrary and Common rules, this leftmost processor performs the exclusive
write. For Collision rule, if there is only one writer (with no left and right neighbors), then it writes its value to the bus. If not, then the leftmost writer writes the
collision symbol to the bus. For the Collision+ rule, if there is only one writer, then
it writes its value to the bus. If there is more than one writer, then each writer sends
data to the next writer to its left (say). Each writer that detects a value di erent
from its own ags itself with a 1. An OR operation is done on all writers (this again
amounts to neighbor localization). If the result of the OR is 0, then all writers have
the same value and the leftmost writer is chosen to write its value to the whole bus.
If the OR is 1, then there are multiple writers with di erent data and the leftmost
writer writes the collision symbol to the entire bus.
Lemma 7.3 An LR-Mesh using exclusive writes or concurrent writes under the Common, Collision, or Collision+ rules can assume, without loss of generality, that
all buses are acyclic.

Proof: Cut the bus at each writer. This transforms every bus (cyclic or not) into
pieces each of which is linear. If the writes are exclusive, then the writer simply writes
to the two segments of the bus (in two separate steps). Otherwise, each piece has
exactly two writers, one at each end of that piece. (If a bus is cyclic, but with no
writer on the bus at a certain step, then that bus can be ignored at that step.)
If the bus has writer(s), then for the Common rule, each writer cuts the bus and
writes to both pieces on which it is incident. Clearly, all processors get the same
value. For the Collision rule, each writer cuts the bus and writes to both pieces of
bus it is incident. If there is only one writer, then all processors on the bus gets the
value written. If there is more than one writer, then each writer receives a collision
symbol (except the piece of the bus of the leftmost processor and the piece of the bus
of the rightmost processor). In another step the collision symbol is written to both

128
of these pieces. For the Collision+ rule, each writer cuts the bus and both ends
of each piece exchange data. Then, each writer that detects di erent data than its
own cuts the bus and writes a collision symbol. If there are no such writers, then all
writers conclude that all the data written are the same and all original writers write
this data to all other processors.
A semimonotonic bus could be row or column
monotonic. So a con guration composed of a set of semimonotonic buses could have
both row and column monotonic buses. A semimonotonic con guration is only permitted to have either row monotonic or column monotonic buses. In the following
lemma, we prove that a set of semimonotonic buses can be treated as a semimonotonic
con guration.
Remarks: The CREW LR-Mesh buses have the same properties (such as semimonotonicity etc.) as the CRCW counterparts.
Before we proceed, we establish a preliminary result that may be of independent
interest.
De ne the gossiping problem on set S as follows. Let each port i of each processor
hold a value vi 2 S [ null; the null indicates that a port may not hold a value. The
problem is for each port incident on a bus b to determine the set fvi : i is incident on
bg.
Semimonotonic Con gurations:

Lemma 7.4 A Common or Collision CRCW LR-Mesh can solve the gossiping
problem on set S in jS j steps.

Proof: Without loss of generality, let S = f1; 2;    ; g. Iterate times as follows.
In iteration j (where 1  j  ), each port i holding value vi = j writes a signal to
its bus and all ports incident on the bus read. A port receiving the signal in iteration
j concludes that some port on its bus holds value j . For the Common rule, each port
writes j in iteration j . For Collision, a signal of any value will be read by all ports
incident on the bus.

129
Lemma 7.5 Let L be a Common or Collision CRCW LR-Mesh. If every bus of a

con guration of L is given to be semimonotonic, then L can partition the con guration
into two semimonotonic con gurations in O(1) time.

Proof: Let C be the given con guration. The aim is to create two con gurations Cr
and Cc so that every bus of Cr (resp., Cc) is row (resp., column) monotonic, and every
bus of C in either Cr or Cc. We explain the algorithm for one bus b. All buses follow
on the same line.
Each processor p in which a bus b bends cuts the bus within that processor. Let
x and y be the two ports of a processor through which bus b traversed before it was
cut. Assign values vx to x and vy to y as follows.

vx =

8
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
:

1; if y 2 North
2; if y 2 South
3; if y 2 East

vy =

8
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
:

1; if x 2 North
2; if x 2 South
3; if x 2 East

4; if y 2 W est
4; if x 2 W est
Also assign value 5 to each processor holding an end of bus b; since b is semimonotonic, it must be acyclic. All remaining ports hold value null. Now solve the
gossiping problem on set f1; 2; 3; 4; 5g for each segment of bus b. This require three
iterations; i.e., constant time. At this point, ports of each processor in which a bus
bends know the value held by ports at neighboring bends.
A processor in which bus b bends could determine the status (row monotonic or
column monotonic) of bus b as follows. Let x and w be two ports at the ends of a
segment of bus b. Let Q be the set of value(s) obtained by port x after solving the
gossiping problem on set f1; 2; 3; 4; 5g. Clearly, 1  jQj  2.
Case 1 jQj = 1. Here Q = fvxg. Therefore the other end of the bus segment starting
at x must also have value vx. If vx = 1 or vx = 2, then we have the situation
in Figure 7.3 Port x determines that bus b is column monotonic. If vw = 3 or
vw = 4, then the situation in Figure 7.4 ensures that bus b is row monotonic.

130
b
x

w

x

w

1

1

2

2

b

Figure 7.3: Detection of a column monotonic bus
b
x

w

b

3

4

3

4

x

w

Figure 7.4: Detection of a row monotonic bus
Case 2 jQj = 2. Here Q = fvx; vw g. If vw 6= 5, then we have the situation in
Figure 7.5, and port x cannot tell whether bus b is column monotonic or row
monotonic.
If vw = 5 (see Figure 7.6), then again x cannot determine de nitely whether
bus b is column monotonic or row monotonic.
At this point, each port has either determined its bus to be row monotonic (R),
column monotonic (C ) or undecided (null). Processors of L reconnect their ports
to reconstruct bus b. Solve the gossiping problem again on set fR; C g for the entire
bus b. Since every bus is either row or column monotonic, the result of the gossiping
is a set with at most one element. If S = fRg, then the bus is row monotonic. If
S = fC g, then the bus is column monotonic. If S = ;, then the the bus quali es to
be both row monotonic and column monotonic. Assign it to be either row or column
monotonic.

131
b
x

w

1

2
b

x

w

2

1

x

b

w

b

3

4

4

3

x

w

Figure 7.5: Illustration of the case vw 6= 5

b

b

b

x

w

x

w

2

5

1

5

x

w

b

3

4

5

5

x

w

Figure 7.6: Illustration of the case vw = 5
Scalability: In general an LR-Mesh has an optimal scaling simulation [3]. That is,


for any R0 < R and C 0 < C , an R  C LR-Mesh can be simulated in O RRCC steps on
an R0  C 0 LR-Mesh. This allows many LR-Mesh algorithms that use (P ) processors
to state results for P processors. However, the scaling simulation algorithm used for
0

0

this result, destroys the structural characteristics of the simulated LR-Mesh buses
(semimonotonicity etc.). Therefore we cannot use the LR-Mesh scalability freely
in our results, which rely on buses having these structural properties. Figure 7.7
illustrates the situation and shows why constants c1 and c2 cannot be removed in
general.
If the given LR-Mesh algorithm uses monotonic buses (both row and column
monotonic) then a di erent scaling simulation due to Murshed [33] can be used. This
simulation preserves the buses' monotonicity. As shown in Figure 7.8, constants can
now be removed.

132
Traditional R  C
LR-Mesh algorithm
using semimonotonic
buses

Traditional cR1  cC2
scaling simulation [3] - LR-Mesh algorithm
using arbitrary
buses

limited-delay
implementation
(Theorem 7.7)
?
c 1 R  c2 C

LR-Mesh algorithm
at most D delay

limited-delay
implementation
?
?

RC
scaling simulation [3] - LR-Mesh algorithm
with unpredictable
delay

Figure 7.7: General scaling simulation do not work for semimonotonic buses

7.2 The Bends-Cost Measure
As mentioned in Section 2.5, a recon gurable bus is a combinational circuit that
establishes a data path from each potential writer to all processors connected to the
bus. Because there are relatively few taps between two successive gates (switches
used to con gure the bus), capacitive loading [44] is not a predominant factor. The
primary concern is the gate delay of the longest path of this circuit. Thus, the bus
delay could be considered to be the gate delay in the longest path traversed by data
in a bus. In Section 2.5 we described several measures for the bus delay. In this
section we introduce a new measure for linear buses called bends-cost.
Under bends-cost, the delay of a bus is assumed to be roughly proportional to the number of \bends" in the bus. (Each transition of the bus from a
row of the LR-Mesh to a column, or vice versa, is called a bend.) Speci cally, if a
Bends-Cost:

133
Traditional R  C
LR-Mesh algorithm
using monotonic
buses

Traditional cR1  cC2
scaling simulation [33]- LR-Mesh algorithm
using monotonic
buses
limited-delay
implementation
(Theorem 7.7)
?
RC

LR-Mesh algorithm
with at most D delay
Figure 7.8: Using a restricted scaling simulation for semimonotonic buses
linear bus snakes through r rows and c columns of the LR-Mesh, then its bends-cost
delay is r + c. Consider the N  N LR-Mesh (for N = 7) of Figure 7.9. Buses labeled
A and B have the same end points and both span O(N ) processors. However, bus A
has one bend and O(1) bends-cost delay, while bus B has (N ) bends and therefore
(N ) delay. Bus C also has (N ) delay, even though it alternates between the same
two rows of the LR-Mesh.
Lemma 7.6 For any x  N , each bus of any column monotonic con guration of an
N  x LR-Mesh has at most 2x 2 bends.

Proof: A bus originating at the leftmost column can have at most 2x 2 bends
before it reaches the rightmost column. Since it is column monotonic, it cannot turn
back to a previously traversed column.

134
C

B
A

Figure 7.9: Buses with di erent numbers of bends for an N  N LR-Mesh (N = 7)

7.3 A Bends-Cost LR-Mesh Implementation
In this section, we outline an LR-Mesh implementation for which the bends-cost
measure is an accurate indicator of the actual bus delay. A segmentable bus is an
important building block of this implementation. Recall that a segmentable bus (see
Figure 2.4) consists of processors connected to a bus with each processor p capable
of controlling a switch that can segment the bus between p and the previous processor. The segmentable bus is similar in function to a 1-dimensional R-Mesh (see
Section 2.2). Let S (N ) denote an N -processor segmentable bus whose (gate) delay is
at most . By Theorem 6.4 (page 115) and the fact that each step on a CST with
N leaves has O(log N ) gate delay,  = O(log N ). We note, however, that the results
of this chapter are not premised on any particular segmentable bus implementation
and are general enough to accommodate future improvements in .
Construct an N  N bends-cost LR-Mesh as follows. Arrange processors as an
N  N array and connect each row and each column of processors by a segmentable
bus S (N ) (see Figure 7.10(a)). At this point, this structure is an implementation of

135
Segmentable Bus

N
Segmentable Bus

PE

W

E
S

(a)
(b)
Figure 7.10: Structure of a bends-cost LR-Mesh implementation
a special case of the LR-Mesh called the HVR-Mesh [3] (or Basic R-Mesh [6]) that
restricts all its buses to be without bends. Within each processor, additional switches
allow a bus segment to bend from a row to a column or vice versa (see Figure 7.10(b)).
The connections between N; S or E; W ports is through the segmentable bus; Figure 7.10(b) shows these connections dashed. The switching fabric in each processor
has the form shown in Figure 7.11 requiring only four 2-input multiplexers one for
Wi

Wo

Si

So

Ni

No

Ei

Eo

Figure 7.11: Switching fabric of a bends-cost LR-Mesh processor
each port. Thus, this additional switching fabric along with the CST implementation
of a segmentable bus could realize the entire bus as a combinational circuit.

136
Thus a bus with bends consists of +1 row and column bus segments connected
in tandem, each with at most  delay. Consequently, the actual delay of the bus is
( + 1) which is proportional to the quantity + 1 predicted by the bends-cost
measure.
Thus, we can now state results for the bends-cost LR-Mesh using buses with
bends or an LR-Mesh result using buses of D =  delay.

7.4 Designing Implementable LR-Mesh Algorithms
In the last section we showed that the bends-cost measure of bus delay can be an
accurate model of the actual delay. Therefore one way to design an implementable
LR-Mesh algorithm is to ensure that the algorithm uses buses with limited numbers
of bends. One way to approach this task is to redesign the LR-Mesh algorithms.
The other approach is to design an automatic method to convert large classes of
the LR-Mesh algorithms (con gurations) to run with bounded delay. We adopt the
latter approach as it o ers the possibility of harnessing the large body of results for
the LR-Mesh.
In the next section we begin our discussion with semimonotonic con gurations.
Many fundamental LR-Mesh algorithms (including counting, pre x sums, multiple
addition, and sorting) use semimonotonic con gurations. We show that every semimonotonic con guration can be emulated quickly and eÆciently on buses of bounded
delay (number of bends). Subsequently in Section 7.6, we consider more general
con gurations.
In these sections we will consider LR-Meshes that use either the unit-cost or
the bends-cost measure of bus delay. In addition to the number of steps used, we
will characterize a bends-cost LR-Mesh algorithm by the maximum delay D (or the
maximum number of bends) that any bus in any con guration of the algorithm may
have.

137

7.5 Simulating Semimonotonic Con gurations
Without loss of generality, let the con guration be column monotonic (see Lemma 7.5).
Let U be a column monotonic con guration of an N  N unit-cost LR-Mesh. We use
the symbol U to denote the above con guration and the unit-cost LR-Mesh as well.
Let B be a c0N  c00 N bends-cost LR-Mesh, where c0 and c00 are constants whose values
will become apparent later. One could view each processor of U as corresponding to
a unique c0  c00 \cluster" of processors of B. Clearly, each sub-LR-Mesh of U also
corresponds to a sub-LR-Mesh of B. Let D   denote the maximum delay that a
bus of B can incur. The main result of this section is the following theorem.
Theorem 7.7 Let  be the delay of an N -processor segmentable bus. For any D 
, any
con guration of an N  N unit-cost LR-Mesh can be simulated
 semimonotonic
2 
log
N
in O log D log 
time on a (N )  (N ) bends-cost LR-Mesh using buses with

at most D delay.

The following corollary re ects an interesting special case of this result.
Corollary 7.8 For any  > 0, any semimonotonic con guration of an N  N unitcost LR-Mesh can be simulated in O(1) time on a (N )  (N ) bends-cost LR-Mesh
using buses with at most N  delay.

Remark: This is the rst general method to achieve constant time on a recon gurable
model without resorting to the unit-cost measure of bus delay.
Most of the remainder of this section is devoted to establishing Theorem 7.7.
We organize this section into three subsections. In Section 7.5.1, we reduce the
simulation to a \channel assignment" problem in an LR-Mesh. Section 7.5.3 rst
solves a restricted version of this channel assignment problem and develops results
that nd use later. Section 7.5.4 uses the results of Section 7.5.3 to solve the (general)
channel assignment problem and completes the simulation.
7.5.1 Simulation Algorithm
The purpose of the simulation is as follows. Suppose that writes during a step with
con guration U result in value vp on some port p of the slice. Then on the corre-

138
sponding port p0 of B, the same value vp must appear at the end of the simulation.
Moreover, B can only employ buses with O(D) delay.
Let x = cD , where c < 1 is a constant,  is the delay of an N -element segmentable
bus, and D   is the maximum allowed delay of buses in B. Without loss of
generality, let N = x , for integer  1. At this point we note that the delay of
buses of B will be proportional to the quantity x. Since we can select the constant c
in the de nition of x without constraint, we may assume that B permits O(D) delay,
rather than \at most D delay," as required.
For 1  k  = logx N , we use the following recursive algorithm to simulate an
N  xk sub-LR-Mesh Uk of U on the corresponding sub-LR-Mesh Bk of B. Note that
since U has a column monotonic con guration, so does Uk .
1. If k = 1, then we have
an N  x LR-Mesh U1. By Lemma 7.6, every bus
 
of U1 has at most O D bends and, therefore, O(D) delay. Consequently, B1
can simulate U1 by using the exact same bus con guration (without incurring
excessive bus delay).
2. If k > 1, then divide Uk into x slices, each consisting of xk 1 contiguous columns;
that is, each slice is an N  xk 1 sub-LR-Mesh of Uk (see Figure 7.12). Moreover,
each slice has a column monotonic con guration. Similarly, divide Bk into
corresponding slices.
We will refer to the W ports of processors on the leftmost column of a slice
collectively as the left border of the slice. Similarly de ne the right, top, and
bottom borders of the slice (see Figure 7.13(a)). Adjacent slices touch only at
their left and right borders.
3. Recursively simulate (in parallel) the slices of Uk on the corresponding slices of
Bk . Now, for each bus b of each slice S of Uk the following statements hold.
 All processors of b hold the value, if any, written to b from within the slice.
 All processor of b hold the end points of the bus.

139
k−1
x

N

k−1

xk

k−1

x

x

......

x slices

Figure 7.12: Dividing a slice into x slices
The remaining phases serve to propagate bus values among slices. Once a slice
has received values (if any) that come from outside it, it is easy to reverse the
steps of the recursion to propagate these new values within the slice. So we
focus only on propagating values among slices.
4. Consider the following classi cation of buses of slice S of Uk or the corresponding
slice S 0 of Bk .
 Category 0: Neither end point of bus b touches the left or right border of
S

(see buses marked C{F in Figure 7.13).

140
top
C
G

H
D

E

right

left

A
B

I
J

K
F

bottom

Figure 7.13: Bus types
 Category 1: One end point of b touches the left or right border of S (see

buses G{K of Figure 7.13).
 Category 2: The end points of b are on the left and right borders of S (see
buses of Type A and B in Figure 7.13).

Note that for a column monotonic bus, both end points cannot be on the left
border (or both on the right border) of Bk . In this phase, Bk identi es the
category of each bus b as follows. Let the end points of b be r and s. Clearly,
both cannot be on the left border or on the right border. If r or s is on the left
border, then Bk writes the index of the port on the bus. Next, if r or s is on
the right border, then Bk writes the port index on the bus.
It is easy to verify that bus b is in Category i (0  i  2) i it receives i values
in the above steps. In addition, a Category 2 bus can identify its Type (A or B)
as well by ascertaining whether its left end is higher than the right (for Type A)
or not (for Type B).

141
5. In this phase, Bk performs actions for a bus b depending on its category.
 Category 0: Here b has no e ect on other slices, nor is it a ected by other

slices. Therefore b does not participate further in the simulation.
 Category 1: Let bus b touch another bus b0 of an adjacent slice that
traverses a port on the left or right border of this adjacent slice. Let
this port of the adjacent slice be in processor p0. If the recursive step
computes the value of bus b to be v (that is, there is a write from within
slice S ), then the end of bus b touching bus b0 sends v to processor p0 and
bus b does not participate further in the simulation. If slice S does not
generate a value for bus b, then it waits to hear from processor p0 about
its nal bus value.
 Category 2: Let p and q be the ports at the left and right end points of
bus b. Construct a column monotonic bus b0 in the corresponding slice S 0
of Bk so that (i) the end points (clusters) p0 and q0 of b0 correspond to the
end points p and q of b, and (ii) bus b0 has a constant number of bends
(see Figure 7.14).

The action for Category 2 buses is undertaken after Category 1 buses have been
handled. Thus, the entire slice Bk is available for Category 2.
6. Note that each bus in each of the x slices of Uk is either removed or replaced by
a column monotonic bus with a constant number of bends. Thus, the con guration of the bends-cost LR-Mesh Bk is column monotonic, and each bus has
O(x) bends, or O(D) delay. Therefore, Bk can use its buses without incurring
excessive delay and convey values between slices.
Let T (k) denote the time to perform the above simulation on an N  xk LR-Mesh.
Let tk 1 be the time to handle Category 2 buses for an N  xk 1 slice. Then, T (1)
is constant and for all k > 1, T (k) = T (k 1) + tk 1 + constant. Solving this
recurrence, we have,
0
1
kX1
T (k) = O@ tq A:
(7.1)
q=1

142
p

p’

Category 2
bus b

bus b’

q’

q

(a)
Figure 7.14: Handling Category 2 buses

(b)

The only step of this algorithm requiring further
elaboration is the handling of Category 2 buses. Figure 7.15(a) shows representative
buses of Types A and B of Category 2. Both these types of buses have one end point
on the left border and one end on the right border. The only di erence is that for a
Type A bus the left end point is higher than the right, and for a Type B bus, the left
end point is lower.
Our solution for handling Category 2 buses of Type A will use a c1 N  c2 xk 1
sub-LR-Mesh of Bk to handle a slice of Uk . Clearly, Type B buses can also be handled
similarly; that is, all Type B buses can also be laid out on a c1N c2 xk 1 sub-LR-Mesh
of Bk . We now describe how both types can be handled simultaneously on Bk . Call
the sub-LR-Mesh handling Type A (resp., Type B) buses as Tier A (resp., Tier B).
The two tiers are interleaved into a 2c1N  (2c2xk 1 + 2) sub-LR-Mesh so that their
buses can be laid out without interfering with each other. Figure 7.15(a) shows a set
of Type A and B buses. Figures 7.15(b) and (c) show these routed in separate tiers.
Figure 7.16 shows the tiers combined.
Handling Category 2 Buses:

143
A

A

B

B

A

A

A

A

B

B

(a)
(b)
(c)
Figure 7.15: Routing Type A and B buses in di erent tiers
The basic idea is as follows. Divide the 2c1N  (2c2 xk 1 +2) sub LR-Mesh L (say)
into three vertical strips. The rst and third consist of the rst and last columns of the
sub-LR-Mesh (shown shaded in Figure 7.16). The middle strip is a 2c1N  (2c2xk 1)
LR-Mesh L0 (say). The layout of Types A and B buses in L use the following rules.
For Type A buses, all vertical (resp., horizontal) segments occupy only even columns
(resp., rows) of L0 (number rows and columns 0,1,  ). For Type B buses, on the other
hand, all vertical (resp., horizontal) segments, occupy only odd columns (resp., rows).
Thus the only way a Type A and a Type B buses traverse the same processor is in
di erent directions (one horizontal and the other vertical). Consequently, a Type A
bus will never get in the way of a Type B bus, and vice versa.
We now explain the function of the rst and the last strips (columns in Figure 7.16)
of L. In the method explained above, Type A (resp., Type B) buses exit the left and
right borders of L0 at even (resp., odd rows). However, a Type A bus on one slice
may be a Type B bus of the next. That is, Type A and Type B buses are not known
a priori. The additional realignment columns (shown dashed in Figure 7.16) serve to

144

Realignment Column

Tiers A & B

Realignment Column

Figure 7.16: Combining two tiers
position buses on the same rows of adjacent slices (regardless of the bus type). Note
that since a processor on the left or right border of Uk can have a Type A or a Type B
bus (but not both), and that this type is known a priori, the realignment columns
can be con gured appropriately.
Thus with c0 = 2c1 and c00 < 3c2, the size of the simulating bends-cost LR-Mesh
B is c0N  c00N .
This idea of using two tiers to accommodate two classes of buses can be extended
to multiple tiers; m tiers of an R  C LR-Mesh can be accommodated on an mR 
(mC + 2) LR-Mesh.

145
Since subsequent discussion is for Type A buses (Type B being handled analogously), we will refer to rows, columns, and processors of Bk rather than pairs of
rows, columns, and clusters, to mean rows, columns, and processors of Bk that handle
Type A buses.
7.5.2 The Channel Assignment Problem
All that remains now is the handling of Type A buses on a c1N  c2xk 1 LR-Mesh.
Speci cally, let the end points of bus b be at row p of the left border and at row q
of the right border of a slice of Uk , where p  q. The aim is to construct a column
monotonic bus b0 in Bk starting from the left border at row p0 and ending at the right
border at row q0 . The constructed bus b0 will have three segments (two bends or a
delay of 3): the rst segment is horizontal and runs in row p0 from the left border
to some column m; the third segment runs in row q0 from column m to the right
border of the slice. The middle segment is a vertical bus on column m between rows
p0 and q 0 . For example, for the Type A bus in bold in Figure 7.15(a), the algorithm
constructs the three segment bus shown in bold in Figure 7.15(b). Since no other
bus of the given slice of Uk can have p and q as end points, it is straightforward to
construct the rst and third segments of bus b0 . The challenge lies in selecting an
appropriate column m for each bus so that no two buses overlap, yet all buses are
accommodated within xk 1 columns.
The task now reduces to the following channel assignment problem. Denote by
fb0 ; b1;    ; by 1 g, a set of Type A buses of an N  X LR-Mesh (slice of Uk ), where
X = xk 1 . Each of these buses has a delay of at most D. For each bus bi , let si
and ei denote the row numbers where the bus touches the left and right borders of
the slice; we will call these rows the starting and ending rows, respectively, of bus bi .
Since bi is a Type A bus, si  ei ; recall that rows are numbered in increasing order
from the top to the bottom of the R-Mesh. The task is to use a c1N  c2 X bends-cost
LR-Mesh to assign a column number mi , where 0  mi < X , to each bus bi such that
if mi = mj (for i 6= j ), then either si  ej or ei  sj . For example, if the input buses
are shown in Figure 7.17(a), then the buses could be assigned columns as shown in
Figure 7.17(b).

146
b1

b1’

b2

b2’

b3’

b3

b4
b4’

(b)
(a)
Figure 7.17: Assignments of columns to buses
In the next section we solve a restricted case of the channel assignment problem
and develop some tools used subsequently to solve the general (unrestricted) channel
assignment problem. In these sections we will often treat the simulating bends-cost
LR-Mesh, V (say), as an N  X LR-Mesh so that its rows and columns are in oneto-one correspondence with the simulated N  X unit-cost LR-Mesh, S (say). The
details necessary to tailor the discussion to the actual size of the simulating LR-Mesh
are tedious and will provide no additional insight. To distinguish the input buses
b0 ; b1 ,    by 1 of S from buses of V used to solve the channel assignment problem,
we will refer to the y input buses as \BUSES." That is, \buses" are physical buses
con gured by V during the simulation of S , whereas \BUSES" are simply inputs to
this simulation.
7.5.3 Restricted Channel Assignment
In this section we solve a simple case of the channel assignment problem where ei si 
X , for each BUS bi (that is, each BUS has at most X rows between its starting and

147
ending rows). The solution is quite straightforward. Simply assign mi = si(mod X )
to bus bi. BUS bi occupies column mi from the S port of the processor in row si to
the N port of the processor in row ei. The next BUS bj that can use column mi has
sj = si + X  ei . Thus the restricted channel assignment problem can be solved in
constant time.
Lemma 7.9 An instance of the channel assignment problem where ei
each 0  i < y , can be solved in (1) time.

si

 X for

Remarks: For only Type A buses an N X simulating LR-Mesh suÆces for the above
result (i.e., c1 = c2 = 1). If both Types A and B are possible in the con guration,
then c1 = c2 = 2.
Recall the de nition of an incremental con guration in Section 2.4. By Lemma 7.5,
there is no loss of generality in assuming that a given incremental con guration is
column monotonic. In such a con guration a bus cannot cover more than X rows while
traversing through X columns. Clearly, the simulation of such a con guration will
require only the solution to the restricted channel assignment problem (Lemma 7.9).
From this observation and Equation (7.1), we have the following result (here, too,
c1 = c2 = 2).
Theorem 7.10 Let  denote the delay of an N -element segmentable bus. For any
D  , any incremental con guration of an N  N unit-cost LR-Mesh can be simu

lated in O log Dlog Nlog  time on a (N )  (N ) bends-cost LR-Mesh whose buses have
at most D delay.

Proof: By Lemma 7.9, tq = O(1) for 1  q < k. Therefore, by Equation 7.1,
T (k) = O(k). Since x = N , our simulation
an N  x LR-Mesh. So the
 simulates




log ND = O log N .
total time is T ( ) =O( ) = O loglogNx = O log
log D log 

Remark: Note that even if a con guration is not incremental, it may be possible to
simulate it using the restricted channel assignment problem.
In the next few subsections, we show that counting N bits and sorting N numbers
can be performed
on a (N )(N ) bends-cost LR-Mesh whose buses have at most D

delay in O log Dlog Nlog  time. For adding N b-bit numbers, the time is the same, but

148
a (N )  (Nb) bends-cost LR-Mesh is used. These results will be useful in the more
general technique for simulating any semimonotonic con gurations (not necessarily
incremental).
7.5.3.1 Applications

In this section we apply Theorem 7.10 to some fundamental algorithms (counting N
bits, adding N b-bit numbers, and sorting N numbers).
Counting N bits: An (N +1)  N unit-cost R-Mesh can count the number of 1's
among N input bits b0 ; b1;    ; bN 1 in constant time. Index the rows (resp. columns)
of the R-Mesh 0; 1;    ; N (resp. 0; 1;    ; N 1). Initially, processor (0; j ) in row 0
and column j holds input bit bj . The algorithm involves an initial step to broadcast
bj along column j . This step uses buses with no bends and does not pose any problem
on a bends-cost R-Mesh implementation. Subsequent steps involve buses with (N )

bends and we focus our attention only on those.
The algorithm constructs incremental monotonic buses starting at the processors
of the left border of the R-Mesh and moving down one row each time a 1 is encountered
(see Figure 7.18).
The bus starting at processor (0; 0) reaches processor (z; N 1) i the input
bits include z 1's. If processor (0; 0) sends a signal from its W port, it will reach
processor (z; N 1) where z is the number of 1's. Obviously all buses are of type
A. All other bus types could be ignored as they do not carry the signal. Clearly,
a bus could bend (N ) times (as there could be (N ) 1's in the input). A direct
implementation of this algorithm on the bends-cost LR-Mesh could have buses with a
delay of (N ), which is more than that of a naive LR-Mesh implementation using
linear-cost buses.
With Theorem 7.10, we have the following result.
Theorem 7.11 Let  denote the delay of an N -element segmentable bus. For any


D  , a (N )  (N ) bends-cost LR-Mesh can count N bits in O log Dlog Nlog  time
using buses of at most D delay.

149
Inputs :

1

1

1

0

0

0

1

0

1

1

0

1

0

signal

1
2
3
4
5
6
7 = Answer
8
9
10
11
12

Figure 7.18: Counting bits on the LR-Mesh
Multiple Addition: The multiple addition problem involves adding N b-bit integers (where b = O(log N )). Jang and Prasanna [22] established that a 2N  2Nb

LR-Mesh can solve this problem in constant time. For brevity, we do not go into the
details of this algorithm. The following observations distill structural aspects of their
algorithm relevant to a bends-cost LR-Mesh implementation.
The algorithm has steps that involve broadcasting within a row or a column
and using a segmentable bus within a row or column. These steps use buses with
no bends. The only step that does not fall in this category involves constructing

150
column-monotonic incremental buses and transmitting signals from the West and
North edges.
From Theorem 7.10, we have the following result.
Theorem 7.12 For any D   and b = O(log N ), a (N )  (Nb) bends-cost


LR-Mesh can add N b-bit integers in O log Dlog Nlog  time using buses of at most D

delay.

The problem of sorting an array of elements A = (a0 ; a1;    ; aN 1) is to
arrange the elements of A in increasing (or decreasing) order. The only assumption
about the elements of A is that they are pairwise comparable. The constant time
sorting algorithm of Jang and Prasanna [22] runs on an N  N unit-cost LR-Mesh.
We now show that each step of this algorithm can be performed by an incremental
con guration.
The algorithm is based on the Leighton's seven-step column sort [25] that requires
an N  N 43 LR-Mesh to sort N 43 elements in four of the seven steps and an N -element
permutation routing in the remaining three steps. The following result is standard.
Sorting:

An N  N LR-Mesh can perform a permutation routing on N elements in
a row in O(1) time using buses with at most two bends.
Therefore we only need consider the algorithm to sort N 34 numbers on an N  N 34
LR-Mesh. Jang and Prasanna [22] applied the algorithm of Theorem 7.12 a constant
number of times to design an N 41  N 34 LR-Mesh algorithm to add N 34 bits. This
counting algorithm is used to design the N 34 -element sorter as follows. Let a0i (0 
3
i < N 4 ) be the elements to be sorted.
Divide the N  N 34 LR-Mesh into N 43 blocks each of size N 14  N 43 . The ith block
determines (by all possible comparisons and counting (Theorem 7.11)) the number
of inputs hi smaller or equal to a0i; the comparisons use buses with no bends. The
LR-Mesh routes (Fact 1) a0i to position hi. Thus, we have the following fact.
Fact 1:



An N 14  N 34 bends-cost LR-Mesh can sort N 43 elements in O
steps using buses of at most D delay.
Fact 2:

log N 
log D log 

151
Theorem 7.13 Let  denote the delay of an N -element segmentable bus. For any


D  , a (N )  (N ) bends-cost LR-Mesh can sort N elements in O log Dlog Nlog 
time using buses of at most D delay.

We now have the tools to tackle the general channel assignment problem.
7.5.4 General Channel Assignment
In the last section we solved the restricted channel assignment problem on a (N ) 
(X ) bends-cost LR-Mesh with the restriction that the end points of each BUS be
no more than X rows apart. Here we remove this restriction. For completeness, we
de ne the channel assignment problem once again.
Let S be an N  X LR-Mesh with Type A BUSES b0 ; b1 ,    ; by . For 1  l < y, let
BUS bl touch the left border of S at row sl (starting row) and the right border at row
el (ending row). We will call the processors at the left border of row sl and the right
border of row el as the left and right ends, respectively, of BUS bl (see Figure 7.19
for an example). Without loss of generality, assume that for each bus bl , el sl > X ;
if not, bl can be processed separately as in Section 7.5.3. Let V be a c1N  c2X
LR-Mesh. The solution to the channel assignment problem is for V to assign to each
BUS bl a column index ml (where 0  ml < X ) satisfying the following condition:
for all 0  l < l0 < y, if ml = ml , then el < sl ; i.e., buses with the same column
index do not overlap.
Assume that no left end of a bus is in the same row as the right end of another
bus. This assumption is without loss of generality. For some l < l0, if el = sl , then
stretch S into a 2N  X LR-Mesh so that all end points are on even rows. If el = sl ,
then move el to the previous odd row. The e ect of stretching S can be incorporated
into the constant c1 in the size of V .
Even though S and V are di erent in size, there is clearly a correspondence between their rows and columns. For bus bl , we will reuse symbols sl and el to also
denote corresponding rows of V .
The algorithm for the general channel assignment problem has three main stages.
0

0

0

0

152
Stage 1{Leader Determination: Identify the BUSES with the X smallest left ends.
Call these BUSES the leaders. Without loss of generality, let the leaders be
b0 ; b1 ;    ; bX 1 ; that is, for all 0  j < X and X  k < y , row numbers sj < sk .
Assign index mj = j to leader bj , for each 0  j < X . If the number y of BUSES
is at most X , then the problem is solved at this point. Therefore, assume that
y > X.
Stage 2{List Creation: For each 0  j < X , construct a list Lj of BUSES such
that Lj contains BUS bj and for any two BUSES bl ; bl , if bl precedes bl in the
list, then el < sl ; that is, BUS bl does not start before bl ends. Consequently,
all BUSES in Lj can be assigned the same column index as BUS bj , namely j .
Figure 7.21(b) illustrates the list for the example in Figure 7.19.
0

0

0

0

Stage 3{Broadcasting in List: For each list Lj , create a bus (in the simulating slice
V ) that traverses the left end of row sl for each bus bl in Lj (the traversal is
in the order of list Lj ). Then broadcast the column index j to these left end
processors.




Each of these stages runs in O log Dlog Nlog  time on a c1N  c2X bends-cost LRMesh V using buses with at most D delay.
Recall that S is the simulated N  X unit-cost LR-Mesh and V is the simulating
c1 N  c2 X bends-cost LR-Mesh. Recall also that the buses of S each have at most
D delay, so V can simulate them directly. As before, we will treat the simulating
LR-Mesh V as an N  X LR-Mesh for ease of explanation. We now detail the stages
of the algorithm.
Figures 7.19{7.23 show a running example.
For this example, X = 3 and the slice contains nine BUSES, b0 ; b1 ;    ; b8 (see
Figure 7.19).
7.5.4.1 Stage 1|Leader Determination

For each row i of V , set a ag start (i) to 1 i there is a BUS bl with sl = i. For
each row i with start (i) = 1, con gure each processor of that row with the partition

153
X
s0

b0

s1
b1

e0

s2
b2

s3

b3

N
e3
e2
s4

s5

b4

b5

e1

s6
b6

e5
s7

b7

e6
s8

b8

e4
e7
e8

Figure 7.19: An example of the channel assignment problem

154
s0

s1
e0
s2

s3

e3
e2
s4

s5
e1
s6

e5
s7
e6
s8

e4
e7
e8

Figure 7.20: Con guration for Stage 1

155

s0

s0

b3

s1
e0
s2

s1

s3

s3

s2

b3

b4

b6

b4

e3
e2
s4

b5

b6

s4
b5

b6

s5

s5
e1
s6

s6

b8

e5
s7
e6
s8

b7

s7
s8
nil

nil

nil

e4
e7
e8

(a)

(b)

Figure 7.21: Con guration and result of Stage 2; part (b) shows the list connecting
starting rows of buses. We use these starting rows as identi ers for the buses. The
pointers themselves are labeled with buses only for clarity.

156

s0

s0

b3

b3

s1

s1

s2

s2

b3

b4

b6

s3

b3

b4

b6

b4

b5

b6

b5

b6

b7

b8

s3

b4

b5

s4

b6

s4
b5

b6

s5

s5

s6

s6

b7

b8

s7

s7

s8

s8

(a)

(b)

Figure 7.22: Illustration of Stage 3; part (a) shows buses separated by class, part (b)
shows buses separated by list.

157

s0

s1
e0
s2

s3

e3
e2
s4

s5
e1
s6

e5
s7
e6
s8

e4
e7
e8

Figure 7.23: The result of channel assignment

158
fN; E ; S; W g; i.e., connect the N and E ports together and the S and W ports

together in the processor. Con gure each processor in the remaining rows with the
partition fN; S ; E; W g (see Figure 7.20). Observe that all buses in the con guration
described above are row monotonic (column monotonic in the transposed X  N
slice) and incremental. Suppose that the processor at row 0 and column 0 (top left
processor) sends a signal through its N port. It is easy to see that for each row i
with start (i) = 1, the processor at row i and column j receives the signal at its N
port i the BUS starting at row i has the j th smallest starting row index. Thus, this
broadcast will not only identify leader BUSES b0 ; b1;    ; bX 1 , but also associate the
column index j with each BUS bj (for 0  j < X ). ByTheorem 7.10, LR-Mesh V
can broadcast the signal described above in O log Dlog Nlog  time.
For the running example, BUSES b0 ; b1; b2 (starting at rows s0 ; s1; s2) are selected
as leaders (Figure 7.20).
7.5.4.2 Stage 2|List Creation

First con gure the simulating LR-Mesh V exactly as the simulated LR-Mesh S ; thus
each BUS of S is now a bus of V . Let the processors at the left and right ends of each
BUS exchange information about each other. (Since the BUSES of S have at most
D delay, V can perform this data exchange without excessive delay.) At this point,
we may assume that each row of V is aware of all information about the BUS (if any)
that starts or ends at that row.
The algorithm for Stage 2 has three broad steps. The purpose of Step 1 is as
follows. Suppose BUS bl was to precede bl in list Li . Then this stage establishes a
bus from the left end of row el to the left end of row sl . Figure 7.21(a) shows (in
bold) the buses established for the list L0 = hs0; s3; s5; s7i.
0

0

Con gure each row i as described below.
 If i = sl for some leader BUS bl (a leader starts at row i), then con gure each
processor of row i of V as fN; S ; E; W g.
 If i = sl for some non-leader BUS bl (a non-leader starts at row i), then con gure
each processor of row i of V as fN; W ; S; E g.

Step 1:

159
 If i = el for some BUS bl (a bus ends at row i), then con gure each processor
of row i of V as fN; E ; S; W g.
 If there is no BUS bl such that i = sl or i = el (no bus starts or ends at row i),
then con gure each processor of row i of V as fN; S ; E; W g.

The leftmost processor in each non-leader starting row sl writes sl (the
identi er for bus bl ) to its W port and the leftmost processor in ending row el reads
from its W port.
Step 2:

0

0

0

If el reads sl in Step 2, then BUS bl points to BUS bl in its list. If el does
not receive anything on its W port, then BUS bl is the last element of its list. For
our example, the end result of this stage is shown in Figure 7.21(b).
To see why this algorithm works, consider the sequence 0 of rows sl and el for all
0  l < y in ascending order. For our example 0 = hs0; s1; e0 ; s2;s3; e3 ;e2; s4 ;s5; e1;s6 ; e5;
s7 ; e6 ;s8 ; e4 ;e7 ; e8 i (see Figure 7.21(a)). From this sequence remove the rst X starting
rows (of leaders) and the last X ending rows. Let the resulting sequence be . For
the example,  = he0 ; s3; e3; e2 ; s4; s5; e1; s6; e5 ; s7; e6; s8i.
In general, let  = h1 ; 2;    ; 2z i where i is some el or sl . The sequence 
has to have an even number of elements. This is because 0 has matching sl , el pairs
and  is derived from 0 by removing X starting rows and X ending rows.
Lemma 7.14 For any 1  k  2z , let k = h1 ; 2 ;    ; k i. Let k have ne ending
rows and ns starting rows. Then, ne  ns .
Proof: Let
k0 = hs| 0 ; s1 ; {z  ; sX 1}; | 1 ; 2 ;{z   ; k} i
Step 3:

0

0

0

X

starts

ns

starts;

ne

00

ends

Clearly, k0 has X + ns starting rows and ne ending rows. If we examine the given
slice S just after row k , then X + ns Type A BUSES would have started, of which
ne would have ended. Therefore X + ns ne BUSES cross row k+1 . Since S has X
columns, X + ns ne  X ; i.e. ns  ne.
Lemma 7.14 together with the fact that sequence  has the same number of
starting and ending rows, allows  to be viewed as a well-nested parentheses sequence

160
(see Section 3.4) by simply replacing each ending row with an opening parentheses
and a starting row by a closing parentheses. Thus each ending row el has a matching
starting row sl in sequence . Since sl > el , l0 6= l.
0

0

Lemma 7.15 For each matching pair (el ; sl ) of sequence  , Step 1 of Stage 2 establishes a bus between the W port of the left end processor of rows el and sl .
0

0

Proof outline: Let pl and pl be the left end processors of rows el and sl . The bus
from the W port of pl moves right by one column for each starting row (including
itself) it traverses and left by one column for each ending row. Thus this bus can
reach a W port on the left border only at processor pl in the matching row el .
Thus in Step 2 of Stage 2, the left end of el reads a row number sl i (el ; sl ) is
a matching pair. That is, in Step 3, bus bl points to bus bl i (el ; sl ) is a matching
pair.
0

0

0

0

0

0

0

Lemma 7.16 The three step procedure of Stage 2 is correct.

Proof: By the same argument, each BUS bl is pointed to from (at most) one BUS
bl . Thus the pointers of all BUSES constitute a set of lists. Since el < sl , it is clear
that if bl points to bl , then both BUSES can occupy the same column (as required).
We now show that there is one list per leader. Since bl points to bl in a list i
(el ; sl ) is a matching pair, and since s0 ; s2;    ; sX 1 are absent from , no BUS can
point to a leader. That is, each leader heads a list. We now show that no non-leader
heads a list; i.e., each non-leader is an element of a list headed by a leader. For each
non-leader BUS bl , its starting row sl is in the sequence . Therefore, there is an el
in  such that (el ; sl ) is a matching pair and so bl points to bl .
As is also evident from Figure 7.21(a), the buses created in Stage 2 are row
monotonic and incremental. By Theorem 7.10, Stage 2 runs in O log Dlog Nlog  time.
0

0

0

0

0

0

0

0

0

7.5.4.3 Stage 3|Broadcasting in List

This stage rst constructs a bus corresponding to each list Li , where 0  i < X ;
speci cally, if list Li = hbi ; bi(1) ; bi(2) ;    ; bi(u) i, then this stage constructs a bus that
traverses the left end of rows si ; si(1) ; si(2) ;    ; si(u) in that order. In other words,

161
each pointer in a list corresponds to a segment of the bus representing
that list. This

stage uses Theorem 7.10 on slice S to simulate the above buses in O log Dlog Nlog  time
and broadcasts the column index of the leader of the list to all other BUSES within
the list. The only point requiring further elaboration is the construction of buses
corresponding to lists Li .
We start by dividing the simulating LR-Mesh V into NX \windows," each an X  X
sub-LR-Mesh consisting of X contiguous rows of V . Call the topmost and bottommost
rows of a window as its borders. Recall that for each bus bl , el sl > X . Therefore
the lists are such that a BUS within a window points to a BUS outside the window.
Let BUS bl points to BUS bl in some list. All we need to do is construct a bus from
left end processor pl of row sl to the left end processor pl of row sl for all such BUS
pairs (bl ; bl ). The algorithm has two phases. In Phase 1, windows collect and record
information about buses (pointers of lists) crossing them. In Phase 2, the windows
use this information to independently route buses passing through them.
0

0

0

0

Phase 1 has three broad steps.
 First con gure V exactly as S to establish the input Type A BUSES.
 Broadcast on each bus bl of V the identi er sl and its pointer sl (assuming
bl points to bl ). Each processor in a window border through which bl passes
collects this information and records it. Bus bl may cross a window border
several times. This could cause pointer sl to be recorded multiple times in a
border, whereas the algorithm requires the border to record each pointer only
once. This can be done by sorting the (at most) X pointers crossing a window
border and then selecting only the rst occurrence
of each pointer
value. By



Theorem 7.13, this part runs in O log Dlog Xlog  = O log Dlog Nlog  time.
Also BUS bl may go below row el and then come up again to end at the right
border of row el . Since we will establish a direct path between rows sl and sl ,
each border at row i > sl (or row i < sl ) ignores the pointer (does not record
it).
 Construct (exactly as in Step 1 of Stage 2) a bus from the left end processor of
row el to the left end processor of row sl . Broadcast on each bus the value of

Phase 1:

0

0

0

0

0

0

162
top border
In
left border

End

I

II

W

Start

III
Out
bottom border

Figure 7.24: Illustration of Stage 2
sl .

As before, record the value of sl once at each border crossing for a direct
path.
0

0

At the end of Phase 1, each border has recorded information about each pointer
that must cross it in a direct path between BUS bl and its successor bl in the list.
Each window also has information about all starting rows within that window.
0

Here we use the information recorded in Phase 1 to construct buses
according to the list obtained in Stage 2.
Consider any window W that has a set In of incoming pointers (buses in V )
recorded at its top border. Let Out be a set of outgoing pointers recorded at its
bottom border. Let End be the set of ending rows within the window W . Let Start
be the set of starting rows within the window W (see Figure 7.24).
Phase 2:

163
s0

s0

s1 s2

row s1

row s1

s3

s3

s0

(a)

s2 s3

s1 s2

s0

(b)

s2 s3

Figure 7.25: Examples of pointers in a window
To illustrate the action of phase 2 consider a window with In = fs0; s1; s2g, Out =
fs0; s2; s3g, End = fs1g, and Start = fs3g (see Figure 7.25(a)). That is, the pointers
to buses b0 ; b1 ; b2 enter the window from the top border. Of these, the pointer s1 is to
a row within the window, so this pointer exits at row s1. Another row in the window
starts a pointer s3 (possibly continuing where s1 ended) and this pointer exits the
window along with s0 and s2 through the bottom border.
The task of phase 2 is to create \corresponding buses" in accordance with these
pointers. For our example, the window may create row monotonic buses that may be
as shown in Figure 7.25(b). This task is accomplished by the information gathered
in Phase 1. Divide the pointers into three classes:
 Class I consists of pointers in set End = In

Out .

Their corresponding buses
run between the top and left borders of the window.
 Class II consists of pointers in set Out Start . Their corresponding buses run
between the top and bottom borders of the window.
 Class III consists of pointers in set Start. Their corresponding buses run between
the left and bottom borders of the window.

164
It should be clear that each pointer entering or exiting the window falls in exactly
one class. Each window independently constructs the corresponding (row monotonic)
buses for its pointers.
Each class of buses is routed on a di erent tier of processors so that all corresponding buses can be accommodated on three tiers (see discussion on page 144).
For the running example, consider the window shown in bold in Figure 7.22(a). Set
In = fs3 ; s4 ; s6 g, set Out = fs4 ; s5 ; s6 g, set End = fs3 g and set Start = fs5 g. The
gure shows the corresponding buses, dashed, dotted and solid for Classes I, II, III.
Since each window constructs row monotonic buses corresponding to its pointers, the buses representing the lists in Stage 2 are row monotonic as well (see Figure 7.22(a)).


By Theorem 7.10, V completes Stage 3 in O log Dlog Nlog  time.
Lemma 7.17 Let  be the delay of an N -processor segmentable bus. For any D  ,

log N  time
the channel assignment problem on an N  X slice can be solved in O log D
log 
on a (N )  (X ) bends-cost LR-Mesh using buses with at most D delay.

With Equation 7.1 and the above lemma, we have Theorem 7.7 and Corollary 7.8
stated at the start of this section.
In the next two subsections we show how the general channel assignment problem
can be applied to eÆciently computing the pre x sums of N bits.
7.5.4.4 Pre x Sums of Bits

Here we apply Theorem 7.7 to compute the pre x sums of N input bits. That is,
for bits a0 ; a1;    ; aN 1 , we compute b0 ; b1 ;    ; bN 1 where bi = Pij=0 aj for each
0  i < N . The algorithm starts with an ineÆcient approach and progressively
re nes it.
The R-Mesh counting algorithm also gives the pre x
sums of the input bits. If the j th pre x sum (0  j < N ) is bj , then the signal reaches
the E port of processor (bj ; j ). Since our transformation maintains only the end points
of buses, the bends-cost R-Mesh algorithm will not directly yield the pre x sums. By
IneÆcient Pre x Sums:

165
reversing the recursion, the bends-cost R-Mesh can easily compute the pre x sums,
however. Each processor within a slice holds the identity of the left and right ends
of its bus. If the bus through processor (bj ; j ) has processor as its left end, then
the j th pre x sum is bj i processor receives the signal. Reversing the steps of the
counting algorithm (that goes from thin slices to wider slices) returns information
from wider slices back to thin slices and ultimately to individual columns.
Lemma 7.18 For any D  , a (N )  (N ) bends-cost LR-Mesh can nd the


pre x sums of N bits in O log Dlog Nlog  time using buses of at most D delay.
Modulo Pre x Sums: For any m  1 the j th modulo m pre x sum of input bits
a0 ; a1 ;    ; aN 1 is (a0 + a1 +    + aj )(mod m). The modulo m pre x sums can be
computed on an (m +1)  2N unit-cost R-Mesh in constant time [36]. This algorithm

uses buses of Types A and B. While a Type A bus is incremental, Type B buses could
extend between the rst and last rows; that is, their left and right ends in an m  X
slice could be more than X rows apart. From Theorem 7.7, we have the following
result.
Lemma 7.19 Let  denote the delay of an N -element segmentable bus. For any
1  m  N and D  , a (m) (N ) bends-cost
LR-Mesh can compute the

modulo m pre x sums of N bits in O
delay.

log N 2 time using buses of at most D
log D log 

Olariu et al. [36] proved that using modulo m pre x summing,
an m  N (unit

log
N
cost) LR-Mesh can compute the pre x sums of N bits in O log m time. With the
result of Lemmas 7.18 and 7.19, we have the following result.
Theorem 7.20 Let  be the delay of an N -processor segmentable bus. For any
1  m  N and D , a (
m)  (N ) bends-cost
LR-Mesh can compute the pre x

2 
N
log N
sums of N bits in O log
steps using buses with delay of at most
log m log D log 
D.


Remark: Once again if D = N 1 for constant 1 > 0, then the time is O
addition, if m = N 2 for constant 2 > 0, then the time is constant.

log N .
log m

In

166
It is possible
to modify the simulation of Lemma 7.19 to reduce the overhead to

O log Dlog Nlog  [16]. Hence, the time overhead of Theorem 7.20 can be reduced to


N
log N . However, it should be stressed that this reduction in time
O log
log m log D log 
comes from exploiting properties of the modulo pre x sums algorithm and does not
translate to any improvement in the result of Theorem 7.7.


7.5.5 Special Cases
In this section we show that the time overhead of Theorem 7.7 can be reduced for
some special cases of a semimonotonic con guration. Let the simulated slice be an
N  X unit-cost LR-Mesh. Recall that x = D which is the maximum number of
bends that a bus of the simulating bends-cost LR-Mesh can have. We consider two
spesial cases.

These con guration are semimonotonic con gurations in which each Category 2 (Type A or Type B) bus has left and right ends s
and e from two xed x-elements subsets S1 ; S2 of f0; 1;    ; N 1g (see Figure 7.26).
Thus every bus must either stay in the same subset S1 or S2, or oscillate between
them. Although Figure 7.26 shows only buses starting from the leftmost corner of
the LR-Mesh, the de nition of an oscillating con guration admits the \mirror range"
band of buses starting at the bottom left of the gure. In general, an oscillating
con guration restricts end points of buses within slices to oscillate between two xed
ranges of x rows. These ranges need not be the topmost and bottommost x rows as
shown in Figure 7.26. Since the number of buses is at most x  X , the number of
columns within the slice, then the static channel assignment used in Section 7.5.3 can
be used to assign each bus to a channel.
This can
be done in constant time and the


entire simulation algorithm runs in O log Dlog Nlog  time.
Oscillating Con gurations:

Theorem 7.21 Let  denote the delay of an N -element segmentable bus. For any
D  , any oscillating con guration of an N  N unit-cost LR-Mesh can be simulated


in O log Dlog Nlog  time on a (N )  (N ) bends-cost LR-Mesh whose buses have at
most D delay.

167

Figure 7.26: Example of an oscillating con guration
parallel con guration of size y is a monotonic con guration in which each Category 2 (Type A or Type B) bus has left and right ends s
and e that satisfy js ej = y, where y 2 f0; 1;    ; N 1g (see Figure 7.27).
Consider an N  N unit-cost LR-Mesh with a parallel con guration. Without loss
of generality let xy and Nx be integers. Divide this into N  x slices as before. Divide
each slice into x  x windows. Number the windows of a slice from 0; 1;    ; Nx 1.
Let the modulo index of a window of (actual) index i be i(mod( xy )). For 0  j < y,
let Sj be the set of all buses that cross a left border of a window with modulo index
j . Figure 7.27 shows the buses with modulo index 0.
Notice that all buses of any xed set Sj can be assigned a channel (within slices)
statically
as in Section
7.5.3. Thus Sj buses can be simulated with bounded delay


log
N
D in O log D log  time. To simulate all xy sets Sj , we need at most xy iterations. If

y >
log N , we can use the algorithm of Theorem 7.7.
x
log D log 
Parallel Con gurations:

Theorem 7.22 Let  denote the delay of an N -element segmentable bus. For any
D  , any parallel con guration of size y of an N  N unit-cost LR-Mesh can be

168

Figure 7.27: Example of a parallel con guration


simulated in O

min

n

log N  ; y o  log N  time on a (N )  (N ) bendslog D log  x
log D log 

cost LR-Mesh whose buses have at most D delay.

7.6 Simulating General Con gurations
Here we present results for simulating the LR-Mesh (with not necessarily semimonotonic buses) and the general R-Mesh on bends-cost LR-Meshes. Recall the de nitions
of row and monotonic subsequence of a linear bus (see page 124).
De nition 7.2 Let b be an acyclic piece of a linear bus. The piece b is a row (or
column) U-turn of the bus if and only if b has a row (or column) subsequence of the
form hi; j; ii.
In Figure 7.1 the portions of the dashed bus between rows 0 and 1, and between rows
1 and 2 are two column U-turns. Clearly a bus has no row (or column) U-turn i it
is row (or column) monotonic.
One way to quantify the amount by which a bus is \not semimonotonic" is by
the number of U-turns (which cause it to lose its monotonicity). Let B be the set of

169
buses in an LR-Mesh con guration. For any bus b 2 B , let #r (b) (resp., #c(b)) be the
number of row (resp., column) U-turns in b. Let r (B ) = maxf#r (b) : b 2 B g and let
c (B ) = maxf#c (b) : b 2 B g. The number of U-turns in an LR-Mesh con guration
n
o
with bus set B is  (B ) = min r (B ); c(B ) :
Theorem 7.23 Let  be the delay of an N -element segmentable bus. For integers
D; # such that D  #, a con guration of an N  N unit-cost LR-Mesh with #


log2 N
U-turns can be simulated in O (log D log )(log
D log # log ) time on a (N )  (N )
bends-cost LR-Mesh using buses with at most D delay.

Proof outline: This result follows on the same lines as the result of Theorem 7.7.
delay
The main di erence is that x = cD# to guarantee that each slice uses buses with

at most D. The number of levels of recursion is therefore O log D loglogN# log  . Each
level involves the
solution to the channel assignment problem. This solution runs in

unaltered in O log Dlog Nlog  time as it is based on end points of buses rather than their
shapes.
The only other issue is that a slice could now have Type L and M buses (see Figure 7.28) in addition to Type A and B buses. The double bus scheme of Section 7.1.1
can be used to identify Type L and M buses and distinguish their end points. After
that they can be handled exactly as Type A and B buses.
Remark: If D# = N  for an arbitrarily small constant  > 0, then the simulation
overhead is a constant.
Matsumae and Tokura [31] proved that an N  N HVR-Mesh can simulate any
step of an R-Mesh in O(log2 N ) time. Since an HVR-Mesh uses only horizontal and
vertical buses with no bends, we have the following result.
Theorem 7.24 Let  be the delay of an N -element segmentable bus. Any con guration of an N  N unit-cost R-Mesh can be simulated in O(log2 N ) time on an N  N
bends-cost HVR-Mesh using buses with at most  delay.

7.7 Concluding Remarks
We introduced the bends-cost measure of bus delay in linear recon gurable meshes
and showed this measure to be a faithful re ection of bus delay on an implementable

170
top

right

left

A
B

L

M

bottom

Figure 7.28: Bus types with U-turns
platform. We also presented simulations for several classes of LR-Mesh con gurations
on the bends-cost model that uses limited delay buses. We showed that an important class of LR-Mesh algorithms can be implemented using limited delay buses.
In particular, we showed that it is possible to design constant time algorithms on
recon gurable models without resorting to the unit-cost assumption.

Chapter 8
Computational Power of the
Bends-Cost LR-Mesh
Two models of computation M1 and M2 are said to have the same power if an
arbitrary step of one can be simulated on the other in O(1) steps, allowing polynomial
blowup in size for the simulating model. For R-Mesh type models, a model's size is
the number of processors in it. In this chapter we prove that if the allowed delay for
buses is polynomial in the number of processors, then the unit-cost LR-Mesh and the
bends-cost LR-Mesh are equal in power. Speci cally, we show that any step of an
N  N unit-cost LR-Mesh can be simulated in constant time on an N (1)  N (1)
bends-cost LR-Mesh whose buses have at most N  delay, for any constant  > 0.
Key to this
result is a simulation of a step of an N  N unit-cost LR-Mesh on a
 2
 2
 DN   DN bends-cost LR-Mesh in O log Dlog Nlog  time, using buses of at most
D delay;  is the delay of an N -processor segmentable bus. Our approach is based
on a well-known R-Mesh list ranking technique called distance embedding [19].
Generally speaking, the idea is as follows. Suppose that the buses of the simulating
bends-cost LR-Mesh can have at most bends (to limit the delay). If a bus has B >
bends, then cut the bus after every bends into d B e segments. Then, proceeding
along the lines of Chapter 7, replace each bus segment (that has at most bends)
by another segment connecting the same end points but with a constant number c
of bends. The new bus has at most cd B e < B bends. Proceed recursively until the
entire bus has at most bends, at which point it can be simulated directly. By
reversing the recursion, bus value can be propagated back to each port.
171

172
We use an R-Mesh technique for list ranking [47] to cut the bus correctly after a
sequence of bends.

8.1 The Simulation Algorithm
By Lemma 7.3 (page 127), there is no loss of generality in assuming that the simulated
LR-Mesh has no cyclic buses. By Lemma 7.1 (page 126), each bus is oriented. So,
we may describe a bus as going from port u to port v.
Let  = 2Dc . The simulation of an LR-Mesh S on a 4N 2  4N 2 bends-cost
LR-Mesh V has the following steps.
1. Divide V into a 4N 2  4N 2 grid of submeshes, each of size   . Denote the
submesh in row i and column j (where 0  i; j  4N 2 ) of this grid as Vi;j .
2. Number the 4N 2 ports of the unit-cost LR-Mesh 0; 1;    ; 4N 2 1. Each port
i (where 0  i  4N 2 ) is represented by diagonal submesh Vi;i .
3. Let the diagonal processors of Vk;k (where 0  k  4N 2 ) be pku, where 0  u < .
If an oriented bus goes from port i to port j in S , then connect processor piu
to pj(u+1)(mod ), for each 0  u < . If i < j , then use submeshes Vi;i, Vi+1;i,
  , Vj;i, Vj;i+1,   , Vj;j . to establish the connection. Otherwise, use submeshes
Vi;i, Vi;i+1, Vi;j ,   , Vi+1;j , Vi+2;j ,   , Vj;j . Figure 8.1 shows an example for
connecting port u to port v (where u < v) and port v to port w (where w < v).
At this point some processor piu is connected through a bus to another processor
piv i there are (v u)(mod ) ports on the bus between ports i and i0 of S .
This is the standard technique for contacting a list on the R-Mesh.
0

4. Cut the bus traversing processor pi0, for each 0  i < 4N 2.
5. Each processor pi0 writes its index in the direction of the bus oriented towards
the next port of i in S . This write by pi0 traverses a bus of V with 2 bends
(or D delay). This index reaches port pi0 i on the bus of S , ports i and i0 are
separated by exactly  ports. Similarly, pi0 sends its index i0 to pi0 .
0

0

173
port u

port w

......

......

......

......

port v

Figure 8.1: Some CST switch con gurations
6. Now V assumes port i to be connected directly to port i0. Accordingly, connect
Vi;i to Vi ;i as in Step 3.
0

0

7. Repeatedly reduce the bus by a factor of  in each iteration till the bus is of
size  or less.


Clearly, log 4N 2 = O

log N  iterations suÆce.
log D log 

Theorem 8.1 Let  be the delay of an N -element segmentable bus. Any con gu

ration of an N  N unit-cost R-Mesh can be simulated in O log Dlog Nlog  time on an
 2
 2
DN

O
O DN

 bends-cost R-Mesh using buses with at most D delay.
Corollary 8.2 For any  > 0, any step of an N  N unit-cost LR-Mesh can be
 2
 2
DN
simulated in O(1) time on an O DN

O

 bends-cost LR-Mesh using buses
with at most N  delay.

174
Theorem 8.3 The unit-cost and bends-cost LR-Meshes are equal in power if a polynomial delay is permitted.

8.2 Concluding Remarks
In this chapter we proved that LR-Meshes with sublinear (but polynomial) bus delay
are as powerful as the unrestricted LR-Mesh with linear bus delay. This result places
the role of bus delay in the context of the power hierarchy of recon gurable models
[5, 46, 48].

Chapter 9
The Enhanced-SRGA
In this chapter we introduce a recon gurable architecture, the Enhanced Self Recongurable Gate Array (E-SRGA) Architecture, that is based on the SRGA architecture
proposed by Sidhu et al. [40] (see also Section 1.2.1). The SRGA is an FPGA-type
architecture with the additional ability to generate con guration information from
within the chip (self recon guration), for instance, to connect two PEs. This eliminates the need to load the chip with con guration information through the limited
number of input pins of the chip. Like the SRGA, the E-SRGA consists of an array
of PEs. Each row and column connect by CSTs. It also has the self recon guration
feature of the SRGA. However the the E-SRGA possesses additional recon guration
features known to be useful in the R-Mesh.
One addition in the E-SRGA architecture is the ability to operate the CSTs as
segmentable buses; we use the implementation of Chapter 6. The E-SRGA also
assigns each switch of a CST to a processing element (PE) in the row or column
and enables the PE to control its switches directly. Speci cally, each PE \owns"
the row switch that succeeds it in the in-order traversal of the CST (see Figure 9.1).
Consequently, each PE owns (at most) two switches, one each from its row and column
CSTs. The architecture has been implemented in VHDL and we have conducted
cost-bene t tradeo for various dynamic recon guration features in the setting of an
FPGA-like device. This study has shown our approach to be feasible. With algorithm
design in mind, we have developed a programming model of the E-SRGA. This model
abstracts away architectural details. This part of the research is work in progress.
175

176
8

4

12

2

1

6

3

5

10

7

14

9

11

13

15

Figure 9.1: Associating CST switches with PEs. The CST nodes are numbered in
inorder. Each switch has a dashed line to the PE associated with it
data in

data out

PE Array

Controller
Low Level
Commands
and addresses

Figure 9.2: Overview of the E-SRGA architecture
The next section gives an overview for the E-SRGA architecture and Section 9.2
provides details. Section 9.3 describes our VHDL implementation of the E-SRGA and
presents simulation results. Sections 9.4 and 9.5 describe the programming model of
the E-SRGA.

9.1 Architecture Overview
The Enhanced Self-Recon gurable Gate Array architecture (E-SRGA) consists of an
array of processing elements (PEs) and an external controller that is responsible of
issuing low-level commands to the PE array (see Figure 9.2). Each row and column of

177
Switch
PE

Figure 9.3: 4  4 PE array
the array is connected by a CST (see Section 2.1). That is, the basic interconnection
structure is a binary tree whose leaves are PEs and whose internal nodes are switches.
Figure 9.3 shows a 4  4 PE array. This array is suitable for VLSI implementation
with the PEs laid out in a 2-dimensional mesh. The architecture also scales well as
the collective areas of the CSTs grows logarithmically with the array size.
As described in Chapter 2, each switch of the CST has a full-duplex link to its
parent (if any) and two children. Each switch is owned by a PE that can con gure
it to connect to its parent and children in various ways. Figure 2.2 (page 20) shows
representative con gurations that are assumed in this work. Some of these con gurations are simple extensions of those used in the SRGA architecture to include
broadcasting.
Each PE consists of a logic cell and a memory block. The logic cell contains a 20bit look-up table (LUT) and 2 ip- ops, collectively capable of implementing many
2-input, 2-output Boolean functions. The memory block in a PE can hold data as well

178
as con guration contexts. Each con guration context contains bits that con gure the
logic cell and the two switches owned by the PE. This context can be changed by the
PE in one clock cycle (as in the SRGA). That is, the functionality of the PE and the
CST con guration can be changed in one clock cycle.
The controller is responsible for issuing commands to the PE array that specify
the operations to be performed in the next clock cycle. To solve a problem on the
E-SRGA a high level algorithm is rst designed, which is then translated to a low
level sequence of instructions understandable by the PE array. The controller is
responsible of issuing these low level commands to the PE array. We describe the
high level commands, and low level commands, and the correspondence between them
in Section 9.5.4. In a typical implementation of an algorithm on the E-SRGA, the
controller receives a high level command program, then PE array is loaded with an
initial context set and data, if any. Next the controller issues low level commands
causing the PE array to take the appropriate action. The important part to note is
that the only run-time interaction between the PE array and the controller is through
short commands requiring a few input pins (if the array is in a separate chip).

9.2 Architectural Details
In this section we describe the detailed architecture of each component of the PE
array.
9.2.1 Interconnection Network
The interconnection network of the E-SRGA is the CST. That is, a binary tree whose
leaves are PEs, and whose internal nodes are switches and edges are full duplex links
(in which information can ow in both directions simultaneously). This interconnection network connects PEs in a row (or column). The actual connections between
a PE could be either speci ed in a con guration context or it could result from a
con guration operation (described in Chapter 5.) In Chapters 3{6, we showed this
interconnection fabric to be capable of implementing a variety of communication sets,
including those of a segmentable bus in at most 2 steps (clock cycles).

P in

L

L

R

in

MUX

R in

Pin

Lout

MUX

Pout

Rout

MUX

179

Pout

in

out

Lout

(a)

R in

(b)
Figure 9.4: Structure of a CST switch
9.2.2 Switches
Each (three-sided) switch is an internal (non-leaf) node of the CST. It is connected
to its parent (if any) and two children through a full duplex link (see Figure 9.4(a)).
Each switch is owned by a PE that can con gure it to connect to its parent and
children in various ways (see Figure 2.2, page 20.) Observe that a switch cannot
connect an incoming link to an outgoing link in the same \side" of the switch. This
ensures that for a tree with N leaves (PEs), every communication will traverse no
more than 2 log N switches, where N is the number of PEs in a row. Each switch has 3
inputs and 3 outputs (an input/output pair per side). Each output can be connected
to any of the 2 inputs on a di erent side of the switch via multiplexers (MUXes). To
con gure the switch, 3 bits are needed for the three MUXes (see Figure 9.4(b)).
The E-SRGA has the ability to internally generate con guration information for
basic routing operations such as connecting two PEs. To generate this con guration
information, however, the granularity of the E-SRGA PEs is somewhat larger than
logic blocks of typical FPGAs. The con guration is performed so that the entire tree
is con gured in a single clock cycle. Chapter 5 discusses issues of con guring the

180
Row Tree
Col Tree
Mem data

Logic Cell

ACC

Low level commands

configuration information

Row
Switch

configuration word
data out
Mux

Column
Switch

Memory Block
data in

external data
SCR , SRR
j

i

CMAR

Figure 9.5: Structure of a PE
CST and the communication classes that can be accommodated on it. Each switch
(or the associated PE) has a logic module that enables the con guration of the CST
in accordance with certain communication classes. Each E-SRGA switch contains
logic modules to handle edge-exclusive communication sets (see Section 5.2) and the
communications of a segmentable bus (see Section 6). A switch also contain logic to
enable the associated PE to directly con gure the switch (without resorting to the
techniques of Chapters 5 and 6).
9.2.3 Processing Elements
A block diagram of a PE is shown in Figure 9.5. Its main components are a logic
cell and a memory block. (The two switches owned by the PE are also shown in
the gure.) The memory block contains space for storing data and con guration

181
Row Tree
MUX

Accumulator

20 bits LUT

ACC

Row Tree
ACC

D MUX

Column Tree

Column Tree
ACC

MUX

Figure 9.6: Logic cell structure
contexts. The logic cell can compute Boolean functions based on inputs from the row
tree/column tree/memory bits. Depending on the low level command issued by the
controller, the PE performs an operation on the local data or changes its con guration
using a certain con guration word stored in the memory block. A PE receives control,
data and address inputs. The control inputs are the low level commands from the
controller. They determine which operation will be performed by the PE in the next
clock cycle. The data inputs include data from row and column trees and external data
(typically used only to load the initial data) that will be processed in the current cycle.
The PE in row i and column j of the array also receives address inputs (SCRj , SRRi
and CMAR) whose function is to enable/disable the PE or to address the memory.
More details appear in Section 9.2.6. The memory block stores con guration words
as well as acts as scratch pad memory. It is essentially like a regular RAM with the
ability to access individual bits of each memory word.
9.2.4 Logic Cells
The structure of a logic cell is shown in Figure 9.6. One of the most important
components of the logic cell is a 20-bits look-up table (LUT) that can implement a

182
Boolean function with 2 inputs, each 2 bits long. The LUT is actually two LUTs,
one 16  1 bits and the other 4  1 bits. Because many useful functions have least
signi cant outputs, depending only on least signi cant inputs, this arrangement works
well. An accumulator (ACC) holds the logic cell output. The two data inputs to the
logic cell can be chosen from the row tree, the column tree or from the previous output
of the logic cell (ACC). The output of the logic cell could be directed to the row tree,
the column tree or the accumulator.
9.2.5 Memory Block
The memory block has 8 words, each of width 46 bits as shown in Figure 9.7 (in general
it could have n w-bit words). Each word contains a con guration for the PE and
its two switches. The memory block could also be used as scratch pad memory. Thus
an access to a single bit in a word is also allowed. The memory is addressed by a 9-bit
address register, CMAR, the rst three bits of which select a word of the memory,
and the remaining 6 bits select a bit within the word. The data for the selected bit
can come from the row switch, column switch, logic cell (ACC) or from outside the
chip (external data).
Figure 9.8 shows the detailed format of the con guration word. The selection lines
of the input and output MUXes and MUXes of the logic cell are set using the part of
the bit labeled M0; M1 ;    ; M7 in Figure 9.8. Bits 20-22 and 23-25 de ne the states
of the row and column switches of the PE. Bits 26-41 and 42-45 specify the contents
of 16  1 and 4  1 LUTs.
9.2.6 Registers
The E-SRGA contains several global and local registers. They are used to hold the
low level instructions (or the decoded instruction) issued by the controller. The local
registers are one per PE, whereas global register are shared by all PEs (see Figure 9.9).

1. Operation register (O-Register) and Assistant register (A-Register): These registers are 4 and 3 bits long (respectively). Together they act as the op-code for
a low level command. The di erence between them is that the O-Register is

183
To Logic Cell

Configuration Register
Low−Level Commands
read / wirte

3

46

2

Row

8

Decoder

8 x 46 Memory
to ACC
data out

ACC
Row Switch
Col Switch
External data

Mux

data in

46
Column
Decoder
3
Context Field

6
Offset Field

Figure 9.7: Memory Architecture
decoded outside the processor array chip, while the A-Register is decoded inside. Thus 24 + 3 = 19 bits of op-code enter the processor array chip. Table 9.1
show the various low level commands and their corresponding op-code.
2. Qualify register (Q-Register), and Don't care Register (-Register): Assume the
PE array to be of size X  Y . The O-Register and the -Register (collectively
called the Select or S-Registers) are in two sets, one for rows and the other for
columns. The row S-Registers are each log X bits long. Together they de ne a
subset of the X rows to select. One way to use these is as follows. The 2 log X
bits can be used to specify X 2 of the 2X subsets of rows. These subsets can
be programmed into the decoder for the row S-Registers. In the same way the

184
0

3

6

9

12

14

16

18

20

M0 M1 M2 M3 M4 M5 M6 M7

23

Row sw.

26

Col sw.

42

LUT2

45

LUT1

Figure 9.8: Details of a con guration word
O−Register

4 bits

A−Register

3 bits

CMAR

3 bits
context field

6 bits
offset field

Figure 9.9: Global registers
column S-Registers is 2 log Y bits long. After decoding the contents of these
registers, SRRi ; SCRj are used to store the decoded values (see Figure 9.10).
A PE is enabled, i its row and column are enabled. The N ags to select rows
and columns are called the select row register (SRR) and select column register
(SCR) (see Figure 9.10).
3. The Context and Memory Address Register (CMAR) has been considered in
Section 9.2.3.

9.3 Implementation
The E-SRGA has been implemented in VHDL and synthesized using a 0.5 micron
library of standard cells from AMI. The Leonardo Spectrum synthesis tool was used
for the synthesis and optimization of the architecture. A C program was written
to automate the implementation of E-SRGAs of di erent sizes. Within the memory
restrictions on our server, we implemented and synthesized arrays of sizes 2  2, 4  4,
and 8  8. Based on our measurements, an array size of 1  N appears to be a good
predictor of an N  N array in terms of speed. In one dimension, we could implement
arrays as large as 1  64. In our implementation we did not add the ability of a bus

185
Table 9.1: Low level commands of the E-SRGA; X denotes a dontcare, a * denote
that these commands use addresses also provided by the control unit; y, these are
explained in Section 9.5
O-Register A-Register Command Class
Function
0000
XXX
Continue
No change from current settings
0001
XX0 *
write back con guration of row switch
XX1 *
Memory access
write back con guration of column switch
0010
XX0 *
write contents of ACC to memory
XX1 *
read memory to ACC
0011
X00
Communication sets
con gure as edge exclusive sets
X01
con gure as segmentable bus
0100
000
set switch to left to right
001
set switch to parent to left
010
Direct switch control
set switch to right to parent
011
set switch to right to left
100
set switch to left to parent
101
set switch to parent to right
0101
X00
set to Zero
0101
X01
Set local FF
set to One
0101
X10
set to Zero ag
0101
X11
complement local FF
0110
XXX *
Switch context
Switch to another speci ed context
1000
X00 *
Set enable ag type1 y
X01 *
Set enable ag
Set enable ag type2 y
X10
Set enable ag on local FF
0111
X00
Set PEs as sources
X01
Set PE class
Set PEs as segmenters
X10
Set PEs as readers
1111
XXX *
Initial load
Load initial data

to bend between columns and rows and vice versa (as in the bends-cost LR-Mesh of
Section 7.2), so that we could measure the clocking rate for a single row or column.
We implemented di erent versions of the architecture, each with di erent sets of
features and compared the results of the simulation in terms of speed. We varied the
size of the memory block of a PE to see its e ect on the area. Our key ndings based
on the simulation results are as follows.
1. The clock rate is logarithmic in the array size (see Table 9.2 and Figure 9.11).
In the gure, the horizontal axis is logarithmic in the array size. Clearly, this is
due to the logarithmic diameter of the tree. The curve labeled \all features" represents the architecture with all the con guration features (segmentable buses,

186

SCR
2

y

N−1

N−1

1

x

SRR

0

Controller
Address

PE Array

0

1

2

PE(x,y)

Low Level Commands

Figure 9.10: Interaction between controller and PE array
edge-exclusive sets, and direct control of the switches). The curve labeled \removing segmentable bus" represents the architecture with only the ability to
implement edge-exclusive sets and direct switch control of PEs. The curve labeled \removing connect pairs" represents the architecture with only the ability
to implement segmentable buses and direct switch control. The curve labeled
\Architecture without con guration circuit" represents the architecture without
support for any of these features. The logic required to implement the feature
that enables PEs to set its switches directly has almost negligible cost.
2. The con guration hardware (needed to implement all con guration features)
at a switch reduces the system clock considerably. For example, for 1  64
E-SRGA, the full-blown con guration hardware reduces the clock by almost
41%. The ability of implementing a segmentable bus only reduces the clock by

187

Clock (MHz)

100

Architecture with all Features

90

Removing Connect Pair

80

Removing Segmentable Bus

70

Architecture without Configuration
circuit

60
50
40
30
20
10
0
0

2

4

8

16

32

64

Size of Array Side

Figure 9.11: E ect of array size and di erent features on clock
about 30% (see Figure 9.11). The ability of implementing edge-exclusive sets
only reduces the clock by about 10% (see Figure 9.11). Direct switch control
by the PEs reduces the clock by about 1%.
3. The con guration hardware increases the switch area by a factor of about 3.
This seems very costly in terms of the architecture area, however the area of
the interconnection fabric including the switches (that have the con guration
hardware) is still a very small factor (6%) of that of the the entire architecture.
4. The context memory size is the dominant factor for the area of the PE (see
Figure 9.12 and Table 9.3). The gure shows that the area of the PE increases
almost linearly with the number of words in the memory block. This means
that if the number of words in the memory block is doubled, the area of the PE
will almost double.
5. The interconnect area is about 6% only of the whole architecture (we removed
all switches to establish this quantity.) In an FPGA-like device, the intercon-

188
Table 9.2: E ect of array size and a erent features on clock; the area is in number of
gates, and the clock is in MHz
without
with
with
with
Con g. Hardware Connect Pair
Seg. Bus
Both
Area Clock Area Clock Area Clock Area Clock
9829
92.5 10159 92.5 10287 92.5 10387 92.5
19658 92.5 20318 92.5 20574 92.5 20701 92.5
39316 92.5 40637 92.5 41647 79.9 41548 78.5
78631
85
81274 83.9 82259 72.1 82805 66.3
157262 77.8 162548 72.8 164590 58.9 166192 49.9
314525 71.7 325095 64.2 329181 49.7 332383 42.3
nection fabric (the routing channels) occupies almost 80-90 % of the chip area.
This contrast points to the E-SRGA having a better functional density than
traditional FPGA. Actually this may be due to the way the E-SRGA architecture and FPGAs solve problems. FPGA-like devices actually build a circuit
to solve the problem (hardware solution). On the other hand, the E-SRGA is
programmed to solve the problem; a sequence of instructions is issued by an
outside controller.

9.4 Modeling
Solving a problem on the E-SRGA would typically start with a high level algorithm
design to the controller. Then the algorithm is translated to a sequence of low level
instructions understandable by the E-SRGA architecture. Finally, the controller issues the low level commands sequence to the PE array. In this section, we abstract
the architectural details of the E-SRGA and develop a programming model based on
the architecture. This model could facilitate the design of algorithms without the
need to know all architecture details. We specify the model of the E-SRGA in terms
of some model parameters, connectivity, PE structure and capabilities and the interconnection fabric as described below. Other modeling approaches have been proposed
before [7, 8, 9, 10, 11].

189

12000
PE Area (no. of gates)

preserved

10000

flatten

8000
6000
4000
2000
0
0

2 3

4

6

8

12
16
Memory Size (words)

Figure 9.12: E ect of memory size on PE area for di erent optimization options
Model Parameters:

The computational model has the following parameters:

 X , Y : Dimensions of the PE array (X is the number of rows, Y is the number





of columns).
P : Processing element word-size.
PEarea : Area of the PE.
IFarea (N ) : Area of the interconnection fabric needed to connect N PEs in a
row or column.
C : Number of initially stored communication patterns (contexts).

Each PE is connected to the row/column interconnect through one
full duplex link (or two half duplex links) which can be used to connect pairs of
PEs (one-to-one communication) or broadcast data from one PE to several PEs in a
row/column. PEs communicate with each other through P -bit wide communication
links. The interconnect could be any network topology that is capable of implementConnectivity:

190
Table 9.3: E ect of memory size on PE area
Number All Preserved All Flattened Memory Flattened
of Words Area Clock Area Clock Area Clock
0
365 181.4 500 370
2
2076 89.4 2457 88.5 2524
92.5
3
3163 84.4 3089 81.6 3008
92.5
4
3715 83.6 3794 83.8 3458
92.5
6
4743 77.4 4691 81 4394
92.5
8
5827 73.7 5850 71 5273
92.5
12
7973 66.1 7942 67.7 7388
89.4
16 10498 77.3 9842 52 9328
89
ing a segmentable bus. However in this work we adopt the CST implementation of
segmentable buses presented in Chapter 6.
PE Structure and Capabilities:

The general structure of a PE is as follows.

 Each PE is connected to row/column CST.
 Each PE has a constant size memory to hold the stored communication patterns





(contexts). These contexts could be changed during execution and restored
again.
Each PE has one P -bit accumulator (ACC) to hold the current result of computations or an initialization value. The PE can write the ACC contents into
the memory.
In one unit of time, a PE can perform any binary operation on two P -bit
operands to produce a P -bit result. The operands could come from one of the
following : ACC, Row CST, or Column CST.
Each PE has an Enable ip op. If set, the PE will participate in the current
step.
Each PE has a number of ags (such as a Zero ag) that re ect the status of
the ACC.

191
Types of Communication:

There are two types of communications.

1. PEs in a Row/Column can be connected in pairs (unicast or one-to-one communication).
2. One PE can broadcast data to several other PEs in the same row/column.
Both one-to-one communications and broadcasting take 1 unit of time. One unit
of time is proportional to log N , where N is the number of leaves of the tree.
The CST connects pair(s) of PEs in a row (or a column) such that all connected
pairs in row/column satisfy topological limitations. Such limitations on communication on the CST is presented in Chapter 3.

9.5 Programming Model
The model is synchronous at the step level. At any time all PEs performing the same
type of step (explained later) or idle. A step can be of three di erent types.
9.5.1 Com Step
A Com(municate/pute) step is a basic unit of computation or communication. This
step always takes 1 unit of time. A PE receives two operands (from ACC, row CST,
or column CST), performs a binary operation and stores the result in the ACC.
9.5.2 Sel Step
The Sel(ect) step selects (enables) a set of PEs for participation in the current step.
Each row (or column) of PEs has a speci c address of length log X (or log Y ) bits.
The programmer can de ne the PEs to be enabled by providing the row and column
addresses. Also, the programmer can enable a subset of rows or columns as explained
in Section 9.2.6. There are three types of Sel steps. All of them run in 1 unit of time.

The programmer de nes a subset
of rows and a subset of columns to be active. PEs at rows and columns are enabled.
Type 1: Select PEs based on Row/Column

192
This
di ers from a Type 1 Sel step only in that the enabled PEs are drawn from those
that were enabled in the previous step. This allows a stepwise re nement of a subset
of PEs enabled.
Type 2: Select from already enabled PEs based on Row/Column

The programmer selects the PEs
to be enabled based on the contents of the local ip ops. The local ip ops can be
set, reset, complemented or set to local data (based on the Zero latch for example)
in a previous step.
Type 3: Select PEs based on local ip op

9.5.3 Con step
The objective of the Con( gure) step is to make changes in the current CST settings
so that di erent connections between PEs are established. This step takes at most 4
units of time. There are three types of Con steps.

At each selected row/column, this step
connects the same corresponding pairs of PEs. The programmer selects a source and
a destination to be connected. This step allows an incremental change in the communication pattern. At the end of this step, the enabled pair of PEs at a Row/Column
will be connected by a path from the source PE to the destination PE. By applying
this step k times, k source-destination pairs could be connected in each tree.
Type 1: Connect Pairs (one-to-one)

The objective is to connect each
row/column as a segmentable bus. PEs that are writers, segmenters, and readers
have to be de ned then the con guration is done as described in Chapter 6.
Type 2: Connect PEs as Segmentable Bus

Each enabled PE sets its switches independently. Any of the above three types of steps can store the changes in the switch
settings back into the con guration memory or it makes the changes only in the
interconnect.
Type 3: Con gure Switch Directly

193
Table 9.4: Translation between high and low level commands

High Level Command
Compute/Communicate

Equivalent Low Level Command(s)
-Continue or
-Switch Context
-Continue
Select PEs based on Row/Column -Set Enable ags type1
Select Already enabled PEs
-Set Enable ags type2
Select PEs based on local data
-Set local FF
-Set Enable ags
Connect pair
-Select PEs as sources
( is the original context)
-Select PEs as readers
-connect edge-exclusive sets
-Write row/column con guration bits (3 clock cycle)
-Switch context (to context )
Connect PEs as segmentable bus -Select PEs as sources
( is the original context)
-Select PEs as segmenters
-Select PEs as readers
-Connect as segmentable bus
-Switch context (to context )
Con gure switches directly
-Direct switch control (at most 3 clock cycles)
-Write ACC to Memory (3 clock cycles)
-Switch context (to context )
c

c

c

c

c

9.5.4 Relation between High and Low Level Commands
The purpose of the model presented in section 9.4 is to have a high level of abstraction
for designing algorithms on the E-SRGA. However, for the algorithm to be actually
executed on the E-SRGA, the high level commands have to be translated into low
level commands that can be understood by the E-SRGA architecture. Table 9.4 shows
the translation between the commands.
Table 9.5 establishes the time needed to run the high level commands on the
E-SRGA.

9.6 Concluding Remarks
In this chapter we presented the E-SRGA architecture that has the ability to solve
problems algorithmically. The E-SRGA has self recon guration ability where the
con guration information for connecting pairs of PEs and connecting rows/columns

194
Table 9.5: Estimated time for high level commands
High Level Command
Estimated Time
compute/communicate (any type of 1 - 2 clock cycles
operation, sending, or
receiving data on already established
path)
Select PEs based on Row/Column 1 clock cycle
Select Already enabled PEs based on 1 clock cycle
Row/Column
Select PEs based on local data
2 clock cycles
Connect pair
7 clock cycles
Connect PEs as segmentable bus
8 clock cycles
Con gure switches directly
7 clock cycles
as segmentable buses can be generated from within the chip. A cost-bene t tradeo s for the di erent dynamic recon guration features were obtained. Also, a high
level abstraction (for designing algorithms) for the E-SRGA architecture has been
developed that abstracts away some of the architectural details.

Chapter 10
Conclusions
Although a very powerful computing paradigm, dynamic recon guration has proved
to be diÆcult to realize. This dissertation deals with di erent aspects of implementing dynamic recon guration. Chapters 3{6 dealt with an important communication
structure called the CST. These chapters laid the foundation for developing primitive communication mechanisms used in subsequent chapters. The segmentable bus
(Chapter 6) was used as a building block in an implementation of an LR-Mesh (Chapter 7). The idea of edge-exclusive communications was used in the E-SRGA architecture of Chapter 9. This work collectively addresses many facets of implementing
dynamic recon guration, ranging from hardware details and low level architectures
to modeling issues and high level algorithm design.
In Chapter 3 we analyzed the communication capability of the circuit switched tree
(CST). We identi ed a property of a communication set, called width partitionability,
that allows the communications to be scheduled eÆciently on the CST. Then we
showed three classes of communication sets to possess this property. As a special
case of one of these results, we showed that the set of communications that can be
performed in one step on a segmentable bus [48] can be scheduled in two steps on the
CST.
In Chapter 4, we showed that any communication set that is not width partitionable has to satisfy a minimum set of requirements. We presented two \simplest
sets" satisfying these minimum requirements and proved that these are the only two
possible. We then showed that a communication set of width w could require as
many as 54 w steps to schedule on the CST. We also proved that, in general, non195

196
oriented, well-nested and non-oriented, monotonic communication sets are not width
partitionable.
Chapter 5 presented a method to con gure the full duplex CST to establish the
communication paths of a one step communication set in one step. We applied our
method to edge-exclusive communication sets. We showed that any one step communication set can be decomposed into at most three edge-exclusive sets and hence can
be performed in at most three steps. Together with results of Chapters 3 and 4, this
establishes a comprehensive method to perform communications on the CST.
In Chapter 6 we presented two approaches for implementing segmentable buses.
The rst is suitable for processors with large word-size using the CST. The second
approach uses a binary tree algorithm and is better suited for small word-size processors.
Chapter 7 introduced the bends-cost measure of bus delay in linear recon gurable
meshes and showed this measure to be a faithful re ection of the actual bus delay in
an implementation of the LR-Mesh called the bends-cost LR-Mesh. We also proved
that an important class of LR-Mesh algorithms can be implemented using limited
delay buses. In particular, we showed that it is possible to design constant time
algorithms on recon gurable models without resorting to the unit-cost assumption.
In Chapter 8 we proved that if polynomial delays are admissible, then the unit-cost
LR-Mesh and the bends-cost LR-Mesh are equal in power. That is, for every T step
algorithm on a unit-cost LR-Mesh, there is an O(T ) step algorithm on a bends-cost
LR-Mesh.
Chapter 9 presented the E-SRGA architecture. This architecture aims to exploit the power of dynamic recon guration in an FPGA-like setting. We presented
cost-bene t tradeo s for di erent dynamic recon guration features and developed an
algorithmic model for the architecture.

10.1 Future Directions
The work done in this dissertation has opened several other directions for future
research. Here we organize these directions along the lines of the main topics of this

197
work, namely, (a) CST analysis, (b) CST con guration, (c) the bends cost measure,
and (d) the E-SRGA.
CST Analysis

In Chapter 3 we have derived a lower bound on the number of steps for scheduling
a set of one-to-one communications on the CST. We showed that this lower bound
is tight for communication sets with disjoint incompatibles, oriented well-nested sets
and oriented monotonic communication sets. The natural question is \are there other
classes of communications for which this bound is tight as well?" In other words, are
there other classes that are width partitionable? Can the methods developed be used
for other communication structures (besides the CST)?
Chapter 4 characterizes the simplest communication sets that are not width partitionable. How does this characterization relate to a characterization of larger sets
that are not width partitionable? Simply requiring a subset of a communication
set to not be width partitionable is not suÆcient for the entire set to not be width
partitionable.
CST Con guration

In Chapter 5 we showed that the CST can accommodate any one-to-one communication set of width 1. For some communication sets (as in edge-exclusive sets and
segmentable bus communications), the binary tree can be con gured to establish the
paths of these communications in a single step. The following questions arise: Are
there other classes of one-to-one communications of width 1 for which the tree can
be con gured in a single step?
Also in Chapter 5 we presented an algorithm that decomposes any width-1 communication set into at most three edge-exclusive sets. This decomposition algorithm
requires compile time knowledge of the communications and so cannot be used for
run-time recon guration. Is it possible to perform this decomposition at run time?

198
Bends Cost Measure

In Chapter 7, we presented simulation algorithms for the unit-cost LR-Mesh on a
bends-cost LR-Mesh with semimonotonic con gurations. Are there other con gurations that could be simulated in the same manner?
Is it possible to extend the bends-cost measure to other recon gurable models,
for example the unrestricted R-Mesh? Cyclic buses are an important issue here as
they can cause a circuit with feedback or sequential circuits. Would these de nitions
change if a di erent technology, for instance, optical buses, were used?
All algorithms of Chapter 7 use a word-model, bends-cost LR-Mesh (as processors
need to handle indices). Can these algorithms run on a bit model (in which processors
cannot handle indices)? This requires the ability to transform the shapes of buses
without using the indices of the end points.
Our result on the relative powers of the unit-cost LR-Mesh and bends-cost LR-Mesh
hinges on the ability to tolerate polynomial delay.
Can this
condition be relaxed? Or


log
N
conversely, is it possible to show that the  log D log  time overhead cannot be
avoided?
A deterministic method was used to cut a bus to size in Chapter 7. Can randomization help, reduce the simulating model size? Randomization is possible only if one
could assume a mechanism to ag buses that are too long. All this may require a
change in the concept of power to include the cost of the bus delay.
E-SRGA

In Section 9.3 we have presented some of our simulation results for the E-SRGA architecture. Based on these, one direction would be to optimize the architecture to
improve its clock rate and reduce the area. We showed that the con guration hardware contributes to lowering the clock rate. Also we observed the need for di erent
hardware for each class of communication sets. Can we implement the con guration
hardware in a manner that reduces its e ect on the clock rate? One possible approach
is to use con gurable logic (possibly LUTs) to implement this hardware. Another approach is to use two di erent clocks in the architecture. The E-SRGA can operate

199
on one clock rate to con gure the switches, while operating on a higher clock rate for
other operations [37].
Since the number of contexts is a dominant factor for the area of the E-SRGA the
following questions arise. How many contexts are needed for a function/algorithm?
Can we reduce the width of each context so that the area is reduced? To load a new
context, do we really need to load the whole context or we need only one part of it
(such as LUT contents)? Answers to these questions will lead to better use of the
chip area.
Another direction for the work on the E-SRGA is to implement a suite of primitive
functions on the architecture and compare it to known hardware solutions that target
FPGAs and ASICs. We expect that the implementation of these primitive functions
(individually or collectively) on the E-SRGA (which uses dynamic recon guration)
will have an advantage over the hardware implementation using FPGA and ASICs.

Bibliography
[1] J. A Anderson, Discrete Mathematics with Combinatorics, Prentice Hall, New Jersey,
2001.
[2] Atmel Corp., \AT6000 Series Con guration," con guration guide, 1997.
[3] Y. Ben-Asher, D. Gordon and A. Schuster \EÆcient Self Simulation Algorithms for
Recon gurable Arrays," J. Parallel & Distributed Computing, vol. 30, 1995, pp. 1{22.
[4] Y. Ben-Asher, K.-J. Lange, D. Peleg and A. Schuster, \The Complexity of Recon guring
Network Models," Information and Computation, vol. 121, 1995, pp. 41{58.
[5] Y. Ben-Asher, D. Peleg, R. Ramaswami and A. Schuster, \The Power of Recon guration," J. Parallel & Distributed Computing, vol. 13, 1991, pp. 139{153.
[6] A. A. Bertossi and A. Mei, \Optimal Segmented Scan and Simulation of Recon gurable
Architectures on Fixed Connection Networks," Proc. 7th IEEE/ACM Int. Conf. on High
Performance Computing (HiPC ), 2000, pp. 51{60.
[7] K. Bondalapati, P. Diniz, P. Duncan, J. Granacki, M. Hall, R. Jain, and H. Zeigler,
\DEFACTO: A Design Environment for Adaptive Computing Technology," 6th Recongurable Architectures Workshop, Springer Verlag Lecture Notes in Computer Sc., vol.
1586, 1999, pp. 570{578.
[8] K .Bondalapati and V. K. Prasanna, \DRIVE: An Interpretive Simulation and Visualization Environment for Dynamically Recon gurable Systems," 9th Int'l. Workshop
Field-Programmable Logic and Applications, Springer Verlag Lecture Notes in Computer
Sc., vol. 1673, 1999, pp. 31{40.
[9] K. Bondalapati and V. K. Prasanna, \Hardware Object Selection for Mapping Loops
onto Recon gurable Architectures," Proc. Int'l. Conf. Parallel and Distributed Processing
Techniques and Applications, 1999.
[10] K. Bondalapati and V. K. Prasanna, \Loop Pipelining and Optimization for Run
Time Recon guration," Springer Verlag Lecture Notes in Computer Sc., vol. 1800, 2000,
pp. 906{915.
[11] E. Caspi, M. Chu, R. Huang, J. Yeh, J. Wawrzynek and A. DeHon, \Stream Computations Organized for Recon gurable Execution (SCORE)," 10th International Workshop
Field-Programmable Logic and Applications, Springer Verlag Lecture Notes in Computer
Sc., vol. 1896, 2000, pp. 605{614.

200

201
[12] B. Beresford-Smith, O. Diessel, and H. ElGindy, \Optimal Algorithms for Constrained
Recon gurable Meshes," J. Parallel & Distributed Computing 1996, pp. 74{78.
[13] K. Compton and S. Hauk, \Recon gurable Computing: A Survey of Systems and
Software," ACM Computing Suveys, vol. 34, June 2002, No. 2, pp. 171{210.
[14] A. DeHon, R. Huang and J. Wawrzynek, \Hardware-Assisted Fast Routing," Int.
Symp. of Field-Programmable Custom Computing, Napa, CA, April, 2002.
[15] H. P. Dharmasena, \Multiple-Bus Networks for Binary-Tree Algorithms," Ph.D. dissertation, Dept. of Electrical & Computer Eng., Louisiana State University, 2000.
[16] H. M. El-Boghdadi, R. Vaidyanathan, J. L. Trahan and S. Rai, \Implementing Pre x
Sums and Multiple Addition Algorithms for the Recon gurable Mesh on the Recon gurable Tree Array," Proc. Int. Conf. on Parallel and Distributed Processing Techniques
and Applications, vol. 3, 2002, pp. 1068{1074.
[17] J. A. Fernandez-Zepeda, R. Vaidyanathan, and J. L. Trahan, \Using Bus Linearization
to Scale the Recon gurable Mesh," J. of Parallel & Distributed Computing, vol. 62, 2002,
704, pp. 495{516.
[18] R. W. Hartenstein, M. Herz, T. Ho man and U. Nageldinger, \On Recon gurable Coprocessing Units," Recon gurable Architectures Workshop, 1998, Springer Verlag Lecture
Notes in Computer Sc., vol. 1388, pp. 67{72.
[19] T. Hayashi, K. Nakano and S. Olariu, \EÆcient List Ranking on the Recon gurable
Mesh, with Applications," Theory of Compter Systems, vol. 31, no. 5, 1999, pp 593{611.
[20] J. JaJa, An Introduction to Parallel Algorithms, Addison-Wesley Publishing Co., 1992.
[21] J. Jang, H. Park and V. K. Prasanna, \A Bit Model of Recon gurable Mesh," Proc.
1st Recon gurable Architectures Workshop, 1994.
[22] J. Jang and V. K. Prasanna, \An Optimal Sorting Algorithm on Recon gurable Mesh,"
J. Paralle & Distributed Computing, vol. 25, no. 1, 1995, pp. 31{41.
[23] M. Kunde and K. Gurtzig, \EÆcient Sorting and Routing on Recon gurable Meshes
Using Restricted Bus Length\ Int. Parallel Processing Symp., 1997.
[24] F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays  Trees
 Hypercubes, Morgan Kaufmann Publishers, San Mateo, CA, 1992.
[25] T. Leighton, \Tight Bounds on the Complexity of Parallel Sorting," IEEE Trans.
Computers, vol. 34, 1985, pp. 344{354.
[26] C. E. Leiserson, \Fat-Trees: Universal Networks for Hardware-EÆcient Supercomputing," IEEE Trans. on Computers, vol. 34, 1985, pp. 892{901.
[27] H. Li and M. Maresca, \Polymorphic Torus Network," IEEE Trans. Computers, vol. 38,
1989, pp. 1345{1351.

202
[28] R. Lin, S. Olariu, J. L. Schwing, and B.-F. Wang, \The Mesh with Hybrid Buses: An
EÆcient VLSI Architecture for Digital Geometry," IEEE Trans. on Parallel & Distributed
Systems, vol. 10, 1999, pp. 266{280.
[29] R. Lin and S. Olariu, \Recon gurable Shift Switching Parallel Comparators, " VLSI
Design, vol. 9, 1999, pp. 83{90.
[30] M. Maresca, \Polymorphic Processor Arrays," IEEE Trans. Parallel & Distributed
Systems, vol. 4, no. 5, 1993, pp. 490{506.
[31] S. Matsumae and N. Tokura, \Simulation Algorithms among Enhanced Mesh Models,"
IEICE Trans. Information & Systems, Oct. 1999, vol. E82-D, no. 10, pp. 1324{1337.
[32] R. Miller, V. Prasanna-Kumar, D. Reisis and Q. Stout \Parallel Computing on Recongurable Meshes" IEEE Trans. Computers, vol. 42, no. 6, 1993, pp. 678{692.
[33] M. M. Murshed, \The Recon gurable Mesh: Programming Model, Self-Simulation,
Adaptability," Optimality and Applications," Ph.D. Dissertation, Australian National
University, 1999.
[34] K. Nakano, \A Bibliography of Published Papers on Dynamically Recon gurable Architectures," Parallel Processing Letters, vol. 5, 1995, pp. 111{124.
[35] K. Nakano and S. Olariu, \An EÆcient Algorithm for Row Minima Computations on
Basic Recon gurable Meshes," IEEE Trans. Parallel & Distributed Systems, vol. 9, no.
6, 1998, pp. 561{569.
[36] S. Olariu, J. L. Schwing, and J. Zhang, \Fundamental Algorithms on Recon gurable
Meshes," Proc. Allerton Conf. on Communication, Control & Computing, 1991, pp. 811{
820.
[37] \Recon gurable Array Media Processor (RAMP)," Proc. IEEE Symp. FPGAs for Custom Computing Machines, 2000, pp. 287{288.
[38] S. M. Scalera, J. J. Murray and S. Lease, \A Mathematical Bene t Analysis of Context
Switching Recon gurable Computing," Recon gurable Architectures Workshop, 1998,
Springer Verlag Lecture Notes in Computer Sc., vol. 1388, pp. 73{78.
[39] D. B. Shu and J. G. Nash, \The Gated Interconnection Network for Dynamic Programming," in Concurrent Computations, S. K. Tewksbury et al., eds., Plenum Publishers,
New York, 1988, pp. 645{658.
[40] R. Sidhu, A. Mei, and V. K. Prasanna, \Genetic Programming using SelfRecon gurable FPGAs," Int. Workshop on Field Programmable Logic and Applications,
Sept. 1999.
[41] R. Sidhu, A. Mei, and V. K. Prasanna, \String Matching on Multicontext FPGAs
using Self-Recon guration," Int. Symp. on Field-Programmable Gate Arrays, Feb. 1999.

203
[42] R. Sidhu and V. K. Prasanna, \EÆcient Metacomputation Using Self-Recon guration,"
Proc. Field Programmable Logic 2002, Springer Verlag Lecture Notes in Computer Sc.,
vol. 2438, 2002, pp. 698{709.
[43] R. Sidhu, S. Wadhwa, A. Mei, and V. K. Prasanna, \A Self-Recon gurable Gate Array
Architecture," Int. Conf. on Field Programmable Logic and Applications, 2000, Springer
Verlag Lecture Notes in Computer Sc., vol. 1896, pp. 106{120.
[44] M. Slater, Microprocessor Based Design-A Comprehensive Guide to Hardware Design,
Prentice Hall Inc., 1989.
[45] L. Snyder, \Introduction to the Con gurable Highly Parallel Computer," IEEE Computer, vol. 15, 1982, pp. 47{56.
[46] \Tighter and Broader Complexity Results for Recon gurable Models," Parallel Processing Letters, special issue on Bus-based Architectures, vol. 8, no. 3, pp. 271{282, 1998.
[47] \Constant Time Graph Algorithms on the Recon gurable Multiple Bus Machine," J.
Parallel & Distributed Computing, vol. 46, pp. 1{14, 1997.
[48] J. L. Trahan, R. Vaidyanathan and R. K. Thiruchelvan, \On the Power of Segmenting
and Fusing Buses," J. Parallel and Distributed Computing, vol. 34, 1996, pp. 82{94.
[49] R. Vaidyanathan, C. R. P. Hartmann and P. K. Varshney, \Running ASCEND, DESCEND and PIPELINE Algorithms in Parallel Using Small Processors," Information
Processing Letters, vol. 46, no. 1, pp. 31{36, April 1993.
[50] R. Vaidyanathan and A. Padmanabhan, \Bus-Based Networks for Fan-in and Uniform
Hypercube Algorithms," Parallel Computing, vol. 21, 1995, pp. 1807{1821.
[51] J. F. Wakerly, Digital Design , Principles & Practices, Prentice Hall, New Jersey, 2001.
[52] M. J. Wirthlin and B. L. Hutchings, \DISC: The Dynamic Instruction Set Computer,"
Proc. Field Programmable Gate Arrays (FPGAs) for Fast Board Development and Recon gurable Computing, J. Schewel, ed., Proc. SPIE, vol. 2607, 1995, pp. 92{103.

Vita
Hatem Mahmoud El-Boghdadi is a native of Egypt. He received his bachelor of science
in electrical engineering (Computers and Control) in 1991 with grade of Distinction
with honor degree, and master of science in electrical engineering in 1994, both from
Assiut University, Egypt. Since 1992 he has been with the Electrical Engineering
Department, Assiut University, as a demonstrator, and as an assistant lecturer in
1994. In 1998, he joined the Faculty of Computers and Informatics, Cairo University,
Egypt, as an assistant lecturer. In the Fall of 1999, he joined the graduate program
in the Department of Electrical and Computer Engineering at Louisiana State University, United States of America. He is expected to receive the degree of Doctor of
Philosophy in electrical and computer engineering in May 2003.

204

