Mapping of recursive algorithms onto multi-rate arrays by Kiaei, Sayfe
AN ABSTRACT OF THE DISSERTATION OF
Yue-Peng Zheng for the degree of Doctor of Philosophy in Electrical and Computer
Engineering presented on May 27, 1994.
Title: Mapping of Recursive Algorithms onto Multi-Rate Arrays
Abstract approved:
In this dissertation, multi-rate array (MRA) architecture and its synthesis are pro-
posed and developed. Using multi-coordinate systems (MCS), a unified theory for map-
ping algorithms from their original algorithmic specifications onto multi-rate arrays is
developed.
A multi-rate array is a grid of processors in which each interconnection may have its
own clock rate; operations with different complexities run at their own clock rate, thus
increasing the throughput and efficiency.
A class of algorithms named directional affine recurrence equations (DARE) is
defined. The dependence space of a DARE can be decomposed into uniform and non-uni-
form subspaces. When projected along the non-uniform subspace, the resultant array
structure is regular. Limitations and restrictions of this approach are investigated and a
procedure for mapping DARE onto MRA is developed.
To generalize this approach, synthesis theory is developed with initial specification
as affine direct input output (ADIO) which aims at removing redundancies from algo-
rithms. Most ADIO specifications are the original algorithmic specifications. A multi-
coordinate systems (MCS) is used to present an algorithm's dependence structures. In a
Redacted for PrivacyMCS system, the index spaces of the variables in an algorithm are defined relative to their 
own coordinate systems. Most traditionally considered irregular algorithms present regu­
lar dependence structures under MCS technique. Procedures are provided for transforming 
algorithms from original algorithmic specifications to their regular specifications. 
Multi-rate schedules and multi-rate timing functions are studied. The solution for 
multi-rate timing functions can be formulated as linear programming problems. Proce­
dures are provided for mapping ADIOs onto multi-rate VLSI systems. Examples are pro­
vided to illustrate the synthesis of MRAs from DAREs and ADIOs. 
The first major contribution of this dissertation is the development of the concrete, 
executable MRA architectures. The second is the introduction of MCS system and its 
application in the development of the theory for synthesizing MRAs from original algo­
rithmic specifications. C Copyright by Yuepeng Zheng
 
May 27, 1994
 
All Rights Reserved Mapping of Recursive Algorithms onto Multi-

Rate Arrays
 
By 
Yue-Peng Zheng 
A Dissertation
 
Submitted to
 
Oregon State University
 
In partial fulfillment of
 
the requirements for the
 
degree of
 
Doctor of Philosophy
 
in 
Electrical and Computer Engineering
 
Completed May 27, 1994
 
Commencement June 1995
 APPROVED:
essor of Electrical and Computer Engineering in c arge of major
Dean of Graduate Sc
Date thesis is presented May 27, 1994
Typed by Yue-Peng Zheng for Yue-Peng Zheng
Redacted for Privacy
Redacted for Privacy
Redacted for PrivacyCONTENTS 
Chapter 1 Introduction  1
 
1.1 An Overview of The Dissertation  4
 
1.2 Literature Review  7
 
Chapter 2 Multi-Rate Array and Directional Affine Recurrence Equations  11
 
2.1 Notations And Definitions  11
 
2.2 Motivation  13
 
2.3 Multi-Rate Arrays And Directional Affine Recurrence Equations  16
 
2.3.1  Properties of DARE  18
 
2.3.2  Mapping of DAREs onto MRA  26
 
2.4 Properties And Synthesis of Higher Dimensional DAREs  33
 
2.5 Limitations of The Multi-Rate Transformation  40
 
Chapter 3 Multi-Coordinate-Systems and Dependence Localization  48
 
3.1 Specification of Algorithms  50
 
3.2 Multi-Coordinate Systems  54
 
3.3 ADIO Dependence Structure  57
 
3.3.1  Constructing coordinate systems  65
 
3.4 Convertibility  78
 
3.4.1  Motivations  78
 
3.4.2  Sufficient condition  80
 
3.4.3  Sufficient and necessary condition  83
 
3.5 Computing A Partitioning  97
 
3.6 Converting ADIOs To Regular Specifications  102
 U 
3.6.1  The iteration space  104
 
3.6.2  Expanding an m-array to an n-array  106
 
3.6.3  Dependence localization for other ADIOs  122
 
3.7 Conclusions  127
 
Chapter 4 The Synthesis Of Multi-Rate VLSI Systems  129
 
4.1 Multi-Rate Affine Schedules  129
 
4.2 A Sufficient Condition For The Existence Of Multi-Rate Affine Schedules  134
 
4.3 The Synthesis Procedures  144
 
4.3.1  The synthesis of ADIOs without the phase component  144
 
4.3.2  The synthesis of ADIOs with phase components  155
 
4.4 Conclusions  171
 
Chapter 5 Conclusions  173
 
Chapter 6 Bibliography  180
 List of Figures 
Figure 1.1: A Single Rate Array where the all operations of the array are controlled by a 
single clock signal  2 
Figure 1.2: A possible supercomputer architecture consists of various application specific 
subsystems, each capable of delivering hundreds of Gflops  3 
Figure 1.3: A Multi-Rate Array architecture where different operations are assigned with 
different amounts of time by the use of different clock signals  5 
Figure 2.1: The DAG of the 10th order decimation filter with M = 3  15 
Figure 2.2: A possible one-dimensional MRA architecture with V variables.  17 
Figure 2.3: The DAG for example[2.2]. The dependencies are uniform only along the 
direction perpendicular to p  23 
Figure 2.4: The family of parallel hyperplanes with uniform dependencies and the 
symmetrical hyperplane Hs for Example[2.3]  25 
Figure 2.5: The DAG of the decimation filter with N = 10 and M= 3 as defined in 
Eq.(2.25)  28 
Figure 2.6: An MRA implementation of the decimation filter. Two clock signals are 
employed  30 
Figure 2.7: A VHDL simulation result of the MRA decimation filter shown in Figure 
2.6  31 
Figure 2.8: An MRA architecture for the 2-D decimation filter (a) and the internal block 
diagrams of the processing elements  39 
Figure 2.9: The block diagram of a three-band critically sampled subband coding system 
with decimation and interpolation filters  42 
Figure 2.10: A single rate array for the 10th order interpolation filter.  43 
Figure 2.11: Decimation filter with a rational decimation factor M/L.  45 
Figure 3.1: Synthesis procedure for mapping ADIOs onto multi-rate VLSI systems  50 
Figure 3.2: A vector is defined relative to different basis in R2.  56 
Figure 3.3: Dependence relationships among input and output variables and W(p)  59 61 
iv 
Figure 3.4: Direct dependence relationships between the output variable y the input 
variables x, h of the decimation filter 
Figure 3.5: Decomposing the dependencies down to the dependencies between the output 
y and W(p); and between W(p) and variables x and h  63 
Figure 3.6: Dependence Acyclic Graph under a single coordinate system.  70 
Figure 3.7: Directed Acyclic Graph under an MCS system.  71 
Figure 3.8: The dependencies of U0 (p)  U1 (D 1p) under a single coordinate system. 
The U1 array is embedded at the subspace z3 = 0  72 
Figure 3.9: The dependencies of U0 (p)  U1 (Dip) under a MCS system. The axes of 
CI are labeled as (z1, z2, z3) and of C?' for U1 as (4, z1)  74 
Figure 3.10: The DAG showing the dependencies at different layers.  75 
Figure 3.11: A possible 3-coordinate systems defined in R3  76 
Figure 3.12: The localized DAG for the IIR filter algorithm.  83 
Figure 3.13: When condition (3.26) is not satisfied, one coordinate system for a variable 
is not sufficient to present an ADIO with regular dependencies  85 
Figure 3.14: Different partitioning of the computational domain and the effect of using 
different coordinate systems for different regions  86 
Figure 3.15: Two coordinate systems are used to define the index regions 01 and a2 of 
the index space for each variable  87 
Figure 3.16: The DAGs for Example[3.9] under different MCS systems. (a) The DAG is 
non-uniform in which the variable U is indexed under C21 only at z3 = 0; (b) C21 
and C22 are employed for U (C  i .?_j and C2 for i <j) at z3 = 0; (c) DAG at 
z3 = 0 and z3 = 1  96 
Figure 3.17: The index points for variable W at. The two dimensional array U is 
embedded into this plane at according to its own coordinate system. The DAG 
has non-uniform data dependencies  98 
Figure 3.18: One possible partitioning of the computational domain.  100 
Figure 3.19: The coordinate systems for variable U and their index regions as expressed 
in them  101 V 
Figure 3.20: The dependencies at z3 = 0 and z3 = 1. All dependencies at z3 = 0 are
 
uniform (except those inside a node), while the dependencies at z3 = 1 are non­
uniform on the boundary of the regions but are uniform inside  102
 
Figure 3.21: Illustration of the vicinity domain SI; of fli, 1  c SZ.  109
 
Figure 3.22: The vicinity domain of 01 is defined as SIE1 = {0 5_ z3 5_ 2} . When
 
p E SIE1, then U0 (p) E- I/1 (D ip)  should be used  110
 
Figure 3.23: Several different possible coordinate systems for the indexing of the variable
 
array x  112
 
Figure 3.24: The MCS system for in R2 consists of Cs
2  and three other coordinate
 
systems C;, Cx, Ch  113
 
Figure 3.25: The dependence structure under an MCS system for in R2 consists of CS
 
and three other coordinate systems Cy, Cx, Ch  114
 
Figure 3.26: The DAG under MCS system for the decimation filter algorithm studied in
 
Example[3.12]. The dependencies of h and y variables have been localized by
 
index space expansion  117
 
Figure 3.27: Another single rate VLSI array architecture for the decimation filter derived
 
from the DAG in Figure 3.26b. Even though it has constant interconnections, the
 
length increases with the decimation factor M  118
 
Figure 3.28: Another MCS system for in R2 consists of CS and three other coordinate
 
systems Cy, Cx, Ch with different origins  119
 
Figure 3.29: The dependence structure another MCS system for in R2 consists of Cs
2 and
 
Cy, C
1, Ch with different origins  120
 
Figure 3.30: Another localized DAG for Eq.(3.39). From this DAG, a more efficient MRA
 
architecture can be derived  121
 
Figure 3.31: Another MRA architecture for the decimation filter. This array has less delay
 
elements in each PE, and its total computation time is less that the one given in
 
Figure 2.6  122
 
Figure 3.32: The subspaces for embedding the input and output arrays  124
 
Figure 3.33: The array for the matrix Lyapunov equation.  127 vi 
Figure 4.1: The dependencies of U0 (p)  U1 (Dip) . The time values assigned to the 
nodes pq of U1 must be less than the time values assigned those nodes p of U0 
which depend on them  138 
Figure 4.2: The convex hull defining the parameters of the schedules  143 
Figure 4.3: The MCS system for the 2-dimensional decimation filter. The 4-dimensional 
computational domain is decomposed down into two 2-dimensional subspaces. 
148 
Figure 4.4: The null spaces for propagating the elements of x variable array to localize the 
dependencies  150 
Figure 4.5: The projected DAG under an MCS system. (a) shows the dependence 
structure at z3 = 0 and one layer; and (b) shows the detailed dependencies at 
z4 = 0 and z4 = 1  152 
Figure 4.6: A three dimensional MRA implementation for the ADIO Eq.(4.13) with 
window size 4 x 4 and the input sample array x(i, j) with size N x N. The MRA 
array size is then 4 x 4 x (N+ 1) . (a) shows part of the 3-D array; (b) shows a 2­
dimensional array in more detail at each layer along z1  153 
Figure 4.7: An MRA architecture for the 2-dimensional decimation filter with decimation 
factors M1 = M2 = 2 (a); Diagrams (b), (c) and (d) illustrate the internal block 
diagram of the processing elements  155 
Figure 4.8: The three-band critically sampled subband coding system.  157 
Figure 4.9: The analysis filter bank of the three-band subband coding system.  158 
Figure 4.10: The DAG for the interpolation filter. The dark nodes and vectors represent 
DAG  = 0 , the gray ones for DAG =1 and  161 nd DAGc - 2 
Figure 4.11: A modified DAG (partial) of the DAG in Figure 4.10  162 
Figure 4.12: The convex hull defining where the parameters of the schedules can be 
selected from  163 
Figure 4.13: An MRA implementation of the interpolation filter with L = 3  164 
Figure 4.14: A two-level pipeline MRA implementation for the interpolation filter for 
applications require higher filtering speed  165 vii 
Figure 4.15: Another localized DAG which may yield a more efficient MRA
 
implementation of the interpolation filter  166
 
Figure 4.16:A corresponding MRA implementation for the interpolation filter from the
 
DAG shown in Figure 4.15  167
 
Figure 4.17:An MRA implementation for the interpolation filter  168
 
Figure 4.18:The three-band critically sampled subband coding system  168
 
Figure 4.19:A block diagram for a decimation filter with fractional decimation factor
 
M/L  169
 
2
 Figure 4.20:An MRA array of the decimation filter with M =  171
 MAPPING OF RECURSIVE ALGORITHMS ONTO MULTI-RATE
 
ARRAYS
 
Chapter 1 Introduction 
The first electronic computer ENIAC had a peak performance of about 103 floating 
-point operations per second (flops) in 1946. Today, there are supercomputers capable of 
delivering hundreds of Giga-flops peak performance. Yet the demand for higher perfor­
mance supercomputers continues to grow for applications such as weather forecasting, 
engineering, material science, plasma physics, economics, national defense, etc. [Mo193]. 
Performance is measured in two ways: peak performance and delivered performance. The 
second aspect is a more important criterion because most of today's supercomputers 
deliver only 5 to 15 percent of their peak performances to the users, except when hand 
optimization and assembly language programming are used on well suited programs 
[Kuc87, Mo193]. Higher performance computing can be achieved by using application 
specific computers that utilize Very-Large-Scale-Integrated (VLSI) circuit technology. 
Such application-specific computers are the key components for many hard real-
time applications such as in real-time speech and image codings, data communications, 
and robotic vision. Low design cost is an important criterion for application specific com­
puters which requires automatic mapping of algorithms onto VLSI systems. 
The systolic arrays shown in Figure 1.1 consist an important class of single rate 
arrays. The most distinguishing feature of the systolic architecture is that all operations in 
a system are controlled by a single clock signal. In other words, all the operations are per­2
 
formed at the same rate throughout the entire array, regardless of the different complexities 
of these operations. 
Clock signal 
i INPUT  OUTPUT 
SIGNALS  °  SIGNALS 
INPUT SIGNALS 
Figure 1.1: A Single Rate Array where the all operations of the array are 
controlled by a single clock signal. 
Systolic and other application-specific arrays are well suited for matrix manipula­
tions,  filtering  applications, and image codings,  etc.  Examples are FIR filters 
y (i) =  (j) x (i  j); and matrix multiplications c (i, j) =  a (i, k) b (k, j); etc. 
15k5n 
The systolic arrays can be a part of a larger supercomputing system shown in Figure 1.2. 3
 
High performance 
general purpose computers 
for sequential operations 
Communication channels 
T 
High performance  High performance  High performance 
parallel subsystem 1  parallel subsystem 2 
0  0  parallel subsystem N 
Figure 1.2: A possible supercomputer architecture consists of various appli­
cation specific subsystems, each capable of delivering hundreds of Gflops. 
Systolic array is a special parallel structure targeted for many applications in digital 
signal processing and matrix computation algorithms. Many synthesis tools are available 
for automatic mapping of algorithms onto systolic arrays. However, systolic arrays have 
limited scope of applications. In order to retain spatial regularity for VLSI implementa­
tions, redundancy often is introduced. Moreover, many algorithms cannot be synthesized 
onto systolic arrays using existing synthesis theories because they have irregular data 
dependence structures. For such algorithms, deriving a regular VLSI array is not a trivial 
task in any sense, nor is the task of performing design optimizations. 
There exists a strong demand for developing a general synthesis method for the 
automatic mapping of a larger class of algorithms onto regular VLSI arrays. The ultimate 
objective is to synthesize algorithms from their original algorithmic specifications instead 4
 
of from their lower level specifications such as uniform recurrence equations (URE), affine 
recurrence equations (ARE), or regular iterative algorithms (RIA). 
1.1 AN OVERVIEW OF THE DISSERTATION 
This dissertation is devoted to the developments of multi-rate array structures and a 
synthesis theory for the automatic derivation of MRA architectures from the given algo­
rithmic specifications. 
Chapter 2 shows the Multi-Rate Array (MRA) as the target architecture for the 
growing class of algorithms. The spatial regularity requirement is indispensable for effec­
tive VLSI implementations, however, the temporal locality restriction is not necessary. 
The MRA concept is motivated by the fact that different operations may consume different 
amounts of time in a hardware system. By allocating different clocks to different opera­
tions, the efficiency of a system can be improved. The operations of an MRA architecture 
are controlled by multiple clock signals, as shown in Figure 1.3. 
The ultimate objective in VLSI synthesis is to start the procedure from the original 
algorithmic specifications. The motivations are that algorithmic level optimization is the 
most important and effective step in application-specific VLSI system design. The selec­
tion of algorithms is an important decision affecting the degree of parallelism and the effi­
ciency of parallel VLSI implementations. Also the ability to explore the full solution space 
of a given algorithm is necessary for architectural design optimization. Architectural opti­
mization is also important for VLSI implementations. 5
 
Multiple clock signals Op ..., 4)v 
Inputs at the rate 4)  .. 
Inputs at the rate 4) v 
INPUT  OUTPUT 
SIGNALS  :  SIGNALS 
Inputs at the rate 4)1 
Inputs at the rate 4) v 
4)i : the ith clock signal  INPUT SIGNALS 
Figure 1.3: A Multi-Rate Array architecture where different operations are 
assigned with different amounts of time by the use of different clock signals. 
Chapter 3 is concerned with data dependence localizations. It defines a new class of 
algorithms, affine direct input output (ADIO), to be studied and synthesized. Section 3.2 
introduces the multi-coordinate systems (MCS). Definitions and the characterizations of 
MCS system will be provided. Under an MCS system, the index space of each variable 
may be defined relative to its own coordinate systems. The dependence structure of a given 
algorithm in general is not specified unless its index spaces have been defined. When a 
system of multi-coordinate systems is constructed for a given algorithm, the dependence 
structure of the algorithm has been specified. The uses of a single coordinate system for 
data dependence presentation and the use of a MCS system for dependence structure pre­
sentation of algorithms, in some sense, can be compared with the use of scalars and the use 
of vectors in number presentations. 6
 
Data dependence structure of ADIOs is investigated in section 3.3. The most impor­
tant dependence properties of ADIOs are identified. By MCS technique, the right null 
space can be used effectively for dependence localizations. No other directional propaga­
tions will be required for most algorithms. Procedures will be provided for the construc­
tion of multi-coordinate systems for a given algorithm. Necessary and sufficient conditions 
will be provided in section 3.4. These conditions can be used to determine if a given ADIO 
can be transformed to regular specifications using null-space propagation under a MCS 
system. Section 3.5 provides a procedure for partitioning the computation domain for a 
variable whose index space must be defined in more than one coordinate system. In sec­
tion 3.6, procedures are provided for converting ADIOs into their corresponding regular 
specifications. The chapter is concluded in section 3.7. 
Chapter 4 deals with the scheduling issues for multi-rate VLSI systems. For the 
multi-rate array synthesis, it is necessary to derive a set of multi-rate timing functions from 
a given algorithm. This chapter defines and characterizes multi-rate schedules and multi-
rate timing functions. It provides a procedure for constructing such schedules for a given 
algorithm. The derivation of a set of multi-rate schedules can be formulated as a linear 
programming problem. 
In chapter 4, we also provide procedures for mapping ADIOs onto VLSI multi-rate 
systems. A procedure for the synthesis of ADIOs without the phase components will be 
provided first. Examples will show how the procedure works. The procedure is extended 
to allow the synthesis of ADIOs with the phase components. Again, examples will be stud­
ied. 
Chapter 5 concludes the dissertation and points out some open problems. 7
 
1.2 LITERATURE REVIEW 
The design of parallel computers can be traced back to the works of Unger [Ung58], 
Slotnick, et la [S1o62], and Go lay [Go165] for image processing applications. With the 
development of VLSI technology, parallel computer design has become an intensive 
research area since the late seventies. Processor regularity and nearest neighboring inter­
connection property are among the most important properties for effective VLSI imple­
mentations. Iterative logical arrays [Hen61, Hen68, Wai67] and cellular logic arrays 
[Duf78, Bat80] are two types of cellular automata [Bur70, Pre84]. Cellular automata are 
structurally iterative and regular which suit them for VLSI implementations. Two other 
important classes of cellular automata are systolic arrays [KL78, Kun82] and wavefront 
arrays [Kun84]. 
Since their introduction in the late seventies and early eighties, the design and syn­
thesis of systolic/wavefront arrays has been one of the most active research area in the 
design of application-specific parallel computers, due to their structural regularity and 
local interconnection properties. Structural regularity can reduce the design cost while 
neighboring interconnections can improve system performance, especially in submicron 
VLSI systems where communication delays become the dominant factor. 
Most of the early systolic architectures were designed by heuristic approaches. Such 
an ad hoc design approach is slow and error prone. It may require extensive simulations 
and the resulting designs are not guaranteed to be correct or optimal. Methodology was 
needed to reduce the design time and to guarantee the resulting systolic array is correct by 
construction. In the period of 1983 to 1986, much work was done on developing synthesis 
theories for automatically mapping systolic algorithms onto systolic architectures. The 
computational structures of systems of uniform recurrence equations (URE) defined by 
Karp, Miller and Winograd [KMW67] map extremely well onto systolic architectures, 8 
This was presented by Quinton [Qui83, Qui84] and Chen and Mead [CM85, Che86], 
among others. Quinton developed a synthesis method where algorithm is initially repre­
sented as a set of UREs. An affine timing function is first determined which satisfies cau­
sality constraints imposed by the partial ordering in computation of the URE, and then an 
affine allocation function is selected to map the problem onto a systolic array. At the same 
time, others such as Cappello and Steiglitz [CS83, CS84], Miranker and Winkler [MW84] 
examined the systems. They pursued the linear mapping of uniform algorithms into space­
time domain. These early works have developed a complete synthesis procedure based on 
URE as the initial specification of an algorithm. 
Due to the restrictions arising from the URE specification of the initial algorithm, 
there has been an increasing effort towards the synthesis from a more general class of 
algorithms expressed in terms of Affine Recurrence Equations (ARE) and Regular Itera­
tive Algorithms (RIA). The methods developed by Fortes and Moldovan [FM85], 
Delosme and Ipsen [DI86a, DI86b], Rao [Rao85], Rajopadhye [Raj86], Yaacoby and Cap­
pello [YC88, YC89] focus on the initial specification of algorithms from AREs. AREs are 
transformed to UREs by localizing the global dependencies using null-space and direc­
tional propagations [FM84, Raj86, Roy88, Bu90,], or to Quasi-URE (QURE) [YC88] 
when the index mapping matrix D is the root of I, that is DL = I, where I denotes the 
identity matrix. However, as it was pointed out in [RTRK88], the null space propagation 
technique works only under the condition that the index mapping matrix D is idempotent, 
i.e., D2 = D. Otherwise, multiple directional propagation technique must be employed to 
propagate the variable first along a direction not in parallel with the range space of the 
index mapping matrix. Then the variable can be propagated along the null space. This 
multiple directional propagation localization technique works, but it suffers from low effi­
ciency in the sense that data must be propagated multiple times and through those index 
points where it is not used. 9 
Efficiency has been one of the most important issues in the design of parallel com­
puters. There has been a great deal of work on the systolic design optimizations [BK81, 
FP84, LW85, Rao85, DI85, CM85, Che86, LK88, SF88, WD85, WD89, WD92, BR90, 
Cap89, Cap92, SC92, ZK94]. These works all contribute to the design optimization of sys­
tolic arrays. 
From the algorithmic point of view, the initial specifications of the above mentioned 
synthesis theories, the algorithms must be transformed from their original algorithmic 
specifications to ARE or RIA by ad hoc methods. To fully explore the parallel properties 
of algorithms for optimal solutions, it is necessary to start the synthesis procedure at 
higher level specifications. Weak Single Assignment Codes (WSAC) and WSAC-like 
specifications, proposed by Roychowdhury et al [RTRK88] and Rosseel et al [RCM92], 
allow the use of global operators such as summation E and product H, in the initial speci­
fication of algorithms. However, additional communication time and interconnections are 
required for such a dependence localization technique, which would degrades the perfor­
mance of a system. Moreover, these works mainly focus on the synthesis of algorithms 
whose index mapping matrices are totally unimodular. 
The advances in digital signal and image processing techniques in recent years have 
resulted in the wide use of multi-rate signal processing systems [CR83, Ma189, Vai90]. 
Sampling rate conversions (up- and down-sampling) are the key components in such sys­
tems. They can be found in applications such as subband speech coding, subband image 
coding, and wavelet analysis systems. Index mapping matrices in such applications are 
generally not idempotent nor unimodular. They do not have regular data dependencies as 
viewed traditionally. Temporal/spatial locality may not be achieved in such applications 
and existing synthesis theories cannot be employed to synthesize these algorithms onto 
regular VLSI array architectures. 10 
In those earlier works done by many researchers, "all variables in the algorithm must 
have the same set of indices" as specified explicitly in [Rao85, p.272] or implicitly in 
[KMW67, Qui84, Qui89, Raj86, RTRK88, etc.]. In recent years, different index spaces for 
different variables have been used by some researchers [MQRS90]. However, the use of a 
coordinate system has been natural and has never become an issue in the development of 
synthesis theories. But in all research works published to date, the use of an implicit coor­
dinate system is apparent in the definition of index spaces and computational domains. 
The most serious problem for the development of synthesis theories for multi-rate 
(systolic) array architectures is the lack of a concrete multi-rate (systolic) array model and 
an executable MRA architecture which could show how such arrays might work. The con­
cept of multi-rate (systolic) array and the motivation for developing such architectures are 
easily understood. However, how to design such arrays was not well understood. Without a 
concrete target architecture and an understanding of how such arrays work, all efforts 
would be in vain for developing synthesis theories for multi-rate (systolic) arrays. 
The derivation of schedules for array processors has been an important issue in par­
allel computer design. Many works have been done on the scheduling of uniform algo­
rithms onto systolic arrays [Qui83], [Qui84], [FP84], [DI86], [SF88], [Roy88], [YC89], 
[WD92], among others. However, these works are all concerned with the design of single 
rate schedules (for single rate arrays) and can not be employed for the determination of 
multi-rate schedules. For the design of multi-rate arrays, it is necessary to develop multi-
rate schedules. 11 
Chapter 2 Multi-Rate Array and Directional Affine Recurrence Equations 
Many algorithms do not have uniform dependencies and are linear functions of the 
indices. In this case, temporal/spatial locality may not be achieved for such algorithms. By 
relaxing the locality criteria and allowing data to propagate in multi-rate fashion, these 
non-uniform data dependencies can be propagated without hampering the regularity and 
pipeline structure of the array. A class of algorithms termed Directional Affine Recurrence 
Equation (DARE) is defined. The dependence structure of DAREs can be decomposed 
into uniform and non-uniform subspaces. When projected along the non-uniform subspace 
into the uniform subspace, the resultant array has regular structure. 
2.1 NOTATIONS AND DEFINITIONS 
This section provides the notations and definitions needed to make this dissertation 
self contained. New notations and definitions are to be introduced when they are needed in 
the following chapters. 
The Euclidean n-space is denoted by R", the set of integers in le is denoted by r, and 
the set of rational numbers in Rn is denoted as Q". An integer lattice in r is denoted by 
L". p, qe r denote the index points. 
Ui, U./ denote variables in an algorithm. 
S2; denotes the computational domain of variable Ui. 
ri (p) denotes the index mapping function of the computed variable U (on the left 
hand side of an equation). The index mapping functions for the input variables (on the 
right hand side of an equation) are denoted as r (p) , 15 j 5 V. A variable U./ may 
have more than one index mapping functions in the i-th equation (1  i 5 s) . In this 
case, subscripts separated by coma will be used to distinguish these index mapping 
functions as ri4 1 (p) ,  2 (P) 12
 
Die rn", di E  1  M 5 n, denote the linear part and the translation part of the 
affine index mapping function Ii (p) for variable Ui (in the i-th equation). 
Diie  Zin,1 5 m 5 n denote the linear part and the translation part for index 
mapping function 51(p),15j Ss. If Ui has more than one index mapping function 
with different index mapping matrices in the i-th equation, subscripts separated by 
comma will be employed to distinguish them as: Di4 1, Dpi, 2. 
I denotes the identity matrix of appropriate dimensions.
 
ti (D) denotes the null space of matrix D, and X (D) denotes the range space of D,
 
Ui (p)  Ili(q) denotes that Ui (p) depends on 1.5 (q) , i.e., the evaluation of Ui (p)
 
uses the value of Ili(q). 
Definition 2.1: Recurrence Equations: A recurrence equation over a computational domain 
0 is an equation defined as: 
U(p) = f[...,u (r (p) ),  (2.1) 
where p, r (p) E S2, f is a single valued function which is strictly dependent on its argu­
ments. A system of s-recurrence equations (SRE) over 0 is a family of s-REs defined as: 
Ui(p) =  (2.2) 
Definition 2.2: Mine Recurrence Equations: An RE as defined in Eq.(2.1) is called an 
Affme Recurrence Equation (ARE) if all index mapping functions in it are affme functions, 
i.e. 51(p) = Diip+ dii;1 5 j 5s, where Diie z  ,dfie Z"', 1 5j 5 k, 1 5 mf5n. If 
Vi, 15 i 5 s, Eq.(2.2) are AREs, then the SRE is called a system of affine recurrence 
equations (SARE). Moreover, if V (i, j) , Dpi = I, then the system of equations specified 
becomes a system of uniform recurrence equations (URE). 
Graph theory has been widely used for defining algorithms and also as a tool for the 
analysis of algorithms' dependence structures. For computability analysis and scheduling 13
 
of algorithms onto arrays, algorithms represented as Directed Acyclic Graphs allow us to 
see their dependencies more clearly. Therefore, they can help us in deriving better array 
structures. 
Definition 2.3: Directed Acyclic Graph: The Directed Acyclic Graph (DAG) of an SRE is 
a directed graph defined by [N, A] where N is a set of nodes and A is a set of arcs. A pair of 
nodes, n1, n2 E N, are connected by an arc, a E A, from n1 to n2 if and only if the 
computation at n2 uses the value from n1. The DAG is related to a SRE so that the nodes 
are N= tadp) I adp) is in SRE, p E Cj and arcs A represent dependencies where A = MO) 
4- ni(q) I q = Mij(p); are dependencies in SRE). 
2.2 MOTIVATION 
Example[2.1]: Decimation and interpolation filters are the most important components in 
multi-rate digital signal processing applications [Cro83, Ma189, Vai90, Che93, Her93, 
Pii93]. An N-th order decimation filter with decimation factor M is defined as: 
N-1 
Y (i) =  h (j) x (Mi j)  (2.3) 
j = o 
where h(j)'s are the filter's coefficients and x(i)'s are the filter's inputs. The above can be 
expressed as an SARE as follows: 
[For(0540...j5N-1) 
Y (i,i) = Y (id 1) + h (j) x(Mi D  (2.4) 
y (4-1) =0 
The index point is p = HTE Z2 and the index mapping functions and index map­
ping matrices are: 14
 
ry (P) =  j  D  = I d  =[°
-11 1 Y Y 
ph =[°01  (2.5) rh (p) =  [(11 
r x(P)  =[11110.  Dx  [MO 01 
The dependencies of variable y (ry) are uniform, but dependencies of variables h 
(rh) and x (Tx) are not uniform. Since Dh is idempotent (Dh = Dh) ,  these global 
dependencies of variable h can be removed by propagating the h-variables along the null 
[space of Dh, X (Dh) =  1  .  However, the index mapping matrix D is not idempotent. 
Conventional null space propagation technique cannot be used to localize these global 
dependencies [RTRK88]. The localization technique proposed by [RTRK88] for this class 
of algorithms is the multiple directional propagation technique. It first propagates the ele­
ments of x along a direction i which is not in the range space of D, i.e., T e 9t (D) , then 
it propagates these elements along another subspace. However, this technique does not 
yield a satisfactory solution here, since the length of propagation along 1 is a linear func­
tion of the indices and increases with the indices in 9t (D) .  Moreover, this technique 
increases the number of interconnections among processors, which are expensive in VLSI 
implementations. This SARE is not convertible to QURE (DL * I). Therefore, existing 
synthesis techniques are insufficient for deriving regular specifications from this class of 
algorithms. Existing synthesis theories are mostly concerned with the synthesis of algo­
rithms with totally unimodular (de t (D) = ±1) index mapping matrices. 
For sake of simplicity, let N=10 and M=3. Moreover, assume that those non-uniform 
dependencies associated with variable h have already been localized by employing exist­
ing localization techniques. Then, Eq.(2.4) can be further specified as the following 
SARE: 15
 
For (0 5.4 0.5jS9) 
y (i, j) = y (i, j  1) + h (i, j) x (3i  j) 
h (i, j) = h (i  1, j)  (2.6) 
h (0, j) = h (j) 
y (i, 1) =0 
The DAG for Eq.(2.6) is given in Figure 2.1. The global dependencies due to the 
broadcasting of xi's can not be removed by the use of conventional localization techniques 
since the index mapping matrix Dx is not idempotent. Apparently, it is impossible to derive 
a regular array architecture from this DAG. 
YO Y1  Y2  Y3 
Figure 2.1: The DAG of the 10th order decimation filter with M = 3. 16
 
2.3 MULTI-RATE ARRAYS AND DIRECTIONAL AFFINE RECURRENCE 
EQUATIONS 
MRA: A Multi-Rate Array is a grid of array processors where interconnections 
among processors may have different clock rates. In other words, each variable may be 
propagated at different clock rates. 
Definition 2.4: A multi-rate array (MRA) is a grid of array processors characterized by the 
sets [P, CD, U, D N1, F; where: 
P is the processor space which is the set of all lattice points enclosed within a specified 
region in a P-dimensional Euclidean space. Each processor in this space may have more 
than one 1/0 link connecting it to other processors. 
CD = oil is the set of clock signals associated with each link i where the clock frequen­
cies could be different for different variables. If the set has only one element then the 
MRA is a systolic array. 
U:= ( U1, U2,.... (Iv) is the set of V variables processed at a given time. 
Dp is the set of processor displacements that defines the interconnection of the proces­
sor array. If de Dp an interconnection exists between processor p and (p+d) irrespec­
tive of the particular value of p, except possibly at the boundary processors. 
N1 is the number of delays in each processor associated with each link / connecting pro­
cessor p to (p + d) . If n E N1, then a variable produced at processor p will be passed to 
processor (p+d) after n clock beats (clock rate for each link is different), irrespective of 
the particular value of p except possibly at the boundary processors. 
F is the set of functional dependencies that relate the computation of a variable x at pro­
cessor p during beat (1),, as a function of the variables computed during the previous beat 
at the neighboring processors. 
The uniqueness of MRA architecture is that the temporal locality constraint does not 
exist and each data link may have its own clock frequency. Figure 2.2 shows a possible 
one-dimensional MRA architecture. 17 
4) 
UI  .11.0.. 
U2  .. "=...... P P P  E.1,  n 
Uv 
Figure 2.2: A possible one-dimensional MRA architecture with V variables. 
Definition 2.5: Directional Affine Recurrence Equations (DARE): A system of Directional 
Affme Recurrence Equations is a system of ARE whose index mapping matrices D are such 
that ID ± I] is singular (where / is the identity matrix). The rank of the matrix ID ± 1], 
m = p (D  I) is the dimension of the DARE, and the DARE is said to be an m-
dimensional DARE. 
A system of DARE is a system of ARE whose data dependencies can be decom­
posed into uniform and nonuniform components. An m-dimensional DARE implies that 
the dependence structure of this DARE has an m-dimensional non-uniform subspace and 
an (n  m)-dimensional uniform subspace. For a given 1-dimensional DARE, there exists 
a unique vector p. By projecting the dependence graph along this direction into an 
(n  1)-dimensional subspace (hyperplane), the resultant array will have constant inter­
connections. However, projecting the dependence structure along any other direction 
results in an array architecture with varying interconnection lengths. 
It should be noted that UREs, singular AREs, and QUREs are special cases of 
DAREs, because for UREs, singular ARE's and QURE the eigenvalues of the dependency 18 
matrix D are zero or ±1. However, for DARE only one of the eigenvalues of D must be 
zero or ±1 and the remaining eigenvalues could be any integer number defining the multi-
rate structure of the dependencies. For convenience, let A = D I and T(p,q) denote the 
dependence vector for  (p) 4 Ui  i(p)) . 
In this chapter, we will mainly investigate the properties of 1-dimensional DAREs, 
i.e., D I is rank one. Higher dimensional DAREs are simple generalizations of the 1­
dimensional cases. The synthesis technique to be developed below can be modified to syn­
thesize m-dimensional DAREs. For a given m-dimensional DARE, there exist m projection 
vectors, {p1, f.t2, ..., gm}, which can be used to project the dependencies m times into the 
(n  m)-dimensional subspace with regular structures. 
The decimation filter studied previously has a singular index mapping matrix 
Dx = 3 -1 . The global broadcasting dependency is due to the broadcasting of the vari­
0 0 
able x. It can be localized by propagating the variable x along the j-axis, resulting in 
D = 3 1 with the translation vector d = [o _3] T. It is easy to show that D I is sin­
1 
gular. Therefore it is a DARE. 
2.3.1  Properties of DARE 
This section studies the properties of one-dimensional DAREs and how to map 
DAREs onto the MRA structure based on these properties. 
Lemma 2.1: For a given DARE, there exists a family of parallel hyperplanes Hi such that 
the dependence vectors from hyperplane Hi to a parallel hyperplane Hi are constant, i.e., all 
dependence vectors Tip, q] starting on nodes pe Hi are a constant vector and end on nodes 
qe Hi. 
Proof. The vector T(p,q] is: 19 
T = [Dp + (I] p = [D f]p+d  (2.7)
 
Since ID-11 is singular and of rank one, its rows are linearly dependent and can 
be expressed as: 
[D  I] = [ aT  = ilocT  T =  aTp+ d  (2.8)
Ka 
whereµ is a constant column vector. The product aTp = C defines a Uniform 
HyperPlane (UHP) with constant dependencies in n-space. aT is the normal 
direction of the hyperplanes with uniform dependencies T = 11C + d, which is 
a constant vector. Therefore, all dependence vectors from hyperplane Hi to Hi are 
constant. 
Lemma 2.2: Among the family of parallel UHPs, there is a Symmetrical Hyperplane Hs 
on which the dependence vectors are invariant. This symmetrical UHP is defined as: 
{Hsi aTp = Cs = } for ell  0  (2.9)
aTd I.t
 
otherwise, Hs is at infinity. 
Proof. Assume there exist a symmetrical plane Hs. Any dependence vector  Ts on Hs 
(i.e., both its starting and end points p, q e Hs) should be orthogonal to aT, 
thus: 
aTT5 = 0  (2.10) 
From Eqs.(3.5-6) we have: 
aTT  aTµaTp +aTd  (2.11) 
ifTEH aTp =  Cs, then 20 
()ET d 
ar T = aTI.tC.+ GET d = 0 = Cs =  a.  (2.12) 
From Lemma.1 we know that the dependence vectors on a UHP are constant 
vectors, therefore, dependence vectors on H. are invariant. 
From the invariant dependency property of H. we have the following theorem for the 
allocation function. It provides the condition for deriving an allocation function for a given 
DARE, which can map the DARE onto a VLSI array with constant length of interconnec­
tions. 
Theorem 2.1: The processor space of a DARE belongs to the symmetrical hyperplane H., 
and there exists an affine allocation function a(p) which maps all the dependencies to this 
space domain, resulting in an array with constant interconnections. 
Proof. The objective is to show that the projection of all dependencies onto this space 
is a constant vector. Let the affine allocation function be: 
a (p) = X op + as  (2.13) 
where as is a constant vector and ki is the projection of dependency T onto H. 
along the directiong: 
{ XA T -*HS along iii}  (2.14) 
from the definition of projection it follows that: 
X.I.t = 0 > 11 e ti (X.) 
Using Eqs.(3.5) and (3.6), we have: 
X. T = (D  I) p = kW p + d) = X, jiCi+ A. ad = kd  (2.15) 21
 
which is a constant interconnection vector in the space domain. 
Lemma 2.3: For a given DARE the end points of the dependence vector [Tit: p  qj lie on 
the parallel hyperplanes defined as: 
Cs+ Ci, aTq aTp  Cs+ rC  (2.16) 
where r is a constant integer. 
Proof. Since q =  (p) = Dp + d, then: 
aTI'(p) = ar (Dp + d) = aT((DI)p+p+d)  (2.17) 
From (8-9) we have 
aTF (p) = ocT [µaTp + d] + Cs+ Ci= aTi.t (Cs +  + ard+ Cs+ Ci 
(2.18)
air (p) = aTt.tCs+ aT d + Cs+ (1 + aTp) Ci 
Substituting Cs from Eq.(2.9) results in: 
aTF (p) = Cs + (1 + ell) Ci = Cs+rCi*r= l+arg  (2.19) 
Therefore, the dependence vector T[p,q] has an index point on the hyperplane 
aTp = Cs + Ci. Its other index point q  =  (p) lies on the hyperplane parallel 
to Hs separated by distance rci as: a Tq = CS + rCi. 
For a given DARE, the dependence structure on either side of the symmetrical UHP 
Hs are mirror-images and can be folded along Hs. 
Lemma 2.4:  For a given DARE with the dependence vectors on a UHP defined as 
aTp = Cs+ Ci, there exist symmetrical dependencies (mirror images) that lie  on the 
hyperplane defined as aTp' = Cs Ci. 22
 
Proof. Assume ell.* 0, then there exists a symmetrical hyperplane (Hs: aTp = Cs). 
The dependence vector Ts on Hs can be presented as: 
Ts = qp = Ap+d = gaTp+d = 1.1.Cs+d 
aTT, = aTtics+aTd  0 =  aTd + aTd 
The dependence vector Ts is perpendicular to aT which is the normal direction 
of the hyperplane Hs. Furthermore, the dependency vector Ton the index point 
p of the UHP defined as aTp = Cs+ Ci has its starting point q on the UHP 
defined as aT q = Cs+ rCi. Thus we have a dependence vector TP defined as: 
71' = qp = Ap+d = gaTp+d = t.t(Cs+Cd+d  (2.20) 
Correspondingly, there exists another dependence vector 7 with an end point p' 
on the hyperplane defined as aTp' = Cs Ci, and its starting point q' is on the 
hyperplane defined as aTq° = CsrCi. Therefore, the dependence vector is 
presented as: 
Tn = II (C  C i) +d  (2.21) 
The  distance  between  the  first  two  UHP's  (aTp = Cs + Ci  and 
aTq = Cs + rCi) is dist(aTpeq) which is the same as the distance 
between the later two hyperplanes aTp' and aT q' . If the vector aT is used to 
project 7P onto H then the vector -aT should be used to project r onto H since 
7P and 7 are on the different sides of Hs. Therefore, the images of these vectors 
are presented below as: 
xrif  ell (Cs + Ci) + aT d  aT gc 
(2.22)
aTTn  _41,71 (Cs  Ci) aTd = aT pt 23
 
Since the images of 7P and r vertical to are H, equal to each other (the distances 
of the hyperplanes of each pair) and the images on H. are also equal to each other, 
therefore, r is the mirror image of 7P and vice versa. 
Example[2.2]: Consider a DARE with index mapping function F (p) = Dp + d where 
r D =  [2 1]  and d =  1j T. The matrix A = D-1 = 11 is singular. Decomposing A
12  11 
as: 
-=  = gaT A= 11  iL 
This DARE has a DAG as shown in Figure 2.3. The symmetrical hyperplane is 
defined as {Hsi i +j = 2 } and is shown in Figure 2.3 (along the direction of [_  T). 
The dependencies are invariant along this direction of E_ 1  fl T. The only valid projection 
vector which can give us an array with constant interconnections is 11 =  r 
1 j 
T 
Figure 2.3: The DAG for example[2.2]. The dependencies are uniform only 
along the direction perpendicular to t. 24
 
The projection vector 4 and the symmetrical hyperplane HS in this example are per­
pendicular to each other. This may not be true in general, as will be shown in the following 
example. Under such circumstances, oblique projection may be used to project a DAG 
onto H3. 
Example[2.3]: Given a DARE Ili (p)  Uj(Dp+d), where D =  [3  = 
0 1  d 3 
This DARE has the following dependence matrix: 
A = D -I =  [2  =  [2  = Re. o o  o 
The hyperplanes are defined by 2i j = Ci+ Cs and the symmetrical hyperplane is 
defined as {1/51 2i j = 3/2}  .  Figure 2.4 shows these hyperplanes (dotted lines) and 
the symmetrical hyperplane (wide heavy black line). The UHP's are parallel to each other. 
The dependence vectors from one hyperplane to another are constant vectors. These 
dependence vectors are symmetrical on both sides of H. From Tlun.2.1, in order to obtain 
an array with constant interconnections, we need to use the vector µ = 
T 
to project 
this DAG into the symmetrical hyperplane Hs. However, since such projections are not 
perpendicular, it may be easier to use orthogonal projections to obtain the allocation func­
tion. 25
 
O  0  0 / 0  o : 0 
o  0  O 0 o 
O 0  O (6 0 
0  o ;0 0 
0 0 
o :'0 0  0 
0;:0 0  0  0 
O 0 0 0 
o : 0  0  o o 0 0 0 0 
The symmetrical hyperplane Hs  '4P-he set of parallel hyperplanes 
Figure 2.4: The family of parallel hyperplanes with uniform dependencies 
and the symmetrical hyperplane H for Example[2.3]. 
Theorem 2.2: For a given DARE, a schedule for the multi-rate variable is a piecewise lin­
ear schedule divided into three regions by the symmetrical hyperplane Hs as: 
aTp < Cs  t1 = aT 
aTp  (2.23) Cs  Xt2 1 a T
 
xrp > C A,t3 = al
 s 
where the schedule function is t (p) = Xtp + at. 
Proof. 1) For aTp < 
Let p be a node on the dependent hyperplane aTp = C. This node depends on 
r (p) = Dp + d = (gar + I) p + d. A valid schedule for a recurrence equation 
must satisfy the causality requirement Alp > Xt8 (p) . 26
 
Xip > Xso (p)
 
air (p)  aT (tie + .0  + ard
 
aTgaTp + ceTp + aTd
 = 
(2.24)
= aTCi+ Ci+ aTd<aTp = Ci 
OCT d 
aTilCi+ aTd <0 = Ci < ---. = Cs 
a 11 
Therefore aT is a valid scheduling vector for the multi-rate variable for Ci< Cs. 
2) For aTp = Cs: 
In this case, both of the nodes p and q of the dependence vector are on the 
symmetrical hyperplane {1/31aTp = Co. = aT8(p) :J. Therefore, aT can not be 
used as a schedule vector because Xi, > ?  (p) . All other vectors which are not 
equal to aT can be used as the schedule vector Xs2. Here we choose Xt2 1 aT. 
3) For aTp > Cs: 
The proof for this case is similar to the case of i). aT is qualified to be the 
schedule function for this case. 
In general, there exist many vectors qualified as scheduling vectors. Therefore, the 
above theorem does not attempt to exclude other vectors to be used as scheduling vectors. 
The above selection is just a choice which reflects the computation ordering. 
2.3.2 Mapping of DAREs onto MRA 
Procedure 2.1. Mapping DAREs on MRA architectures. 
1. Represent the algorithm in terms of ARE and check if it is a DARE. 
2. Extract the projection vector .t from the matrix A = D I = gaT. 27
 
3. Compute the clock rate ratio r as r = aTµ + 1. 
4. Base on the value of r determine: 
a) r > 1. Use t as the projection vector. Projecting the DAG onto the 
(n  1) -dimensional space. The multi-rate schedule function has a ratio of 
r in respect to the schedule function for the uniform dependencies. 
b) Since r =1 implies aril= O. Thus, symmetrical hyperplane Hs is at infin­
ity. This is a URE. 
c) Since r = 0 = det(D)=0, use the null space propagation technique 
[Ray86]. 
d) r = -1. The ARE can be made uniform by folding the computation 
domains along the symmetrical hyperplane Hs. 
e)  r < -1. Fold the DAG along Hs and convert it to an ARE with r>1. 
Example[2.4]:  MRA implementation of the decimation filter: Apply the synthesis 
procedure to the mapping of decimation filter studied in example[2.1] onto MRA array 
architecture. Let N=10 and M=3, as mentioned before. Propagating the-variable x along the 
f-axis, Eq.(2.6) can be transformed into the DARE specifications as defined below: 
For (05i, 05j59) 
y(i,j) 
h (i, j) = h (i  1, j) 
h (0, j) = h (j)  (2.25) 
x(i,j) = x(i,j -1) 
x(i,j50) = x(i) 
y (i,  =0 
The computational domain of Eq.(2.25) is defined as SZ = {  0 5 i, 0  9} . 
Ci 
The index mapping functions and matrices of this DARE are defined as: 28 
y(P) =[, j iiDy= dy =[1;Th(p) =[j-11  Dh= I, dh=r1 
0 
(2.26) 
3i  [3 0 
(P)  3  Dx  0 1 dx 
In Eq.(2.26), the only non-uniform index mapping function is r, (p) ,  but 
det[Dx  = 0, therefore, the above synthesis procedure can be used for mapping this 
algorithm onto the MRA structure. Figure 2.5 shows the DAG of this algorithm. Since the 
y- and h-dependencies are uniform over the entire domain, we present them in gray lines in 
order to show the dependencies of variable x clearly. 
Yo Yi Y2 Y3 Y4 Ys Y6 
.9 ..
4  . .  . 
ts  * 
,  .....1111r4r/IIMMkt., e, ... 
,65  .rigilhala ...  : 
,4 eft ,,,
.  _07_9._ 
rt:  Willi i
 2 ,, 
whommottikK ...
'ammo  0..0. 1
  .,  0 
The depending and depended hyperplane-pairs 
Figure 2.5: The DAG of the decimation filter with N = 10 and M = 3 as 
defined in Eq.(2.25). 29 
Following the synthesis procedure presented above, we have: 
A  [2  r2 1  (2.27)
0 0  0 L  A-J 
Therefore, we have the allocation function a (p)  =  [o  , which give us a linear
1 
array structure as shown in Figure 2.6a. Now, let different variables travel at different 
clock rates and derive a structure for the algorithm. Let the clock signal for variables y and 
[ h have a rate 4)y. The clock rate ratio is determined as r = aT 4+ 1 = [i 6 2 + 1 = 3 
1 
for variable x. That is, the clock signal frequency for the variable x should be three times 
the other variables' clock signal. Thus, we have 4)., = 30y. 
A good system design engineer would produce a linear array structure as shown in 
Figure 2.6a with internal structure as given in Figure 2.6b. The most significant difference 
between MRA architecture and single rate array structures is that in MRA array different 
operations may be allocated with different amounts of time. Complicated operations (e.g., 
multiplication-accumulation operations in this example) are always granted with more 
time to perform. Simpler operations (e.g., shifting data from register to register) are always 
traveling at faster clock rate. The amount of time allocated for performing the multiply-
accumulate operation is three times the time given to the data shifting operations. This fea­
ture of MRA structure also will be illustrated later in the MRA implementation of an inter­
polation filter, a dual structure of the decimation filter. 30 
Ito  h1 h2  h3 h4 h5  h6  y 
(a) Multi-rate array for the decimation filter 
L: Latch. D: Delay. 
Ox=30y 
(b) Internal block diagram of the processing element 
Figure 2.6: An MRA implementation of the decimation filter. Two clock sig­
nals are employed. 
The operation of the array is as follows. The input terminal x (link-x) receives the 
input data from the signal source and propagates the data along link-x at the clock rate 4)x, 
while variable y is computed at each PE and propagated along link-y at the rate Oy. The fil­
ter's output is generated also at the rate 0y, where 4)x = 34y. Such a design can improve 
the performance of the array since the time required for latching and shifting operations 
generally takes less time than tin or; (multiplication and addition times). Assume that the 
shift registers and latches are rising edge triggered. Since the latch L is clocked by tOy and 
the frequency of Oy is only one third of 4)x, each PE captures only one third of the data 
propagating on link-x for the computation. This MRA array, in effect, performs the deci­
mation operation before the computation instead of post computation. 31
 
A VHDL model of this MRA architecture has been developed and simulated. Figure 
2.7 shows the simulation output, the waveform of each signal. It shows only the waveform 
of the first four PEs due to the space restriction. 
01,1 X  1
 
CLIC V  1 
H_SEL 
X IN 
O 
1 \H  CDCDCD011 
YO0 Y1I 
X00 X1X 
00\14 
0\H 
CDC) 1111:11E11111111111 CD CD CD CD CD CD CD
CMCDOIEDCDMTIO 
Y10 Y2 I 
X10 X2I 
00 \ H 
0 \H  1:11* 
CDC) CDC, fZ1 CD MED DOO 
Y20 1(31  00\H  M11111:111:111CD CD CD CD 
X2  X3 I  0\H  .11111111rCDCD0000000000000000000 
v ouP  00 \H  CD CD =CD CD 
185  100  200  300  460  11. St 
Figure 2.7: A VHDL simulation result of the MRA decimation filter shown 
in Figure 2.6. 
The detailed operations of the first four PEs of the MRA array are given in the snap­
shots in Table 2.1. Again, only the operations of the first four PEs are given here due to 
space restriction. 32
 
TABLE 2.1. Snap-shots of the operation of the MRA decimation filter. 
CO,  (1)),  PEO (h0) 
0  0  kfrro (x0) 
1  hoz° (xi xo) 
2  h0x0 (x2 xi xo) 
3  1  hox3 (x3 x2 x1 xo) 
4  hox3 (x4 x3 x2 x1) 
5  hox3(x5 x4 x3 x2) 
6  2  hox6 (x6 x5 x4 x3) 
7  hox6 (x7 x6 x5 x4) 
8  hox6 (xg x7 x6 x5) 
9  3  hox9 (x9 x8 x7 x6) 
10  h0r9 (x10 x9 x8 
X7) 
11  hox9 (x11 xio x9 
X8) 
12  4  hox12 (x12 x11 
xi° X9) 
13  h0x12 (x13 x12 
xii x10) 
14  horn (x14 x13 
X12 X11) 
15  5  h0x15 (x15 x14 
x13 x22) 
PE1 (h1) 
hox0+0 
hox0+0 
hoxo (x0) 
h0x0 (xi xo) 
h0X3+h1X2 (x2 
X1 X0) 
h0x3 +h1x2 (x3 
X2 x1 x0) 
hox3+h1x2 (x4 
X3 X2 Xi) 
hox61-171x5 (x5 
X4 x3 x2) 
hor6+h1x5 (x6 
x5 x4 x3) 
hox6+h1x5 (x7 
x6 x5 x4) 
hox9+h1x8 (x8 
X7 X6 X5) 
h0X0+h1x8 (x9 
x8 x7 X6) 
h0X9+hix8 (x10 
x9 x8 x7) 
h0x12 +h1x22 
(xii X10 X9 X8) 
PE2 (122) 
h0x0 
h0x0 
hoxo (xo) 
hox3+h1x2+h2x1 
(x1 x0) 
hox3+h1x2+h2x1 
(x2 x1 X0) 
h0x3+h1x2+h2x1 
(x3 x2 x1 x0) 
hc,x6+h1x5+h2x4 
(X4 X3 X2 Xi) 
hcx6+h1x5+h2x4 
(x5 X4 X3 X2) 
hox6+h1x5+h2x4 
(x6 x5 x4 x3) 
hox9+h1x8+h2x7 
(X7 X6 X5 X4) 
PE3 (h3) 
h,x0 
hoz° 
h0X0 
h0x3+h1x2+h2x1 
+h3x0 (Xo) 
h0x3+h1x2+h2x1 
+h3x0 (x1 xo) 
hox3-1-h1X24-h2X1 
+h3x0 (x2 x1 x0) 
hox6+h1x5+h2x4 
+ h3x3 
(x3x2xix0) 33
 
The advantages of this MRA implementation are: (1) regular array structure with 
nearest neighboring interconnections; (2) high hardware utilization, the efficiency of this 
MRA decimation filter is one hundred percent, no computed result is compressed; (3) high 
throughput rate can be achieved; (4) high achievable speed, the speed of the MRA array, is 
in general not limited by the slowest operation. 
2.4 PROPERTIES AND SYNTHESIS OF HIGHER DIMENSIONAL DARES 
In the last section, we considered one-dimensional DAREs, i.e., p (A) = 1. In this 
section, we will generalize this study to algorithms with higher rank numbers, or m-dimen­
sional DAREs (it is assumed that these algorithms themselves are n-dimensional). The 
dependence space of this class of algorithms can be decomposed into an m-dimensional 
non-uniform subspace and an (n  m)-dimensional uniform subspace. Let m (m<n) be the 
rank of A, m = p (A) = p (D  I', This means that there are only m vectors in the matrix 
A which are linearly independent. The other n m vectors are linearly dependent on these 
m vectors. Therefore, we can decompose A as: 
A =  af 1-112a72'  grnamT  (2.28) 
where all gi's and ai's are linearly independent respectively. The dependence vector 
T = (DI)p+dcan be expressed as: 
T = µl aTp +11.2aIp +  -1-1.tmaLp + d  (2.29) 
Lemma 2.5: For a given m-dimensional DARE there exists a family of intersections of 
(n  m)-dimensional space I formed by the m-sets of hyperplanes defined as: 
p = Cu, i = 1,  m  (2.30) 
Dependence vectors starting on one intersection Il have their end points on an parallel 
intersection i2. 34
 
Proof. Since we know that the index point p is on the intersection of the hyperplanes 
defined by Eq.(2.30), then Vi, 1 5 i 5 m, 3j A Cu such that index point p satis­
fies all these equations of hyperplanes. Thus the dependence vector T is an n-
dimensional constant vector defined as: 
T =  +112C2 +  +  + d  (2.31) 
This means that all dependence vectors with their ending points on this 
intersection have their starting points on another intersection parallel to it. 
Lemma 2.6:  Among the family of parallel intersections, there exists a Symmetrical 
Intersection is on which the dependence vectors are invariant. This symmetrical 
intersection is defined by the following hyperplanes: 
Hisiarp = es =  (  ocTi.t.aTp + ap) / (aD.ti)  (2.32)
ioj,15j5m 
for argi * 0;  1 5 i S m, otherwise is is at infinity. 
Proof. Assume that there exist a symmetrical plane is. Let Ts be a dependence vector 
on is,  i.e., both its starting and end points p, q E is. Moreover, since 
p, q E is c His, 15 i  m, therefore, any 7; on Is should be orthogonal to all 
vectors a', Vi, 15 i 5 m, that is: 
ocrTs = 0, Vi, 15 i 5 m  (2.33) 
From Eqs.(3.27-28), Vi, 1 5 i 5 m, we have: 
aTT = aT (11 larip +112ccIp +  +1.iniaLp + d) 
(2.34) aTipti afp  ocTR2a72:p  ocTp, alp +  17:d 
if Ts e Is c de 1 5 i  m, it is orthogonal to all the hyperplanes ocrp = 
therefore, we have the conclusion that: 35
 
aiTTs  T  T aJ s  + a. d = 
j*i,15j5m 
(2.35)
Cis= (  + aTA)/ 
i *i, 1 sism 
From Lemma 3.5, we know that the dependence vectors on an intersection are 
constant vectors, therefore, dependence vectors on is are invariant. 
From the invariant dependence property of is, we conclude that the regular proces­
sor space of a given DARE belongs to is. 
In order to implement such problems onto VLSI arrays with fixed length of intercon­
nections, the allocation function a (p) = Aap + ca must satisfy the condition Aa T = C, 
where C is an (n-m) dimensional column vector, A. is an ( (n - m) x n) matrix. Extend­
ing Thm.2.2 from 1-dimensional to m-dimensional, we have the following theorem for the 
solution of the allocation function. 
Theorem 2.3: For a given m-dimensional DARE, its processor space belongs to the sym­
metrical interconnection is. There exists an affine allocation function a(p) which maps all 
the dependencies to this space domain, resulting in an array with constant interconnec­
tions. Moreover, the linear part of the allocation function Aa E z(n m) 
x n must satisfy the 
conditions Aal.ti = 0,  Vi, 1 5 i 5 m. 
Proof. The objective is to show that the projection of all dependencies onto this space 
is a constant vector. Let the affine allocation function be: 
a (p) = Xap + as  (2.36) 
where ce,, is a constant vector, X. E Z(n-m) xn  is the allocation matrix. A 
dependence vector T can be mapped onto is by projecting it m times along the 
directions Ili, 1 5 i 5 m. In order to map the dependence vector Tonto the (n-m)­36 
dimensional processor space as a constant interconnection, Xagi = 0 must hold 
for i=1,2,...,m. This confirms that X,a lies in the null space of IA. The rest of the 
proof is similar to that of Theorem 3.1. 
The rank of A plays an important role in the synthesis of MRAs based on DAREs. It 
determines the maximum possible dimension of a VLSI array with constant interconnect­
ing links for a given problem. 
The procedure for synthesizing a DARE is: 
1) Determine m = p [A] and the number of different clock signals.
 
2) Decompose the matrix A as shown in Eq.(2.28).
 
3) Finding the allocation function Xa
 
4) Find the proper timing function.
 
Example[2.5]: Two-dimensional decimation filter. A 2-dimensional decimation filter is 
defined as: 
K-1L-1 
y(i,j) =  k,M21  (2.37) 
k= 01= 0 
where x(ij)'s and y(ies are the input and output signal arrays of dimensions Nx1 x Nx2 
and Ny1 x Nye respectively; h(k,l) is a K x L rectangular window array. M1 and M2 are the 
decimation factors. For the sake of simplicity, we will only consider the filter with param­
eters defined as: M1 M2 =2 and K=L=4: 
3 3 
y (i, j) =  h(k,l)x(2i k,2j l)  (2.38) 
k = 01= 0 
We can express the above algorithm as a system of ARE as follows: 37
 
For 0 5k53
 
For 05/52
 
y (i, j, k, 1) = y (I, j, k, 1  1) + h (k, 1) x (2i  k, 2j 
For 1= 3  (2.39) 
y (i, j, k, 1) = y (i, j, k, 1 1) + y (i, j, k  1,1) + h (k, 1) x (2i  k, 2j 
y (i, j, k, 1) =0 
y (i, j, 1, 3) = 0 
By augmenting the lower dimension arrays h and x to the full dimension, we have 
the following system of DARE: 
For 05k53
 
For 05/52
 
y (i, j, k, 1) = y (i, j, k, 1 1) + h (i, j, k, 1) x (2i  k, 2j  1, k 2,1-2 
For  1 = 3 
(2.40) y (i, j, k,  = y (i, j, k, 1  1) + y (i, j, k  1,1) 
+ h (i, j, k, 1) x (2i  k, 2j  1, k  2, 1 2)
 
y (i, j, k, 1) =0
 
y (i, j, 1,3) =0
 
The index mapping functions are given as follows: 
0 0  -1 
0 0  -1 Dy,1 = Dy, 2  = Dh = I, d  =  ' dy, 2  =  dh = 
0  -1  0 
-1  0 0 
20 -1 0  0 
02 0 1 D =  = [1 0
0 0 1  0  x
 
0 0 0 1  2
 
All dependencies are uniform except those of Dr It is a two-dimensional DARE 
with the dependence matrix defined as: 38 
1 o 1 0 
O1 0 1 0
1 
1 A = D-1 =  D 0 -1 d +  [o 1 o  = RA. +112(4 (2.41) 000 o 0  o 
fl 
000 o o  0 
From Tlun.2.3, the allocation function must be in the null space of P's and it is 
defined as K 0.0 =  r0 0 i 
. Even equipped with this information, the derivation of the 
0 0 0  1 1 
two-dimensional MRA architecture is not a trivial task. Instead of projecting the DAG 
along one direction, we need to project the DAG twice along both vectors p and P2. Or 
[0 0 1 1  [1 equivalently, the allocation function is defined as a (p)  re As a result, =  0 0 0  / 
each node p= Djk r in the computational domain is mapped onto a 2-dimensional 
processor array. 
The resulting multi-rate array is given in Figure 2.8. Three types of processing ele­
ments are used in the array. There are three clock rates in the operation of this MRA archi­
tecture. The first clock signal has the highest operating frequency denoted as 44). The 
second clock signal is 24), one half of the rate of the first one. The third is 4), the slowest 
clock frequency in the array, which is one-fourth of 44). The original data element x(i,j) 
enters the array at the highest rate 44) and the final computed result y(ij) leaves the array 
at the lowest rate at 4). The internal block diagrams of these PEs are given in Figure 
2.8b,c,d. 
The efficiency of this array is 100% and the throughput rate is 1. This array is capa­
ble of producing one result at each clock cycle 4) and receiving input samples at the clock 
rate of 44). The operation of this MRA array is similar to the one-dimensional MRA deci­
mation filter. It performs decimation before other computations. 39 
(a) The MRA system
architecture 
(b) Internal block diagram 
of PE(i, 0) 
KS. 
x 
24) (c) Internal structure of
 
non-boundary PEs  (d) Internal structure of PE(j, N)
 
Figure 2.8: An MRA architecture for the 2-D decimation filter (a) and the 
internal block diagrams of the processing elements. 40 
There are other algorithms such as the Fast Fourier Transform (FFT) which may also 
be considered as directional uniform algorithms. But the FFT algorithm is not a DARE 
because we cannot present it as a DARE. However, the concept of projecting the depen­
dence structure along the non-uniform subspace into the uniform subspace can still be 
used to produce spatially regular VLSI arrays. But the design procedure may be more 
involved and can only be performed case by case, instead of following a systematic proce­
dure. The internal structures of these processing elements may not be identical under such 
circumstances. For example, the N-point (N = 2n) FFT algorithm can be implemented on 
an array of n PEs with constant interconnections, but the internal structure of each PE dif­
fers from one another. However, these differences are minor. Most functional blocks in 
each PE are the same, except that the number of delays in each PE is different, one from 
the other. 
2.5 LIMITATIONS OF THE MULTI-RATE TRANSFORMATION 
CASE I: Requirements for low level specifications. 
Under most circumstances, DARE, like other low level specifications, is not the 
original algorithmic specification for many digital signal processing applications. More­
over, there does not exist a systematic procedure for transforming an algorithm from its 
original algorithmic specification into its corresponding DAREs. This has been performed 
by the use of ad hoc methods. Hence, it is impossible to determine if a given algorithm can 
be implemented onto regular VLSI architectures from its original algorithmic specifica­
tion, until its corresponding DARE specification is derived. This makes it difficult for 
algorithmic level optimization. Moreover, the use of the heuristic method for deriving 
DAREs from their algorithmic specification reduces the solution space of an application. 
This also decreases the ability to perform architectural level optimizations. These are corn­41
 
mon problems for all synthesis methodologies starting with initial specifications such as 
ARE, URE, and RIA. 
In the following, we present two examples in which the existence of the correspond­
ing DARE specifications is not known. How to transform them into corresponding DARE 
specifications is not clear. These examples are interpolation filter and decimation filter 
with fractional factors. 
Example[2.6]: Subband coding has found applications in speech coding, image coding, 
and other digital signal processing applications. There are two major sections consisting of 
a subband coding system, the analysis and the synthesis sections. Decimation filters are the 
main components of the analysis section, while at the synthesis section, interpolation filters 
play the role. Figure 2.9 shows a block diagram of a subband coding system. At the analysis 
section, the input signal is decomposed into three signal bands with different frequencies. 
At the synthesis section, these three subband signals are used to recover the original signal. 
In the diagram, x denotes the input signal sequence, y denotes the subband signals, X 
denotes the output sequence synthesized from the subband signals y. 
An interpolation filter increases the sampling rate of the input signal by a factor L, 
which is the dual system of a decimation filter. The synthesis section is composed of deci­
mation filter banks which can be implemented by using the MRA architecture shown in 
Figure 2.6. It is possible to optimize the design by combining all three decimation filters in 
one architecture. In the following, we are interested in the implementation of the synthesis 
section and the interpolation filter banks. The coder and decoder (or communication chan­
nel) are of no concern to us here. 42
 
Ho(z)  3 3  Coder/  Go(z) 
Decoder 
Or 
yl
 
H1(z)  Comm.  it 3  Gi(z)
Channel 
2 
H2(z)  G2(z) 3 
Synthesis Section Analysis Section 
Figure 2.9:  The block diagram of a three-band critically sampled subband 
coding system with decimation and interpolation filters. 
An interpolation filter with interpolation factor L will first expand the size of input 
data sequence y by L times. This is achieved by inserting (L  1) zeros between any pair of 
the input samples, y(i) and y(i+1) to form a new data sequence w. This new signal 
sequence w is filtered to obtain a smooth signal. An Nth order interpolation filter with 
L = 3 is defined as follows: 
rx(3 ), for  = integer
 
0, otherwise
  (2.42)
N-1 
y(i) =  g(j)w(ij) 
=o 
In the above expression, redundancy has been introduced through the intermediate 
variable w. The dependence structure of Eq.(2.42) is regular. A direct implementation of 
this algorithm is shown in Figure 2.10. This implementation suffers from low efficiency, 
low throughput rate, and low operating frequency, because all operations are performed, 43
 
including those multiplications and accumulations with zero valued operands. To improve 
the efficiency, different structures have been investigated by many researchers [Vai90]. 
Among them, polyphase implementation provides a fully efficient solution. Our interest 
here is to investigate how to express this algorithm with high efficiency, and if it is possible 
to express it in the format required for MRA implementations. 
w(i)= y(i13) 
y  - I3 
g0  82 83  g4  85 86 87  88 89 
-

y 
Figure 2.10:  A single rate array for the 10th order interpolation filter. 
To improve the efficiency of Eq.(2.42), it is necessary to eliminate those redundant 
computations, that is, those computations involve with zero-valued operands. We need to 
eliminate the intermediate variable w and to relate the output variable .7 to the input vari­
able y directly. The following equation presents this relation with some nonlinear opera­
tors: 
1 _I 
x(i)  =  g (jL + (i) modL)y (Lid j)  (2.43) 
1 =0 
where (i) modL is the modular operator, LqJ is the floor function which yields the great­
est integer value which is less than or equal to the given number q. The process of deriving 
a DARE format of this algorithm becomes difficult, if not impossible. Let us denote 44 
(i) modL as cp and name it as the phase component of the index quantization. The follow­
ing is an equation which directly relates the output signal (7) to the input signal (y): 
0 <_cp < L -1 
1  L  (2.44) 
x(1L+9)=  g (jL+ 9)y (i  j) 
l =0 
The implementation of this equation may yield a fully efficient architecture. The key 
concern here is whether we have an MRA architecture which can perform such algorithm. 
From this equation, we do not have a systematic approach for deriving the corresponding 
DARE specifications. Therefore, the synthesis procedure based on DARE can not be 
employed for the implementation of this algorithm. This example illustrates the limitations 
for synthesis procedures starting with low level initial specifications. We do not know how 
to start the synthesis procedure until we find a corresponding DARE specification. But it is 
not clear if there exists a DARE expression for it. 
Example[2.7]:  In digital audio, different sampling rates have been used for different 
media. Non-integral sampling rate converters are used for copying materials from one 
medium to another. This is performed through the use of decimation filters with a fractional 
decimation factor. Decimation filter with fractional decimation factor M/L can be 
described by the block diagram shown in Figure 2.11. The implementation of such filters 
has drawn much attention in recent DSP applications. It will be shown later in chapters 3 
and 4 that this class of algorithms can also be implemented onto the MRA architectures. 45 
v 
H(n) 
Y 
Figure 2.11:  Decimation filter with a rational decimation factor M/L. 
An Nth order decimation filter with fractional decimation factor MIL can be 
expressed as the following set of equations: 
x (  ) ,  Li  = integei
u (i) = 
(  0, otherwise 
N-1  (2.45)
V (i) = I h(j)u(ij)
j. o
 
y (i) = v (Mi)
 
Again, intermediate variables have been introduced in this expression. The algo­
rithm expressed in Eq.(2.45) has a regular dependence structure. Direct implementation of 
the block diagram in Figure 2.11 or Eq.(2.45) results in an array architecture with 
extremely low efficiency. Only one of (L x M) computed results must be computed. This 
efficiency is derived from two facts: first, there are (L-1) out of L computations 
involved with the multiplication of zero valued samples; secondly, only one of M com­
puted result is used in the output signal. Polyphase implementations [Vai90] present an 
efficient approach to improve efficiencies of these realizations, but the regular interconnec­
tion pattern has been greatly hampered by the continuous redrawing of the block diagrams. 
Moreover, these polyphase implementations are not programmable to suit different appli­
cations. 46 
To utilize the synthesis technique for this algorithm, we again face the problem of 
transforming Eq.(2.45) into its corresponding DARE specification. There does not exist a 
known DARE specification for this algorithm. Therefore, the DARE specification 
becomes the bather again. But, it is relatively easy to transform Eq.(2.45) into a high effi­
ciency specification as shown below: 
for (0S45L-1) r 
L  (2.46) 
y (Li + 4)) =  h (Li + 4))x (Mi j) 
= 
How to derive a highly efficient regular array architecture from this specification? 
The DARE approach can hardly help us in such applications. 
CASE  Conflicting Interests 
In the synthesis procedure, the vector 11 has been used as the projection vector to 
obtain a regular array architecture. However, in a given algorithm, if the computational 
domain has a ray y with direction different from the direction of .t, then the synthesis tech­
nique developed for DARE cannot be used to derive an array with regular structure and a 
finite number of processing elements. It is impossible to implement an array with an infi­
nite number of processing elements. Therefore, under such circumstances, if this synthesis 
technique is to be used, structural regularity must be sacrificed. This is an undesirable situ­
ation. Is it possible to develop a new synthesis theory which can produce regular array 
architectures with a finite number of processing elements for such algorithms? 
The second conflict occurs where a given algorithm has two or more affine index 
mapping functions in an equation, say r1 (p) = Dip + d1 and 1-2 (p)  = D2p + d2 
where both dependence matrices Al = Di  I and A2 = D2-/ are singular. That is, 47 
there are two DARE dependencies in the same equation. Let A1 = 'liar and A2 = 112a2 . 
If Ili 01.12, under such circumstances, the synthesis procedure based on DARE cannot be 
employed to derive a regular MRA architecture for this algorithm, because projecting the 
DAG along either direction µl or i.t2, the interconnections associated with the other 
dependency will always be irregular. 
Example[2.8]: Let the following equation be in a given algorithm: 
y(i,j) = y(i,j 1)w (2i + j, + 2j  1) + h (i, j) x (3i  j, j  3)  (2.47) 
where the values of each variable are not important. There are two non-uniform index 
mapping functions which are defined as: 
T Ds, = [1 ;1;  [  ,  =  [1] [1 1]  1-1,y  fl 
(2.48) 
D  = r3  dx.roi; Ax.rii[2q.p.x.[,[  r 
x  LO  L-3j  LOj 
Those conventional techniques cannot be employed to synthesize this problem, 
since the index mapping matrices are not totally unimodular. Neither can the multi-rate 
transformation technique developed in this chapter be used to obtain array architectures 
T
with regular interconnections because 1.t., =  , while µx =  (AT. No matter which 
projection vector is selected, there are always non-constant interconnections. 48 
Chapter 3 Multi-Coordinate-Systems and Dependence Localization 
Traditionally, the dependence structures of algorithms are often analyzed using a 
single coordinate system and synthesis theories are developed by expressing the computa­
tional domain and the index spaces by an n-dimensional orthogonal coordinate system 
[Jay91]. In many digital signal processing applications, the computational domain is 
defined in an n-dimensional Euclidean space Rn, but most variables are m-dimensional 
data arrays (m < n). Most existing synthesis theories are based on the initial specifications 
over the space Rn; thus the original algorithm index must be extended from m to n. Fur­
thermore, most synthesis methods require the dependence map D to be idempotent. For 
algorithms with non-idempotent index mapping matrices, the approach employed by con­
ventional techniques does not work well and more complicated control mechanisms. 
These problems are directly related to the use of a single coordinate system for 
defining the index spaces and can be solved by expressing the algorithms in a system of 
multi-coordinate systems (MCS) by defining the initial specification as an Affine Direct 
Input Output (ADIO). The ADIO specification resembles the weakly single assignment 
codes (WSAC) specifications but is more general due to the introduction of the phase com­
ponents and the index mapping matrices are m x n, m 5 n. This allows the synthesis pro­
cedure to start directly from original algorithms (in WSAC index mapping matrices are 
defined to be m x n dimensional [RTRK88], but the synthesis procedure starts by extend­
ing the specifications to n x n dimensional by pre-multiplying index mapping matrices). 
In an ADIO, the direct input output relationships in specifying algorithms are used 
to minimize the use of intermediate variables which would degrade the performance of 49 
VLSI implementations. The purpose of using these intermediate variables is to obtain uni­
form data dependence structures. Using MCS, the index space of each variable may be 
defined in relative to its own coordinate systems. When presented in MCS system, most 
non-uniform iterative algorithms will present uniform dependence structures after the 
removal of global broadcasting dependencies. It will be shown that there exists a close 
relationship between MCS system and MRA structure. Procedures will be provided for the 
construction of MCS systems for given algorithms, and transforming these algorithms into 
their corresponding specifications with regular dependence structures. The next step is to 
find a valid affine allocation function and timing to map a given algorithm onto a VLSI 
array architecture. In summary, this chapter will provide: 
A definition of the initial specification of algorithms in ADIO. 
A characterization of Multi-Coordinate Systems, and procedures for constructing MCS 
systems for a given algorithm. 
A technique for deciding the necessary and sufficient conditions to present algorithms 
with regular dependence structures in MCS systems. 
A procedure for the removal of global dependencies and transforming the ADIOs to 
regular dependence structures. 
Figure 3.1 shows the synthesis procedure for MRA architectures from given algo­
rithms. The thrust of this chapter will be on the development of a systematic procedure to 
transform a given algorithm into its corresponding specifications with uniform dependence 
structures. 50 
ADIO specification  Regular VLSI RMCS
of Algorithms  Architectures 
Figure 3.1:  Synthesis procedure for mapping ADIOs onto multi-rate VLSI 
systems 
3.1 SPECIFICATION OF ALGORITHMS 
Definition 3.1: Direct Input Output Specifications (DIO): A direct input output 
specification for a set of s-equations is defined as: 
Ui (Fi (p) +  i) =4111  (p) + p11] ,  Uvi[f (p) +  vi]] 
(3.1) 
pe 
where: 
Di c Ln is the computational domain for 
(Fi (p) + 9i) is the output variable of the array; Fi (p) is the index mapping func­
tion which maps point p to Fi (p) 
U. (T.. (p) +  1 5 j 5 V, are the input variables of the i-th array. Each input vari­
able able may be the output variable of the same or another array, e.g., the output variable Ui
 
may also appears as an input variable on the right hand side of Eq.(3.1).
 
f is a function.
 
srpi and vii are the phase components for the output and input variables respectively.
 
The computation domains SZi, Op_ need not be the same.
 
Note that the computed variable Ui (Fi (p) +  may may not appear on the right-hand 
side of the equation and in general, Eq.(3.1) may not be a recurrence equation. 51
 
Definition 3.2: Affine Direct Input Output (ADIO) Specifications: An ADIO is a DIO whose 
index mapping functions are affme (linear): 
(p) = Dp + d,  DE rn",de Z"`  (3.2) 
where r is an integer lattice point in Rm, D is an m x n index mapping matrix, d is the 
translation part, an m x 1 constant vector, and 1  m n. 
Example[3.1]: An UR filter is defined as: 
Y(i) =  a(l)x(ij)-1- (3.3)
osisN,  1 sisN, 
The above is the ADIO specification of the algorithm. The computational domains 
are defined as Oa = {pl 0 5 i, 0 5 j 5 N1 } and ab = fpl 0 s i, 1 si s N21 where the 
index point p = Dj e Z2. The index mapping functions are: 
ry(p) =  Dy  El 0]
 
r xy(P) = El flP,  Dyy =  10
 
rx, (p) =  Dxy =  El  (3.4)
 
Day = [1:1) Fay (P)  E0 
rby (P)  E0  Dby = E0 
There is no intermediate variable in the above specification. 
Example[3.2]:  Interpolation filters have many applications such as subband speech 
coding, subband image coding, wavelet analysis of images, etc. [Cro83, Ma189, Vai90]. An 
Nth order FIR interpolation filter with upsampling factor L can be expressed as follows: 52
 
i /3,(i), for  = integer w (i) =  L 
C 0, otherwise  (3.5)
N-1 
=  g (j)w (i 
j=0 
where w is an intermediate variable used to relate the input and output variables y and x. 
The direct VLSI implementation of Eq.(3.5) is not efficient since some of the variable 
computations involve zero operand multiplications and additions. The ADIO specification 
of Eq.(4.5) is: 
for 0 <T5L1 
[14,  (3.6) 
X(iLi-v) =  g(1L-FT)y(ij) 
1=0 
The computational domain is S2 = {pi 0 5. i, 0 5 j 5 N-1} ,p=  e Z2. The 
index mapping functions are: 
(p) = [L 6P,  (Pi= (P,  Di= [L 
(3.7) rg (P) = E0 4P'  (Pg  9,  D8= [oLI 
ry (p) =  lip,  = 09  Dy= 
Example[3.3]: An FIR decimation filter is defined as: 
iw (i) =  (j) x (i j) 
(3.8) 
(i) = w (Mi) 
where M is the decimation factor. The intermediate variable w is computed at every index 
point but only one out of every M computed samples is used as the output sample of the fil­53 
ter. The efficiency of this specification is 1/M. The ADIO specification of Eq.(3.8) is as 
follows: 
N-1 
y (i) =  h(j)x(Mij)  (3.9) 
j =o 
The computational domain is S2 = {pl 0 5 i, 0 5j5N-1},p=.j 
T E Z2. The 
index mapping functions are: 
y(P) =  (Pt  Dy= [i
 
rh (p) = [o flP,  Dh=  Co  1]  (3.10)
 
rx (p) = [M -1]P,  Dx=
 
The dependencies described by Eqs.(3.6) and (3.9) can not be localized by the use of 
conventional methodologies. The conventional approaches require the algorithms be ini­
tially expressed as URE, ARE or RIA and the index mapping matrices be unimodular. 
Transforming original algorithmic specifications to ARE and URE in general requires 
index space expansion. 
The affine recurrence equation (ARE) is the expanded version of an ADIO specifica­
tion where index space of the output variable has been expanded. For example, 
y (i, j) = y (i, j  1) + h (j) x (Mi j) is an ARE specification of the ADIO defined in 
Eq. (3.9), which is just one of many possible ARE formats derived from the ADIO specifi­
cation in Eq.(3.9), which is just one of many possible ARE formats that can be derived 
from the ADIO. 54
 
3.2  MULTI-COORDINATE SYSTEMS 
The concept of coordinate was first introduced by Descartes in 1637 [Ben58] and 
later expanded to rectangular cartesian coordinate, spherical polar coordinates, cylindrical 
coordinates, etc. It is possible to transform the specification of an algorithm from one 
coordinate system to another, however, representation of an algorithm in more than one 
coordinate systems is not common. 
The n-dimensional coordinate system was first introduced by an Italian geometer 
Enrico Betti [Ros88], "On spaces of an arbitrary number of dimensions" (1871) [Bet03, 
vol.2, pp. 273-290] as: 
"Let z1, z2,  zn be n variables that can take on all real values from 00 to 
+ 00. We will call the n-ply infinite field of systems of values of these vari­
ables a space of n dimensions and denote it by S A system (4, z2,  zn°) 
will define a point Lo of this space; we will call 4, z2, ..., z° the coordinates 
of this point:' 
It is well known that an n-dimensional coordinate system is defined by n indepen­
dent n x 1 base vectors and a reference point called the origin. Given a coordinate system 
and an n-tuple, or n-ply as in above, (z1,  zn), any point can be uniquely defined. The 
scale coordinate was introduced by W. Bender [Ben58, Ben75] for the representation of 
mathematical quantizations of Riemannian geometry for applications in modern quantum 
mechanics. The coordinate system used for the analysis of data dependence structures for 
DSP algorithms is not exactly the same scale coordinate defined in [Ben58] since the 
physical dimensions of the scales are not of concern. 
The simplest coordinate system in the n-space Rn is called the standard coordinate 
system Csn which is defined as: 55 
Definition 3.3: Standard Coordinate System C7: An n-dimensional standard coordinate 
system, C; is an n-dimensional orthogonal coordinate system defined by n base vectors 
I T
ei = [0 O... 0 1 0 ... Oj  ,  1 5 i 5 n, with 1 is at its ith location. The axes of CS are labelled 
as z1, z2,  zn respectively. 
Given a coordinate system, any vector v can be represented as a linear combination 
of its basis: 
v = ciei + c2e2 +  + cnen 
The n-tuple (v)  is the vector of v represented basis { el, ..., en} . Hereafter, any 
vector v or point p in r is defined using CS will be represented as v orp without the sub­
script. For other coordinate system 5 with the basis u the subscript (v)u will be used to 
notify the vector with the basis u. 
Example[3.4]: Let v = (2, 2) ; v e R2 in the standard coordinate C. Consider the basis 
u = {up 112}  with coordinate Cie R2, where ui = [1/2  u2 = [-1/2 
resulting in (v)  = (6, 2)  The new basis u can be represented by transformation T of The
the standard basis in the form [u1 u2] = T [e1 e2] as: 
uzi  1/2 -1/2 [el  [2]  (T-iM)
 
[ 0 1  2 2
 
(3.11) 
1/2 -1/2 6  2 
[ 0  1 [2 [2  [2  u 
Figure 3.2 shows this vector defined relative to different coordinate systems. 411:0j
ks?  lie  () p c)  .c' .:,b c,;%4 ,i,- _,6  c.)  '...,4)  .$  , *.''	  ,e 
4. A.., z? GI  ,,,,  .? '61  '''' (§, ..., 1;.?I/ "t'.sg"  t,  epcP  t 4)  ..  t, 
.4z,  k,e ..q'  t'S' crt  >.4  °,??#..#  .4  :P .3) 
`)  .**  "?  g ,,.4). Qt I, 
,,,. 4,-z)4' c$,)  Cs  .4  .0 .4, 4'  .$  t),* 0  ,0  t,0 
t?*  ..., .1.)  ,100  o  ,0os. ..,, ts, A,-N 
.0 AIe as,4) 4)491  t? -se
--.., b.° 4 -.9 v..  .$ ,ig  ....0  0.9S' 
,.	 
I I 
*	  ,.4, 4,",§,) c,4' 49  ...," *, 4.)  4 0.  ,s, .q.  4,0 
Ilk  . 
0  .%,* 4,  ti,,e.$ g?
co  4.  4"CP
f.)..&	  4'4'. 
Csle'1/4'q 
N	 #cs '...,  ,..  .....  ...  ...  : .`N oiS"  ''N 0o..NN --;
0*4. .te It' 4o .s.,k, '4 0.,  e' 4; 4) 4?  e Ito  co  e oo eci".-4) 
cD`i 'q'a" C4o  "" cke 't;:'0  c,04, dzk0:e  i  ct 
eCk 57 
nodes. However, for the synthesis, it is necessary to embed the variables and its index into 
a relevant n-dimensional computational domain (i.e., assigning dependency flow). The 
procedure of embedding a variable array into the computational domain f2 is called an 
embedding. 
An embedding of the variable index ili(q), q e r into the n-fractional space Qn is 
a mapping function a :  Qn which maps point q to point pq E Qn. 
Eq.(3.12) provides no information on how to embed variables into the n-space and 
how to assign nodes and dependence structure. Note that different variable may have dif­
ferent dimensions and embedding all variables into the n-space according to a single coor­
dinate system may not be desirable for dependence localization. 
3.3 ADIO DEPENDENCE STRUCTURE 
Let Uo(Dop+ do) = f[...,  ((Dip + di) , ...) be an n-dimensional ADIO where 
Uo and Uj designate the output and input variables respectively. Since the translation part 
d is a constant vector, it is not included in the analysis. 
For algorithm dependence structure analysis, it is necessary to know the set of nodes 
where the element U. (q) are used. This sets of nodes will be defined as: 
Ni (q) = {A tip E f2 A W (p)  Uf(q)}  (3.13) 
where some other variable W (p) , p e SI requires elements of U./ (q) .  Similarly, for the 
output variable, the set of nodes where the output nodes requires a value is denoted as: 
No (q) = {A Vp e Stn U0 (q) E W (p)}  (3.14) 58 
Thus, the element Ui (q) is sent to node p if and only if p E Ni (q) ; and Uo(q) 
requires an element from node p if and only if p E No (q)  . 
The objective is to find the node sets U (q) and Ui (q) , and to use the intermediate 
variable to perform the data transfer from Ui (q)  Uo (q) via W (p) , p E  a The objec­
tive is to find the relationship among W (p) and the elements of variables in an ADIO. 
Using W (p) , an ADIO can be expressed as follows: 
Uo (D op) = g (W (p)) = f[..., Ui ( (Dip) , ...)}  (3.15) 
The variable W (p) decomposes the dependence relationships between Uo and 1/./ 
into  the  dependence  relationships  between  first W (p) ÷- Ui (Dip)  and  then 
U0 (Dap)  W (p) and W (p) . The node sets Ni (q)  LI and No (q)  L/ are as shown 
in Figure 3.3, nodes Ni (q) = {p1, p2} require Ui (Dip) and nodes Ni (q) = {p3, p4} 
require Ui (Dip) and the output Uo(Dop) requires values from N (q) = {pi, p3} . 
Therefore: 
p1  Dip 
Di(pip2) = 0 = ,f;1p2e X (Di) 
P2  Dip 
p3  Dip 
P4 4- Dip 
} = Di (p3  p4) = 0  p3 p4 E  (Di) 
Dop < pi 
p3 } = Do(pip3) = 0  E X (D0) 59
 
(a) Direct input-output dependence relationships, Q0, Or 9/ c Si 
N (q)  Ni(q) 
Do(pip3) = 0; Di(pip2) = 0;  Di (p3 p4) = 0;  pi E LI 
(b) Decompose the input-output dependence relationships
down to two separated parts 
Figure 3.3: Dependence relationships among input and output variables and 
W (P) . 
The following lemmas provide the tools to determine Ni (q) for Ili (Dip + di) and 
No (q) for Uo (q) . 
Lemma 3.1:  If W (p) E- 1.5 (Dip + di) ,  W (pi)  and W (p2)  require  U./ (q) , 
3q  Sli,Vp E SI, if and only if pi p2 E X (Di) . 
Proof. If pip2 E X (Di) , from the definition of the right null space of a matrix, we 
have: 60 
Di (p1 p2) = 0  D  1  = Dip2Dipi+ di = Dip2+ di 
Therefore, W (p1)  and W (p2)  depends on the same data element 
Ui (Dip + di). 
Only if: if W (p1) and W (p2) depends on the same data element of variable Ui: 
Dipi + di = Dip2+ di  Di (pi  p2) = 0= p1  p2  (D1) . 
In other words, if 3131 e f2 such that W (pi) 4  (q) , then the node set Ni (q) is 
defined as: 
Ni (q) = {p1Vp e S2,p p1 e K (Di)  (3.16) 
Lemma 3.2:  If Uo (Dop + do) 4 W (p) ,  then Uo(q),Vqe f2o, Vp a f2, requires 
W (pi) and W (p2) , 3p1, p2 a SI, if and only if pi p2 X (Do) . 
Proof. The proof of this lemma is similar to lemma 3.1. 
In other words, if 3q a f2o A 3p1 a f2, U0 (q) 4 W (pi) , then the node set No (q) 
is defined as: 
No (q) = {pi Vp a f2, p pl e it (Do) }  (3.17) 
Lemma 1 and 2 describe an important data dependence relationship in an ADIO. 
These properties are similar to the null space propagation technique [Raj86, RTRK89] but 
without the idempotent restriction, or DoDi = Di (PQ = Q2 as given in [RTRK89]). 
The question now is how can the dependencies of an ADIO be localized by the use of null-

space propagation even if the idempotent condition is not satisfied?
 
Example[3.5]: Consider the dependence relationships of an ADIO:
 61 
6 
y (i) =  h (j) x (3i  j)
 
j =0
 
The index mapping matrices of this ADIO is given as: 
Dy  6, ph =  112 Dx = E3 1] 
where y is the output variable, h and x are the input variables. The computational 
domain of this ADIO is SI = { i 0 5 i, 0 5 j 5 N} .  Figure 3.4 shows the direct depen­
dencies among the output and input variables without W (p) .  Each line linking the ele­
ments y (i) , to the elements h (j) or x (k) represents that the evaluation of y (i) requires 
the values of h U) and x (k) . The dependence structure is not well organized in Figure 
3.4. 
h(0) h(1) h(2) h 3) h 4) h(5) h 6) 
*jV
14P14K 
YO)  ) y  ) y  ) y(4) y(5) Y(6) 
)  1)  2) x 3)  4)  5)  6) 
Figure 3.4: Direct dependence relationships between the output variable y the 
input variables x, h of the decimation filter. 
Let the sets of nodes be denoted as Ny (0, Nh ( U), Nx (i)) respectively. W (p) is 
W (i, j) , V (i, j) e O. To compute these node sets Ny (0, Nh ( U), Nx  ) , we can 62 
examine the computations performed in the ADIO directly. For example, let us compute a 
node set N  (2) for the output variable y:
Y 
6 
y (2) = yi h U) x (6 -j)  Ny(2) = { (2, DI (0 Sj 5 6)1 
j= o 
This can be repeated for the determination of all node sets. However, it is not neces­
sary to go through this procedure to determine the node sets. Let us use lemma 3.1 and 3.2 
to compute these node sets. The null spaces of the index mapping matrices are: 
X (Dy) = [ 0 fl, t t (Dh) = [i 6, X (D x) = [1 3] 
From lemma 3.2, y (i) 4-- W (pi), W (p2) if and only if pi -p2 E X (Di) = [o 
Thus, the node set of the output variable is defined as: 
Ny(i) = {(0)105j56},05i 
Similarly, from lemma 3.1, W (pi) , W (p2) E- h (j) iff pi -p2 E X (Dh ) = D d. 
Thus the node set of the input variable h is defined as: 
Nh (j) = { (0)1 0 5 i}, 0 5 j 56 
Again, from lemma 3.1, W (pi) , W (p2) +- x (i)  iff pi  p2 e N (Dx) = [1 3]. 
The node set Nx (q) is defined as: 
Nx(3i-j) = { (i,j)I 0 Si,05j56} 
If node (pi E N x(q)) A (pi ± [1 3] E SI) and pi ± [l 3] E Q, then these nodes 
(p1 ± [i 3]) E Nx (q). For example, the node pi = [1 3] E Nx (0) ,  then the node 
p2 = pi + b 31 = [2 d E Nx (0) . Similarly, p3 = pi - [i 3] = E0 d E Nx(0) is also 
true. The node set Nx (1) contains these nodes { [i A, [2 5] } and Nx (2) contains these 63
 
nodes { [2 1], [3 4] }. All other node sets can be determined similarly. Note that the ele­
ments of each node set belong to the null space 14t (D x) . 
The data dependence structure is shown in Figure 3.5. The circles denote the index 
points W (p) (or the nodes) in the computational domain. The node sets are either sur­
rounded by dashed lines (for Ny (i) and Nh (j)) or simply struck through by a straight 
line (for Nx (q) ) depending on the available space. From the node sets defined above, it is 
easy to see that the null space propagation technique can be employed even if the index 
mapping matrices are not idempotent. 
Nx(1)1V2) Nx(4)Nx(5) 
NO)  x(3)  (6) 
ne y r 
1/1  I Ii  v  Ii  fj Ij 6- 4 1,4 
/1 it 
1 
NSAIrifilWIANMr? I , Ari  Arm  / pi fo  A 
...,............

r A sr, 
... ...... r 
.....,.°:  _.,.f I? igigi
Ai,  , 0 / o...
.  *i'l
, : ,  , A INki .  4  tA o I, t/  G  i  I,i Ifift
.1 
q i  n rifigro:Aram,  Alifir A 
..  ..  ..  .... 
I­ 0  0  (..  fj  fp  Oi Nh(0).  - II  i 
1 
l k ,  i S i 
f f  1 I 
N (0)N (1) N (2) N (3) g (4) N (5) N (6))V (7AT (8)N (9) YY  YY Y  Y Y Y Y  Y 
Figure 3.5:  Decomposing the dependencies down to the dependencies 
between the output y and W (p) ; and between W (p) and variables x and h. 64 
The objective of embedding is to assign variables U./ (q) in an ADIO to points 
pq e Qn (i.e., assigning direction of data flow and index points in the DAG). For example, 
consider the decimation filter example in ADIO form: 
y (i) = X h (j) x (3i  j) 
For variable x (3i  j) , we need to assign point q = [3i j] = [3 _ lip to a set of 
nodes pq e Q2 in the computational domain. This set of points pq e Q2 lies in the null 
space of Di., X (Di) such that Dx (p  pq) = 0. For this example, 
D x(p pq) = [3 --1] (P-Pq)  = 0  p =  P l 31 
q  LO j 
For example, from the above example, we know that Nx (1)  = { Ei  ,  [2  1, 
Nx (2) = { [2  [3 zO, the elements x (1) and x (2) will be embedded into the nodes 
pqi =  [11 and pq2 =  [211 respectively. Such embeddings satisfy the requirement
0
 
W (p) 4 U i (q) = Di(p p q) = 0
 
Therefore, U (q) can be propagated to Vp e NJ (q) by using the null space propa­
gation technique. Idempotent condition is not required for Di. 
Lemma 3.3: For the variable II i (q) , q E in in an ADIO, its node q can be embedded to 
an index point pq e Qn iff pq lies on the right null space of Di such that Vp e Ni (q) 
D. I (p p
q) = 0. 
Proof. Vp e Nj (q) c Q, W (p) 4 U./ (q), if the condition Di (p  pq) = 0 is satis­
fied, then the null space X (D j) can be used to propagate Ili (q) from the 
index point pq to Vp e Ni (q) . 
On the other hand, if the condition is not satisfied, i.e., 3p1 e N (q) such that 
D j(p1 pq) *0  p pq = '11 
1 0 X (Dj) . Then, in order to propagate U.
J (q) 65 
from node pq to node p, the propagation must be along the vector vl e t (Di) . 
Moreover, from lemma 3.1, if 3pi, p2 E Q, W (p1), W (p2) E- II (q,, then 
Di (pi p2) = 0 = p2 = p1 +  E X (Di)  for some integer k. Thus, 
Di(pipq) # 0  p2pq =  + i 1 = .62 e X (Di) .  This requires that the 
element U./ (q) be propagated along v2 to the index point p2. Consequently, 
multiple directional propagation technique must be employed to propagate the 
variable. However, as pointed out before, dependence localization by the 
multiple directional propagation technique requires multiple interconnections 
and complicated control mechanisms for VLSI implementation. Therefore, only 
if the node q = Dip is embedded to the index point pq E Qn such that 
Vp e Nj (q) , W (p)  U (q) ,  pq) = 0, then the null space X (Di) 
can be used for dependence localization and no other directional propagation 
will be required. 
Obviously, the embedding function cri  Qn is defined as a (q) = pq, 
pq e  . The question now is how can we find pq E Qn for each variable (If (q) system­
atically? This can be achieved through the use of a system of multi-coordinate systems, 
i.e., by constructing a coordinate system for each Uj itself. 
3.3.1  Constructing coordinate systems 
This section is devoted to the development of a procedure which allows us to deter­
mine index points pq systematically for embedding. 
In this section, we are going to adopt the notations a, 13 given in [HJ85] (p.17) for 
defining submatrices. Let D e  ". "For index sets a c {1, .  ml and 13 c { 1, ..., n 
we denote the (sub)matrix that lies in the rows of D index by a the columns indexed by 13 
as D (a, (3) . For example, 66 
123 
[1 2 31 D = 456  D({1,3}, {1,2,3}) 
'7 8 9 789 
etc."  (Note, the notation for a matrix has been changed from A to D for consistency with 
the rest of the dissertation). 
Moreover, we will adopt the notation 1131 used in [CR81] for denoting the cardinal­
ity of the set (3, that is 101 = card (13) = the number of elements in 13. (Note, the set is 
denoted as A in [CR81] instead of 13). 
For simplicity, we will use the short hand notation D (43) to denote a submatrix of D 
which consists of f3 columns of D. For example, if 13i =  1, 21 and 132 = { 2, 3}, then 
the submatrices of D are defined as: 
[1 2  D 0.)  12)

1 0 2  10,D(132)  [02 -21 
For any given variable array U (q) in an ADIO, can it be guaranteed that there 
exists a systematic procedure to determine pq for U (q)? 
Theorem 3.1: For a given m-dimensional variable array U (q) , it is always possible to 
find a coordinate system Cm and an embedding function a :  Qn which maps 
U (q) to an index point pq e Qn such that Vp e N (q) ,D (p q- p) = 0. 
Proof. This theorem is proved by the following two cases. 
Case I: p (D) = n. The null space of this variable array is zero dimension and 
every element U (q)  will be used at most only at one location and no 
propagation of the elements will be required. To make regular dependencies, the 
linear transformation T = D-1 is used to get the base vectors of the new 
coordinate system  are defined as: 67
 
it);  = Tel;  1 5 15n 
If U (q) is an output variable array with coordinate 5, then the dependencies 
between  W (p)  and  U (p)  can  be  expressed  as 
U (q)  = U (p + D-1 d) = g (W (p)) . If U (q) is an input variable array, 
then g (W (p)) = U (q) = U (p + D-1 d) .  These dependencies am constant 
over the entire computational domain and can be presented as a dependence 
vector D-1 d. 
Case II: p (D) = m,D: mx n, m< n. 
For a rank m matrix A, the following statements are equivalent [HJ85]: 
(a) rank A = m; 
(b) There exist k, and no more than k, columns of A that constitute a linearly 
independent set; 
(c) There exist a k-by-k submatrix of A with nonzero determinant, but all (k +1)­
by-(k+1) submatrices of A have determinant 0. 
The right null space It (D) is used for dependence localization using W (p) . 
For an m-dimensional data array, we need an m-dimensional coordinate system 
to define its index space. To find the m-dimensional coordinate C7, we need to 
extract an m x m non-singular submatrix D (1) from the m x n index mapping 
matrix D, where 3 c { 1, 2, ..., n}  and  113i = m. Each selection of  13 
determines an m-subspace of Qn in which the coordinate system C7 is 
constructed. If k E 13, then the k-th dimension of the n-space is selected as one 
dimension in the m-space. The selection of 13 must be such that D ((3) is non-
singular. Otherwise, the space spanned by D ((I)  is not m-dimensional but 68 
(m k) -dimensional, where 1 S k S m. An (m  k)-space is not sufficient to 
define an array of m-dimensional. 
To construct the m-dimensional coordinate system C7, we need to find the 
transformation T. The procedure for deriving T consists of the following steps: 
1.	  Select a subspace to embed the variable array U, i.e., select 13; 
2.	  Select the data flow direction, or the iteration direction; 
3.	  Construct the submatrix D (0) which is non-singular, compute its inverse 
D-1 (0); 
4. Construct T as: V k	  13 n k e { 1, ..., n} , fill the k-th row and column with O's. 
Then Vk e 13, fill those unoccupied positions in the k-th column with the corre­
sponding elements of D-1 (0). 
Once the transformation T is determined, the coordinate system C7 is defined 
by the basis ui = Tei, 1 5 i S n. 
Note that in CM defined above, only those k-th dimension, Vk e 13, has nonzero 
base vector and the other dimensions reduce to zero. This puts the origin of C7 
at the same location as the origin of the standard coordinate. For different data 
flow directions, the origin of C7 may be moved to other locations. 
Embed the variable array U (q) to Qn relative to its own coordinate system C7 . 
The physical location of U (q) in relation to the standard coordinate system C's` 
is given as pq = Tql,  q1 e Zn A pq e Qn , where q1 is an expansion of q as 
follows: Vk e (3 A k e { 1, ..., n], fill the k-th position of q1 with zeros, then fill 
the  rest  positions  by  the  elements  of  q.  In  other  words,  let 
p'q E Qm n pig = D -1(13) q, we can expand p'q e Qm to pq e Qn by filling the 
k-th positions with 0, Vk E { 1, 2, ..., n} n k  13, and the other positions are 
filled by the elements of p'q. 69 
Since the k-th elements of pq E  Qn are 0, Vk e {1, 2, ..., n} Ake 13, and the 
relation q =  ((3) p'  pi E 0, thus we have q = Dip  p E  .  This 
means that the dependency W (pq)  U (Dp) exists. From lemmas 3.1 and 3.2, 
and W (p) 4- U (Dp) , we have D (p p q) = 0. Therefore, when D (0) is 
non-singular, the transformation T and the coordinate system constructed above 
can be used to fmd the embedding functions so that null space propagation can 
be used for dependence localization. 
It should be pointed out here that if C7 is allowed to be defined as a function of 
more than m-base vectors from Csn, then the number of possible coordinates is infinite. 
Example[3.6]: Let an ADIO be defined as y (i,j)  x (2i j, 2j) . The index mapping 
matrices are specified as: 
D  10  D =  (3.18) 
Y 0  0 2 
The dependencies of this ADIO cannot be localized using existing methodologies. 
Its DAG under a single coordinate system is given as in Figure 3.6. Apparently, this DAG 
has non-uniform dependencies and it can not be mapped on regular array architectures. 70 
0 
o  oe o 
O. 
'Sc  I 
420 
o  0 0 0 
ON 
0 9 
: nodes of x variable  o nodes of y variables 
Figure 3.6: Dependence Acyclic Graph under a single coordinate system. 
This ADIO has full rank index mapping matrices. Following the procedure of case-I 
the index space of variable y is defined relative to the standard coordinate. However, vari­
able x must have its own coordinate system C2. The transformation T = /).;1 is given as: 
Z.  [1/2 1/1  rr  r CZ  1 1_1/2 up L1/4 1/2] }  (3.19)
0  1/2 
The new dependence structure under MCS system is given in Figure 3.7. Note that 
there are many elements of x-array which have no dependence vectors originated from 
them, this means that they are not used in the computation. 71
 
4;1 
nodes of x variable  o nodes of y variables 
Figure 3.7: Directed Acyclic Graph under an MCS system. 
Example[3.7]: Consider a three dimensional filter as: 
!Jo (i, j, k) = f I (2i  k, kj) ,  U2 (i  2j k, i + 2k) ,  U3 (i  j + k) 
With the computational domain S2 = {1,1 (0 5 i, j, k S N) } . To embed this into the
 
computational domain for 0 5 i,j,k5N. Let p = DAT be the index point. This is the I/ 
0 specification (ADIO) and the index mapping matrices are: 
100
  [2 0 1  1 2 1  r
 
D®=  0 1 0 , D1 =  , D2 - , D3 = - -1 11
 0 -1 1  10  2
 0 0 1
 72 
The embedding of U 0(p) , ao : Z3 -4 Z3 is given  as a° (i, j, k) = (z1, z2, z3) 
where i, j, k are not relative to any specific coordinate system, but,  z1, z2, z3 are in the 
standard coordinate C. 
For U0 (p) 4- 111(D ip) , one  possible embedding is ai : Z2 -4 Q3 which is 
defined as al (i, j) = (z1, z2, 0) ,  that is, U1 (q) is embedded into the 2-dimensional 
subspace (z1, z2) with z3 = 0 in CS as shown in Figure 3.8. Since the matrix D1 is not 
unimodular, its dependencies are linear functions of the indices and grow with the increase 
of the indices at the rate of the eigenvalues in corresponding directions. A partial depen­
dence structure of this dependence relation is given in Figure 3.8. 
: Nodes of U1 array 
0 : Nodes of U0 array 
Figure 3.8:  The dependencies of U o(p) 4- U (Dip) under a single coordi­
nate system. The U1 array is embedded at the subspace z3 = 0. 
MCS representation: Following the procedure outlined above: 
Step 1. Select the subspace to embed U1. Let this subspace be (z1, z2) 73 
Step 2. Select the data flow direction. This is equivalent to the selection of the origin for 
the new coordinate system Cl. Let cl have the same origin as the standard coordi­
nate. 
Step 3. Construct the submatrix D1 ((3i) . Since the subspace is selected as (z1, Z2), there­
fore, 131 = {1, 2). Thus: 
[02  01 --11] (131)  [02 ..° j , and the inverse of Di ((3i) is given as: 
Dil  [1/02 Oid 
Step 4: Construct the transformation T1 e .Z3 x 3. First, fill the third column and the third 
row of T1; then fill columns one and two with the columns of the matrix D--,1 (p). 
Thus the transformation T1 is given as: 
0 
T1 =  0 1 0  (3.20) 
0  0 0 
with this transformation, the coordinate system CI, is defined as Cl = Tic: with basis 
{ u11, u2 }; where u1 = [1  0 6 T9 u2 = [CI -1  6T. C1 is shown in Figure 3.9b. The 
superscripts of ul, u2 denote that these base vectors are associated with the first coordi­
nate system C21 for variable U1. Using q, the embedding is: 
z1 0  z1
ai (i,j) = u2 z2 =  0 1 0 z2 = (-T, z2, 0) ,for i -4  z2  (3.21) 
0  z3  0 0 0 z3  i 
Z3 74 
The dependence graph is partially given in Figure 3.9c where the global broadcast­
ing dependencies have been removed by the null space propagation. Figure 3.10 shows 
this DAG with different layers. 
Z3 
N.* 
01,  .
:Pr 
ear 
Z2 
(b) The coordinate system for U1 
and the element array of U1 (a) The Standard coordinate system
 
and the element array of U0
 
0: nodes for the U0 variable
 
: nodes for the U1 variable
 
0: nodes where elements
 
of U0 and U1 overlap
 
z  (c) The localized DAG after embedding the U1 array 
zi  into the computational domain at z3=0 
Figure 3.9: The dependencies of U® (p)  U1 (Dip) under a MCS system. 
The axes of CS are labeled as (z1, z2, z3) and of Cl for U1 as (zip z1) . 75 
0: nodes for the U0 variable 
o: nodes for the U1 variable 
0: nodes where elements
 
of Uo and U1 overlap
 
Vow. 
z2 
The localized DAG after embedding the U1 array 
into the computational domain at z3=0 
1 
Figure 3.10: The DAG showing the dependencies at different layers. 
To complete the dependence analysis, it is necessary to define the coordinate sys­
tems for variables U2 and U3. Let the subspace for embedding U2 be the (z2, z3) and the 
subspace for U3 be the (z3). Correspondingly, following the same procedure, the coordi­
nate systems for U2 and U3 are defined by the transformations T =  T2, T3 
0 0  000 
T2 = 0 1/2 1/2 ,T3 = 0 0 0  (3.22) 
0 0 1  0 0 1 
=  ,  = [0 1/2 6, u2= [0 -1/2 11 
(3.23) 
=  {14},14 = [0 o 76 
These coordinate systems are illustrated in Figure 3.11. For clarity, each coordinate 
system is illustrated separately from IC:. 
I  3  2
2 
Cl
 
2 Z2  c2 
Zi 
NMI 
Ltip MAWR in iviviv 
.it mutitt 
1 
Figure 3.11:  A possible 3-coordinate systems defined in R3. 
To summarize, to derive a regular dependence structure from a given algorithm, we 
need to propagate the output dependencies using the intermediate variable W (p) into uni­
form dependencies and transform the dependencies between W (p) and the input vari­
ables into uniform dependencies. 
Assume that for each variable U. (q) in a given ADIO, there is one and only one 
index mapping matrix Di associated with it. Based on this assumption, the following is 77 
developed for constructing an MCS system for such ADIOs. This assumption is too 
restrictive for many applications. It will be removed in next section. 
Let m. = rank (D.) ' the coordinate system for variable U. be m.J -dimensional; and 1 1  1 
V be the number of variables in an ADIO. The construction of an MCS system for a given 
ADIO is described by Procedure 3.1: 
Procedure 3.1: 
For 1 S j 5 V, DO
 
If  = n, for 1 =1 to n, compute basis u =
 
else
 
If mi < n' DO 
select an m-subspace to embed the variable U. to the computational 
domain by selecting  13.  (f3)  must be nonsingular; 
For 1 = 1 to mi. DO 
compute basis uji = Di ((3)  er E 13; where e7 denotes the l -th base 
vector of the standard coordinate system with the other k-elements 
been deleted, where k E {1, 2, ..., n} Ake 13. 
End
 
End
 
End
 
In the above procedure we have restricted ourselves to the selection of coordinate 
systems which can be defined by m base vectors and ruled out the use of other (n - m) 
base vectors. By doing so, we have reduced the number of possible coordinate systems to: 
n n  1) ... [n - (mi - 1)] 
(3.24)
1  2 3  ... mi 78 
This number is derived from the possible combinations of m out of n elements. For 
most algorithms, such an MCS system would be adequate to present algorithms with regu­
lar dependence structures, using the null space propagation. However, if there are two 
dependence relationships associated with the same variable  lli  ( Ui (Dop) and 
Uj (Di, 2p), where Di, 1 * Di, 2), which Di should we use in Procedure 3.1? What are the 
characteristics of an ADIO which make it possible to be transformed into regular specifi­
cations using null space propagation? These questions are investigated in next section. 
3.4 CONVERTIBILITY 
This section studies if a given ADIO can be transformed into a system of specifica­
tions with regular dependence structures. First, the sufficient condition for convertibility is 
provided and then the necessary and sufficient condition is formulated. 
3.4.1  Motivations 
This section provides some motivation examples for removing the assumption made 
in last section, i.e., each variable Ui is associated with one and only one index mapping 
matrix Di. For many applications, this assumption does not hold. Therefore, it is necessary 
to remove this assumption to allow the synthesis of a larger class of algorithms. 
Example[3.8]: A Lyapunov matrix equation [Che84, Roy88] is given by: AX + XB  = C 
where A is an n x n non-singular lower triangular matrix, B is an n x n non-singular upper 
triangular matrix. The objective is to solve the n x n matrix X. the above equation can be 
represented in detail as: 
for  15.i,j.. 
i j
c (i, j) = X a (i, k) x (k, j) + X b (k, j) x (i, k) 
k =1  k =1 
This equation can be easily represented as an ADIO defined as follows: 79 
for 15i,j5n 
i -1  j -1 
(i, j)  a (i, k) x (k, j) +  b (k, j) x (i, k)) 
k = 1 x (i, j)  l  a (i, i) + b (j, j k)= 1 
The computational domain of this ADIO is SI = fpl 1 S i,j  n, 1 S k < i, jj. To the 
index mapping matrices are given as: 
D  =lOC  r0,D  [001i
al  xl  c  LO,Qf  0 0 if  0 1 Q1 
(3.25) 
D  [ 0 0 1 1  D  [ 1 0 6 1  D  100I  0 1 0 61 
bl  x2 a2 0 1 0 1  0 0 if  0 Or  LO 1 01 
The variables x, a and b have multiple index mapping matrices. 
Example[3.9]: The algebraic path problem (APP): 
is (i,j) ,  k = 0 
f(i,j,k 1),  i=j=k 
f (i, j, k) = f (i, k, k)  f (i, j, k  1) ,  i=knj*k  (3.26) 
f (i, j, k  1)  f (k, j, k)  j=kni*k 
j, k  1) elf (i, k, k) 0 f (k, j, k  1) ,  i*knj*i 
The index mapping matrices are given as:
 
100  100  0
 [1 0 0 D,  010,Df2 = 001 'Po = 010 Dar  0 1 0  f' i  001  0 0 1  0 0 1 
Another example in which a variable has more than one index mapping matrix is the 
Toeplitz factorization algorithm [13186]. The procedure provided in last section apparently 
cannot be used to construct MCS systems for these algorithms. For such algorithms, can 80 
the MCS technique be used to localize their dependencies using the null space propaga­
tion? 
3.4.2  Sufficient condition 
Using MCS, dependencies among Uj and Uk for j * k are not important to the anal­
ysis of the dependence structure because each of them may be indexed in relation to differ­
ent coordinate systems. However, if there are two dependencies for a given variable as 
Uj(D11p) and Uj(Dj,2p), the properties of every linear part is important to the others. 
Let Ai be the set of index mapping matrices for U./ (q) as Ai =  1, Dj 2, ... } 
Definition 3.5: Set of Index Mapping Matrices  The The set of index mapping matrices Ai 
is  defined  as  Ai = {D.,44v/  + dij,i) appears in the ADIO ,  i.e.,  the 
elements of A. are the distinct linear parts of all index mapping functions associated with 
Uj in an ADIO. IN denotes the cardinality of  or or the number of elements in set Ai. 
For example, for Ui (i, j) = f [..., U2 (3i j),U2( 3i +j), ...] ,  then these sets 
are defined as Al  =  [1  } and A2 =  [3  [.._3  . 
1 
For a given n-dimensional ADIO with V distinct variables Uj, 1 Sj S V, their corre­
sponding sets of index mappings are A 1 5 j 5 V. 
Theorem 3.2: (Sufficient condition for RMCS convertibility): Given a system of n-dimen­
sional ADIO with V distinct variable arrays  1 5 j 5 V. Let its ith equation be defined 
as: 
Ui (Dip + di) = 
(3.27) f [Ui (D  + di,j),Ui(DA1P+ di41) , Uj (Di42P + 4/42),  ] 81 
Let A. 1 5. j V be the sets of index mapping matrices and each set have IN elements 
D k E Z'njxn,1  k  IA  Vj, 1 Sj S V, if the following condition is satisfied:
A 
313  {1, 2, ..., n} A  = "if) A  (W, 2 5 k 
(3.28)
(DA l (0) ) -1D  k (0) = 
then the ADIO is RMCS convertible. 
Proof. The correctness of this theorem is apparent because if we construct a transfor­
mation Tj (or the coordinate system dini) from Di,  1  (13) for each variable 15, 
from theorem 3.1, we know that the dependencies of  can be localized using 
the null space propagation. Each index mapping function may have its own null 
space. Therefore, the elements of  may be propagated along IN directions 
each direction is determined as  k) ;1 <_ k S IO  . 
From this theorem, for a given algorithm with V variables, if Vj, 1 5 j 5 V, IN = 1, 
then the algorithm can be represented in specifications with regular dependence structures. 
However, this is only a special case depicted by Eq.(3.26). The class of algorithms defined 
in Eq.(3.26) is characterized by one property, that is, these algorithms can be transformed 
to regular specifications using one coordinate system for each variable. In other words, to 
present these algorithms as regular specifications, each variable in a given ADIO needs 
only one coordinate system to define its index space. This is not always true. In the more 
general class of algorithms, the index space of each variable may need to be partitioned 
and defined relative to different coordinate systems. 
Example[3.10]: An RR filter is defined as: 
3 3 
Y (i) =  (i) x  D  b (i) Y  (3.29) 
0  k = 1 82 
r (p)  =  Dy= El 0]
 
rxy (p)  El  DYY=  -fl
 
rxy (p) =  Dxy= El 11  (3.30)
 
ray (P) = Eo flp,  Day = [:1
 
rby (p) =  flp,  Dby =
 
The sets of index mapping matrices are defined as: 
=  6,  , Ax =  --1]} , A, =  [)  , Ab =  Eo  (3.31) 
All other variables have only one linear part in their sets except variable y. There­
fore, the convertibility of Eq.(3.27) depends on the dependence properties of variable y. Ay 
has two distinct linear parts which satisfy Eq.(3.26) if 13 =  1} c { 1, 2) is selected, thus 
Dy(0)  Dyy, (0) = [1]. Therefore, Eq.(3.27) can be transformed to regular specifica­
tions using the null-space propagation. 
Let the subspaces for embedding be: (z1, 4) for y; (z1, 0) for x; (0, z2) for a and 
0 0 with (0, z2) for b. The transformations are given as Ty = ;=[01 a 00 ,T  = Tb = 
the origins Oy = (0, 4) , Ox = (0, 0),  Oa  = (0, 0) , and Ob = (0, 4) . The localized 
DAG is given in Figure 3.12. 83
 
Figure 3.12: The localized DAG for the IIR filter algorithm. 
When an ADIO satisfies Eq.(3.26), it can be transformed to a system of RMCS 
using the null space propagation. However, if Eq.(3.26) can not be satisfied, can a given 
ADIO be localized using the null space propagation? This problem is studied in next sec­
tion. 
3.4.3 Sufficient and necessary condition 
When a given algorithm fails to satisfy Eq.(3.26), one coordinate system for each 
variable is insufficient to transform it into a regular specification using the null-space prop­
agation. Then, can it be transformed to a regular specification using the null space propa­
gation? This is investigated in this section. 84 
This following example illustrates the case where the use of a single coordinate sys­
tem for a variable is no longer sufficient to present it with uniform dependencies, and 
more than one coordinate systems must be used to define its index spaces in different 
regions in order to transform it into RMCS. 
Example[3.11]: Let an ADIO be U1( j) = f [ . . ., U2 (3i  j) , U2 (- 3 i + j), .... The 
index mapping matrices are D1 = 
[01 01  n 
1 =  [3  Dz 2 = E-1 fl and the sets of 
index mapping matrix are defined as: 
Al= 
0 1 11,A
2 = i[3 11[3  (3.32) 
In this example, Al has only one element which is the identity matrix, therefore, its 
dependencies can be expressed regularly. Actually, the index space of variable U1 can be 
defined in relative to the standard coordinate system since D1 is the identity matrix. How­
ever, A2 has two elements and the condition in Eq. (3.24) is not satisfied, thus the algo­
rithm can not be transformed into a regular specification if only one coordinate system is 
used to define the index space for variable U2. Let 13 = { 1} , then the coordinate system 
C2 constructed from D2,  1 (0) = [3] is defined by the basis u = [1/3  If this coor­
dinate system C2I is used to define the entire index space for U2, the resultant dependence 
structure is illustrated in Figure 3.9, which is not regular. 85
 
Z2 
Figure 3.13: When condition (3.26) is not satisfied, one coordinate system 
for a variable is not sufficient to present an ADIO with regular dependencies. 
Note that even though (D2,1 (13) ) -1D2,2 ((3) */,  (D2,1  -1D2, 2 On 2 = I. 
This means that (D2 
1 (0) ) -1 D2 2 (B) is a 2-root of I. This property is important since it , ­
has been proved that if a matrix is a root of I, then its dependencies can be represented as 
quasi-regular dependence structures [Yaa88]. By quasi-regular it is meant that the depen­
dencies are regular over the entire domain except possible at the boundaries. For these 
algorithms, the strategy is to use the "general fold" method to transform them into quasi-
regular specifications, which is equivalent to the use of different coordinate systems to 
define the index spaces in different regions. 
Instead of defining the entire index space relative to the coordinate system C21, let 
the index space of U2 be partitioned into two regions each to be defined in its own coordi­
nate system. How to partition the computational domain for each variable will be investi­
gated in next section. For this example, the domain is partitioned as into two regions, 
either SZ1 =  (0)1 0  i}  and S/2 =  (i,j)I 0 > i} ,  or ST, =  (i,j) I 0  and 
LT2 =  (i, j) I 0 > j} respectively, depending on how 13 is selected to construct the sub­86 
matrices for constructing the coordinate systems. When [3 = {1}  , D2, 1((3) = [3] and 
Dz 2 (13) = [-3] are selected as the constructing submatrices, then the first partitioning 
strategy must be employed, i.e., ill = { (0)10  and 02 =  (i, j) I 0 >  .  The 
corresponding coordinate systems for U2 can be constructed by using Procedure 3.1, and 
their basis are given as u2 1 = { [1 /3 d } and uz 1 = {[ 1/3(D for regions f/i and 
S22 respectively. If, on the other hand, 13 = {2), then [1] and [-1] are selected as the 
constructing submatrices, the coordinate systems are then defined by the basis  as 
uz  { Co fl } and u2 1 =  [o 1] } respectively for regions Si'l and n'  The corre­
sponding partitioning and the coordinate systems are illustrated in Figure 3.10. 
4Z2 
(b) 
Figure 3.14:  Different partitioning of the computational domain and the 
effect of using different coordinate systems for different regions. 87 
In Figure 3.14a and b, the computational domain is partitioned into two regions but 
it is defined in one coordinate system. In Figure 3.14c and d, these regions are defined in 
different coordinate systems whose actual effects are folding these regions along the z2­
and z1-axis respectively. It should be pointed out that not only the domain for variable U2 
is so partitioned and defined, but also the domains of the other variables. If the first parti­
tioning strategy is employed, the resultant dependence structure of the ADIO is as given in 
Figure 3.15. The new dependence structure is regular over the entire domain. 
E: Nodes p E K-22  0: Nodes p E Oi 
Figure 3.15: Two coordinate systems are used to define the index regions i/i 
and 02 of the index space for each variable. 88 
The question now is that, for a given ADIO, what are the conditions which it should 
satisfy to be transformed into a system of regular specifications or quasi-regular specifica­
tions? 
Let D. D. j, 2 e A be two different linear parts of the index mapping functions J,19 
associated with variable U. and both of them have the same rank m = rank (D) , this is 
reasonable because they both define the index vectors for the same m-dimensional variable 
array. However, if the rank of a matrix, say Dj, 2, is less than m, then Dj 2 should not be 
considered and Di, p (p  1) = m) should be used, because only the matrix of rank-m 
can be used to index an m-dimensional array. If all the index mapping matrices associated 
with variable U. have rank less than m, then U. should not be an m-dimensional array. 
If (Di11 (13) )  2 (13) */, but the condition ( (DA, (1) )  2 (13) )  = I 
is satisfied for some integer Lip the algorithm still can be transformed into a new specifi­
cation with regular dependencies in the entire domain, except possibly at boundaries of the 
those partitioned regions of the computational domain. The computational domain is parti­
tioned into L.. regions. This case is similar to the "general folding" technique proposed by 
Yaacoby [Yaa88] but it is more general because only the index mapping matrices associ­
ated with the same variable U. are considered. But in [Yaa88], all index mapping matrices 
in an equation must be considered. However, the method developed in [Yaa88] for com­
puting a fundamental region can be employed to compute a partitioning for a variable. 
This will be discussed later for the computation of partitioning. 
For a given ADIO, whether it can be transformed to a regular specification using 
null space propagation is determined as follows: 
Theorem 3.3: Necessary and Sufficient condition for RMCS convertibility: Let an ADIO 
have V variables as defined in Eq.(3.25),  1,  2 e Ai be two different linear parts 89 
associated with variable U./. and their rank be m1 . = rank (D). Vj, 1 5j 5V, if the follow­
ing condition is satisfied: 
313 e {1, ..., n} n (I f31 = mi) n (3Lii) = Vk, 2 5 k5IN, 
(3.33)
( (Dji,  (I3) )  2 (13)  = p A ( (Dji,  (3) )  k (13)  is integral) 
where Lei is an integer number, then the ADIO is RMCS convertible. Its dependencies can 
be localized using Lji coordinate systems to define the index space of Uf. And the compu­
tational domain fli is partitioned into Lji polytopes Pip 1515 Lji. 
Proof. Assume that we have already constructed a coordinate system C for Uf based 
on  1  0), therefore the dependencies related to DA 1 are uniform. As men­
tioned before, we restrict ourselves to the embeddings of an m-dimensional 
array into an m-dimensional subspace in  Qn  with the coordinates of other 
dimensions been set to some constants. For simplicity, let the notations for 
1 03) and DA 2 03) be denoted as D1 and D2 respectively (D1 and D2 are 
full rank m x m matrices). Now, let us study the dependence relation 
W (p)  U (Dip), U (D2p); p 
When L = 1, from Tlun.3.1, we know that one coordinate system is sufficient 
to make the dependencies uniform. Assume D1 * D2 and (Di D2) L = I, for 
1 < L. First, we will show that the use of a single coordinate system C; for the 
indexing of Uj is not sufficient. Let 
W(p1)  U  U(D2p1)
 
W (p2) 4- U (D1p2) , U (D2p2) ;  Daps = Dip2
 
(3.34) 
. 
W(pL+i)  U(D1pL+1), U (D2pL+ 1);  D2pL= D 90
 
The  dependence  vectors  associated  with  D2  are  defined  as 
(D2  I) p1; (D2  I) p2;  The objective is to transform these dependencies to 
uniform ones, that is 
(D2 I)p1 = (D2 /)p2 =  = (D2 I)pL = (D2 I)pL +l  (3.35) 
From Eq.(3.34), we have 
P2= D11D2P1
 
D11D2p2  (DV  2p
 
(3.36) 
PL+1= D  1) -11D2PL= (11D2)LP1 
Eq.(3.35) is translated into 
(D2  I) = (D2  I) D11D2 =  = (D2 I)  D 2) L  (3.37) 
This equation holds if and only if D2 = I or D11D2 = I. Following the same 
procedure, if D2 = I holds, we can easily derive that D1 = I, which is 
contradict to our assumption and this case can be easily presented with uniform 
dependencies. If the condition D11D2 = I holds, then this is also contradict to 
the assumption that D11D2* I. Therefore, the use of only one coordinate system 
is not sufficient to make the dependencies uniform. 
In order to have uniform dependencies, Eq.(3.37) must hold. This is possible 
only if p1, 2.5 j  L, is embedded to the same physical location as pl . This can 
be achieved using other coordinate systems e7, 2 5. j L to define the index 
spaces which contain nodes pi, 2 5 j 5 L: 91 
\j-1, -1
f (D-1 17-1 k 111:i =  1-. 21  11111;1  (3.38) 
When pj, 2 5 j 5 L, are indexed relative to their own coordinates, their physical 
locations are located at: 
r
i  (D-1  :1-1 [(D 1 D2)j-11-1P1  = [  D2 
-1 
1  D2)  P1= P1  (3.39) 
Substitute Eq.(3.39) into Eq.(3.37), the equation holds and the dependencies are 
constant. This proves that when ( (D11 (13) )  2 ((3) ) 
Lt  = I, Li coordinate 
systems can be used sufficiently to present the dependencies uniform. 
On the other hand, if this condition does not hold, i.e., there is no such an integer 
Li such that (  ((3)) -1Di 2 ((3) )  = I, or Li ) co, then Eq.(3.24) can not 
be made uniform, or we need infinite number of coordinate systems to index 
which is not a interesting case since by then all nodes in the entire domain are 
reduced to a single point and no parallel computation can be achieved. This 
proves that ( (D41 ((3) ) -1Di. 2 ((3) ) Ld  = I is a necessary condition to present 
these dependencies uniform under Li coordinate systems. 
For the dependencies over the entire n-dimensional computational domain, 
W (p) 4- U(Dii 1p), U (Dji, 2p) ; p e Z", from lemmas 3.1 and 3.2, nulkpaces 
X (Dii, 1) and X (Dii, 2) must be used to propagate a variable to the index 
points where it is required. Therefore, if  = K, then at least k-directional 
propagation (all along the null spaces X (D14 k),15k.K) must be used. 
Remarks: The conclusion drawn from theorem 3.4 can be considered as the generaliza­
tion of the "generalized fold" method reported by Yaacoby [Yaa88], where the index 
spaces of all variables are defined relative to a single coordinate system (implicitly 92 
defined). In contrast, Eq.(3.33) is concerned with each individual variable  with its 
index space defined relative to the coordinate system constructed from (Di  (13) ) -1. On 
this behalf, if we consider all D,1(13) (VD.41e Ai) as all the linear parts of dependence 
mappings in an SARE, then the theorems and procedures developed in [Yaa88] can be 
employed for the analysis and transformation of the dependencies of variable Ui. The 
number L has been determined as L = lcm(rj) (km: least common multiple), where 
(r1) 5 m1. Moreover, L 5 f (mi) , where f N --) N (Theorem 3.6, p47, [Yaa88]). For 
' the values that L can take are 1, 2. For m. = 2 and 3, the values that L can take 
are 1, 2, 3, 4, 6. For mi = 4 and 5, the values that L can take are: 1, 2, 3, 4, 5, 6, 8, 10, 12. 
Note that the number used to determine L in [Yaa88] is the dimension of the algorithm, n, 
which is generally greater than mi. In next section, a procedure will be provided which 
incorporates some procedures from [Yaa88] for the computation of a partitioning of the 
computational domain Or 
The following algorithm is a summary of the foregoing discussions. For a given 
algorithm with V distinct variables, it determines the necessary and sufficient condition for 
transforming it into an equivalent specification with uniform dependence structures. If the 
condition in Eq.(3.33) is satisfied, the given algorithm is RMCS convertible; otherwise, 
the algorithm can not be transformed to RMCS using null space propagation only. To 
transform it to a regular specification, other directional propagation may be required. This 
would require additional interconnections for VLSI implementation. 93
 
ALGORITHM 3.1: Decide RMCS Convertibility 
BEGIN 
FOR 1 <_j  V, 
Extract all index mapping matrices Dii from the algorithm, calculate the 
set A.; 
IF IA j.1 > 1, 
Determine if 313 e {1, 2,  n} and 3Li1 such that V/, 2  / S IA  , 
( (Dji,  (13) )  (f3) )  = IA ( (D( f3) ) -1DA1( 13) is integral) 
If this condition is not satisfied by any DA, (0) , then the algorithm is 
not RMCS convertible; 
RETURN; 
END 
END 
The algorithm is RMCS convertible; 
RETURN; 
END 
In practical applications, the computational domains are generally not higher than 
four dimensions while most variables are one- and two-dimensional arrays, such as speech 
and image signals are 1-dimensional and 2-dimensional variables respectively. This means 
i 
that most m. equal to one or two and Nn  is generally not greater than 6. Therefore, the 
computation of this algorithm should not become too complex. 
Procedure 3.1 constructs a system of MCS for algorithms which satisfy Eq.(3.26). 
For algorithms which do not satisfy Eq. (3.26) but satisfy Eq.(3.31), it is necessary to con­
struct multiple coordinate systems for  This procedure is given by Procedure 3.2 below: 94
 
Procedure 3.2: Construct MCS for variable U. 
1. Invoke Procedure 3.1 to compute a system of MCS of the given ADIO. 
2. For 1 5 j 5 V, if lAil > 1, then 
2.1. V/, 2 5 / 5 IN, if (Dii, 1 (0) )-1Dii,/ 03) = D1 (13) *1, 
invoke Procedure 3.3, compute a partitioning P11 for Or 
2.2. Corresponding to each region (polytopes) Pr, 1 5 k 5 Lji - 1 , apply 
the linear transformation T3,1 = (Di (0))--1 to base vectors of the coordi­
nate system C7, which constructs a new coordinate system C711 for defin­
ing the index region 13.41. 
Remarks: The operation of step 2.2 in the above procedure is equivalent to applying the 
linear transformation Ti,, = (DI (13) ) -1 to each polytopes Pilo, 1 5 k 5 Lii - 1, which 
maps all these regions to P. Therefore, this operation is equivalent to the "generalized 
folding" method given in [Yaa88]. 
The following example serves to illustrate the application of theorem 3.4 in depen­
dence localization where a variable requires more than one coordinate systems for the def­
inition of its index spaces in order to present its dependencies as uniform ones. 
Example[3.12]:  Let an ADIO be given as W (p) = f[U (D 1, 1p) , U (D1, 2p) , ...] , 
-1 1 D  [0 2 where D1  -11, and pT =  j id . Let 13 = {1,2} e {1, 2, 3} be 
1  20 1 0  2  0 -1
selected for defming the submatrices, which are given as: 
D1,1(13) =  [02 1  and D1,2 03) = 
[ol (2',1 
To construct an MCS system from these matrices, let D1,1(0) be used in Procedure 
((3)_i  1/0 2 0 and ( (D1(13) ) -11)2 ((3) ) 2  = I. This means 3.1. Thus we have D 1,1
 
that we can present the dependencies with uniform structure using two coordinates for U.
 95
 
The first coordinate system CI =  [1/2 0 6, [o 1 6} is derived from D1 (0). 
The dependencies of this algorithm indexed under this coordinate system is not uniform as 
illustrated in Figure 3.16a (note that the standard coordinate system is always used as a 
reference coordinate system, it is also used for indexing W). In order to have a uniform 
DAG, we need to partition the computational domain into two regions using two coordi­
nate systems C21 (for the half plane z1  z2) and Cl (for the other half plane z1 < z2) for 
the indexing of U, where  { LO 1/2 d, Ei 0  } Correspondingly, the domain of W 
is also partitioned into two regions and each requires a coordinate system of its own. For 
the index points of Win z1  z2, CI is used; while if z1 < z2 then the coordinate system is 
defined as CI = { [0 1 6, [1 0 d, [0 0 fl r Figure 3.16b shows the dependencies at 
z3 = 0 while Figure 3.16c shows the dependencies at both z3 = 0 and z3 = 1. 96
 
(a) DAG at z3=0 when U is indexed  (b) DAG at z3=0 when U is indexed under 
underC21  two coordinate systemC and C' 
W(p) for (i ?_ j).
 
W(p) for (i <j).
 
U(p) for (i  ./).
 
O U(p) for (i < j).
 
Z1 
Z3
 
....-------­
(c) MCS DAG at z3=0, 1 when U is indexed under C21 and C22 
All dependencies are constant vectors 
Figure 3.16: The DAGs for Example[3.9] under different MCS systems. (a) 
The DAG is non-uniform in which the variable U is indexed under Ci only at 
z3 = 0; (b) Cl and  are employed for U (Ci for i  and CI for i < j) at 
z3 = 0; (c) DAG at z3 = 0 and z3 = 1. 97 
How to compute a partitioning of the computational domain and construct a system 
of MCS for a given system of ADIO which satisfies condition Eq.(3.33)? These problems 
are to be studied in next section. 
3.5 COMPUTING A PARTITIONING 
When a given algorithm dose not satisfy condition Eq. (3.24) but satisfies condition 
Eq.(3.28), multiple coordinate systems must be employed to define the index space of each 
variable, each coordinate system is used to define one of several regions of the computa­
tional domain. The computational domain is partitioned into L non-overlap polytopes. 
The computation of a partitioning of the computational domain has been studied by Yaa­
coby [Yac88], even though the procedures developed were used for computing a funda­
mental region for SAREs with index mapping matrices D all are roots of I, they can be 
modified to compute a partitioning here for ADIOs. 
The following procedure outlines the steps required for computing a partitioning: 
Procedure 3.3: Compute a Partition for 
1. Construct A. from the algorithm. 
2. If IN = 1, there is only one linear part associated with Uf, no partitioning of SZi is 
required. Return. 
3. Determine the m.-subspace (of n-space) which U. is to be embedded in according to
.1 
specifications, and therefore the corresponding submatrix  1 (13) from Di4 1 E  Ai. 
4. For 2  / 5 'Ail, construct Di4  (3) and compute D1 =  1 ((3) )-1Dj41((3) . If 
D1 = I, compute next submatrix; otherwise, compute as follows: 
4.1 Let GI =  {Di} l° 0-1 be the group generated by Di, where Do = I. 
4.2 Construct point p e Rmj such that DTpop, VD E  I. 
4.3 Construct matrix Ao, whose rowi = pT (Di - I) , VD1 E 
4.4 VDi E G1, form the matrix Ai = AoDil 98 
4.5 For 0  < IG/I , define F =  xl Ai (.  0) } 
4.6 Construct q E Fio such that for 0  < I Gil ,  15 j  [Ai] fq  0. 
4.7 Construct the linear constraints for Pi from those of F  as follows. If 
[A ]iq < 0, then use  [Ai] ix 5 0; else use [Ad? < 0. These polytopes are the Pg. 
Example[3.13]:  Let W (p) = f[U (Dip) , U (D2p) , ...] ,  where p =  j /dr and 
2 0 Oil, D_  0 2 1 =  =  Let the variable array U be embedded in the (zp z2)
0 1  z 1 0 
subspace, that is, let  13 = { 1, 2} ,  and so D1 ((3) =  2 0  be used to construct the 
0 1 
coordinate system for variable U. The dependencies at z3 = 0 are given as in Figure 3.13. 
t Z2 
ES 
BioriEorlelleEE
 
0 e Index points for W  0: Node of W depends on node of U 
Index points for U  through the first dependence Dl. 
Figure 3,17: The index points for variable W at z3 = 0. The two dimen­
sional array U is embedded into this plane at z3 = 0 according to its own 
coordinate system CU  The The DAG has non-uniform data dependencies. 99 
In Figure 3.13, the elements of W are indexed under the Standard coordinate system 
and the elements of U are index under the coordinate system Cli21 which is defined as: 
42:)i  = { (  0 d, 1/2), ([0 1 0], 1) }  (3.40) 
The first dependence relation defined by D1 is local and illustrated by the connected 
squares and circles. However, the dependence relation defined by the second index map­
ping function D2 is not uniform and they are the function of the index points. To transform 
these non-uniform dependencies into uniform ones, the use of multiple coordinate systems 
for the variable array U is required. Thus the computational domain is to be partitioned 
into different regions. 
Let us select the first two columns for constructing the submatrices, that is, let 
2 =  1, 2} , then D1 ((3)  °1 2 for constructing the MCS sys­
O1  and D2 ((3) 
tern. Thus we have (D1 owl = [1/21  and  ( (Di ((3) )-1D2 ( (3) ) 4 = I which 
o 1 
means that we need to partition the domain into four regions and therefore four coordinate 
systems are needed for defining the index space of the variable array U. The computational 
domain SI is partitioned into the following four regions defined as in Eq.(3.36) with their 
coordinate systems Eq.(3.37): 
P1= {pli0,i?.j,i>j}
 
P2= {plj <0,i  i > j}
 
(3.41) P3 = {Pli<O,ji,j<i}
 
P4= {plj>0,j
 100 
=  E1/2  6, E0 1 6}
 
= {E0 1/2 d, E-1 0 on
 
(3.42) 
3 =  {E-1/2 0 6, E0  6} 
= { [C) 1/2 6, El 0 o]} 
These partitions of the computational domain are illustrated in the Figure 3.18. 
Z21 
P4 O 0 0 0 0 4 0 0 0 0 
o o  000
000  o o  oo 
o o oo  o 0 oo 
O 00  0  0000pi
P3 0000  o o o 0 
A...A...A. A.  es. 
O 0 0 0  0 0 0 0 
O 0 0 0  0  000 
O 00  o o o  oo 
o o  oo9oo o 00000000
0 
O  ' 
O 000000000 
P2 
Figure 3.18: One possible partitioning of the computational domain. 
These coordinate systems defined as in Eq.(3.37) are illustrated in Figure 3.19a with 
their associated indexing regions are presented in Figure 3.19b. 101
 
i 
P2 
(a) Four coordinate systems each a partition P, 
2 1  Z2 
1 4 
Z2 
(b) Each region is indexed under its own coordinate system 
Figure 3.19: The coordinate systems for variable U and their index regions as 
expressed in them. 
Corresponding to this partitioning, the computational domain for W is also parti­
tioned into four regions. These regions are similarly defined. Figure 3.20 illustrates the 
dependence structure after embedding the U and W variable arrays into the computational 
domain according to the MCS system. 102
 
0: Index points at z3=0 
Index points at z3=1 
Figure 3.20: The dependencies at z3 = 0 and z3 = 1, All dependencies at 
z3 = 0 are uniform (except those inside a node), while the dependencies at 
z3 = 1 are non-uniform on the boundary of the regions but are uniform 
inside. 
3.6 CONVERTING ADIOS TO REGULAR SPECIFICATIONS 
For a given algorithm, the ultimate objective is to map it onto a regular VLSI array 
architecture. This can be achieved if and only if the algorithm can be presented as specifi­
cations with regular data dependence structures. This section is concerned with how to 103 
transform an ADIO to its corresponding specifications with regular dependence structures. 
The structurally well defined specification in Eq.(3.12) provides all informations regarding 
how the index space of each variable is defined in relation to its own coordinate systems. 
However, in general, the dependencies in Eq.(3.12) are not uniform. Global broadcasting 
dependencies need to be localized to transform it into a specification with regular depen­
dence structures. 
Single assignment property is necessary for pipeline implementations. Only com­
putable algorithms are concerned here. 
Single Assignment Property: For each output variable Uo in a system of ADIOs, after 
its index space has been completely expanded from m- to n-dimensions, any instance of 
Uo, U0 (p) , p  r, appears at most once on the left hand side of a recurrence equa­
tion. 
Computability: After the index spaces of all variables in an algorithm have been com­
pletely expanded, there exists a partial ordering of the equations such that any variable 
instance appearing on the right hand side of an equation appears earlier on the left hand 
side of some equation in the partial ordering, except those variables input from outside 
of the computational domain of the algorithm. 
Thorough Distribution Property: A data distribution mechanism is said to possess thor­
ough distribution property if it can distribute all elements Ui (q) of a variable array U./ 
to their corresponding node sets Ni (q) , and it is said that U./ will be thoroughly distrib­
uted. 
Transforming ADIOs to their corresponding specifications with regular dependence 
structures consists of the following steps: 
1) Decide the convertibility of a given ADIO. 
2) Construct a system of multi-coordinate systems for the ADIO. 
3) Find the embedding functions for each variable in the ADIO 
4) Determine the iteration space for each variable. 104 
5) Perform index space expansion for each variable. 
Steps 1, 2 and 3 have already been studied in previous sections. This section will 
determine the iteration space and how to expand the index space of each variable in a 
given ADIO to localize the global dependencies. 
3.6.1 The iteration space 
In order to perform pipelining implementation for a given algorithm, single assign­
ment property is necessary for the algorithm specification. however, in ADIO specifica­
tions, multiple assignment code may exist. Therefore, a procedure must be performed to 
transform an ADIO with multiple assignment property into a corresponding specification 
with single assignment property. This requires the operation called index space expansion. 
The index space expansion transforms a lower dimensional m-array into a corresponding 
higher dimensional n-array by expanding the index space of the m-array from m- to n-
dimensions. However, before the index space expansion can be performed, it is necessary 
to determine subspace along which the expansion can be performed, which is called the 
iteration space, and the iteration orientation such as along the + 1 or 1 direction along 
some the iteration space. 
The iteration space is the (n  m) -subspace of SZ along which the m-array (Ii(q) 
(q E Zn) is expanded into an n-array IIi(p)  (p E r) . Formally, the iteration space is 
defined as: 
Definition 3.6: Iteration Space IS: The iteration space of a variable U./ (q) (qe in) is the 
(n  m) -subspace of the computational domain SI e r along which the index space of f././ 
is expanded to transform 1.5 into its corresponding iterative format with single assignment 
property specified as: 105 
Ui(p) = g(Ui(pd),...),pe S/ 
where g is a mapping function, d is the dependence vector along the iteration space. Fur­
thermore, d's must fill the entire iteration space such that U./ (q) is thoroughly distributed. 
Lemma 3.4: The iteration space ISM of variable Ui (Dp + d) in an ADIO belongs to the 
null space of D, i.e., IS1 c iZ (D) . 
Proof. Assume that IS` does not belong to X (D) , that is, let  e ISM A  e X (D) . If 
76 is used  as the iteration vector, then every node U1(p) is propagated to 
(p +  (here U has already been expanded so that p e LI rather than C/i). 
Let p2 = pi +  since  e X (D) , we have DS * O. Therefore, Dp2* Dpi, 
this means that U (Dp2) and U (Dpi) are different data elements in the Ui 
array. But, Vp e SZ A p = p2 +  0 < a e Z, then each node p will be 
assigned twice by U (Dp2) and U (Dpi) respectively, which violate the single 
assignment property. Therefore, the iteration space of U must belong to 
(D) 
Lemma 3.5:  The iteration space ISM of variable Ui (Dp + d) in an ADIO possesses 
thorough distribution property only if it contains the null space of D, i.e., It (D) 
Proof. Assume that ISM does not contain X (D) , that is, pi 0 ISi A pi E  (D) .  Let 
W (p)  U; (q) = Ui(Dp) and D (p  p1) = 0. Then we have also the 
dependence relation W (p1) 4- Ui(q) .  But pi  ISM, therefore, the element 
Ui(q) would not be sent to pi . This means that the iteration space IS does not 
possesses the thorough distribution property. Therefore, X (D) c "Si must be 
respected. 106 
From Lemmas 3.3 and 3.4, we have the following theorem for the determination of 
the iteration space for a variable in an ADIO. 
Theorem 3.4: Given an ADIO, any variable array U (Dp + d) , D e Zn", m S n, in it 
can be expanded to an ARE which satisfies the single assignment property and thorough 
distribution property if and only if the space of expansion is the null space of D. 
Proof. Lemma 3.3 provides the necessary condition which guarantees that the ARE 
has single assignment property. Lemma 3.4 ensures that each element of Ui, 
IIi (q) , is propagated to all the elements of Ni (q) . 
By expanding the output variable solely, an ADIO is transformed into a system of 
ARE. For example, let an ADIO be Y (i) = Ih (j) x (3i Pi. the iteration space for vari­
i 
able y is defined as /Sy = [o fl T. Let the coordinate system for variable y be defined as 
uy = [i d 
T 
and let the iteration orientation be [IL id . Then the ADIO is transformed into 
a system of ARE defined as: 
y (i, j) = y (i, j  1) + h (j) x (3i  j)
 
y (i, 1) =0
 
y(i) = y(i,N -1)
 
The index space of y is expanded from 1-dimensional to 2-dimensional which trans­
forms variable y into a two dimensional array from a 1-dimensional array. 
3.6.2 Expanding an m-array to an n-array 
Index space expansion generates a higher dimensional array from a lower dimen­
sional one, which is necessary for transforming a given ADIO to its ARE or regular speci­
fications. 
Definition 3.7: Index Space Expansion: Let Ui be an m-dimensional variable array in a 107 
given n-dimensional ADIO with domain Si (m < n) , and the index space of  be Llj a 
The index space expansion generates a new set of index vectors p E LI for Ui from the old 
index vectors q E Of such that U./ is expanded from an m-dimensional array to an n-
dimensional array. 
What is the format and characteristics of a regular specification under a MCS sys­
tem? The characteristics of a regular specification under an MCS system is that the specifi­
cation has uniform dependencies over the entire computational domain S2 characterized 
by a dependence vector or a set of dependence vectors die X (Di) for variable 
except that the norms of di may be less than others in a subdomain Si called the vicinity 
domain of SI, where Of c LI; c 0, while the orientation of all di should be the same over 
the entire domain U. Formally, the regular specification is defined as: 
Definition 3.8: Regular Specifications Multi-Coordinate-Systems (RMCS): A recurrence 
equation is said to be an RMCS if it has uniform dependencies throughout its entire 
computational domain under an MCS system as specified as: 
Ui(p) = f[Ui(p), ...,Ui(p + did , ...,Uv(p+ dvd],Vp E fi
 
Ui(p+ di) ,Vp  0;
  (3.43)
U. (p) =

1  Ui (Dip + d',)  Vp  SI;
 
uf 
where OE. denotes the vicinity domain of the computational domain Q.. Vp  SY means 
that index point p does not depend on any point q E  directly, or those points q which 
are on the other side of the domain crossing the domain  such that the dependence vec­
tor will not intersect with the domain Qi .  Ui (Dip + d',)  indicates that the index space 
of variable U. is defined in relative to the basis u. , or the coordinate system eni. 108 
0 
The vicinity domain is defined as follows. Let W (p) E Ui (Dip) ,  E in, and let 
the m-array U1(q'),qa E in be embedded into the m-subspace of Rn such that the other 
(n  m) coordinates are all zero, that is, zk = 0, Vk E {1, 2, ..., n} Ake 0. Moreover, 
without loss of generality, let the computational domain SI be defined in such a way that 
Vj E {1, 2, ..., n} . Such a assumption can be justified because, for all computable 
algorithms, their computational domains must be bounded or semi-infinite only along cer­
tain coordinate and bounded along others. 
Applying index space expansion to the m-array  (q') ,  E in will expand it into 
an n-array  (q) , q E Zn by propagation along the right null space of Di. Then we have 
all the dependencies characterized by die X (Di) and  (q) = Ui(q di) .  Then the 
vicinity domain 9; where S2 g SZ, g SI is defined as: 
= {pIW (p) 4 Ui(q), elf q  0; p,q e Zn,Vk e { 1, 2, ..., n} Ak0 
The vicinity domain  is illustrated by the diagram shown in Figure 3.17. Note 
that the domain O. may be just a subdomain of the entire computational domain and it is 
defined in relative to its own coordinate system C7i. The dependencies in an RMCS  are 
uniform in the sense that there is only a limited number of dependence vectors in the entire 
domain O. and this number is not a function of the domain size parameters. Whenever 
p E SI but p  9.;, then the dependence vector is always equal to di. However, ifp E il;, 
then the dependence vector may have different norms but its orientation should remain the 
same. 109 
Zk 
=0 
eTq =  `.
 
ekq  0
 
Figure 3.21: Illustration of the vicinity domain S2; of 9  , S2 c SI; 
The application of MCS technique for the analysis of algorithms' dependence struc­
tures can best be illustrated through examples. The following examples show dependence 
structures under single coordinate systems and under MCS systems. 
Example[3.14]: Let us now consider again the dependence Uo (p)  U1 (Dip) studied 
in Example[3.5] to see how the vicinity domain of S21 is defined. Let the computational 
domain be S2 = {pl 0 5 zi, 0 5 z2 S NI, 0 5 z3 5 N2} . Let U1 be again embedded into the 
-IT  r
subspace  (z1, z2)  in relative to the basis  tit  = [1/2 0  or
9  u2 =  -1 QJ T, 
coordinate system C21 = {ui, 14} .  The global dependencies can be removed by 
propagating the elements of U1 in the right null space of D1 to all index points where it is 
used. The dependence vector is the right null space of D1 which is given as: 
[2 0 1, d. =  (D )  = [1 2
1  o i 110 
Since 13 = {1, 2} c {1, 2, 3} and D1 (13) were selected to construct the coordi­
nate system for U1, Therefore, e3 is used to compute the vicinity domain of fix because 
3E {1, 2, 3} A3  {1, 2} . Accordingly, 
T e3di =  r  0 lj 2 = 2 1  z3 5 2
 
2
 
while the other parameters are defined the same as before, thus the vicinity domain 
is defined as L.2xr = { (zp z2, z3) I 0  zi, 0 S z2 5 N 0  z3 S 2} . The dependence graph 
is given in Figure 3.22 where the global broadcasting dependencies have been removed by 
pipelining. 
0: nodes of U0 at z3=2 
0: nodes of Uo at z3=1 
0: nodes for the U1 variable 
Z2 
Figure 3.22: The vicinity domain of f/i is defined as S21  =  0 5 z3  2}  . 
When p E OE then U0 (p)  U (Dip)  should be used.  u 111 
The uniqueness of MRA architecture is that the temporal locality constraint does not 
exist and each data link may have its own clock frequency. This feature makes ita suitable 
target array structure for the implementation of RMCS. When the different scale sizes in 
MCS system are used properly, they can be transformed into the definitions of different 
clock rates in MRA architectures. The following example briefly shows the synthesis pro­
cedure based on the MCS technique. 
Example[3.15]: The ADIO defined in Eq. (3.9) for the decimation filter is studied here to 
show the synthesis procedure. To simplify the analysis of this algorithm, let M  = 3 and 
N = 10, thus the ADIO is specified as: 
9
y(i) = Ih(j)x(3ij),05i  (3.44) 
=o 
where h(j)'s are the filter's coefficients and x(i)'s are the filter's inputs. The computational 
domain of this ADIO is defined as SZ = { 0 5 i, 0 5 j 5_ 9} . Let us use the MCS technique 
discussed in previous sections to synthesize this algorithm. 
Step 1. This algorithm is apparently convertible to regular specifications, since each vari­
able has only associated with one linear part. 
Step 2. Construct an MCS system for the ADIO. This algorithm has a 2-dimensional com­
putational domain, which is to be defined in accordance to a 2-dimensional standard coor­
dinate system C2, . The index mapping matrices are specified as: 
Dy  6, Dh = [)  = [3  (3.45) 
For the variables y and h, each of them has only one submatrix which is non-singu­
lar, thus their corresponding 13 must be selected as 13y  = { 1} and Ph = { 2} . Therefore, 112 
variables y and h each may have one set of coordinate system only where the element of 
the set is coordinate systems with different origin only but the others are the same. How­
ever, Dx has two non-singular submahices Dx( { 1 } ) = [3] and Dx( {2} ) = [ 1] , 
therefore, variable x has two sets of coordinate systems. Let the coordinate systems be 
defined by the following transformations: 
Figure 3.23a corresponding to the selection of CI =  [1/3  } with the same ori­
gin with C. Figure 3.23b shows the same coordinate system except that its origin is 
defined at [N/3 (N  1)1 for an Nth order filter. Figure 3.23c shows the coordinate sys­
tem defined as CI =  [o  } with the same origin with C. It should be point out here 
that the number of selections for embedding a lower dimensional array into a higher 
dimensional space is infinite if no restrictions are imposed. It is impossible to find all of 
them. Based on the fact that the coordinate systems are constructed from the selection of 
fl, we thus have limited ourselves to the selection of an m-standard space to embed an m-
array. But we should remember that we have the freedom to select other embedding strate­
gies and whenever we need this kind of freedom, we may still use it to optimize designs. 
(b) 
Figure 3.23: Several different possible coordinate systems for the indexing of 
the variable array x. 113 
Let us now consider the implementation of this algorithm. Let the transformations 
be defined as Ty =  Tx = [1/3], Th =  . The corresponding coordinate systems are 
then defined as: 
Cy =  D. 61, cxi = { [1/3 d },  =  {[01]} 
These coordinate systems are as shown in Figure 3.24, where the axis of each coor­
dinate system are labelled as zy, zx, zh respectively. The gray lines shows the standard 
coordinate system C. Firstly, let the origins of Clx, CI be located at the same location as 
the standard coordinate system Cs2, and let the origin of Cy be located at the location 
(zp z2) = (0, N 1) for an Nth order filter. This yields in the MCS system as given in 
Figure 3.24. Such selection will result in one MRA implementation. 
tZ2 
zY 
Zh 
Z1 
01 
zx 
Figure 3.24: The MCS system for in R2 consists of CS and three other coor­
dinate systems C.;, Cl el x, C. 
Step 3. Once the MCS system has been constructed for the algorithm, the dependence 
structure of the ADIO is also defined and Eq. (3.39) can be represented as the following 
structured equation: 114 
9
y(i). = Ih(j).x(3ij) 
VI;  (3.46) 
= 0  h Y 
All variable arrays can now be embedded into the computational domain SI accord­
ing to the following embedding functions: 
Z---> Q2, a (i) = (z1, N -1) 
ah ° Z  Q2' ah (i)  = (O, Z2)  (3.47) 
Z1
Z - Q2, ax  = (T, 0) 6x 
After such embeddings, the dependence structure is as shown in Figure 3.25 where 
the dependencies are not yet transformed into regular specifications. Broadcasting depen­
dencies need to be localized. Note that the dependence vectors associated with those input 
variables (It and x) are directed into the index points p while those associated with the out­
put variable y are directed toward the nodes of y. This shows how information flows. 
Figure 3.25: The dependence structure under an MCS system for in R2 con­
sists of CS and three other coordinate systems C;, 115 
Step 4. Index space expansion and dependence localization: The iterative property of Eq. 
(3.41) is not explicitly expressed in the equation and the dependence structure is still not 
uniform. The index space expansion aims at extracting the iterative property and the regu­
lar property of an equation. Let us first perform the index space expansion for variables y 
and h (Note, it is possible to expand the index spaces of all variables at the same time). 
These two arrays are expanded from 1-dimensional arrays where the variables are defined 
only in their 1-dimensional index spaces to 2-dimensional arrays where these variables are 
defined over the entire domain. After this expansion, Eq. (3.41) can be represented as a 
system of ARE as follows: 
1 For  0 .i, 0 54 5.9 
y (i, j) = y (i, j  1) + h (i, j) x (3i  j) 
h(i, j) = h (i  1,j)  (3.48) 
h (0, j) = h (j) 
y (i, 1) =0 
Please refer to chapter 2 for the dependence structure of Eq.(3.43) under a single 
coordinate system. We can further localize those global dependencies of x-variable by 
pipelining, and Eq.(3.43) can be expressed as a system of regular structures specifications 
as follows: 116
 
For(05i,29) 
y(i,j) = y(i,j- 1) +h(i,j)x(i- 1,j  3)
 
h (i, j) = h (i  1, j)
 
x (i, j) = x(i  1,j 3)
 
For(0 5..i, 0  ,j5 2)  (3.49) 
x (i,j) = x (3i j),... 
y(i,-1) =0 
h(-1,j) =h(j),,,, 
Under the MCS system defined above, Eq.(3.43) has a data dependence structure as 
given in Figure 3.26a with broadcasting global dependencies. The corresponding DAG for 
Eq.(3.44) is given in Figure 3.26b, whose dependencies are localized by pipelining. From 
this uniform DAG, it is straight forward to implement it onto regular array architectures. 117
 
Z2 
.Z1 
Yo  Y  Yo  Y  Y4 
h  4 
ma.i>  0.6  4Nc 
h 
;w1 
h h
 
h 
Apt
h 
ha  why 
h 
h
 
Noir 111
 
VIlfillifir 
1 
Mi  y;  YO  1  Y2 Y3 Y4 
3  4 5  7 8  1  12  0;  2  4 5 4 7 8  le 10.12 
(a) DAG with global dependencies  (b) DAG with uniform dependencies 
Figure 3.26: The DAG under MCS system for the decimation filter algorithm 
studied in Example[3.12]. The dependencies of h and y variables have been 
localized by index space expansion. 
Step 5. Mapping the uniform DAG in Figure 3.26b to regular array architecture. Since it is 
regularly structured, so it is easy to derive the MRA architecture from Figure 3.26b. Let 
the allocation function be defined as a (p) = [0 flp = z2, and the timing functions be 118
 
given as t1 (p) =  flp = zi+ z2, p E Z2, for the variables y and h (where  z1 and z2 
are the coordinates in relative to the standard basis), and t2 (q) = [3  q = 3 ( z + z2) , 
q E  Q2  = {12.p E Z2 } for variable x respectively, then resultant MRA architecture is the
3 
same as the one given in Figure 2.6. 
As for comparison, we provide an single rate array architecture derived from the 
DAG in Figure 3.26b for this algorithm. Since the DAG is uniform, it is possible to imple­
ment this algorithm onto SRA structures. One possible implementation is as shown in Fig­
ure 3.27 by projecting the DAG along the i-axis onto the f-axis, where 3 links for the 
propagation of x-variables are required. Each black strip in each link represents a delay 
while each processing element contributes another delay to each channel. This SRA array 
is not very suitable for VLSI implementations. 
X8 Xs X7 
I I  III 
X7 x4 x1  II 1 
11 
X6 x3 x0 
h0  h1  h2  h3  h4  h5  h6 
y 
y 
Figure 3.27: Another single rate VLSI array architecture for the decimation 
filter derived from the DAG in Figure 3.26b. Even though it has constant inter­
connections, the length increases with the decimation factor M. 
MRA solution 2. Now let us consider another embedding strategy for this algorithm. 
Let the coordinate systems be defined the same way as before except that the origins are 
relocated. For practical applications, it is always desirable to input data elements at the 119 
boundary processors of the array. Therefore, the coordinate systems for variable x is cho­
sen to be at the boundary of the computation domain as show in Figure 3.23a and b. Ele­
ments of x are propagated along its null space into the domain. Let be origins of Ch, C; be 
located at the same origin as the standard basis. Let the origin of Cx1 be located at 
(N  1, N 1) . The resultant MCS system is given in Figure 3.28. 
Zh  Z2	  I I  I I I I  I I I I I  I I 
Zx 
4 
0'
 
Figure 3.28:  Another MCS system for in R2 consists of CS and three other 
coordinate systems C)1,, Cz, CI with different origins. 
Now, for N = 10, the embedding functions are defined as follows: 
a : Z--4 Q2, a (i) =	  0)
 
Z  Q2, a h (i) =  z2)
  (3.50) 
z1+ 
ax Z --> Q2, ax (i)  (  9,9) 
The dependence structure is illustrated in Figure 3.29, where only a part of the 
dependence structure in the domain SI is given. 120
 
Figure 3.29: The dependence structure another MCS system for in R2  con­
sists of CsS and Cl CI Cl with different origins. Cy,  x'  h
By employing null space propagation to pipeline the data transmissions, the algo­
rithms can be represented as the following specification with regular dependence structure. 
Figure 3.30 shows the DAG with localized dependencies. This new RMCS is presented as 
follows: 
For(0 Si, 2 <j59)
 
y(i,j) =y(i,j+1) +h(i,j)x(i,j)
 
h(i,j) = h (i 1,j)
 
x(i,j) = x(i +1,j +3)
 
For (0540 5j52)  (3.51) 
x(i,j) = x (3i j); 
h(0,j) =h(j),,,, 
y (i,N) =0 
The DAG corresponding to Eq.(3.51) is given in Figure 3.30. 121
 
Figure 3.30: Another localized DAG for Eq.(3.39). From this DAG, a more 
efficient MRA architecture can be derived. 
Let the schedule for variable y and h be given as s (p) = z1 z2+ c. Then, from the 
relationship given in the definition, that is sx = rxs, we can easily derive the multi-rate 
schedule for variable x to be that sx(q) = 3z1  3z2 + cx. where c = 6 and cx = 18. 122 
With the allocation function a (p) = z2 and the timing functions as defined above, 
Figure 3.31 shows the new MRA array. The total computation time of this new array is 
less than the one shown in Figure 2.6. Moreover, this new MRA array uses less registers 
(two) than the previous one. So, this new array is a better design than the first one both in 
computational time and area. 
h9  h h  h6 h  h4  h h  hl  116 
-111.-y 
(a) Multi-rate array for the decimation filter 
4)x=34)y 
L: Latch. D: Delay. 
(b) Internal block diagram of the processing element 
Figure 3.31: Another MRA architecture for the decimation filter. This array 
has less delay elements in each PE, and its total computation time is less that 
the one given in Figure 2.6. 
3.6.3 Dependant* localization for other ADIOs 
The conditions given in theorems 3.2 and 3.3 provide the sufficient and necessary 
conditions for localizing the dependencies of a given algorithm by the use of null space 
propagation technique solely. However, algorithms which do not satisfy these conditions 
do not necessarily mean that they can not be transformed into regular specifications. This 
x 123 
simply means that the null space propagation technique is not sufficient to localize the 
dependencies of such algorithms, and therefore multiple directional propagation would be 
required. Even for this class of algorithms, the MCS technique developed in this chapter is 
still very useful in helping a designer to determine the spaces to embed each variable array 
into the computational domain. The following example shows how this can be achieved. 
Example[3.16]: A matrix Lyapunov equation is given by: 
AX+XB = C 
where A is an n x n non-singular lower triangular matrix, B is an n x n non-singular 
upper triangular matrix. The objective is to solve the n x n matrix X. the above equation 
can be represented in detail as: 
for 150 n 
c (i,j) = I a (i, k)x (k,j) +  b (k,j) x (i, k)
 
k=1  k=1
 
This equation can be easily represented as an ADIO defined as follows: 
for 15i,j5n 
i- .11-1 
x(i,j) = (c (i,j) - a (i, k)x (k,j) +  b(k,j) x (i, k))/ (a (i,  4-b(j,j)) 
k=1  k=1 
The computational domain of this ADIO is Q = {pi 1 5 i, j 5n, 1 5 k < i, j} .  To 
localize the dependencies of this ADIO, the first step is to extract all index mapping matri­
ces from it. They are given as: 
D  100 il  D  [0 il, D  [ 10 00 l 1O  001 D 
010 
001  100  100  010
Dbl  a2 0 1 0  2  001  100  010 
where Dx is the index mapping matrix for the output variable and the others are the index 
mapping matrices of the input variables (Dx1, Dx2 denote the different index mapping 124 
matrices for the input variable x). Even though the dependence properties of this ADIO are 
not exactly the same as described by the convertibility condition, the dependence analysis 
method developed in this chapter still can be of great help in localize the dependencies of 
this algorithm. Let the domain be defined in relative to the standard basis C: whose axes 
are labelled as z1, z2, z3 respectively. Since the dependence relations are so simple, we are 
not going to show the procedures for the derivation of each coordinate systems. The output 
variable array x (i, j) must be embedded into the (z1, z2) subspace since it is computed 
in this subspace. The c (i, j) array should be similarly embedded as for di e variable x. 
The array a (i, k) is embedded into the (z1, z3) subspace while b (k, j) is embedded into 
the ( z3, z2) subspace. These embeddings are shown in Figure 3.32. In such an embedding 
scheme, the only real conflict is the embedding of the x variable. The embeddings for vari­
ables a and b suit well for both index mapping matrices of each variable. 
Figure 3.32: The subspaces for embedding the input and output arrays. 125 
For the embedding of the x variable, the above embedding scheme can not satisfy 
the requirements of the other two index mapping matrices Dx1, Dx2 which require that the 
variable array be embedded into the (z3, z2) and (z1, z3) subspaces respectively. But 
such requirements can not be satisfied because they conflict with the requirement of the 
first index mapping matrix Dx, which is more important since it is the output variable and 
so it should be considered with the highest priority. Therefore, other directional propaga­
tions might be employed to transfer the x array from the (z1, z2) subspace to the (z3, z2) 
and (z1, z3) subspaces. However, such directional propagations may not be necessary if 
we study the dependence structure among the elements of x variable in more detail, which 
is given as x (i, j)  x (k, j) , x (i, 1) , 1 5 k < i, 1 51 < j. Since x (i, j) is not needed for 
the computations of x (k, j), k Si and x (i, 1) , 1 5j, it needs to be sent to those nodes at 
either z1 > i or z2 > j only. Therefore, it can be first sent to the point (i, j, k) and then sent 
to those nodes which require it along their corresponding iteration spaces. If the DAG is to 
be projected onto the (z1, z2) subspace, then this other directional propagation is not 
needed. 
After such embeddings, to localize the dependencies, the iteration spaces are identi­
fied as /Sx = z3, this is the iteration space for the output variable x; /Sxi  = z1 and 
/Sx2 = z2 these are for the input variable x respectively. The array c (i, j) has the itera­
tion space IS, = z3. However, since in the computation, the elements of c are not used 
until the iterations are at the boundary of the computational domain, the elements of c do 
not need to be distributed over the domain. IS  = z2, IS b1 = z1 are the iteration spaces 
for variables a and b due to Doi and Dbl respectively.  /.51,2 = ( z2, z3)  and 
/Sb2 = (z1, z3) are for Dal and Db2 (these iteration spaces are for the diagonal elements 
of variables a and b only). 126 
Instead of propagating the input variable x from the nodes (i, j, 0) to the node 
(i, j, k) for performing the computations a (i, k) x (k, j) and b (k, j) x (i, k) , we can 
propagate the variables a (i, k) and b (k, j) from z3 = k to z3 = 0 and then perform the 
computations. These do not change the input output relationships. However, the data 
dependence relations are greatly simplified. The localized algorithm is given as: 
for
 
a (i, j, 0) = a (i, j  1, 0)
 
a (i, 0, k) = a (40, k + 1)
 
b (i, j, 0) = b (i  1, j, 0)
 
b(0,j,l) = b(0,j,1+1)
 
c (i, j, 0) = c (i, j)  (3.52)
 
+a(i,j)x(i-1,j) +b(i,j,0)x(i,j-1),k<i,1<j 
x(i,j,0) +b(i,j,0)x(i,j 1),k= 4151<i x(i,j,0) =  x(i,j,0)+a(i,j)x(i-1,j),1k<i,1=j 
(c(i,j) x(i,j))/ (a(i,j,0) +b(i,j,0)),k=  = j 
The localized algorithm can be implemented on array architecture by projecting the 
dependence graph onto the (zi, z2) space, which results in the architecture as shown in 
Figure 3.33. b33 
b22  b23 
b11  b12  b13 
b
 
b34
 
b24 
/214 
127 
b55 
b45 
b35 
b25 
b15 
Figure 3.33: The array for the matrix Lyapunov equation. 
3.7 Conclusions 
In this chapter, a new initial specification ADIO was proposed which resembles 
most original algorithmic specifications for many applications. For example, most DSP 
and matrix operation algorithms are initially specified as ADIO. The roles played by coor­
dinate systems were then studied and the limitations of single coordinate system  were 
identified. ifrifelti-comiinate systems was introduced and defined, together with RMCS. 
Then the dependence structures of ADIOs were investigated and the iteration spaces for 
index space expanding were also identified. A procedure was developed for the transfor­
mation of ADIOs into RMCS specifications. The introduction of MCS technique makes it 
possible to localize global dependencies by the use of unidirectional data propagation 
solely, in contrast to the multiple directional propagation technique which requires multi­128 
ple communication channels for a variable. This unidirectional propagation technique can 
thus reduce the communication cost and the complexity of control mechanism of VLSI 
implementations. Moreover, MCS technique allows us to synthesize algorithms whose 
index mapping matrices are not totally unimodular, which was not possible for conven­
tional methodologies. Furthermore, the use of ADIO as the initial specification for synthe­
sis allows us to derive VLSI architectures for many applications systematically and also 
allows us to explore the full solution space of an algorithm. 
This chapter also provided a procedure for the construction of a system of MCS for 
a given algorithm and an algorithm to determine if a given algorithm can be transformed 
into a corresponding RMCS specification. A procedure was also given for the construction 
of an MCS system for a variable which requires more than one coordinate systems to 
define its index space. A procedure for computing a partitioning of the computational 
domain of a variable has also been provided. Examples have been provided to show the 
application of these procedures. 
It is very desirable to reduce the number of directional propagations for each vari­
able because each directional propagation may later require an interconnection in a VLSI 
implementation. The conditions given in this chapter for the determination of convertibil­
ity of regular specifications are necessary only for the use of null space propagation tech­
nique for localizing dependencies under MCS systems. However, for a given algorithm, it 
does not mean that a given algorithm can not be transformed into regular specifications if 
it does not satisfy these conditions. It does mean that the algorithm can not be localized by 
the use of null space propagation only if it does not satisfy these conditions, and other 
directional propagation may be required. However, even under such circumstances, the 
MCS technique developed in this chapter can still be very helpful for a designer on decid­
ing how to embed each variable array into the computational domain. 129
 
Chapter 4 The Synthesis of Multi-Rate VLSI Systems 
Chapter 2 discussed the mapping of DAREs onto multi-rate array architectures by 
the multi-rate transformation. Chapter 3 introduced the multi-coordinate systems, investi­
gated the dependence properties of ADIOs and developed procedures based on the MCS 
technique for data dependence localizations for ADIOs. This chapter first will investigate 
the scheduling issue for MRA architectures. It will introduce and define the multi-rate 
affine schedules and multi-rate timing functions. Sufficient and necessary conditions will 
be provided for the determination of multi-rate affine schedules. Then the synthesis of 
MRA systems based on the MCS technique will be investigated. A procedure will be 
developed for synthesizing MRAs from ADIOs without the phase components. Another 
synthesis procedure will be provided for ADIO with phase components. Examples will be 
studied to illustrated the applications of these procedures. 
4.1 MULTI-RATE AFFINE SCHEDULES 
Many works have been done by numerous researchers on the computation of sched­
ules for given uniform algorithms [Qui83, Qui84, Rao85, DI86, YC88, Roy88, SF89, 
SF90, WD92]. However, all these techniques are concerned with the computation of single 
rate schedules and their optimizations. As presented in previous chapters, multi-rate arrays 
are advantageous over single rate arrays in many applications. The computation of multi-
rate schedules is essential for designing MRA architectures. Synthesizing multi-rate arrays 
from regular specifications consists of two affine transformations, the multi-rate timing 
functions t (p) and the allocation function a (p) .  This section provides definitions of 
multi-rate schedules, multi-rate timing functions and their characterizations. In the design 130 
of single rate arrays, the computational partial ordering can be determined completely by 
counting the number of clock cycles. Therefore, in single-rate arrays, a schedule or a tim­
ing function can be used interchangeably for specifying the computational partial ordering 
of a given algorithm. However, in multi-rate arrays, in order to determine this partial 
ordering, not only the number of clock cycles, but also the period of each clock signal 
must be considered. In this chapter, the term "a schedule" will be used to define a function 
which computes the number of clock cycles; and the term "a timing function" will be 
employed define the function for computing the amount of time in relation to the common 
clock signal, the fundamental clock rate. 
In order to derive a set of affine schedules for a given ADIO, the dependence struc­
ture of the ADIO must be specified. 
Definition 4.1: Structurally Well Defined ADIO: The following specification: 
U (D + di)  = 
(4.1) f [U (D  + did  1p + dpi, 1)  (Dii, 2p + (IA 2) up  .] 
is said to be a structurally well defined specification of the ADIO. The subscripts ui and 
u1 denote that the index spaces of variables Ui and Ui are defined relative to the coordi­
nate systems dini and C'inj respectively. The origin of Cr will be the same as CS if it is not 
defined otherwise. 
In a regular specification, there exists a partial ordering cc for evaluating the nodes 
of all variables. 0. is pronounced as "precedes". If p1, p2 E SI, then p1 « p2 (p1 precedes 
p2) if and only if the evaluation of a variable at node p2 uses the value of that variable at 
node p 1 .  In other words, p1 a p2 if and only if there exists a direct path from node p1 to 
p2 in the DAG. A similar definition of the partial ordering can be found in [SF891. Com­
putations in an ADIO are carried out only at the integer index points p E SI c Zn of the 131 
standard coordinate system. Operations at those fractional index points pqe 11,7 c Qn  are 
data transferring operations. Before introducing the multi-rate affine schedule and timing 
function, let us define the fundamental schedule and timing function. A fundamental 
schedule and a fundamental timing function for a given algorithm are affine mappings 
which map the index set p E r into a one-dimensional integer number sequence and the 
real time space respectively. 
Definition 4.2: Fundamental affine schedule and timing function: A fundamental affine 
schedule sf(p) = nip+ cf for a given ADIO is a mapping sf : zn + Z which is strictly 
monotone increasing with respect to the ordering induced from the regular specification, 
where nfe Ql" and c f E Q are called the scheduling vector and the translation part 
respectively. The fundamental timing function tf (p) = sf (p)  = (nip + cf) T1 is a map­
ping function tf:  > R such that if 3pi, p2 E 12 A (pi 0. p2) , then tf (pi ) < tf (p2) . 
For convenience, let the period of the fundamental clock signal be defined as 
= 1. Therefore, the fundamental schedule sf(p) and the fundamental timing function 
t (p) will have the same format. Under this assumption, the timing function can be con­
sidered as a mapping tf :  > Z. However, the amount of time should not be confused 
with the number of clock cycles. The clock periods ti of other clock signals will be 
defined relative to tf = 1. 
Without loss of generality, icf is normalized, that is, the gcd (greatest common 
divider) of the elements of nf is 1. tf(p) = nip + cf defines the slowest clock rate in  an 
MRA architecture which in general involves with the most complicated operations. A 
multi-rate schedule si (p) = nip + cj assigns an integer number to each index point in its 
domain and a multi-rate affine timing ti (p) = (nip + cf) ti assigns each index point a 
time value for evaluating variable Ui such that the computation partial ordering is pre­
served. 132
 
Definition 4.3:  Multi-rate affine schedule and multi-rate timing function: A multi-rate 
affine schedule si (p) = nip + ci for U (p) (it has been expanded from an m-dimen­
sional variable Ui (Dip + di) to an n-dimensional array) in a regularly structured specifi­
cation is a mapping si :  > Z. si is monotone increasing with respect to the partial 
ordering from the corresponding regular specification. The multi-rate timing function 
ti (p) = (nip + ci) T./ is a mapping ti :  +R. For 1 5. j 5_ V, there exists a set of inte­
gers 7) so that ni = ran  ri is called the multi-rate clock-rate ratio for variable  If 
q E SZ and q « p, then  ti (q) <  (p) .  The period of the clock frequency for 
: Qn  Q will be defined as t =  1/ ri. 
The fundamental schedule and other multi-rate schedules for each variable form  a 
set of schedules for a given algorithm which assigns proper time instants to all variable 
instants in the computational domain. 
Definition 4.4: Sets of multi-rate affine schedules and timing functions: For a given ADIO 
with V variables, the set of multi-rate affine schedules As = {sr si,  s1,} is defined by 
a  set  of scheduling vectors  rl = {nt nk, ...,nv} ,  a set of clock rate  ratios 
r = {1, ri, r2,  rv} , and a set of translation parts c = {cpci,  . If the set of 
scheduling vector II has only one component, i.e., RI, then the schedule becomes a single 
rate schedule. The set of multi rate timing functions At = { tp t1,  tv} is defined as 
t i (p) = s i (p)  V j E if, 1, 2, ...,  . 
For example, in the decimation filter design discussed in Example[3.151, the funda­
mental affine schedule and timing function are defined as sf(p)  = tf(p) = i + j. The 
multi-rate schedule for the variable x is defined as sx (pq) = 3i + 3j; Vi,j E S-Zq  Q2, 
that is nf = Cl fl and nx = 3 [1  where the clock-rate ratio is rx = 3. However, the 
multi-rate timing function is defined as tx (p) = (3i + 3j) ix;  E Slq c Q2, where 133
 
the cycle period for tx (pq) is Tx = 1/3  .  In other words, three tx cycles have the same 
period as one tf period. 
Lemma 4.1: Given a system of regular specifications: 
(Ui(p) =  ...] 
At = { tp t1,  tv} defines a valid set of multi-rate timing functions for the algorithm if 
and only if it satisfies the following conditions: 
i  V, Ui (p)  Ui(qi) = 7Ci (pqi) >0  icidi> 0  (4.2) 
(2) Vi, j, 1  j < V, Ui (p)  Uj(qi)  (nip + c1) ti > (niqi+ ci)ti  (4.3) 
where p,qES/c 
Proof.  (1) Since Ui (p) depends on Ui (qi) , i.e., Vi,liV,Ui(p) 
therefore, the evaluation of variable Ui (qi) must be performed before the com­
putation of Ui(p). In other words, ti (p) > ti(qi) must be respected, which 
can be represented as nip+ ci>niqi+ ci  ni(pqi) >0 = icidi> O. More­
over, since the dependence relation happens among the different elements of the 
same variable, the clock rate is the same. Therefore, the clock period does not 
need to be considered. 
(2) This is similar to the proof of (1). However, since the dependence relation 
occurs between different variables and each variable may have its own clock 
rate, therefore the clock periods must be taken into account when considering 
which computation is performed first. Counting the number of clock cycles is not 
sufficient in multi-rate schedules. 134 
Mapping an n-dimensional algorithm into an (n  m) -dimensional array architec­
ture, the allocation function a (p) can be represented by an (n  m) x n matrix Aa or m 
normalized projection vectors gi, ..., gm, where gi, 1  i S m, are the right null spaces of 
Aa, that is, Aagi = 0, Vi, 1 5. i  m. In other words, we restrict ourselves to consider the 
case of perpendicular projection only (a space is projected into a subspace normal to the 
projection subspace). The other condition which a multi-rate timing function must satisfy 
is the conflict free condition, that is, if nodes pl, p2 E SI are assigned to the same process­
ing element, then pp p2 should not be evaluated at the same time instant. This condition is 
expressed as: 
p2 E 0, a (p1) = a (p2)  t  * t (p2) or 
\fill' P2 E  ti (Pi) = ti (P2) = a (P  a (P2) 
This condition is called the conflict-free condition. This condition is  necessary 
because each PE (or functional unit) can perform only one computation at any given time 
instant. If each PE contains more than one functional units which can operate indepen­
dently, than the assignment of these function units and the timing function must satisfy the 
conflict-free condition. 
4.2 A SUFFICIENT CONDITION FOR THE EXISTENCE OF MULTI-RATE AFFINE 
SCHEDULES 
Assume the evaluation of I  (q) take one clock period of the clock signal tJ, which 
has a period of tj = 1/ri. The derivation of a feasible schedule for a variable in general 
depends on how its elements are embedded into the computational domain. Let the embed­
ding function a./ :  Qn for variable 15 be a (q)  = pi where q E S2 c of and 
p E Q ". The following theorem relates each individual multi-rate schedule with the funda­
mental schedule. 135 
Theorem 4.1: Let a structurally well defined ADIO be represented as follows with the use 
of the intermediate variable W (p): 
Ui (Dip + di)  = g [W (p)] = f [..., Uj(Djip + dji)  ...] 
Let a (q) = pj,  E { 1, 2, ..., V} be the embedding functions for U1. Then the ADIO 
admits a set of multi-rate timing functions At = { tp t1,  tv} if and only if: 
(1) Vp E Ni(q), q = Dip + di  (nipi+ ci) ti  nip + cf 
(2) Vj, 1  V, Vp E  Nji(q), q = Dix + d3i  (xjpj+ QT./57R + cf 
(3). If y is a ray of SI, then iv > 0;
 
(4). For every vertices in the domain, Vv E  Cj
  0. 
Ui denotes the output variable; Uj denotes the input variables in the ADIO; Ni (q) 
denotes the node set where the output variable Ui (q) is computed from the values at these 
index points; and Nji(q) is the node set where the computations at these index points 
require the value of Uj (q) . 
Proof.  (1) Since Vp E Ni (q), Ui (q)  W (p) ,  therefore, the evaluation of Ui (q) 
cannot be finished until all the computations at nodes p, Vp E  Ni (q) , are fin­
ished. In other words, Vp E Ni(q), ti (pi) > tf(p) must be enforced under any 
circumstances. Since the clock period of ti (p) is ti relative to the period of 
t f (p) whose period ti = 1, therefore, the result is achieved. 
(2) Since Vp E Nji(q),W (p)  Uj (q) ,  therefore, the evaluation of Ui (q) 
must be completed before the computations at Vp E Nji(q) can be started. In 
other words, the values of Uf (q) must be available before starting the 
computations at nodes p, Vp E  Nji(q) .  Therefore, the schedule for input 
variable U./ (q) should satisfy the condition tf(p) > tj(p))  . 136 
(3). By the definition of a ray of a domain 0, we have: 
Va > 0, a E R, V p E SZ= p + ay  SZ 
Then there must exist a point p1 = p + ay 0 with a time value given as 
tJ (p1) = (n. (p + ay) + c.)  ..  Assume that this condition is not true, i.e., 
0, then for a sufficiently large a, ti (p1) =  (p + ay) + ci) J < 0 is 
expected. This is contradict to the definition of a (schedule) timing function, 
which cannot have negative values. On the other hand, if njy = 0, this means 
that all the points lying along the ray of SI are assigned the same timing value. 
Thus the number of points computed simultaneously is not bounded, which 
would require an infinite number of processor and is not feasible. 
(4). This condition is apparent since all point belong to the domain must have a 
non-negative time value. 
Example[4.1]: Consider the scheduling problem for the dependence relation in the 3­
dimensional ADIO studied in example [3.7]: 
Uo(i,j,k) = f[111(2ik,kj), 
The domain is defined as 0 = {pl 0 5 i,  N} where p = 
T 
is the index 
point. The index mapping matrices are: 
0 
Do' 0 10,D1 = [2 0 1
2 -1  1 0 0 1 
The index mapping matrix D1 is not a square matrix so it is not full rank, therefore, 
the scheduling technique given in [YC88] cannot be employ. 137 
1 
To use the multi-rate schedule for this problem, the embedding functions must be 
determined for each variable. As shown before, the embedding function ao : Z3 > Z3 for 
U0 (p) is given as a o(i, j, k) = (z1, z2, z3) , where i, j, k are not relative to any specific 
coordinate system, however, z1, z2, z3 are relative to the standard basis. Again, let 
_-, T  1  r  _71 T 
U1 = [1/2 0 q , u2 = Lc) 1 oj  be  the  basis  of the  coordinate  for  U1,  i.e., 
CI  = { ulp u21} . With the origin of Cl been located at the same location of the origin of 
Cs3, the embedding function for U1 is defined as: 
z1  a 
1 (i,j) = (-2 Z2' 0)  (4.4) ' 
Thus the data dependency is defined as: 
U0 (Zi, Z2, Z3) + U1 (2z12  Z3, Z2  Z3, 0) 
For any point p = [z1, z2, Z3] T, its time value t0 (p) must be greater than the time 
value assigned to the point  z3, z2  z3, 0) for variable U1. That is
2
 
I
 
2z1  z3 
Z1 
1Co Z2  + co  n1 
2  + Cl  T1  (4.5) 
Z2  Z3
 
Z3
 
0
 
The clock signal for U0 is the fundamental clock signal, whose period is defined as 
to = 1. At point p = [k k IdT e SI, Eq.(4.5) becomes 
k 
no  k  0  + ci t1  (4.6) 
k 0 
Since the value of the schedule s1 (p) must be integer numbers, therefore, the first 
k element of a1' Xl, 
1  must divide two, i.e., n11  =  = integer. Let it be x1, 1 '  21c0,r 
2-I T
At point p = [k 0 0]  E 0, we have: 138 
no[01+ co>  [ni [pi +c1  it  k+c > (2no 1 k+c  ) T  (4.7)
1  o, 1  0  1
 
0 0
 
To satisfy this inequality relation, T1  1/2 is required. Let it be Ti = 1/2. This 
procedure can be carried on to determine the other elements of the schedule vectors. As 
the relationship between the fundamental schedule vector and the multi-rate vector is 
defined as ni = nr/Ti, since Ti = 1/2, therefore, ni = 2no. Depending on other depen­
dence relations, no can be determined, thus  can also be determined. 
: nodes for the Ul variable 
A I VA WA I
 
AMMOSIMMI
 
2 
2  Mi11511M11111111 
Z2  111112113/111M1111 
1 
Z1 
Figure 4.1:  The dependencies of U (p)  U1 (D1p)  .  The time values 
assigned to the nodes pq of U1 must be less than the time values assigned 
those nodes p of Uo which depend on them. 139 
After every index point pq E SZ1 c Q2 (which is defined relative to the standard 
basis) has been assigned a time value, it is straight forward to further assign time values 
for each element of U1 relative to its own basis. For example, if pq = (z1, z2, 0) has 
been assigned the time value t1, then the element Uit (2z1, z2) is also assigned the value 
ti. 
Theorem 4.1 can be employed to compute a set of schedules for a structurally well 
defined ADIO, however, it is not a constructive way because the node sets are not explic­
itly defined in the algorithm. Recall that for every convertible ADIO, there exist corre­
sponding specifications with regular dependence structures defined as follows: 
U (p) = f[Ui (p  did ,  Uj (p  ...,Uv(p  dvd] ,Vp E Sli
 
U.i(p  di) ,Vp  11.;
  (4.8)
Uf(p) = 
Uf (Dip + di )u,Vp E 
In this specification, all variables have been expanded so that they are defined over 
the entire computational domain and the dependencies are constant. As it was shown in 
example[3.15], since this specification is regular, therefore, it is possible to derive single 
rate schedules for this algorithm. However, as it was shown in that example, the use of sin­
gle rate schedule may have many undesirable properties for VLSI implementations (e.g., 
multiple interconnections among processors and non-nearest neighboring interconnec­
tions). Therefore, multi-rate array and multi-rate schedule suit better for VLSI implemen­
tations. To derive a set of multi-rate schedules for the specification given in Eq.(4.8), we 
have the following theorem: 
Theorem 4.2: For a given system of regular recurrence equations defined as: 140 
U (p) = f [U (p  aid ,  Ui (p  did , ...,Uv(p  dvi)] , Vp
 
Ui  did ,  p  SI;  p e SI
  (4.9)
uf(p) =  + di )  Vp E O.; 
where d.E Zm is the translation part in the original index mapping function. Then 
As = {Sp si,  sv} defines a set of multi-rate schedules if and only if it satisfies the fol­
lowing conditions Vj E tf, 1, 2, ...,  : 
(1). VPq E  cr./ (q) = pq, si (pq) = integral;
 
(2).  1;
 
(3). If y is a ray of SI, then niy > 0;
 
(4). For every vertices in the domain, Vv E 0, nv +  0.
 
Proof. (1). Since all operations of Uj are synchronized by the clock signal 4) , non-
integral number cannot be used to define a clock signal. Moreover, pq corre­
sponds to an integer point q E Zm if it is defined in relative to the coordinate 
system C7 for variable Uj, any action happening to a variable can only happen 
at integer number of clock signals. Therefore, sj (pq) must be an integer num­
ber to be a valid timing function. 
(2). Recall that si (p) = nip + ci computes the clock cycles for index point p. 
Since Ui (p)  Ui (p  aid , the time value assigns to the point (p  did must 
be at least one clock cycle earlier, that is, the condition sj (p)  s (p  did 
must be respected. sj (p)  si (p  aid =  ?_ 1. The clock period is 
ti, therefore, at least ti time difference should be assigned to points p and 
(p  aid . 
(3). The same as the proof for theorem 4.1 (3). 
1 141 
(4). The same as the proof for theorem 4.1 (4). 
For a given localized algorithm, condition (1) of theorem 4.2 can be used to deter­
mine the clock rate ratio associated with each variable. Condition (2) can be used to for­
mulate a linear programming problem for solving a set of multi-rate timing functions for 
the algorithm. Moreover, if we consider the dependence vectors in the vicinity domain as 
different dependence vectors, then Condition (2) can also be used to decide the clock rate 
ratio. Condition (3) is used to ensure that the clock increases in the direction of the ray 
thus that the time value will not become negative. And Condition (4) is used to determine 
the translation parts ci of the schedules. 
Remarks: Conditions (2), (3) and (4) in theorem 4.2 are similar to the Quasi-Affine-Time-
Function (QATF) given in [Qui84]. However, the index point in [Qui84] is strictly integer 
vector and it is concerned about the single rate schedule only. Here we are dealing with 
fractional index vectors and multi-rate schedules. Therefore, Condition (1) is necessary. 
The number of clock cycles must be integer numbers, however, the time value relative to 
the fundamental time function can be fractional numbers. 
Geometrically, the clock rate ratios can be determined from the ratios of the scale 
sizes of different coordinate systems along the projection direction. This can be interpreted 
as follows: if the direction of projection is considered as the time axis, then the scale sizes 
of each coordinate along this time axis mark the clock periods for different clock signals. 
Example[4.2]: Consider the localized decimation filter algorithm represented as follows: 142 
(For(0.5.i,2<j5.9)
 
y(i,j)=y(i,j+1)+h(i,j)x(i,j)
 
h(i,j) = h (i  1,j)
 
x(i,j) = x(i +1,j +3)
 
For(0i3Oj2)  (4.10) 
x(i,j) = x (3i j)ux 
h(0,j) =h(j)uh 
y(i,N) =0 
The dependencies in the above specification are given as: 
d  [0], d  [1], d  d  [-1  d 
y  h  x xi  x2 
where dx1, dx2 are the dependence vectors of variable x in the vicinity domain. They have 
the same orientation as dx but their norms are different from dx's. The index spaces for 
variables y and h are defined in the integer lattice points p E Z2. However, the index 
space of x is defined as: 
pq E  (3, 9), (3  9), (3  9), ...} 
Let the schedule for x be sx (pq) = itxpq + cx and the schedule for variables y and 
h be defined as s (p) = icp+c. According to condition (1), if the elements of it and c 
are  integers,  then  the  schedule  s (p) = ap+c will be  integral. However, for 
sx (pq) = 7c..A + cx, the first element of xx = Prx1  must be dividable by 3. This 
determines the clock rate ration rx = 3. Higher ratios are possible, e.g., rx = 6. However, 
if rx = 6 is selected for this algorithm, then dummy variables must be introduced to 
ensure the correct timing relationships. Introducing redundancy is not what we want. 143 
This clock rate ratio can also be derived from dx1, dx2 by the application of Condi­
tion (2). However, the integer condition is still necessary here. Otherwise, the number 4 
will also satisfy Condition (2), but the number 4 cannot be used as the clock rate ratio for 
this problem. From condition (2), we have the following inequalities: 
pci Irj  = -R2 ?. 1  R2  -1 
Rdh = Pri 7tj 01 = n1 [  1  (4.11) 
axl. <-1
 
,,/rxdx = PCx1 itx2] [1 = -Xxl -31Cx2  1  ct  < 1/3

x2 
These inequalities define a convex hull on which the parameters for the schedule can 
be selected. This convex hull is represented by the shaded area in Figure 4.2. 
Figure 4.2: The convex hull defining the parameters of the schedules. 
If we select the vertex point  of the convex hull as the schedule vector, that is 
=  ,  then we have the fundamental schedule given as s (p) = z1 z2+ c. Then, 
from the relationship given in the definition, that is sx = rxs, we can easily derive the 
multi-rate schedule for variable x to be sx(q) = 3z1 3z2 + cx. This selection satisfies 144
 
Condition (3) since the ray of the domain is y =  T. From Condition (4) and a vertex 
T,
point of L/, v = [3  we can determine that c = 6 and cx = 18. 
4.3 THE SYNTHESIS PROCEDURES 
Chapter 3 provided procedures for dependence localization for ADIOs. Last section 
developed theorems for computing scheduling and timing functions for ADIOs and their 
corresponding regular specifications. This section provides systematic procedures for 
mapping ADIOs onto multi-rate VLSI arrays. 
4.3.1 The synthesis of ADIOs without the phase component 
This section outlines the synthesis procedure for the synthesis of MRAs from 
ADIOs without polyphase components. The synthesis of ADIOs with polyphase compo­
nents is studied in next section. The synthesis procedure consists of the following steps: 
1.	  Extract the sets of index mapping matrices from the given ADIO specification, 
using theorem 3.3 and 3.4 to determine if the given ADIO is regular specification 
convertible. If yes, go on to step 2; otherwise, use multiple directional propaga­
tion. 
2. Determine the iteration spaces for each variable. 
3.	  If IA) > 1, select an index mapping matrix D11,15.1 IN as the matrix for the 
construction of a coordinate system for Up 1 5 j 5 V. Invoke Procedure 3.1 to 
construct an MCS system for the ADIO. 
4.	  If the index mapping matrices of variable U3,15j 5 V do not satisfy the suffi­
cient condition Eq. (3.26), invoke Procedure 3.3 to compute a partition of its 
domain. Then invoke procedure 3.2 to construct a system of coordinate systems 
for each region of the domain. 
5.	  For each variable  1 5j 5 V, select an iteration orientation for it and decide 
the origins for each coordinate system. Find the embedding functions for each 
variable and embed each variable array into the computational domain Si 145
 
6.	  Transform the ADIO into a system of regular specification by removing broad­
casting variables using pipelining technique along their iteration spaces respec­
tively. 
7. Determine an allocation function a (p) = A + ca. Compute a set of multi-rate 
timing functions At =  tj, t1,  tv} . Apply these functions to the regular spec­
ification, derive a regular array architecture and design the internal PE structure. 
8.	  If the resultant array satisfies the specification, stop; otherwise, select another 
Di 1 and go back to Step 3 and continue to derive another implementation. 
Remarks: In the above procedure, step 1 examines if the dependencies of a given ADIO 
can be localized using null-space propagation. If it can be localized by null-space propaga­
tion, the procedure finds a proper MCS system for the localization. However, if it does not 
satisfy the conditions, then multiple directional propagation must be employed. If the 
index mapping matrices associated with the output variable do not satisfy the conditions, 
the priority should be given to finding the embedding functions of the output variables. If 
the variable is an input variable, then the priority should be given to finding the embedding 
functions which can embed the entire m-dimensional array into the computational domain. 
For algorithms with phase components in their index quantizations, such as the 
interpolation filter example, the synthesis procedure presented above needs to be modified. 
This will be discussed in next section. 
The following example illustrate how to use the synthesis procedure for mapping 
algorithms onto VLSI arrays. 
Example[4.3]: Two-dimensional decimation filter. A 2-dimensional decimation filter is 
defined as: 
IC  L 
y (i, j) =  h (k,  x (A 1  k, M2j  (4.12) 
k = 01= 0 146 
where x(ij)  is the input signal array  (let  its  size be N x N), h(k,l)  is a 
(K+ 1) x (L + 1) rectangular weight window, and y(i,j) is the output signal array, M1 and 
M2 are the decimation factors. For the sake of simplicity, in this example, we are going to 
consider the following specific filter with M1=M2=2 and K = L = 3 : 
K L 
0  N + 1  I y (i, j) =  h (k, 1) x (21  k, 2j  1),  (4.13) 
k = 01 = 0  L  2 j 
This algorithm has a four-dimensional computational domain which is defined as: 
= { 0  i,j  , 0 S k  K, 0  L} under a four-dimensional standard coor­
2 
dinate system q. Let us define a point in C as (i, j, k, 1) . The first step is to extract the 
index mapping functions for each variable in this ADIO. They are specified as follows: 
D  [ 1 0 0  D  [0 0  1 01  D  20 -1  0  (4.14) 
Y  0 1 0 0  h  0 0 0 1  x  0 2 0 
Since every variable in this ADIO has only one index mapping function, according 
to Theorem 3.3, Eq.(4.12) is RMCS convertible. The iteration spaces of each variable are 
determined as: 
(D  [00 0 1 IS 
[1 0 0 0 ISh =  tt(Dhi =  0100  (4.15) 
D  [1 0 2 ISx= 
From the index mapping matrices specified in Eq.(4.14) and following the Proce­
dure 3.1, we can easily determine that there is only one coordinate system for y and one for 
h, since there is only one possible submatrix extracted from either Dy or Dh which is not 
singular. These submatrices are defined as: 147
 
Dy({1,  =  [1 (1,D ({3 4}) =  [1 (4.16)
0 1 h  0 
The coordinate systems for variables y and h are defined as: 
uy2}, uyi 
[1 0 0 6, uy2 = [0 1 o 
(4.17) 
=  ill 1 ,  = [o o 1 o] , tit  = [o o o 
No other coordinate system for either y or h arrays is possible. (Here we are con­
cerned with different base vectors only, different origin has not been taken into account for 
different coordinate systems). In contrast, we have up to four different coordinate systems 
for the indexing of variable x. they can be derived from the following four nonsingular 
submatrices of Dx: 
[2 0 [2 (,D  ({1 4})
02 x  0 -1 
(4.18) 
ro 11D  r_i {3
24 j  x  LO 1 
Note that the total number of 2 x 2 submatrices of Dx is six but two of them are sin­
gular matrices (rank 1). These rank deficit matrices cannot be used to determine coordi­
nate systems for the 2-dimensional variable. Each valid submatrix will define a coordinate 
system for x. Following Procedure 3.1, we have the following corresponding coordinate 
systems: 
C21=  ,  = [1/2 o 0 6,  = [0 1/2 0 d 
=  {1.1x12,  ux12  [1/2 0 0 6, ILL = [0 0 0 
(4.19)
C23= { uL, uL} , ux3 = [0 1 /2 0  ux3 = E0 0 1 
Cx24 =  ful4,  , ux14 = [0 0 -1 6, uL = [0 0 0 I] 148 
Among these coordinate systems, only C11 derived from D x( { 1, 2} )  lies at the 
boundary of the computational domain Si It is used to imbed the input variable x array at 
the boundary of Si which is shown in Figure 4.3. Because we can present at most three 
dimensional subjects graphically, we must decompose the four-dimensional computational 
domain down to two subspaces, that is, the (z1, z2) and the (z3, z4) subspaces. The input 
variable x and the output variable y are imbedded into the (z1, z2) subspace. The h array is 
embedded into the (z3, z4) subspace. 
Z 2  Z4 
2 
limormilimasmos
 
Ifuliffiihrodalis
 
IIIIMIEMMI101111 
Z1
 
Z1  Z3 Z3 
Figure 4.3: The MCS system for the 2-dimensional decimation filter. The 4­
dimensional computational domain is decomposed down into two 2-dimen­
sional subspaces. 
The next step in the synthesis procedure is to find an embedding function for each 
variable. Iteration direction is selected for each variable for index space expansion Let the 
coordinate systems for x and h have the same origin as CI, i.e., at (0, 0, 0, 0) T. Let the 
origin of the coordinate for y be at (0, 0, K, L) T. Let the iteration orientation for y be 
[0 0 1  and [0 0 0 fl (along the positive directions z3 and z4 axes respectively), let the 
iteration orientation for h be [0 0 1  and [0 0 0 1] (along positive directions of z1 and 149
 
z2 respectively), and x along [1 0 2 0] and [0 1 0  directions, then the embedding func­
tions are given as: 
ay : Z2 > Z4  ay (  =  [Z  Z2 °  12] 
ah z2 + z4  ah (  ) = [0 0 z3 z,j 
(4.20) 
: Z2 -+ Q4  ax  i  ° 
After embedding all variables into the computational domain SI, we can perform 
index expansion for all variables to remove those global dependencies and transform the 
ADIO into a system of regular specifications. However, if we perform the index space 
expansion operation for the output variable y alone, then the ADIO in Eq.(4.13) is trans­
formed into a system of ARE as given below: 
For 05i,j<I N + 1
 
L  2 j
 
For(OSk5K)
 
For(0515L-1)
 
y (i, j, k, 1) = y (i, j, k, 1  1) +h (k, 1)x (21 k, 2j  1)  (4.21) 
For (1 = L)
 
y (i,j, k, 1) = y (i, j, k,  1) + y (i, j, k  1,1) + h (k, 1) x (21  k, 2j  1) 
y (i,j, k, 1) =0 
y (i, j, 1, L) = 0 
Eq.(4.21) is just one of many possible SARE specifications of the original algo­
rithm. By selecting different iteration orientations, we may arrive at different SAREs. Per­
forming index space expansion for h array is similar to that of the index space expansion 
for y except with different iteration space and orientation. 150 
The index space expansion and dependence localization for x dependencies deserve 
our special attention. Even though in multi-coordinate systems, they are treated by the 
same way as the other variables. After the embedding, all dependencies are bounded in 
each null space only. Then, we can localize those broadcast global data dependencies of x 
by propagating the data samples of x(i, j) along the null space of Dx, that is: 
T 
(D	 )  [1 0 2 0 1
 
x  0 1 0 2
 
This null space is 2-dimensional. The null space propagation can be carried out by 
propagating the x elements first along the r  0 2 OJ  direction, then along E0  1 0 
T, or 
vice versa. This process is illustrated in Figure 4.4, where we have decomposed the 4­
dimensional space down into two 3-dimensional subspaces. 
/ 
/ / I / I
Z1	  I 
1 
Figure 4.4: The null spaces for propagating the elements of x variable array 
to localize the dependencies. 
After the null-space propagation for dependence localization, the algorithm in 
Eq.(4.13) can be represented as a system of regular specifications in an MCS system as in 
Eq.(4.22). The window size parameters are given as (K+ 1) x (L+ 1  For simplicity, we 151 
will consider the special case with K = L = 3, i.e., a 2-dimensional decimation filter of 
order 4 x 4. 
i,j<1 N + 1 1) For (0 
L  2 j 
For(OkK)
 
For(0 l_L-1)
 
y (i,j, k, 1) = y (i,j, k, 1  1) + h (i,j, k, 1) x (i,j, k, 1)
 
For (1 = L)
 
y (i,j, k, 1) = y (i,j, k, 1) + y (i,j, k  1,1) + h (i,j, k, 1) x (i, j, k, 1)
 
y (i, j, k, 1) =0
 
y (i,j, 1, 3) = 0
 
For(2k5..K,1= 0) 
x(i,j, k, 1) = x(i  1,j, k  2,1) 
For(1 5.1 5.L)  (4.22) 
x(i, j, k, 1) = x (i,j 1, k, 1  2) 
For(0.k 5_1,1=0) 
x (i,j, k, 1) = x(21 k, 2j 1) 
For (i = 0,j = 0) 
h (i, j, k, 1) = h (k, i) uh 
For(j=0,0<i<LN 2+1) 
h (i, j, k, 1) = h (i  1, j, k, 1) 
For (0 <j <I N + 1 I)
 
L  2 j
 
h (i,j, k, 1) = h (i,j 1, k, 1) 152 
To see a more complete dependence structure of this algorithm, let us project the 
localized DAG along the z2 direction. Figure 4.5 shows this projected DAG partially. A 
small part of the dependencies is shown. After the projection, the entire subspace along the 
z2-axis is projected to those points where z2 = 0 and all the elements of each row x(i) of 
the input data array x(i, j) is concentrated to a single point in the projected DAG, which we 
present as x(i, *). It is a very involved process to express the complete projected DAG in a 
three dimensional space, and if we do so it may do more harm than good. Here we only 
show the dependencies for z3 = 0 because all other dependencies for z3  0 are similar to 
those at z3 = 0. 
Figure 4.5: The projected DAG under an MCS system. (a) shows the depen­
dence structure at z3 = 0 and one layer; and (b) shows the detailed dependen­
cies at z4 = 0 and z4 = 1. 
With the understanding of the VLSI implementations of the 1-dimensional decima­
tion filter MRA arrays, it is not difficult to derive an MRA architecture from this P_DAG. 153 
If we have the luxury to afford a 3-dimensional processor array, then Figure 4.6 shows a 3­
dimensional MRA array processor architecture. The time required for processing an 
N x N image frame is 0(N). The array has two clock frequencies 20 and 4. 20 is the clock 
signal controlling the operations of variables x and 4) is the clock signal for y. x(i, *) and 
y(i, *) represent row-i of x and y respectively in Figure 4.6. 
y(0,*)  Z3 
y(1,*) 
63  4:0 
a 
z1 
x
 
x(0, 13)
 
x(1,*)
 
x(2, *
 
x(3, *)
 
x(4, *
 
x(5, *
 
x(6,*
 
(a) 
Figure 4.6:  A three dimensional MRA implementation for the ADIO 
Eq.(4.13) with window size 4 x 4 and the input sample array x(i, j) with size 
N x N. The MRA array size is then 4 x 4 x (N + 1)  . (a) shows part of the 3-D 
array; (b) shows a 2-dimensional array in more detail at each layer along z1. 
In reality, for VLSI implementation, we can hardly afford a three-dimensional 
implementation. The number of processing elements required by this 3-dimensional array 154 
is tremendous even for a moderate image size of 512 x 512. Therefore, instead of a 3­
dimensional implementation, a 2-dimensional MRA implementation would be much more 
practical for many applications. To reduce the dimension of the array, we need to project 
the projected DAG once more along another direction. In order to obtain an array whose 
size is almost independent of the problem size (it cannot be completely independent of the 
problem size because the number of internal storage elements required by the array will be 
a function of the problem size, but the others are independent of the problem size), we 
would project the projected DAG along z1 direction. 
Let the allocation function a (p) = Aap be defined by Aa = 
° ° 1 ° As a result, 0001 
each node p = Ci j k T E Z¢ in the DAG is mapped onto a 2-dimensional processor array 
as a (p) = Lic 
T 
The resultant multi-rate array is given in Figure 4.7, which is the same as the one 
given in chapter 2 (Figure 2.8). 155
 
44) 
(b) Internal block diagram of PE(i, 0) 
y 
0 
x 
I I  I  x
 
(c) Internal structure of non-boundary PEs 
X00
 
xol
 
(a) The MRA system architecture 
(d) Internal structure of PE(j, N) 
Figure 4.7: An MRA architecture for the 2-dimensional decimation filter 
with decimation factors M1 = M2 = 2 (a); Diagrams (b), (c) and (d) illus­
trate the internal block diagram of the processing elements. 
4.3.2 The synthesis of ADIOs with phase components 
Last section presented the synthesis procedure for ADIOs without considering the 
phase components. This section will generalize the synthesis procedure to allow the syn­
thesis of ADIOs with polyphase phase components. 156 
Assume that a given ADIO with a phase component 0 5 q) 5 L 1, the synthesis 
procedure starts at setting the polyphase component to zero, i.e., let cp  = 0 and obtaining 
its dependence graph DAG9= 0 using Procedure 4.1. After obtaining the MCS system and 
the dependence graph DAG9= 0, the complete dependence graph can be obtained by 
shifting DAG, = o in corresponding directions as q) increases. This procedure is described 
as follows: 
Procedure 4.2: Synthesizing ADIOs with Phase Component 0 5  L  1 
1. Let (p = 0, invoke Procedure 4.1 to construct the dependence graph DAG,	  0 with a 
node called the starting node which is located at the origin of the standard coordinate 
systems. A corresponding regular specification of the ADIO with 9 = 0 also can be 
obtained. The above operations should have finished the construction of the MCS sys­
tem, the selection of embedding functions, and the index space expansion operation. 
2. Let() = (p+1,if(p<LDo 
2.1. for (1	  V) , for Uf (r (p) + (pi) , if (pi = p (i.e., (pi * 0 in the ADIO), let 
the starting node of DAGq,  0 be located at ((p)  change the corresponding labels 
in DAG, = 0 from U.(r. (p)) to U. or. (p) +  . This is the dependence graph for 1  1  .1 1 
the phase component at 9, denote it as DAG(. 
2.2. In DAG if the label U. (	 ) of a dependence vector pointing into a node p is the 
same as those labels in DAG  = 0 to DAG  1' in other words, variable U. does not 
have a polyphase component, pipeline the signal by drawing a dependence vector 
from the corresponding node in DA G9 _1 to node p and remove the original vector 
and label. 
3. In the DAG constructed in Step 2, index nodes with dependence vectors going into and 
leaving them (fractional index points relative to the standard basis are allowed) are the 
nodes where computations need to be performed. On the other hand, those nodes with 
dependence vectors cross over each other are the dummy nodes where no computations 
will occur, they may be removed. If the dummy nodes are removed, re-indexing and 
scaling may be required. This step is optional. 
4. Find an allocation function a (p) and a set of multi-rate timing functions. 157 
Recall the subband coding system presented in Chapter 2, where the decimation and 
interpolation filters are the key construction blocks. 
Example[4.4]: A three-bands subband coding system is depicted by Figure 4.8. 
YO
 
HO(z)  Coder/  3  GO(z)
Decoder 
or 
H- Comm. H1(z) Hi 3 
Y1
Gl (z) Channei""  3 
Y2 
H2(z)  3  G1(z) 
Analysis Section  Synthesis Section 
Figure 4.8: The three-band critically sampled subband coding system. 
In the analysis section, the decimation filter bank can be implemented using the dec­
imation filters derived in Chapter 3. Because the data dependencies of each decimation fil­
ter are identical except the values of the filter coefficients are different, the data path for 
propagating the x variable can be shared by all decimation filters. If the MRA architecture 
in Figure 3.31 is employed to implement this filter bank, the analysis section has a struc­
ture as shown below in Figure 4.9: 158
 
x  x 
Yo  hog  ho,8  ho,7  h0,5  h0,4  ha 3  h0,1  hao  yo 
Y1  h1,9  h1,8  h17  h1,5  h1,4  h1.3 .--0 hia  h1,1  h1,0  Y 
Y2 
hZ 9  h2,8  hz7  h25  h2,4  hz3  hz2  h2,1  Y2 
(a) The three-band decimation filter bank 
x x  11111. X 
Yo(j+1) 
Y1(j +1) 
,Y2(j +1) 
Y2(j) 
L: Latch. D: Delay. 
(b) Internal block diagram of the processing element 
Figure 4.9: The analysis filter bank of the three-band subband coding system. 
The synthesis section of the subband coding system consists of interpolation filter 
banks. An interpolation filter with up-sampling factor L can be represented as: 
0  L 1 
rli41  (4.23) 
x(iL+ 9) =  gUL-4-0y(ij) 
1 =0 159 
The computational domain is defined as Q = {  .j 10 S i, 0 .54 S N 11. For a 
three band subband coding system, we have L = 3. For the sake of simplicity, let 
N = 10. The synthesis procedure of this ADIO is as follows: 
Step 1: Set the phase component cp = 0. Eq.(4.23) is represented as: 
3 
x(3i) =  g (3j) y (i j)  (4.24) 
1 =0 
The null space of each index mapping matrix and the corresponding coordinate sys­
tems are defined as follows: 
k t (Di) =  ( Dg) 
Cz = {to,  = [1/3 6, cig =  [0 1/3] 1, ug = [0 1/3]
 
ug,  , ug, =  = {11g2}, 11g2 = [CI -1]
 
If we select the iteration orientations for variables x, g, and y to be along the direc­
tions [0 1], [1 Cl [1  respectively, then Eq.(4.24) can be transformed into the following 
system of regular specifications where 1  = {PI (p e  Z2) A (0  i, 0 Si S 3) 
V(i,j) E
 
.X(i,j) =i(i,j 1) +g (i, j)y (i  1, j  1)
 
g (i,j) = g (i  1,j)
 
y (i,j) = y (i  1, j  1)
  (4.25)
(i, 0) = (30 la
 
g (0,j) = g (3j)
 
Y  = Y (i) uy
 
Step 2: Construct the complete regular specification from Eq.(4.25). First introduce the 
phase component back into the index quantization, then increase the phase component cp 160 
from 0 to 1 and to 2 respectively. Graphically, relocate the starting node of the dependence 
graph DAG9= 0 from p =  [o  to p = [1/3 1/3] and to p = [2/3 2/3] respectively. 
These operations generate the following complete system of regular specifications for Eq. 
(4.23): 
flq = { kqiji (ig,4 E Q3) A (0  iq, 0 _,jq  3) A (iq jg = integer) } 
V (i, j)  E  Sig 
.Tc (i, j) =  (i, j  1) + g (i  1, j)y (i  j 
g (i,j) = g (i  1, j)  (4.26) 
1 .  1 
Y  Y (I­
(i, 0) = x (30 ui 
g (0, j) = g (3j) 
Y (i3O) = Y 
Where  Q3  denotes the set of fractional numbers with the denominator equals to 3, 
that is Q3 =  p E (Z2/3) } . The complete dependence graph is given in Figure 4.10. 
The deep dark nodes and dependence vectors represent those of DAG( = 0; the gray ones 
represent those of DAG, =1 and DAG9= 2  respectively. DAG9  and DAG". 2  are = 
obtained by relocating the starting node of DAG9 =0. 161
 
x6 70 71 72  73 74 75  77 78 79 
89 
88  111111111111111111Pr, 
1111111111PA 
87 
86  11111111111210 
1`5­ -pripr
 84. 
83  , AIIIIIIIIII A
\ham. 411111TV
. 
80 
ANEW
MS Ifir' 
fil 
AM 
... 
I  2  3_ 
X0  x2  x3  x4  X5  x6  x7  X8  x9 
Figure 4.10: The DAG for the interpolation filter. The dark nodes and vectors 
represent DAG, 0; the gray ones for DAG, =1 and DAG,  2. 
Step 3: This is an optional step. It is possible to derive the set of multi-rate schedules from 
the DAG constructed in step 2 as shown in Figure 4.10. Computation is to be performed at 
a node (iq  jq = integer',. There are dummy nodes in this DAG. Without changing the 
input-output relationships, we can eliminate these dummy nodes. This can be done by re-
indexing the variables and making the scale of the coordinate system for variable g be the 
same as the standard basis. Changing the scale size for the coordinate for g is because the 
index space of variable g is used as the processor space. The processor space is preferred 
to have the same scale size as the standard basis. By eliminating the dummy nodes in Fig­
ure 4.10, the new DAG is as given in Figure 4.11. 162 
X0 Xi x2 x3 x4 x5
 
Figure 4.11: A modified DAG (partial) of the DAG in Figure 4.10. 
The multi-rate timing functions can be determined as follows: 
ltdg= pri .2]  =n1 >_1 
ndy = [ni  ..1+ n2  1  n2 >_0  (4.27) 
1?. 1/2 
\nidi = [nil i r )72] [1= 27ril+ 3ici2  1 
x­
Gcc_
x2 > 1/3 
These inequalities define a convex hull on which the parameters for the schedule can 
be selected. This convex hull is represented by the shaded area in Figure 4.12. 163
 
Figure 4.12: The convex hull defining where the parameters of the schedules 
can be selected from. 
Projecting the DAG in Figure 4.11 along the z1-direction onto the z2-axis results in 
an MRA architecture as shown in Figure 4.13a, where each PE performs the computations 
involved with a single coefficient  0 54 5 9. The system level architecture of this MRA 
interpolation filter is the same as the decimation filter with the clock rate ratio r = 3. 
However, the data elements of the computed variable x travel at the higher clock rate 
= 30), instead of at a slower rate. Notice that despite that the computed variables prop­
agate at the higher clock rate, the multiplication-accumulation unit is still work at the 
slower clock rate. This is possible because we have eliminated the operations which 
involve with zero valued operands. However, depending on the actual timing function, the 
internal PE architecture may vary with different selections of timing functions. If the 
schedule functions are selected as ty (p) = i (which is equivalent to ty (p)  = i + 2j if it 
is determined from the DAG in Figure 4.10) for variables y and g; and ti (pq)  = 3iq 
(which is equivalent to ti (pq) = 3 (iq + 24) if it is determined from the DAG in Figure 
4.10) where the subscript  denotes that these coordinates can be fractional numbers). 
These schedules are shown in Figure 4.10 and Figure 11 respectively, where the  gray 
hyperplanes show the time instants for variables y, x and g, while the dashed hyperplanes 164 
are the time instants for 7c only. The resultant internal PE architecture is given in Figure 
4.13b. In this MRA interpolation filter, the amount of time allocated to performing an mul­
tiply-accumulate operation is (L  1) ti instead of Lti (where ti is the clock period of 
4)i), this is because the computed partial result is due to arrive at the next PE by (L  1) 017 
clock cycles. 
y 
go g1 g2 g3  g5 g6 g7 g8 g9 x
  x
 
(a) A multi-rate array architecture for the interpolation filter 
C-F-34)y  a  I s 
Mux 
'70)  Cri dm-DD 
L: Latch. D: Delay. Mux: Multiplexer 
(b) Internal block diagram of the processing element 
Figure 4.13: An MRA implementation of the interpolation filter with L  = 3. 
For some applications, if the amount of time (L  1) tix is not enough for perform­
ing a multiply--accumulate operation, then different schedule must be determined to allo­
cate more time for the multiply-accumulate operation. If we select the new schedules as 
ty(p) = i + j for variables y and g for the DAG in Figure 4.11 (or ty (p)  = i + 5j for the 165
 
DAG in  Figure  4.10),  and  ti (pq) = 3 (iq +jq)  for  i  in  Figure  4.11  (or 
tx (pq) = 3 (iq + 54 from the DAG in Figure 4.10), we can have a two level pipeline 
MRA implementation for this interpolation filter with the PE architecture as shown in Fig­
ure 4.14. Now the multiplication operation has been allocated with Lci amount of time 
and the accumulation operation has been allocated with (L-1) ti amount of time. 
4)y 
y
 y
 
D
 
D  DD D 
L: Latch. D: Delay. Mux: Multiplexer; S: Mux selection control signal 
Figure 4.14: A two-level pipeline MRA implementation for the interpolation 
filter for applications require higher filtering speed. 
The selections of different schedule functions can affect the speed requirement of a 
processing element, such as those two different designs shown above. While the selection 
of different iteration orientations for dependence localization may affect the total compu­
tational time and the area of each PE. For this interpolation algorithm, a more efficient 
array architecture can be derived from another DAG by selecting different iteration orien­
tations for .Tc, g, and y as [ci  ,  respectively. The resultant localized DAG 
is as shown in Figure 4.15. Projecting along the direction z1, the resultant MRA architec­166 
tore is given in figure 4.16. The schedule functions can bederived following the samepro­
cedure TheY are given as ty (p)  z...-- i.-4 / + 9 and ti. (pq)  :--- 3  ) + 27 
As compared to the implementation in Figure 4.13, this AIRA architecture in Figure 
4.16 allocates LT.i- amount of time for the	  operation instead of 
(L - 1) i.i.. for this operation. 
6  ..z..,  .ire 
-neure 4.1S: Another localized DAG which may yield a nlore efficient M.RA 
implementation  of theinterpolation filter. 167
 
y  y 
So  g1  82  83  Sa  gs  86  87  88  89 
x  x 
(a) Another MRA architecture for the interpolation filter 
L: Latch. D: Delay. Mux: Multiplexer 
h. 
4)2 
.14)7341y 
D  D D 76-4)  XU) 
(b) Internal block diagram of the processing element 
Figure 4.16: A corresponding MRA implementation for the interpolation fil­
ter from the DAG shown in Figure 4.15. 
The MRA architecture is Figure 4.16 can be used in a similar way to construct an 
interpolation filter banks as the decimation filter banks. Figure 4.17 shows this interpola­
tion filter bank. Figure 4.18 shows the complete subband coding system using these MRA 
filter banks. 168 
D D x(j-4)  lo, co 97, 
1(j-4)  HE  41 
h 
7
 
;C(j _4)  D  D 
Figure 4.17: An MRA implementation for the interpolation filter. 
=3 
x 
Ho(z) 
Hi(z) 
Yo 
Y1 
Coder/ 
Decoder 
Or 
Go(z) 
Gi(z) 
(1)x  =  3 0  112(z)  Comm.  G2(z) 
Y2  Channel 
Analysis MRA array  Synthesis MRA array 
Figure 4.18: The three-band critically sampled subband coding system. 169 
Example[4.5]: In digital audio, different sampling rates have been used for different 
media. Nonintegral sampling rate converters are used for copying materials from one 
medium to another. This is performed through the use of decimation filters with fractional 
decimation factors. Decimation filter with fractional decimation factor M/L can be 
described by the block diagram shown in Figure 4.18. The implementation of such filters 
has drawn a great attention in recent DSP applications. 
Figure 4.19: A block diagram for a decimation filter with fractional decima­
tion factor M/L 
An Nth order decimation filter with fractional factor decimation M/L can be 
expressed as the following set of equations: 
i i ru  (L) ) , z = integer 
0, otherwise 
N -1  (4.28) 
V (i) =  h(j)u(ij) 
j = o
 
y (i) = v (Mi)
 
Direct implementation of the block diagram in Figure 4.18 or Eq.(4.28) results in 
arrays of extremely low efficiency, only one out of (L x M) computed result is truly 
needed to be computed. This efficiency is derived as follows: first, there are (L  1) out of 
L computations involved with the multiplication of zero valued samples; second, only one 170 
out of M computed result is used as the output. Polyphase implementations [Vai90] 
present an efficient approach to improve efficiencies of these realizations, but regular 
interconnection pattern has been greatly hampered by the continuous redrawing of the 
block diagrams. Furthermore, these polyphase implementations cannot be programmed to 
suit for different applications. 
Eq.(4.28) can be transformed into an ADIO specification as follows: 
(0 __(1)5L- 1)
 
iNL-1
 
(4.29) ir y (Li + 4)) = 1 h (Lj + 4)) x (Mi j) 
j = 0 
For the ease of illustration, let us derive the MRA architecture for the decimation fil­
ter with L = 2, M = 3. This ADIO can be synthesized following the synthesis procedure 
to find out the MCS system, carry out the embeddings of variables and perform depen­
dence localization, and obtain a system of regular specifications. Then the multi-rate tim­
ing functions can be derived from these regular specifications. Similar to the designs of 
decimation and interpolation filters with integral factors, different schedules and iteration 
orientations will yield different processing element structures. 
Here we are not going through these steps again for deriving an MRA array for 
Eq.(4.29). After we have derived the MRA structures for decimation and interpolation fil­
ters with integral factors, it is not difficult to derive a structure for the filter with fractional 
factors. It can be constructed by combining the interpolation and decimation filters. There 
three clock rates are used for the propagation of different data. The multiply-accumulate 
unit operates at the slowest clock rate 4), the filter output propagates at the clock rate 24) 
and the input sequence propagates at the highest clock rate 34). The system MRA architec­
ture and the internal PE structure are illustrated in Figure 4.19. These structures share the 171 
basic structure of the decimation and interpolation filters. The operation of this MRA array 
is similar to the operation of the MRA decimation filter and the interpolation filter with 
integral decimation/interpolation factors. 
h0  h1  h2  h3 h4 h5 h6  y
 
(a) Multi-rate array for the decimation filter 
x x 
C=34) 
(b) Internal block diagram of the processing element 
Figure 4.20: An  MRA array of the decimation filter with M = 2/3. 
4.4 Conclusions 
In this chapter we first introduced the multi-rate schedules and multi-rate timing 
functions. Necessary and sufficient conditions have been provided for the determination of 
the existence of multi-rate schedules and multi-rate timing functions. Such conditions can 
be formulated into a linear programming problem to derive these schedules and timing 
functions. Then the synthesis procedure for the automatic mapping of ADIOs without 172 
phase component are presented. Using this procedure, a two-dimensional MRA decima­
tion filter structure has been derived. 
The synthesis procedure has been extended to allow the synthesis of ADIOs with 
phase components. Different applications have been synthesized from their algorithmic 
specifications onto MRA architectures. By controlling the clock rate ratios and by making 
the number of delays in each interconnection programmable, these MRA architectures 
will become programmable and therefore can be used for a larger class of applications. 173 
Chapter 5 
Conclusions 
In this dissertation, we have proposed and developed the multi-rate array as the tar­
get architecture for a larger class of algorithms. We have also developed the synthesis 
technique multi-rate transformations for mapping DAREs onto MRAs. However, beside 
the MRA architecture, the major contributions of this dissertation are the introduction of a 
new indexing mechanism, multi-coordinate systems and the development of a unified syn­
thesis theory based on the MCS technique. This theory enables a designer to start the syn­
thesis procedure from the initial algorithmic specifications, instead of other low level 
specifications. 
Single rate array architecture was discussed and the limitations of SRA architecture 
investigated. The multi-rate array architecture was proposed and developed as a more gen­
eral target model for the synthesis of DSP and other algorithms onto VLSI systems. Algo­
rithms rarely appear as uniform recurrence equations or regular iterative algorithms. These 
specifications must be derived from original algorithmic specifications. Most existing syn­
thesis theories start from these low level specifications and leave to the designer the hard 
task of deriving these low level specifications from their original algorithmic specifica­
tions. Moreover, most existing synthesis methodologies require that the index mapping 
matrices of algorithms be totally unimodular. This requirement has excluded a large class 
of algorithms to be synthesized onto VLSI systems. 
In this dissertation, the concept of multi-rate (systolic) array architecture has been 
revived and materialized by the designs of MRA decimation and interpolation filters. The 174 
introduction and application of the multi-coordinate systems have made it possible to syn­
thesize algorithms from their original algorithmic specifications, including those algo­
rithms whose index mapping matrices are not totally unimodular. Moreover, the synthesis 
procedures developed in this dissertation can be used to map a class of general algorithms 
onto VLSI systems with minimal interconnections. 
In Chapter 2, notations and some basic definitions used in this dissertation are pro­
vided. Then we give a definition for the new target array architecture, i.e., the multi-rate 
array (MRA) architecture. The MRA architecture was not really a new concept; many 
researchers had already mentioned the concept of multi-rate systolic array and multi-rate 
array. It has been believed that such MRA architectures are more realistic and would suit 
VLSI implementations better. They would have higher performances than their single rate 
counterparts [Lev84, Rao85, Kun88, Roy88, LK9O]. These advantages of MRA architec­
tures over SRA architectures have been recognized. However, there was no concrete 
model of MRA architectures, nor any executable MRA architecture showing how such 
architectures operate. 
The real challenges did not come from trying to understand the advantages of MRA 
architectures, but on the nature of a concrete model for MRA architectures, how they oper­
ate, and how to design them. The more challenging task is the development of a synthesis 
methodology for automatically deriving MRA architectures from a large class of algo­
rithms. A class of algorithms termed directional affine recurrence equations (DARE) was 
defined and their properties investigated. The most distinguishing feature of DAREs is that 
their dependence structures can be decomposed into uniform and non-uniform subspaces. 
When these dependence structures are projected along the non-uniform subspaces (g's) 
into their corresponding uniform subspaces, the resultant array architectures have constant 
interconnections. Regular structure is an important property for effective VLSI implemen­175 
tations. Other properties of DARES include families of parallel hyperplanes with constant 
dependence vectors. These hyperplanes are useful for scheduling the multi-rate variables. 
The MRA concept was developed in this chapter by the VLSI implementation of the 
multi-rate decimation filter architectures. MRA architecture is indeed superior to its single 
rate counterpart. It is a suitable target architecture for implementing algorithms, including 
those with non-totally unimodular index mapping matrices. The multi-rate transformation 
technique has been developed and can be used to map DAREs onto regular VLSI multi-
rate architectures. 
The synthesis procedure based on the DARE specifications shares some of the prob­
lems and limitations with existing methodologies. 1) DAREs are low level specifications 
like ARE and RIA, which must be transformed from their original algorithmic specifica­
tions by the heuristic method. 2) If in a given algorithm, there exist more than one DARE 
dependence relationship, but they have different uniform subspaces, or if 1..ti * i.t2, then 
there does not exist any projection vector 11 which can produce a regular array architecture. 
3) If the computational domain has a ray y which is different from the uniform projection 
vector, then the resultant array architecture either has an irregular interconnection pattern 
or an infinite number of processing elements. Both cases are not suitable for VLSI imple­
mentations. Moreover, this synthesis procedure cannot be of much help in the design of 
internal processor structures of MRA arrays. 
In the development of methodologies presented in the last chapter and in those 
reported by other researchers, there has been an implicit assumption that the index spaces 
of an algorithm are defined as relative to a single coordinate system, the standard coordi­
nate system. This assumption has a delicate impact on the analysis of the true dependence 
structures of algorithms. It is this assumption which makes the development of a unified 
synthesis methodology for all linearly (affine) indexed algorithms so frustrating. 176 
In Chapter 3, we first defined the affine direct input output (ADIO) specifications. 
ADIOs are the original algorithmic specifications for most applications. The ADIO speci­
fications aim at reducing redundant computations in algorithms introduced by the use of 
intermediate variables. Intermediate variables are usually employed to transform non-uni­
form algorithms into regular specifications. It was shown that most DSP algorithms, 
matrix manipulation algorithms and many others can be conveniently presented as ADIOs. 
The use of ADIOs as initial specifications allows a designer to explore the full solution 
spaces of given algorithms, which is important for design optimizations. However, the 
synthesis of VLSI systems from ADIOs is not a trivial task, because of the non-uniform 
dependence structures, the use of global operators such as summation and product opera­
tors, and the different dimensional variable arrays in ADIO specifications. However, the 
most challenging difficulty comes from the fact that the index mapping matrices in a given 
ADIO may not be idempotent nor totally unimodular, which are restrictions imposed by 
conventional synthesis methodologies. 
To overcome all of the difficulties in synthesizing MRA arrays from ADIOs, we 
introduced the multi-coordinate systems (MCS) for the data dependence analysis of algo­
rithms and dependence localization. Under a system of multi-coordinate systems, the 
index spaces of different variables in a given algorithm may be defined relative to different 
coordinates systems. Each variable may have its own coordinate systems. In an MCS sys­
tem, there is a standard coordinate system used to define the computational domain and as 
the reference system for other coordinate systems. Each coordinate system may have dif­
ferent dimensions and a different basis. The applications of the MCS technique for depen­
dence structure analysis were illustrated by examples. 
We then investigated the most important dependence properties of ADIOs. A con­
clusion was derived: the right null space data dependence property does not depend on 177 
whether the index mapping matrices are idempotent or not. Those index points lying in the 
same null space always depend on the same data element. This dependence relationship is 
not affected by the idempotent property of the index mapping matrices. However, if all 
index spaces in an algorithm are defined in relation to a single coordinate system, then the 
idempotent condition is necessary for using the null space propagation technique for 
dependence localization. This was derived in [RTRK88] where an implicit assumption was 
made of the use of a single coordinate system for defining the index spaces of an algo­
rithm. 
But, under the MCS system the right null spaces K (D) of index mapping matrices 
can always be used effectively for dependence localization. Such observation is important 
because then unidirectional propagation, instead of multiple directional propagation, can 
be used to localize most data dependencies. This has reduced the requirement for proces­
sor interconnections and simplified the control mechanisms in VLSI implementations. 
Procedures have been provided in this chapter for the construction of multi-coordinate sys­
tems from given algorithms. Constructing the proper MCS systems for given algorithms is 
important because only then data dependence localization can be done through the use of 
null space data propagation solely, and no other directional propagation is required. In 
contrast, those conventional techniques employ multiple directional propagation for data 
dependence localization when idempotent condition is not satisfied. Multiple directional 
propagation requires multiple interconnections for a variable in VLSI implementations. 
The ultimate objective is to map algorithms onto regular VLSI systems. Only algo­
rithms with regular dependence structures can be mapped onto regular array architectures. 
Given an algorithm, it is important to decide if it can be implemented onto regular array 
architectures. Sufficient and necessary conditions are provided in Chapter 3 to detect if an 
algorithm can be transformed into regular specifications. This is followed by the proce­178 
dures for transforming algorithm specifications with global dependencies into regular 
specifications. These transformations require the operation called index space expansion. 
Index space expansion retrieves the iterative and regular properties from an algorithm and 
expands an m-array to an n-array. Examples have been provided for the applications of 
MCS systems for dependence structure analysis and dependence localization. 
The convertibility condition investigated in Chapter 3 was necessary under the 
assumption that only the right null space propagation technique is used for dependence 
localizations. For some applications, this assumption may be too restrictive. If these condi­
tions cannot be satisfied, the multiple directional propagation technique still can be 
employed for dependence localizations. Even for such applications, the MCS technique 
will still be useful in helping a designer decide how to embed each variable array into the 
computational domain. The matrix Lyapunov algorithm was used to illustrate how this 
could be done. 
Chapter 4 dealt with the problems of scheduling algorithms onto MRA architec­
tures. Existing theories are concerned with single rate schedules only. Multi-rate schedules 
and multi-rate timing functions were introduced and defined. Sufficient and necessary con­
ditions were developed to determine if a given algorithm has a set of multi-rate timing 
functions. We have also developed theorems which can be used to formulate a linear pro­
gramming problem for the derivation of a set of timing functions. Procedures have been 
provided for synthesizing ADIOs onto VLSI multi-rate systems. First, a procedure was 
developed for ADIOs without the phase components in their index functions. This proce­
dure later was expanded to synthesize ADIOs with the polyphase components. Examples 
for synthesizing a two dimensional decimation filter, a one-dimensional interpolation fil­
ter, and a decimation filter with fractional decimation factor were presented. These filters 
can be easily used to construct subband coding systems for speech and image codings. 179 
There are several open topics remain to be investigated: 
1. What architecture is best for an application? 
Multi-rate array is a more general architecture suitable for a larger class of applica­
tions. However, for some applications, it is advantageous to use asynchronous implemen­
tations, and occasionally to use broadcasting or bus architectures. For example, to solve 
the matrix Lyapunov equation and the Algebraic Path Problem (APP), systems with broad­
cast ability seem to solve the problem with less time, but the possibly longer delays intro­
duced from this broadcast function must be considered. 
The asynchronous approach can also be used to improve the performance of such 
systems. However, the control overhead for the asynchronous operations may make the 
seemingly faster system too costly and may also be proved to be slower than expected. 
Multi-rate array can also be employed for the design for propagating the computed 
results at a faster rate so that a higher performance is achieved. But the synchronization of 
such an MRA architecture is believed to be more complex than the single rate architecture. 
The single rate array architecture might have the simplest control mechanism, but it is also 
the slowest. So, given an algorithm, is there any method to detect which implementation 
would be the best under certain specification? 
2. Other multi-rate schedules? 
The second interesting topic is the multi-rate schedule problem. If the restriction 
= ran is not imposed, will it produce a better set of multi-rate schedules than with the J f
restriction? 180 
Bibliography 
[AG89]  G. S. Ammar and W. B. Gragg, "Numerical experience with a superfast real 
Toeplitz solver," Naval Postgraduate School, NPS-53-89-008, Feb. 1989. 
[AK78]  J. Agnew and R. C. Knapp, LINEAR ALGEBRA WITH APPLICATIONS, 
Brooks/Cole Publishing Company, Monterey, CA, 1978. 
[Atr58]  A.J. Atrubin, A STUDY OF SEVERAL PLANAR ITERATIVE SWITCH­
ING CIRCUITS, S.M. Thesis, MIT, Dept. of Electr. Engr., 1958. 
[BA93]  D.G. Baltus and J. Allen, "Efficient exploration of non-uniform space-time 
transformations for optimal systolic array synthesis," in Proc. 1993 Appli­
cation Specific Array Processors, Ed. L. Dadda and B. Wah, pp. 428-441, 
IEEE Computer Society Press, 1993. 
[Ban90]  Thomas F. Banchoff, BEYOND THE THIRD DIMENSION, Geometry, 
Computer Graphics, and Higher Dimensions. Scientific American Library, 
New York, 1990. 
[Bat80]  K.E. Batcher, "Design of a massively parallel processor," IEEE Trans. 
Comput. C-29, pp.836-840, 1980. 
[Bet03]  Enrico Betti, Opree matematiche, Vol. 1-2, Milano, 1903-1913. 
[BK81]  R. P. brent and H.T. Kung, "The area-time complexity of binary multiplica­
tion," J. ACM, 28(3), July 1981. 
[BR90]  A. Bennaini and Y. Robert, "Spacetime minimal systolic arrays for Gauss­
ian elimination and the algebraic path problem," In Proc. Int. Conf.  on 
Application Specific Array Processors, pp. 746-757, Princeton, IEEE Com­
puter Society, Sept. 1990. 
[BSG91]  M.K. Birbas, D. J. Soudris,and C.E. Goutis, "Design methodology for direct 
mapping of iterative algorithms on array architectures," Intern. Symposium 
on Circuits and Systems, vol.5, pp.3058-3061, 1991. 
[Bu90]  J. Bu, Systematic Design of Regular VLSI Processor Arrays, Ph.D thesis, 
Delft University of Technology, Delft, The netherlands, May, 1990. 
[Bun85]  J. R. Bunch, "Stability of methods for solving Toeplitz systems of equa­
tions," SIAM J. Sci. Statist. Comput., pp. 349-364, Vol. 6, 1985. 
[Bur70]  A.W. Burks, Editor, ESSAYS ON CELLULAR AUTOMATA, University 181 
of Illinois Press, Urbana, IL, 1970. 
[Cap82]  P.R. Cappello, VLSI Architectures for Digital Signal Processing, Ph.D the­
sis, Princeton University, Princeton, NJ, Oct. 1982. 
[Cap89]  Peter R. Cappello, "A spacetime-minimal systolic array for matrix product," 
In J. V. McCanny, J. McWhirter, and E. E. Swartzlander Jr., editors, Systolic 
Array Processors, pp. 347-356, Prentice-Hall, Killarney, Ireland, May 
1989. 
[Cap92]  P. Cappello, "A processor-time-minimal systolic array for cubical mesh 
algorithms," IEEE Trans. on Parallel and Distributed Systems, Vol. 3, no. 
1, Jan. 1992. 
[Che86]  M. C. Chen, "A design methodology for synthesizing parallel algorithms 
and architectures," J. of Parallel and Distributed Computing, pp.461-491, 
Dec. 1986. 
[Che88]  M. C. Chen, "The generation of a class multipliers: synthesis highly parallel 
algorithms in VLSI," IEEE Trans. on Computers, Vol. 37, no. 3, pp.329­
338, March 1988. 
[CL88]  P. R. Cappello and A. J. Laub, "Systolic computation of multivariable fre­
quency response," IEEE Trans. Autom. Control. 33(6), pp.550-558, June, 
1988. 
[CM85]  M. C. Chen and C. Mead, "Concurrent algorithms as space-time recursion 
equations," In S.Y. Kung, H. J. Whitehouse, and T. Kailath, editors, VLSI & 
Modern Signal Pmcessing, Prentice-Hall, Englewood Cliffs, 1985. 
[CMP90]  Ph. Clauss, C. Mongenet, and G.R. Perrin, "Calculus of space-optimal map­
ping of systolic algorithms onto processor arrays," 1990 Int. Conf.  on 
Application  Specific  Array  Processors,  Sun-Yuan Kung,  Earl  E. 
Swartzlander,Jr; Jose A.B. Fortes and K. Wojtek Przytula, Ed.. IEEE Com­
puter Society Press. 
[Co169]  S. N. Cole, "Real-time computation by n-dimensional iterative arrays of 
finite state machines," IEEE Trans. Computers, 18, pp.349-365, 1969. 
[Cox91]  H. S. M. Coxeter, REGULAR COMPLEX POLYTOPES, Second edition, 
Cambridge University Press, 1991. (QA691. C66 1991). 
[CR81]  C. W. Curtis and I. Reiner, METHODS OF REPRESENTATION THE­
ORY, With Applications To Finite Groups and Orders, vol.1, John Wiley & 
Sons, New York, 1981. 182 
[CR83]  R.E. Crochiere and L.R. Rabiner, MULTIRATE DIGITAL SIGNAL PRO­
CESSING, Prentice Hall, Inc., Englewood, 1983. 
[CS83]  P. R. Cappello and K. Steiglitz, "Unifying VLSI array design with geomet­
ric transformations," in H.J. Siegel and L.Siegel, Ed. Proc. Int. Conf on 
Parallel Processing, pp.448-457, Belaire, MI, Aug. 1983. 
[CS84]  P. R. Cappello and K. Steiglitz, "Unifying VLSI array design with linear 
transformations of space-time," in F.P. Preparata, Ed. Advances in Comput­
ing Research, pp. 23-65, JAI Press, Inc. 1984. 
[CV93]  T. C Chen and P.P Vaidyanathan, "The role of integer matrices in multidi­
mensional multirate systems," IEEE Trans. on Signal Processing, Vol.41, 
No.3, March 1993. 
[CVV88]  J. P. Charlier, M. Vanbegin and P. Van Dooren, "Systolic algorithms for 
digital signal processing," Philips Journal of Research, Vol. 43, pp. 268­
290, Nos 3/4, 1988. 
[DD90]  Alain Dane and Jean Marc Delosme, "Partitioning for array processors, TR 
90-23. Lah de Informatigue 
[DI85]  Jean Marc Delosme and Ilse C. F. Ipsen, "Efficient systolic arrays for the 
solution of Toeplitz systems: An illustration of a methodology for the con­
struction of systolic architectures in VLSI," Research Report YALEU/DCS/ 
RR-370, June 1985. 
[DI86a]  Jean-Marc Delosme and I.C.F Ipsen, "An illustration of a methodology for 
the construction of efficient systolic architectures in VLSI," In Proc. 2nd 
Int. Symp. on VLSI Technology, Systems and Applications, pp. 268-273, 
Taipei, 1985. 
[DI86b]  J.M. Delosme and I.C.F Ipsen, "Systolic array synthesis: computability and 
time cones," in M. Cosnard, P. Quinton, Y. Robert and M. Tchuente, Ed. 
Parallel Algorithms & Architectures, Elsevier Science Publishers B.V. 
(North-Holland), pp. 295-312, 1986. 
[Don89]  V. V. Dongen, "Quasi-regular array: definition and design methodology," 
Proc. of Intern. Conf. on Systolic Arrays, 1989. 
[Duf78]  M.J.B. Duff, "Review of VLIP image processing system," Proc. Nat. Com­
put. Conf., pp.1055-1060, 1978. 
[FFW85]  J.A.B. Fortes, K.S. Fu, and B.W. Wah, "Systematic approaches to the 
design of algroithmically specifies systolic arrays," Proc. 1985 Intl Conf. 
Acoustics, Speech, and Signal Processing, IEEE  , pp. 8.9.1-8.9.5, Piscat­183 
away, NJ, 1985. 
[FM84]  J.A.B. Fortes and D.I. Moldovan, "Data broadcasting in linearly scheduled 
array processors," Proc. 1 1 th Ann. Symp. on Computer Architecture, pp. 
224-231, June, 1984. 
[FM85]  Jose A. B. Fortes and Dan I. Moldovan, "Parallel detection and algorithm 
transformation techniques useful for VLSI algorithms," Journal of Parallel 
and Distributed Computing, Vol.2, pp.277-301, 1985. 
[FP84]  Jose A. B. Fortes and F. Parisi-Presicce, "Optimal linear shedules for the 
parallel execution of algorithms," J. Parallel and Distributed Computing, 
pp.227-301, Aug. 1984. 
[GRU67]  Branko GRUNBAUM, CONVEX POLYTOPE, Interscience Publishers, 
London, 1967 
[Go165]  M. J. E. Goley, "Apparatus for counting bi-nucleate lymphocytes in blood," 
US Patent 3,214,574, 1965. 
[Hen61]  F. C. Hennie, ITERATIVE ARRAYS AND LOGICAL CIRCUITS, The 
M.I.T. Press and John Wiley & Sons, Inc., 1961. 
[Hen68]  F. C. Hennie, FINITE STATE MODELS FOR LOGICAL MACHINES, 
John Wiley & Sons, Inc., New York, 1968. 
[HJ85]  R. A. Horn and C. R. Johnson, MATRIX ANALYSIS, Cambridge Univer­
sity Press, Cambridge, 1985 
[HJ90]  J. N. Hwang and J. M. Jong, "Systolic architecture for 2-D rank order filter­
ing," pp.90-99, 1990 International Conference On Application Specific 
Array Processors", Edited by Sun-Yuan Kung; Earl E. Swartzlander,Jr; Jose 
A.B. Fortes and K. Wojtek Przytula. IEEE Computer Society Press. 
[Hoo87]  F. de Hoog, "A new algorithm for solving Toeplitz systems of equations," 
Lin. Algebra Appl., pp. 122-138, 1987. 
[JEH91]  J. A. K. S. Jayasinghe, F. M. El-Hadidy, and O.E. Herrmann, "An array pro­
cessor design methodology for hard real-time systems," ISCAS, vol.5, 
pp3062-3065, 1991. 
[HR93]  A. Halanay and V. Rasvan, APPPICATIONS OF LIAPUNOV METHODS 
IN STABILITY, Kluwer Academic Publishers, Dordrent, 1993. 
[HV930]  C. Herley and M. Vetterli, "Wavelets and recursive filter banks," IEEE 
Trans. on Signal Processing, Vol.41, No.8, pp.2536-2556, Aug. 1993. 184 
[KA89]  S. Kiaei and L. Aihua, "Analysis of Multirate/Bounded Broadcast," Techni­
cal Report, Elec. & Comp. Engr., Oregon State Univ., 1989. 
[Kat90]  Y. Kato, "Application-oriented high speed processors: experiences and per­
spectives," 1990 International Conference On Application Specific Array 
Processors", Edited by Sun-Yuan Kung; Earl E. Swartzlander,Jr; Jose A.B. 
Fortes and K. Wojtek Przytula. IEEE Computer Society Press. 
[KDLS87]  D. J. Kuck, E. S. Davison, D.H. Lawrie and A.H. Sameh, "Parallel super-
computing today and the cedar approach," pp.1-23, Experimental Parallel 
Computing Architectures, Edited by J.J Dongarra, 1987, Elsevier Science 
Publishing Co., North Holland. 
[KH83]  Sun-Yuan Kung and Yu Hen Hu, "A highly concurrent algorithm and pipe-
lined architecture for solving Toeplitz systems," IEEE Trans. on ASSP, Vol. 
31, No. 1, pp.66-76, Feb. 1983. 
[KL78]  H. T. Kung and C. E. Leiserson, "Systolic arrays for VLSI," Sparse Matrix 
Proceedings, SIAM, pp.245-282, 1978. 
[KL84]  H.T. Kung and M.S. Lam, "Wafer-scale integration and two level pipeline 
implementations of systolic arrays," J. Parallel Distributed Comput., void, 
pp.32-63, 1984. 
[KMW67]  R. M. Karp, R. E. Miller, and S. Winogard, "The organization of computa­
tions for uniform recurrence equations," Journal of ACM, vol.14, pp. 563­
590, July 1967. 
[Kun80]  H.T. Kung, "The structure of parallel algorithms," in Advances in Comput­
ers, Vol. 19. New York: Academic, 1980. 
[Kun82]  H.T. Kung, "Why systolic architectures," IEEE Computer, pp. 37-45, Jan. 
1982. 
[Kun84]  S. Y. Kung, "On supercomputing with systolic/wavefront array processors," 
Proc. IEEE, vol. 72, pp. 867-884, July, 1984. 
[Kun85]  H. T. Kung, "Systolic alagorithms for the CMU warp processor," in Systolic 
Signal Processing, E.Swartzlander, Ed. pp. 73-96, New York: marcel Dek­
ker, 1987. 
[Kun88]  S.Y. Kung, VLSI Array Processors, Prentice Hall, Englewood Cliffs, 1988. 
[Lev84]  H. Lev-Ari, Nonstationary lattice filter modeling, Ph.D Dissertation, Infor­
mation Systems Laboratory, Dept. of Elect. Engr., Stanford University, 
Stanford, CA, Sept., 1984. 185 
[LK88]  P. Lee and Z. M. Kedem, "Synthesizing linear array algorithms from nested 
for loop algorithms," IEEE Trans. on Computers, 37(12), pp.1578-1598, 
Dec. 1988. 
[LK90]  A. Li and S. Kiaei, "VLSI design of multi-rate arrays for DSP algorithms," 
1990 Int'l Conf. on Acoustics, Speech and Signal Processing, New Mexico. 
[LK93]  P. Lenders and S. Kiaei, "Synthesis of multi-dimensional ARE's", Submit­
ted to IEEE Trans. on Parallel and Dist. Computing, Feb. 1993. 
[LRS83]  C.E. Leiserson, F.M. Rose, and J.B. Saxe, "Optimizing synchrounous cir­
cuitry by retiming," in Third Caltec Conference on VLSI, R. Bryant Ed. pp. 
87-116, Rockville, MD: Comnputer Science Press, 1983. 
[LW84]  G. J. Li and B. W. Wah, "The design of optimal systolic arrays," IEEE 
Trans. on Computers, Vol. C-33, no. 10, Oct. 1984. 
[LW85]  G. J. Li and B. W. Wah, "The design of optimal systolic algorithms," IEEE 
Trans. on Computers, Vol. C-34, no. 1, Jan. 1985. 
[Ma189]  S. E. Mallat, "A theory for multiresolution signal decomposition: the wave­
let representation," IEEE Trans. on Pattern Analysis and Machine 
intelligence, Vol.11, No.7, pp.674-693, July, 1989. 
[McC58]  E.J. McCluskey, "Iterative combinational switching networks-general 
design considerations," IRE Trans. on Electronic Computers, Vol. EC-7, 
pp.285-291, 1958. 
[MF86]  D. I. Moldovan and J. A. B. Fortes, "Partitioning and mapping algorithms 
into fixed-size systolic arrays," IEEE Trans. Computers, Vol. C-35, No. 1, 
pp. 1-12, Jan. 1986. 
[Mo182]  D.I. Moldovan, "On the analysis and synthesis of VLSI algorithms," IEEE 
Trans. Comput., C-31:1121-1126, November 1982. 
[Mo193]  Moldovan, Dan I., PARALLEL PROCESSING: From Applications to Sys­
tems, Morgan Kaufmann Publishers, Inc., San Mateo, CA, 1993. 
[MQRS90]  C. Mauras, P. Quinton, S. Rajopadhye, and Y. Saouter, "Scheduling affine 
parameterized recurrences by means of variable dependent timing func­
tions," Int. Conf. on Application Specific Array Processors, IEEE Computer 
Society, pp.100-110, 1990. 
[OF86]  M.T. O'Keefe and J.A.B. Fortes, "A comparative study of two systematic 
design methodologies for systolic arrays," in M. Cosnard, P. Quinton, Y. 
Robert and M. Tchuente, Ed. Parallel Algorithms & Architectures, Elsevier 186
 
Science Publishers B.V. (North-Holland), pp. 313-324, 1986. 
[Opp78]  A. V. Oppenheim, Ed., Applications of Digital Signal Processing, Engle­
wood cliffs, NJ: Prentice-Hall, 1978. 
[Orf88]  S. J. Orfanidis, OPTIMUM SIGNAL PROCESSING: An Introduction (2nd 
edition), McGraw Publishing Company, pp.239-246, 1988. 
[PD84]  K. Preston and M.J.B. Duff, MODERN CELLULAR AUTOMATA, Ple­
mum Press, New york, 1984. 
[PRLN92]  J.G. Proakis, C.M. Rader, F. Ling and C. L. Nikias, ADVANCED DIGI­
TAL SIGNAL PROCESSING, McMillan Publishing Company, New York, 
1992. 
[Qui83]  Patrice Quinton, "The systematic design of systolic arrays," Tech. Rep. 216, 
Institute National de Recherche en Informatique et en Automatique INRIA, 
July, 1983. 
[Qui84]  P. Quinton, "Automatic synthesis of systolic arrays from uniform recur­
rence equations," Proc., 11th Ann. Symp. on Computer Architecture, 
pp.208-214, 1984. 
[Raj86]  S.V. Rajopadhye, "Synthesis, optimization and verification of systolic 
architectures," Ph.D. thesis, University of Utah, Salt Lake City, Utah 84112, 
Dec. 1986. 
[RF86]  S.V. Rajopadhye and R.M. Fujimoto, "Systolic array synthesis by static 
analysis of program dependencies," in Parallel Architecture and Languages 
Europe (Eds J.W. deBakker, A.J. Nijman and P.C. Treleaven), Springer 
Verlag, pp.295-310, 1987. 
[Raj89]  S. V. Rajopadhye, "Synthesizing systolic arrays with control signals from 
recurrence equations," Distributed Computing, pp.88-105, May 1989. 
[Rao85]  Sailesh Rao, "Regular iterative algorithms and their implementations on 
processor arrays," Ph.D. dissertation, Standard University, Information Sys­
tem Lab., Standard, CA, Oct. 1985. 
[RCM92]  J. Rosseel, F. Catthoor, and H. D. Man, "The exploitation of global opera­
tions in affine space-time mapping," VLSI Signal Processing, V. pp. 309­
318, 1992 
[RF88]  S. V. Rajopadhye and R. M. Fujimoto, "Synthesizing systolic arrays from 
recurrence equations," Parallel Computing, 1988. 187 
[Ros88]  B.A. Rosenfeld, A HISTORY OF NON-EUCLIDEAN GEOMETRY, Evo­
lution of the Concept of a Geometric Space, Springer-Verlag, New York, 
1988. 
[RTRK88]  P. Roychowdhury, L. Thiele, S.K. Rao, and T. Kai lath, "On the localization 
of algorithms for VLSI Processor Arrays," in R.W. Brodersen and H.S. 
Moscovitz (Eds) VLSI Signal Processing III, IEEE, pp.459-470, New York, 
1988. 
[SBM62]  D. L. Slotnick, W.C. Borck, and R.C. Mcreynolds, "The SOLOMON com­
puter," Proc. of the AFIPS Fall Joint Computer Conference, pp. 97-107, 
Washington D. C., 1962. 
[Sch86]  A. Schrijver, THEORY OF LINEAR AND INTEGER PROGRAMMING, 
John Wiley & Sons Ltd.Tiptree, Essex, 1986. 
[SC92]  Chris J. Scheiman and P. Cappello, "A processor-time minimal systolic 
array for transitive closure," IEEE Trans. on Parallel and Distributed Sys­
tems, Vol.3, No.3, pp. 257-269, May 1992. 
[SC93]  C. J. Scheiman and P. Cappello, "A period-processor-time-minimal sched­
ule for cubical mesh algorithms," In Proc. hit. Conf. on Application Specific 
Array Processors, pp. 261-272, IEEE Computer Society, Oct. 1993. 
[SF88]  W. Shang and J.A.B. Fortes, "Time optimal linear shcedules for algorithms 
with uniform dependencies," in Intl Conf on Systolic Arrays, pp.393-402. 
San Diego, May 1988. 
[SF89]  W. Shang and J.A.B. Fortes, "On the optimality of linear shcedules," J. of 
VLSI Signal Processing, Vol.1, 1989, pp.209-220, Kluwer Academic Pub­
lishers, Boston. 
[Sha73]  R. S. Shankland, "Conversations with Albert Einstein. II," American Jour­
nal of Physics, Vol. 41, pp.895-901. 1973. 
[Sha82]  R. Shaw, LINEAR ALGEBRA AND GROUP REPRESENTATIONS, 
Vol.1, "Linear Algebra and Inroduction to Group Representations," Aca­
demic Press, London, 1982. 
[Sha83]  R. Shaw, LINEAR ALGEBRA AND GROUP REPRESENTATIONS, 
Vol.2, "Multi linear Algebra and Group Representations," Academic Press, 
London, 1983. 
[Tau79]  gerald E. Tauber, Editor, ALBERT EINSTEIN'S THEORY OF GENERAL 
RELATIVITY, Crown Publishers, Inc. New York, 1979. 188 
[Thi89]  L. Thiele, "On the design of piecewise regular processor arrays," IEEE 
symp. on Circuits and Systems, Portland, pp.2239-2242, 1989. 
[Ung58]  H. Unger, "A computer oriented toward special problems,", Proc. IRE, 46, 
pp.1744-1750, 1958. 
[Vai90]  P. P. Vaidyanathan, "Multirate digital filters, filter banks, polyphase net­
works, and applications: A tutorial", Proc. of The IEEE, Vol.78, No.1, 
pp.56-93, 1990. 
[Wai67]  W.M. Waite, "Path detectionin multidimensional iterative arrays," J. ACM, 
14(2), pp.300-310, Apr. 1967. 
[WD85]  Yiman Wong and J. M. Delosme, "Optimal systolic implementation of N-
dimensional recurrences," IEEE Proc. ICCD, pp. 618-621, 1985. 
[WD89]  Yiman Wong and J. M. Delosme, "Optimization of processor count for sys­
tolic arrays," Dept. of Computer Sci. RR-697, Yale Univ., May 1989. 
[WD92]  Yiman Wong and J. M. Delosme, "Optimization of computation time for 
systolic arrays," IEEE Trans. on Computers, Vol. 41, no.2, pp. 159-177, 
Feb. 1992. 
[YC88]  Yoav Yaacoby and Peter R. Cappello, "Scheduling a system of affine recur­
rence equations onto a systolic array," Dept. of Electrical & Computer 
Engineering, and Dept. of Computer Science, U. of California, Santa Bar­
bara, 1988. 
[YC88a]  Y. Yaacoby and P.P. Cappello, "Converting Affine Recurrence Equations 
to Quasi-Uniform Recurrence Equations," Technical Report, Dept. of Com­
puter Science, UCSB, Santa Barbara, CA, February 10, 1988. 
[YC88b]  Y. Yaacoby and P.R. Cappello, "Bounded Broadcast in Systolic Arrays," 
Technical Report, Dept. of Computer Science, UCSB, Santa Barbara, April 
1988. 
[ZK93]  Y.P. Zheng and S. Kiaei, "Multi-rate transformation of directional affine 
recurrence equations," The Inter. Conf. on Application Specific Array Pro­
cessors, Venice, Italy, pp.392-403, 1993. 
[ZK93]  Y.P. Zheng and S. Kiaei, "Multi-rate transformation of directional affine 
recurrence equations," Application-Specific-Array-Processors, IEEE Com­
puter Society, pp.392-403, Oct. 1993. 
[ZK93]  Y. Zheng and S. Kiaei, "Automatic derivation of multi-rate VLSI arrays for 
directional affine recurrence equations," IEEE Trans. on Signal Processing, 189 
(under review). 
[ZK94]  Y. Zheng and S. Kiaei, "Overlapping transformation for time-area optimal 
VLSI arrays for Toeplitz system," J. of Computer and Software Engr., 1994. 
[ZR92]  X. Thong and S. V. Rajopadhye, "Quasi-linear allocation functions for effi­
cient array design," J. of VLSI Signal Processing, 4(97-110), 1992. 190
 
The following papers are all cited from "1990 International Conference On Applica­
tion Specific Array Processors", Edited by Sun-Yuan Kung; Earl E. Swartzlander,Jr; Jose 
A.B. Fortes and K. Wojtek Przytula. IEEE Computer Society Press. 
[Kat90]	  Y. Kato, "Application-oriented high speed processors: experiences and per­
spectives," 1990 International Conference On Application Specific Array 
Processors", Edited by Sun-Yuan Kung; Earl E. Swartzlander,Jr; Jose A.B. 
Fortes and K. Wojtek Przytula. IEEE Computer Society Press. 
Application-oriented high speed processor development should last forever by featuring 
one order higher speed processing capabilities than that of conventional microproces­
sors. special architecture tuned for a specific application has advantages over conven­
tional microprocessors. 
The future application trend is not a issue here, because every personal workstation will 
enjoy today's super computer processing capability in the near future by employing 
application-oriented processors in its peripheral. Such workstations will be able to per­
form more complex and higher speed applications. Real time foreign language inter­
preters and three dimensional video/image semantic recognitions are possible 
examples. Humanoid robots which can communicate with us will also become a reality. 
The key issue of the application-oriented high speed processor is the harmony among 
cost, size and processing capability for specific applications which creates harmony 
between human beings and machines. 
[C1a90]	  Ph. Clauss, C. Mongenet, and G.R. Perrin, "Calculus of space-optimal map­
ping of systolic algorithms onto processor arrays," 1990 International 
Conference On Application Specific Array Processors," Edited by Sun-
Yuan Kung; Earl E. Swartzlander,Jr; Jose A.B. Fortes and K. Wojtek 
Przytula. IEEE Computer Society Press. 
A method based on geometrical interpretations on the convex polyhedra in r for the 
mapping of systolic algorithms onto minimum number of processors is presented. Two 
space-optimal mappings of the Gaussian elimination algorithm are derived, one is 
called pipelining, and the other is grouping mapping. 191 
There is nothing new but they try to use different geometrical tools for the expressions 
of algorithms. I. polyhedron expression, expressed by the edges and vertices of the 
polyhedron of the algorithms. II. timing and allocation expression, decomposing the 
polyhedron into temporal planes and allocation directions; III. potential parallelism 
expression, considering the maximum parallelism of the algorithm and then determine 
the space-optimal mapping relative to a given timing. 
Overall, there is nothing new but try to use some geometrical terminology for the 
expression of algorithms and systolic arrays. 
[Sch90]	  C.J. Scheiman and P. Cappello, "A processor-time minimal systolic array 
for transitive closure," Edited by Sun-Yuan Kung; Earl E. Swartzlander,Jr; 
Jose A.B. Fortes and K. Wojtek Przytula. IEEE Computer Society Press. 
The computation of transitive closure of a relation over a set of n elements on n2 PEs 
requires 5n-4 steps as shown in the Kung, Lo, and Lewis (KLL) algorithm. This paper 
shows that to achieve this 5n-4 time bound, only [n2/31 PEs are needed. A processor-
time minimal systolic array using 1n2/31 PEs organized as a cylindrical 2D mesh 
when n is multiple of 3, or connected as a 2D mesh as twisted torus is presented. 
The dag of the KLL algorithm is defined as follows: 
G  (N, A) tc(n) 
A= {  k),  1c)  = + 1. ER.i' =i+ 1, 1 5 i, j, k <n} 
u [ (i, j, k), (il, jl,k1)}1i1 =  1,j1 =j 1, kl = k + 1, 1 < j 
lk<n 
[Hwa90]	  J.N. Hwang and J.M. Jong, "Systolic architecture for 2-D rank order filter­
ing," pp.90-99, 1990 International Conference On Application Specific 
Array Processors", Edited by Sun-Yuan Kung; Earl E. Swartzlander,Jr; Jose 
A.B. Fortes and K. Wojtek Przytula. IEEE Computer Society Press. 
2-D rank order filtering has found applications in a wide variety of applications in 
image processing. This design derives its architecture mainly from a systolic design for 
1-D rank order filtering. The adopted design, called sample oriented rank order filter 
design, takes advantage of the evaluated ranks values in the current window for the 
evaluation of the rank values in the next window without explicitly sorting the data in 
the window. This fact allows the conversion of a 2-D windowed data sequence into a 1­
D windowed data sequence with multiple data samples exchange at each window 
movement, and the 1-D systolic design can be used.By cascading many such 1-D sys­192 
tolic arrays into a 2-D array, and supplied with simple parallel-in serial-out logic, this 
architecture is able to achieve the maximally available parallelism with nearly 100% 
efficiency if either the pipeline interleaving or processor sharing technique is used. This 
design does not require the preloading of the whole 2-D data array, line scanned images 
are allowed to be processed with negligible time delay. 
[Mau90]	  C. Mauras, P. Quinton, S. Rajopadhye, and Y. Saouter, "Scheduling affine 
parameterized recurrences by means of variable dependent timing func­
tions," 1990 International Conference On Application Specific Array 
Processors", Edited by Sun-Yuan Kung; Earl E. Swartzlander,Jr; Jose A.B. 
Fortes and K. Wojtek Przytula. IEEE Computer Society Press. 
A new technique on affine scheduling of each variable of a system of ARE independent 
of the others by a affine timing function. This new technique makes it possible to ana­
lyze systems of RE with variables in different index spaces, and multi-step systolic 
algorithms. 
Previous synthesis techniques make implicitly the assumption that all variables are in 
the same index space. Thus, a preliminary heuristic transformation is necessary to 
ensure that this condition is met. 
This method is very similar to what I proposed but it is still a step behind, they implic­
itly assume that all index spaces are under the same coordinate system which makes its 
application to only unimodular dependence maps. But its writing will be a good exam­
ple for me to follow, especially section 2. The definition for MCS specification of ARE 
is given in section 1. 
[Jon93]  Don H. Johnson and Dan E. Dudgeon, ARRAY SIGNAL PROCESSING-­
Concepts and Techniques, Prentice Hall, Englewood Cliffs, NJ, 1993. 
Some good points on Eigenanalysis. Eigenvectors are used to define the coordinate sys­
tem tailored to the linear operator. This is very similar to what I am working on the der­
ivation of MCS from the dependence matrices. Some terminologies may help me in 
presenting MCS. But in this book, only full rank matrices are considered, i.e., the linear 
operator has no null space. In DSP algorithms, dependence matrices are often rank def­
icit and therefore it is important to cope with rank deficit matrices. 
[GRU67]	  Branko GRUNBAUM, CONVEX POLYTOPE, Interscience Publishers, 
London, 1967 193
 
This book talks about polytopes and convex hulls, hyperplanes, and all those n-dimen­
sional geometries needed for the development of synthesis theory of VLSI array pro­
cessors. The following is some notes taken from the book for the definitions of some 
terminologies: (x, y) denotes the scalar product of vector x and y. 
1. A hyperp/ane H is a set which may be defined as H = {x e Rd' (x,y) = a} for
 
suitable y E Rd, y  0.
 
2. An open half-space [closed half-space] is defined as H = {x E Rd' (x, y) > a}
 
[respectively H = {x E Rd] (x, y)  a} ] for suitable y E Rd, y # 0, and a.
 
3. A set K E Rd is a convex if and only if for each pair of distinct points a, b e Rd the 
closed segment with endpoints a and b is contained in K. equivalently, K is convex if its 
intersection with every straight line is either empty, or a connected set. 
K is a convex if a, b e K and 0 5. X  1 imply Xa + (1  X) b e K. 
Properties of convex: 
a). if {1C1,} is any family of convex sets in Rd, then their intersection (yr  is also con­
vex.  V 
b). if A, B are convex, then A + B and A B are convex, and for any real X, the set XA 
is convex. 
k 
c). If A is convex, ai E A and Xi  0 for i = 1, 2, ..., k, and  Xi = 1, then 
k  i =1 
y, Xiaie A.
 
i =1
 
d). if A c Rd is convex, the sets cl A and int A are also convex. 
e). if A c Rd is convex, x E A ye intA, then all points of the line segment between x 
and y belong to int A. 
f). If T is affine transformation of Rdinto itself, and if A c Rd is convex, then T (A) is 
convex. 
g). For a convex set A c Rd let H=aff A be the affine hull of A. The relative interior 
relint A of A as a subset of H is never empty, and relint cl A cA c cl relint A. The rela­
tive boundary relbd A of A with respect to H is empty iff A is an affine variety (i.e. 
A =H). 
4. Convex hull: A convex hull cony A of a subset A of Rd is the intersection of all the 
convex sets in Rd which contain A. 
[Ban90]	  Thomas F. Banchoff, BEYOND THE THIRD DIMENSION, Geometry, 
Computer Graphics, and Higher Dimensions. Scientific American Library, 
New York, 1990. (QA691.B26 1990) 
The axioms of Euclidean Plane Geometry: 194 
1. A straight line may be drawn between any two points. 
2. Any terminated straight line may be extended indefinitely. 
3. A circle may be drawn with any given point as center and any given radius. 
4. All right angles are equal. 
5. If two straight lines in a plane are met by another line, and if the sum of the internal 
angles on one side is less than two right angles, then the straight line will meet if 
extended sufficiently on the side on which the sum of the angles is less than two right 
angles. 
5'. (5 can be stated as follows): For any given point not on a given line, there is exactly 
one line through the point that does not meet the given line. 
Coordinate geometry. 
[Sch86]  Alexander Schrijver, THEORY OF LINEAR AND INTEGER PRO­
GRAMMING, John Wiley & Sons Ltd.Tiptree, Essex, 1986. 
The emphasis of this book is on the theoretical aspects of linear and integer program­
ming, and it aims at complementing the more practically oriented books. This book 
contains many basic definitions and theorems in linear algebra, matrix theory and 
Euclidean geometry. 
Pivoting: If A is a matrix, say 4 = I a b where a is a nonzero number, b is a row vector,
c D 
c is a column vector, and D is a matrix, then pivoting over the pivot element (1,1) means 
[
 
a-1 c D a-1 cb
 
defined similarly.
 
replacing A by the matrix -a-1  a-1 b ]  Pivoting over any other element of A is 
The sizes of a rational number a = p/ q (where p and q are relatively prime integers), 
of a rational vector c =  yn) and of a rational matrix A = (au)7 1,71.  are: 
size (a) = 1 + log2(1p1 + 1)1+ I log2 (Iq1 + 1)1 
size (c) = n + size (y1) + ... +size (7n) 
(4.1) 
size (A) = mn +Isize (au) 
Polynomial algorithms: 
If f, g1, ..., gm are real-valued functions, then f is said to be polynomially bounded by 
g1, ..., gm if there is a function 4) such that 4)  and such that 4) arises by a sequence 
of compositions from the functions g1, ..., gm and from some polynomials. 195 
In the special case that g1, ..., gm are polynomials, it follows that iff is polynomially 
bounded by g1,  gm, then f is bounded above by a polynomial. In that case, f is called 
a polynomially bounded function. 
An algorithm is called polynomial-time or polynomial if its running time function is 
polynomially bounded. A problem is said to be solvable in polynomial time or polyno­
mially solvable if the problem can be solved by a polynomial-time algorithm. 
The Classes of P, NP and co-NP 
The class of decision problems solvable in polynomial time is denoted p. Another, pos­
sibly larger, "complexity class" is the class NP [solvable by a Non-deterministic Turing 
machine in Polynomial-time]. 
A set A of Rn is called an (additive) group if: 
(i) 0 e A 
(4.2) (ii) if  x, y E A, then x+y E A and x E A 
The group is said to be generated by al, ..., am if 
A = { X1a1 +  + Xmand  ..., am E Z}	  (4.3) 
The group is called a lattice if it can be generated by linearly independent vectors. 
These vectors then are called a basis of the lattice. 
[Mo193]	  Moldovan, Dan I., PARALLEL PROCESSING: From Applications to Sys­
tems, Morgan Kaufmann Publishers, Inc., San Mateo, CA, 1993. 
SIMD Supercomputers:(pp.204-217) 
(1) The Connection Machine: CM-2. 64K PEs, organized in four 16K partitions. Each 
partition is controlled by a sequencer. The four sequencers are connected to as many as 
four front-end computer systems through a 4 x 4 cross-point switch. 
(2) The Hughes 3-D Computer: By Hughes Research Laboratory. It is the first system 
built with 3-D wafer-scale integration technology. The functional blocks of the individ­
ual array processing elements, such as the ALU, registers, comparators, memory, and 
other logic, are distributed across a number of vertically aligned wafers. Each wafer in 
the stack contains an entire 128 x 128 array constituting one particular functional 
block. The stack is about 1 inch high and 4 inches in diameter. 
Systolic arrays: The interconnection pattern should be simple and regular, with only 
local connections of PEs and without long wires that would need more area or more 
energy to drive them. Thus, VLSI structures are characterized by a high degree of mod­196 
ularity, absence of long data paths, localized connectivity for data transfer, limited 
capabilities of processing elements, absence of central control, and simple timing 
mechanisms. 
The general idea in applying systolic processing to a problem is to transform the prob­
lem to meet the VLSI circuit requirements, rather than the other way around. This shifts 
much of the design complexity from the chip layout designer to the algorithm designer. 
This approach also minimizes overall design efforts, because the layout design is much 
more time- and labor-intensive than are the algorithm manipulations needed to "systoli­
cize" algorithms. 
The challenge is to design algorithms that are computation and I/O balanced. 
Warp and iWarp: Warp machine has 3 components: Warp processor array, interface 
unit, and the host. The host executes parts of the application program that are not 
mapped onto systolic array, such as initialization subroutines and sequential code. the 
host supplies the data to, receives the results fro, the array. (Notice that this is the same 
concept I have for the future expert-computer-systems. Computers with different exper­
tise are integrated together to form a supercomputer system.) 
[Cox91]  H. S. M. Coxeter, REGULAR COMPLEX POLYTOPES, Second edition, 
Cambridge University Press, 1991. (QA691.C66 1991). 
[Bir91]  M.K. Birbas, D. J. Soudris,and C.E. Goutis,"Design methodology for direct 
mapping of iterative algorithms on array architectures," Intern. Symposium 
on Circuits and Systems, vol.5, pp.3058-3061, 1991. 
The concept of the array architecture is combined efficiently with the existing knowl­
edge of compiler techniques. It is considered as initial description of the iterative algo­
rithm Fortran-like nested loops, without the requirement of transforming it into any 
intermediate form such as URE. The principles of Lamport's Coordinate Method are 
employed. Depending on the systolic or non-systolic structure of the algorithm and/or 
the multidimensional mapping, the resulting architectures can be either Regular Arrays 
or Piecewise Regular Arrays. The important subclass of algorithms known as Weak 
Single Assignment Codes is treated in an one time a way. Finally, it should be stressed 
that multiprojection can be performed easily and the hardware specification of the pro­
cessing element of the array processor are described in detail. 197 
The clock ratios of an MRA array are determined by ratios of the scale sizes of the axes 
of different coordinate systems along the direction of the projection vector. Let the pro­
jection vector be A.,a. Let variable u depends on variable v, and their dependence vector 
be T, then the clock rate ratio r can be determined as follows: 
413v  T a vu r=  (4.4)
410u  aTuu 
where Xa Tu. should not equal to zero, otherwise, the variable U is a stored variable 
instead of propagated variable. We will consider the propagation clock rate of a stored 
variable as zero. 