Design of Efficient Algorithms Through Minimization of Data Transfers by Chong, Yong Mo
Old Dominion University 
ODU Digital Commons 
Electrical & Computer Engineering Theses & 
Dissertations Electrical & Computer Engineering 
Fall 1983 
Design of Efficient Algorithms Through Minimization of Data 
Transfers 
Yong Mo Chong 
Old Dominion University 
Follow this and additional works at: https://digitalcommons.odu.edu/ece_etds 
 Part of the Computational Engineering Commons, and the Signal Processing Commons 
Recommended Citation 
Chong, Yong M.. "Design of Efficient Algorithms Through Minimization of Data Transfers" (1983). Master 
of Science (MS), Thesis, Electrical & Computer Engineering, Old Dominion University, DOI: 10.25777/zkjm-
z906 
https://digitalcommons.odu.edu/ece_etds/164 
This Thesis is brought to you for free and open access by the Electrical & Computer Engineering at ODU Digital 
Commons. It has been accepted for inclusion in Electrical & Computer Engineering Theses & Dissertations by an 
authorized administrator of ODU Digital Commons. For more information, please contact 
digitalcommons@odu.edu. 
DESIGN OF EFFICIENT ALGORITHMS THROUGH 
MINIMIZATION OF DATA TRANSFERS
by
Yong Mo Chong 
B.S.E.E. May 1981 Old Dominion University
A Thesis Submitted to the Faculty of 
Old Dominion University in Partia l Fulfillm ent of the 
Requirements for the Degree of
MASTER OF ENGINEERING 
ELECTRICAL ENGINEERING
OLD DOMINION UNIVERSITY 
November 1983
Approved by:
Meghanad D. Wagh (Director)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
©Copyright by Yong M. Chong 1983 
A ll Rights Reserved
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ABSTRACT
DESIGN OF EFFICIENT ALGORITHMS THROUGH
MINIMIZATION OF DATA TRANSFERS
Yong M. Chong 
Old Dominion University 
Director: Meghanad D. Wagh
This thesis explores the time optimal implementation of
computational graphs on a f in ite  register machine. The implementation 
fu l ly  exploits the machine architecture, especially, the number of 
registers. The derived algorithms allow one to obtain time e ffic ie n t  
implementations of a given graph in machines with a known number of 
registers.
These optimization procedures are applied to d ig ita l signal 
processing graphs. I t  is shown that the regular structure of these
graphs allows one to identify  computational kernels which, when used 
repeatedly, can cover the entire graph. The 1- and r-reg is ter  
implementations of Hadamard and Fast Fourier Transforms using various 
computational Kernels are studied for the ir code sizes and time 
complexities. The results obtained also allow one to select an optimal 
hardware devoted to a particular computational application.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ACKNOWLEDGMENT
I would like  to thank Dr. Meghanad D. Wagh for his patience, 
guidance, and help during the research. This work would not have been 
possible without his enthusiasm, insight and encouragement. In 
addition, his assistance during the preparation of this thesis was 
appreciated since i t  was the result of many long nights together.
I would also . line to acknowledge the other members of my thesis 
committee, Dr. Sherad Kanetkar and Dr. John W. Stoughton, for the ir time 
and consideration. Thames are also due to Teri M. Owens for her 
assistance in preparing this thesis.
i i




LIST OF TABLES .....................................................................................................  iv
LIST OF FIGURES .................................................................................................  v
LIST OF SYMBOLS .................................................................................................  v ii
CHAPTER
1 INTRODUCTION ............................................................................................ 1
1. Background....................................................................................  1
2. Computer A rch itectu re ...........................................   3
3. Problem Identification  ............................................................  6
4. Unique approach to the Problem .......................................  7
5. O verv iew ........................................................................................  8
2 COMPUTATIONALLY ORGANIZED BLOCK: 1-DIMENSION .............................  9
1. Graph Theory Preliminaries ....................................................  10
2. Computationally Organized Block (COB) .................................  15
3. Complexity of 1-Register Implementation ............................. 16
4. Algorithm for Implementation of a 1-Register Machine . 18
5. Example............................................................................................... 24
3 COMPUTATIONALLY ORGANIZED BLOCK: R-DIMENSION ............................. 27
1. Time Complexity of R-Dimensional COBs ..................................  27
2. R-Dimensional COB Algorithm ....................................................  29
3. Example............................................................................................... 33
4 APPLICATIONS .......................................................................................... 42
1. Primitive COB.................................................................................... 42
2. Hadamard Transform ( H T ) .................................................................47
3. Implementation of a Complete HT Through
Primitive COBs ...........................................................................  65
4. Fast Fourier Transform (FFT) ................................................  68
8
5. Implementation of 2 Length FF T ................................................. 72
5 CONCLUSIONS ..........................................................................................  75
1. Summary of Selected Results ....................................................  75
2. Significance of the Results ....................................................  76
3. Suggestions fo r Further Work ................................................  77
LIST OF REFERENCES................................................................................................... 79





1.1 Execution times (in  usee) for various micropro­
cessors   2
CHAPTER 4
4.1 Dependence of the complexities of two d ifferent
implementations upon the number of registers in
the machine...........................................................................................  43
4.2 Comparison of implementations with and without
prim itive C O B s...................................................................................  46
4.3 Complexities of various implementations of HT
prim itive C O B s...................................................................................  64
4.4 Change in the values of Eta for various prim itive
C O B s...........................................  65
12
4.5 Implementation of 2 length H T ..................................................... 68
4.6 Complexities of various implementations of FFT
prim itive C O B s...................................................................................  72
8
4.7 Implementation cf 2 length F F T ..................................................... 73





1.1 SISD architecture ..............................................................................  5
CHAPTER 2
2.1 Graphical and alternate representation of a computation . . 11
2.2 Basic representation of a graph ..................................................  13
2.3 Four topological sorts of the graph in Fig. 2.2 ...................  14
2.4 Example of a 1-dimensional COB ....................................................  17
2.5 Example of a 2-dimensional COB ....................................................  17
2.6 A computational graph and its  1-dimensional COB cover . . .  26
CHAPTER 3
3.1 The four basic transformations used to form computable
p a th s .................................................................................................  31
3.2 Computational graph of 4-point FFT ........................................... 34
3.3 1-dimensional COB cover of the 4-point FFT graph ................ 35
3.4 Equivalent 1-register COB cover of the 4-point FFT graph . 36
3.5 2-dimensional COB cover of the 4-point FFT graph .................  40
3.6 3-dimensional COB cover of the 4-point FFT graph .................  40
3.7 4- through 9-dimensional COB covers of the 4-point FFT
graph.................................................................................................  41
CHAPTER 4
4.1 A computational graph with 63 p o in t s .................................  44
4.2 Primitive COBs for implementation of the graph in
Fig. 4 . 1 .........................................................................................  44
4.3 Various implementation of 3- and 7-point primitive COBs . . 45
4.4 Cover of complete graph using 3- and 7-point prim itive
C O B s.................................................................................................  45
4.5 1-register implementation of H T .............................................  48
4.6 2-register implementation of H T .............................................  53




4.7 The three types of bu tterfly  implementations prevalent
in the 2-register implementation of HT .................................... 57
4.8 3-register implementation of HT ..................................................  59
12
4.9 Time complexity of various implementations of 2 length
H T ............................................... ; 67
4.10 Computational graph of 2-point FFT ........................................... 70
4.11 1- and 2-dimensional COB cover of 2-point F F T .......................  71
8
4.12 Time complexity of various implementations of 2 length
F F T ..........................................................................................................  74
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
LIST OF SYMBOLS
SYMBOL MEANING
Ci 1-th Computationally Organized Block (COB)
G Computational graph
r  Number of registers
P Set of computational points
U Union
Vi i- th  computable path
e Belong to
/  Not belong to
0 Nul1 set
<= Subset




The past two decades have seen rapid strides in the area of d ig ita l 
signal processing. Many new signal processing techniques were designed 
and many new applications were discovered. However, most of the e ffo rt  
in this area was concentrated on reducing the complexity of the 
algorithms involved. Since signal processing algorithms are used 
repeatedly (and in some cases, continuously) for d ifferent data sets, a 
small reduction in th e ir complexity results in a large saving of 
practical resources. In addition, the demanding real time applications 
of signal processing techniques are becoming increasingly popular.
A reduction in time complexity may be achieved by employing 
hardware techniques such as paralle l processing and pipelining, by using 
faster technologies, or by restructuring computational algorithms so 
that the time intensive operations are reduced. The least expensive of 
these, the th ird  a lternative, is the subject of this thesis.
Traditionally , only the m ultiplication was viewed as the time 
consuming operation. However, several breakthroughs in technology have 
now reduced the m ultiplication time sign ificantly . As a resu lt, both
the number of m ultiplications and additions in an algorithm are
generally used to estimate its  computational complexity. The
unsuitability of even this complexity measure may be illustra ted  by 
pointing out a case of great practical significance. A Fourier
1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2
transform algorithm designed by Winograd (WFTA) in 1976 [1 ] had a 
smaller number of m ultiplications and additions and was therefore 
immediately accepted as a replacement for the fast Fourier transform 
(FFT) [2 ] . However, an implementation of WFTA on PDP 11/55 and IBM 
370/168 was found to be much slower, than that of FFT [3 ] . This 
discrepancy could be explained only a fte r  a detailed operation count was 
maintained. I t  was found that on a PDP11/55 (using Assembler), for 
example, a 1008 point WFTA required 14.6 msec less time for 
multiplications than FFT, but simultaneously, used up 40.1 msec more for 
the memory reference operations resulting in an implementation that was 
45% slower than the FFT. The fac t that memory referencing is very time 
intensive may also be understood by examining Table 1.1 which compares 
the times for various operations in many general purpose microprocessors 
available today. Even though the importance of reducing the number of 
memory reference operations is thus obvious, l i t t l e  has been done about 
i t  to date. There are two main reasons for th is . F irs tly , the 
realization of the importance of these operations is rather recent, and 
secondly, there does not exist a mathematical model which may, in rather 
systematic manner, pave the way to such optimization.
Table 1.1. Execution times (in  usee) fo r various microprocessors [4 -8 ].
microproc. 8080 6800 Z-80 8085A 8086 68000 Z8000 TMS9900
clx. cycle 2.0 1.0 0.5 .32 0.2 0.125 0.25 .3333
Load 7 4 4 4.16 2.8 2.0 3.00 7.30
Store 7 4 4 4.16 3.0 2.125 3.50 7.30
Mop(+,-) 7 4 5 n/a 3.0 1.125 3.75 7.32
Copy 5 2 1 1.28 0.4 0.5 0.75 4.60
Rop(+,-) 4 2 1 1.28 0.6 0.5 1.00 4.60
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Compiler designers had realized the importance of reducing memory 
fetches as early as in 1964. In that year, Anderson designed an 
algorithm for compiling a computation expressed as a tree using a stack 
of local registers [9 ]. His results were la te r extended by Nakada who 
obtained compiling algorithm for arithmetic expressions in computers 
with n accumulators [10]. His algorithm generated an object code 
which minimized the frequency of storing and was used in a FORTRAN IV 
compiler for the HITAC-5020 computer which has 14 accumulators. In a 
computer with limited core memory, a large amount of data has to be 
stored on a slow, external memory device. Thus while solving problems 
on such machines, one needs to minimize the reads and writes to that 
slow memory. Specific algorithm implementations which distinguish 
between slow and fast memory and reduce references to the slow memory 
have also been reported. Both Brenner [11] and Naidu [12] have studied 
computation of FFT of a large sequence resident in an external device 
such as disk. S im ilarly, Eklundh [13] and Naidu [14] have implemented 
fast transposition of matrices too large to be stored in fast memory. 
More recently, Nawab and McClellan have done a detailed analysis of 
implementation of WFTA and FFT on f in ite  register machines and have 
found optimum number of registers for d ifferent length WFTA [15].
1.2 Computer Architecture
One possible defin ition of computer architecture is the
characteristics of a machine as seen by a programmer. In general, i t  is 
d if f ic u lt  to categorize d ifferent computer architectures because of the 
numerous variations. One possible scheme proposed by Flynn [16] is to 
divide computer architectures into four d istinct categories: SISD
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(Single-Instruction-Stre?m/Single-Data-Stream), SIMD (Single-
Instruction-Stream/Multiple-Data-Stream), MISD (M ultiple-Instruction- 
Streara/Single-Data-Streain), and HIMD (Multiple-Instruction- 
Stream/Nultiple-Oata-Stream). With the exception of SISD, a l l  
categories use some type of parallel processing with multiple
processors. The SISD architecture has only one processor which uses one
instruction per instruction cycle. Almost a l l  general purpose computers 
and microprocessor systems f a l l  in SISD category. For this reason, the 
remainder of this thesis addresses only the SISD architecture. A 
typical SISD architecture has a local register f i le  and a large main 
memory as shown in Fig. 1.1.
The instructions in SISD architecture may be divided in two
categories: memory referenced and local register referenced. A memory 
referenced instruction is one in which an operand resides in memory. A 
local register instruction, on the other hand, does not access the 
memory.
For th is  study, the set of instructions is restricted to the
following:
Load : Rn *  Mj (Load Register-n from Memory-j)
Store : Mj ♦ Rn (Store Register-n in Memory-j)
Mop(*): Rn *  Rn *  Mj (+,-»x
(Copy
Memory-j to Register-n)
Copy : Rn ♦ Rm Register-m to Register-n)
Rop(*): Rn ♦ Rn *  Rm (+ ,- ,x Register-m to Register-n)
The execution times for these instructions are dependent upon the types 
of operations and the specific architecture of the machine. Further, 
for memory related instructions (Load, Store, and Mop(*)), i t  also 
depends upon the addressing mode. However, in most cases, (see Table 
1.1) the execution of memory reference instructions (Load, Store, and







Fig. 1.1 SISD architecture.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6
Mop(*)) is slower than the equivalent local register instructions (Copy 
and Rop(*)). This time difference between the two types of instructions 
is inherent to SISD architecture and may be attributed to the 
comparatively large access time of memory.
The following normalized times (suggested by actual times lis ted  in 
Table 1.1) are used in this work to denote the re la tive  time complexity 
of these instructions.
Tload = 2 units
Tstore = 2 units
Tmop(+,-)= 2 units 
Tmop(x) = 4 units
Tcopy = 1 unit
Trop(+,-)= 1 unit
I t  should be noted that the time differences (Tload-Tcopy), (Tstore- 
Tcopy), and (Tm op(+,-,x)-Trop(+,-,x)), are chosen to be exactly equal, 
because they a l l  are identical to the memory access delay of the 
architecture.
1.3 Problem Identification
Recalling the disscussion in e a rlie r  sections, two problems faced 
by d ig ita l signal processing engineers can be easily id en tified . 
F irs tly , given a machine, how best to exploit its  architectural features 
in order to obtain an e ffic ie n t implementation of any signal processing 
algorithm. Since signal processing algorithms are used over and over 
again, any small improvement in th e ir complexity without calling fo r an 
improved hardware is immensely useful.
Secondly, given an algorithm, i f  one is to construct a special 
purpose hardware for its  implementation, what should be the 
architectural features that be b u ilt in the hardware. Since the cost of
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the hardware increases with every new feature added, one must have a 
clear understanding of the advantages th is  new feature w ill provide.
The results obtained in this thesis are f i r s t  steps towards the 
solution to these problems. For example, by exploiting the two
accumulator feature in a machine (say a 6800 microprocessor) as shown 
herein, one may improve the computational time of the Fast Fourier 
Transform by 35.29%. Sim ilarly, the results obtained here demonstrate 
that a hardware for implementing the Hadamard Transform need not have 
more than three accumulators, since the gain due to more registers is 
marginal.
1.4 Unique Approach to the Problem
A directed graph is used here to model a computational algorithm. 
The nodes of the graph represent actual computations and the edges 
represent the order between various computations. Since the aim here is 
to minimize the memory reference operations, the graph is partitioned  
into subgraphs (called COBs) each of which may be evaluated without any 
memory reference on a given hardware configuration. This enables one to 
identify  the memory reference operations with the graph edges not 
included in any COB. In order to minimize such edges, a two step 
approach is used. F irs t, the given graph is partitioned into COBs 
suitable for a one accumulator architecture. Next, an accumulator is 
added to the machine and the COB cover is modified to take into account 
the a v a ila b ility  of the extra register. This second step is repeated 
un til a ll available registers are used. In addition, the regularity  in 
a signal processing graph is exploited to identify  the computational
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
8
Kernels and to implement the graph by repeating the implementation of 
the Kernel.
1.5 Overview
Chapter 2 o f  this thesis reviews some graph theoretic preliminaries 
required la te r. I t  also defines the Computationally Organized Block 
(COB) of arbitrary dimension and presents an algorithm to partition the 
given graph into 1-dimensional COBs. A procedure to cover the graph 
using r-dimensional COBs (r  > 2) is presented and illustrated  in Chapter
3. Using the algorithms, Chapter 4 explores the implementation of 
e ffic ie n t algorithms for Hadamard Transform(HT) and Fast Fourier 
Transform(FFT). This chapter also defines and uses the concept of a 
prim itive COB. F ina lly , Chapter 5 concludes this thesis by summarizing 
the results obtained and pointing out directions for future research.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2
COMPUTATIONALLY ORGANIZED BLOCK: 1-DIMENSION
As has been stated in Chapter 1, the major thrust of this thesis is 
the establishment of a mathematical model appropriate for description 
and implementation of a signal processing algorithm on a f in ite  register 
machine. Computational graphs fo r signal processing algorithms are 
unlike the computational graphs studied in e a rlie r lite ra tu re  in that 
they do not have the tree structures and instead have feed-forward 
paths. This chapter is devoted to the investigation and modelling of 
such graphs.
Section 2.1 describes the nomenclature and the basic properties of 
signal processing graphs. Based on these properties, Section 2.2 then 
derives the mathematical models for such computations in a f in ite  
register machine. The basic approach here consists of partitioning the 
graph into modules, each of which may be computed independently in a 
machine with ’r ' registers without, making a reference to the memory 
external to the CPU. These modules are designated herein as 
COMPUTATIONALLY ORGANIZED BLOCKS (COBs) of dimension r .  Since the 
computations within a COB do not require any memory fetches or stores, 
the complexity of the algorithm in terms of the number of memory 
references, then, is determined solely by the number of graph edges 
joining different COBs. This is shown in Section 2.3. An algorithm to 
obtain an implementation in terms of 1-dimensional COBs is presented in 
Section 2.4. and illustra ted  through an example in Section 2.5.
9
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
10
2.1 Graph Theory Preliminaries
A computational algorithm can always be represented as a directed 
graph. Points in such a graph stand for computational nodes and a 
directed edge from PI to P2 indicates the involvement of the resu lt at 
PI in the computation P2. A lternately, a directed graph G=(P,E) can be 
represented by a set of points P and a set of ordered pairs, E={(x,y)| 
x,y e P} as Figure 2.1 illu s tra te s . Note that in this figure, points 
A,B,C,D and the dotted lines shown in the graphical representation are 
not rea lly  part of the computational graph and w ill not be shown in 
graphs encountered la te r. . We now give some basic definitions and 
results from graph theory, which would be used la te r.
Partia l Order:
A set Ec  P x P of ordered pairs is said to be a partia l order i f  
i t  is weaxly antisymmetric ( i . e . ,  i f  (x,y) e E, then (y,x) f. E for 
x t  y) reflexive ( i . e . ,  (x ,x) e E for a l l  x e P) and transitive
( i . e . ,  i f  (x ,y ) ,(y ,z )  £ E then (x ,z) e E for a ll x ,y ,z  e P). In
representing computational graphs we w ill relax the re fle x iv ity  
requirement which implies a loop at every computational node. Every 
computational graph is then a partia l order.
Total Order:
In addition to the partia l order, i f  the set Ec  P x P is such that 
for any x,y £ P either (x,y) e E or (y,x) e E or x=y, then E is
called a to ta l order. We w ill show that 1-dimensional COBs are
subgraphs with to ta l order.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
12
Indegree, Outdegree:
Let Iy={x I (x ,y) e E} and Oy={x I (y ,x) c E}. Then IIy | and |Oy| are 
called the indegree and the outdegree of point y respectively. In a 
computational graph, indegree of a point can only be 0, 1 or 2 since 
we deal only with the binary operations.
Minimal, Maximal points:
Points in a graph with indegree zero are called minimal points. 
Sim ilarly points with outdegree zero are called maximal points.
Path:
An ordered n-tuple (X l,X 2 ,...X n ) with (X i,X i+ l) e E for i= l ,2 . .N - l ,  
is called a path of length n-1 in the graph G=(P,E).
Acyclic Graph:
A graph with no path with idendical f i r s t  and last points and length 
> 2 is called an acyclic graph. A computational graph is always 
acyclic for the following simple reason. (X i,X i+ l) e E implies the 
computation of point Xi+1 requires the result from point X i. Now i f
a sequence (X1,X2,X3----- Xn-1, Xn=Xl) with (X i,X i+ l) e E for
i= l , 2 ,3 , . . . ,n - l  exists, then i t  implies that the computation of Xn 
requires Xn-1, which in turn requires X n-2 ... . Proceeding in this  
manner, we conclude that computation of Xn, which is re a lly  X I, 
requires X2. But since (XI,X2) e E computation of X2 requires XI 
and thus this computation cannot be carried out.
Basic Representation of a Graph:
A subgraph obtained by eliminating from the original edge set every
pair (X,Y) for which there is a path between X and Y of length >. 2
is known as the basic representation of the graph. Figure 2.2 shows
a graph and its  basic representation.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
13
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
14
Topological Sort:
Topological sort of a graph G=(P,E) is a graph G"=(P»E") such that G' 
has only one minimal point of outdegree one, only one maximal point 
of indegree one, indegree and outdegree of a l l  points except these 
are one, and a path from X to Y (X,Y e P) in G implies a path from X 
to Y in G'. Figure 2.3 illu s tra tes  topological sorts.
m m Q
A B E D C H G J  K I F I
a b e o i g c j f k h l
8 A E D G C I J H K L F
8 A E D I F G J C K L H
Fig. 2 .3. Four topological sorts of the graph in Fig 2.2.
The following results from graph theory are required in this thesis 
[18].
Theorem 2.1
The restric tion  of any partia l order is its e lf  a 
partia l order.
Theorem 2.2
In a f in ite  nonempty p a rtia lly  ordered set, there is 
at least one maximal and one minimal element.
Theorem 2.3
I f  graph G is acyclic, then there exists a unique 
basic representation.
Theorem 2.4
Topological sort of a f in ite  graph G=(P,E) exists i f  
and only i f  G 'is acyclic. Further, this topological 
sort is unique i f  and only i f  E is a to ta l order 
re lation , in which case this sort is the basic 
representation of G.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
15
2.2 Computationally Organized Block(COB)
In th is  section, the concept of Computationally Organized Block 
(COB) is defined. . Then, the computational complexity of an algorithm is 
related to the partitioning of its  graph into various COBs.
Definition of an r-dimensional COB:
Let G=(P,E) be an acyclic computational graph. Let Gy=(Y,Ey) denote 
the subgraph obtained by restric ting  the set of points to Y <= P. 
Then, COB Gy' of dimension r  is a subgraph Gy'= (Y,Ey')» Ey' <= Ey 
with the following property:
The computation represented by Gy' can be performed in a SISD 
architecture machine with ’ r* registers without any store 
operations.
For la te r use, for every COB, we define an integer function n (.)  with 
domain Y such that
( i )  n(A) < n(B) i f  there exists a path from point A to point B in 
graph G.
( i i )  n(A)*i(B) i f  A # .
Since 1-dimensional COBs are paths in the original graph and a path 
in an acyclic graph is a to ta l order, the points in every 1-dimensional 
COB form a to ta l order.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
16
Example of a COB of dimension 1:
In graph G of Fig. 2 .4 , the subgraph Gy' =(Y,Ey' ) is a COB of
dimension 1, where Y ={A,B,C,D} and Ey'={(A ,B ),(B ,C ),(C ,D )}. I t  may be 
implemented as R1 F, R1 R1+G, R1 R1+E, R1 Rl+J, R1 R1+M1.
Example of a COB of dimension 2:
In graph G of Fig. 2 .5 , the subgraph Gy' =(Y,Ey' ) is a COB of
dimension 2, where Y={A,B ,C ,...,H} , and Ey' = {(A,B), (B,C), (B,D), 
(D,E), (E ,F), (E,H), (E,G)}. I t  may be implemented as R1 L, Rl-*- Rl+M, 
Ml «- R l, R1 «- Rl+K, R2 «- Rl, R1 «■ Rl+Ml, R2 *  R2+N, R2 «- R2+0, R1 «- R2,
R2 ♦  R2+J, R2 R l, R2 <- R2+P, Rl ♦  R l+ I.
2.3 Complexity of 1-Register Implementation
As can be noted from Fig. 2 .4 , a one register COB is a to ta l order 
and except for the m inim al(first) point which needs to be evaluated 
through a Load and a Mop(+), a l l  other points in the COB are computed 
only through a Mop(+) each. S im ilarly only the maximal(last) point and 
points with outdegree^ 2  need to be stored in the memory. I f  a 
computational graph is covered by 1-register COBs, the complexity of the 
complete graph may be obtained by summing the complexity associated with 
the points in each COB. This immediately gives following complexity of 
1-register implementation of the to ta l graph.
Number of Loads = Number of COBs
Number of Mop(+)= Total number of points in the graph
Number of Stores= Number of points in the graph with outdegree _> 2
+ Number of COBs with last point outdegree < 2

















O R I G I N A L  G R A P H
Fig. 2.4.
O O'










O N E  D I M E N S I O N A L  C OO
-dimensional COB.
Ml
T H O  D I M E N S I O N A L  C O O
Example of a 2-dimensional COB.
18
From the assumptions in Chapter 1, each of these operations take exactly 
two units of time, and hence the to ta l time complexity of computation
T =(# of Loads)+(# of Mop(+))+(# of Stores)
= [(to ta l number of points in the graph)
+(number of points with outdegree >_Z in the graph)]
+ [(number of COBs)+(number of COBs with last point 
outdegree < 2 )] .
I t  should be noted here that both the terms in the f i r s t  square bracket 
are to ta lly  dependent on the given computational graph. On the other 
hand, the terms in the second square bracket, namely, the number of COBs 
and number of COBs with last point1 s outdegree < 2 are dependent upon 
the manner in which the COBs are chosen.
2.4 Algorithm for Implementation of a One Register Machine
I t  was shown in Section 2.3 that the time complexity of an 
implementation on a 1-register machine is largely dependent upon the 
number of one dimensional COBs covering the graph. In th is section, we 
present a heuristic algorithm which partitions the original graph into 
one register COBs in a manner which minimizes the to ta l number of COBs. 
This partitioning would be referred to as a 1-dimensional COB cover of 
the graph. Since a l l  points within a COB are evaluated consecutively, 
computability of the implementation for the entire algorithm demands 
that the graph obtained by replacing every COB by a point should s t i l l  
be acyclic. Following algorithm guarantees this property of the COB 
cover.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Step 1( In it ia liz a tio n  )
Set i= l and le t G = (P ',E ') be the Basic Representation of G.
Step 2( Computable path determination )
Find a ll computable paths in G'. A path (X 1 ,X 2 ,..,X t) is a
computable path i f
a. XI is a minimal point of G'.
b. (X j,X j+ l) e E \  j= l , 2 , . . , t - l .
c. Xj has indegree one fo r j= 2 ,3 , . . , t .
d. Either Xt is a maximal point of G' or, fo r every X e ? '
such that (Xt,X) e E ' ,  there exists Y e ? '  such that
(Y,X) e E'  and Y /  Xi for 1=1.2,__ » i- l -
Step 3( Choosing a COB )
(a ). I f  a computable path has a maximal point, choose the path as 
COB Ci = (P i,E i) and go to step 4. ( I f  there is more than one 
computable path with maximal point, one may choose any of 
them.)
(b ). Generate graph G" from G' by deleting a l l  points on a ll  
computable paths. Let S denote the set of minimal points 
of G". Find, i f  possible, computable paths V l,V 2 ,...,V n  with 
terminal points X l,X 2 ,...,X n  respectively such that for 
i= l , 2 , . . . ,n  there exist (not necessarily d istinct) Yi e S 
satisfying (X i,Y i) e E' and for any X f. VI U V2 U . . .  U Vn, 
(X ,Y i) f. E . Choose the path VI as COB Ci=(P i,E i) and go to 
step 4.
(c ). Find computable paths V I, V 2 , . . . ,  Vn with terminal points 
X l,X 2 ,...,X n  respectively such that for i= 2 ,3 , . . . ,n  there 
exist Yi ^  S and YI e S satisfying (X i, Y i) , (Z i - l ,Y i ) ,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
20
(X I,Y I) e E ' where Zi is the non-terminal point of path V i.
Choose the path VI as COB C i= (P i,E i).
Step 4( Deleting a COB from the graph )
Let Pi={xi,X2,...,Xt} and Ei = {(Xi,Xi+l)[ i=l,2,...,t-l}. Modify
V  «• E'- {(X,Y) |X e p i}  and P ' «• P i. I f  P'= 0, the procedure 
ends. Otherwise, i i+1 and go to step 2.
The reason fo r using the basic representation (as per step 1) in
the algorithm is to eliminate a l l  extraneous edges from a given
computational graph. The edges removed by basic representation are
those that can never be part of a computable path. This can be proved
as follows:
Let there exist edge (A,B) and path (A ,...,C ,B ) of length i  2 in graph 
G. Suppose V = (X l,X 2 ,...,X n ,A ,B ,...)  is a computable path. Since both 
(A,B) and (C,B) e E, B uses results of both the computations at A and C. 
Thus point C should also be on the path V before point B i . e . ,  C=Xi, 
The to ta1 order of the points on the path implies that there 
exists a path from C to A in G. But since (A ,...,C ,B ) is also a path in
G, G has a cycle (A , . . . ,C , . . . ,A )  and hence is not acyclic. Thus our
assumption that edge (A,B) is on a computable path is wrong.
Conditions a. through c. listed in step 2 of the algorithm ensure
that every path is computable. Condition d. allows one to choose the
longest possible chain of computable points as a computable path.
We now show that the step 3 of the algorithm always allows one to
choose a COB. Note that i f  there is no path with terminal point as a 
maximal point of G", then the graph G" is not empty and is acyclic
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
because of Theorem 2.1. Furthermore, the set S^0 because of Theorem 
2.2. F in a lly , notice that any s e S has an indegree 2 in G' and 
indegree 0 in G". This follows from the fact that s e S, being a 
minimal point, has indegree 0 in G". I f  s had indegree 0 in G', then a 
path (s) would have been a computable path and s i  G". F inally  i f  s 
had indegree 1 in G' , then for some X on a computable path, V, (X,s)e 
E '  and s would be on another computable path identical to V t i l l  X and 
containing s. Thus even in th is  case s f  G".
There are at least two computable paths le f t  a fter eliminating some 
computable paths which have no points X e V, s e S such that (X,s) £ E ' .  
(The reason why there are a t least two and not just one computable paths
le f t  is as follows: i f  the point s-e S gets both of its  inputs from the
same computable path, V, in G% i . e . ,  (X i,s ), (X j,s ) e E ', i > j ,  with 
both X i, Xj e V, then there is a path of length _> 2 between Xj and s, 
namely, the path ( X j , . . . ,X i ,s ) .  Therefore, presence of the edge (X j,s) 
in G" contradicts the fact that G' is a basic representation).
To ju s tify  the weighing scheme outlined in step 3, suppose that the 
last node of every COB is colored red. To minimize the number of COBs, 
one should thus have as few red points as possible in the fin a l graph. 
All maximal points of G must be red, since COBs computing these must end 
there. For th is  reason, i f  one finds a path with its  last point, a 
maximum point, then one may safely choose i t  as a COB since no other
choice of a COB may ever save the last point of this path from being
red.
A ll points X of the graph for which there exist some indegree one 
points Y such that (X,Y) e E , are defin ite ly  not red, since any 
computable path containing X can always be extended to Y; and thus, X is
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
never the last point of any COB. Thus the only points which may be 
affected by choice of COBs are those X e P" for whom every Y with 
(X,Y) e E '  has an indegree 2.
At any stage (any i value) in the algorithm, no point Y with 
indegree 2 in G' of that stage can belong to any computable path because 
of condition c. of step 2. Thus a point Y e P '  of indegree 2 with
(X,Y), (Z,Y) e E '  can occur only in following configurations: 
i )  X, Z e G".
i i )  X f  G" and X is non-terminal point of a computable path.
Z e G".
i i i )  X, Z f  G". Neither X nor Z are terminal points of th e ir
respective paths VI and V2.
iv) X, Z /  G". X is a terminal point of path VI and Z is a
terminal point of path V2. ( VI t  V2, as has been shown
e a r lie r ).
v) X, I f  G". X is a terminal point of path V I, but Z is not a 
terminal point of path V2.
We now determine the effect of choosing a particular path as COB at 
a given stage on X and Z. In case i ) ,  choosing a particular path as a 
COB at this stage clearly has no effect on the color of X and Z.
To deal with the remaining cases, note that a computable path at
any stage, i f  not chosen as a COB, s t i l l  remains as a computational path
at the next stage. There are only two exceptions to th is . F irs tly ,
some in it ia l portion of the path and the chosen COB may be same. In
this case, those in it ia l  points already computed by the chosen COB w ill 
no longer be on the path. Secondly, le t  X be the terminal point of the 
computable path and (X,Y) f  E '  for some indegree 2 point Y e G".
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
23
The chosen COB might convert Y to a point of indegree 1. In th is  case,
the computable path w ill be appended at least by point Y.
From the disscussion above, the point X in case i i )  and points X 
and Z in case i i i )  cannot be painted red regardless of choice of COB.
The point Z in case i i )  is also obviously not affected by this choice.
Regarding case iv ) ,  note that choice of a computational path other 
than VI and V2 as a COB does not in any way affect paths VI and V2. 
Choosing VI or V2 as COB has the same effect of painting exactly one of 
the points X or Z red. Thus at the present stage or some time in
future, one of these two points w ill be painted red. In this case, one
can choose one of the paths as a COB since any other choice w ill not
save both the points from being red. The situation described in part
(b) of step 3 of the algorithm is a generalization of this case.
F ina lly , in case v ), choosing a computational path other than VI or
V2 has no effect on the two paths as before. I f  VI is chosen as a COB,
then X becomes a red point, however, choice of V2 as a COB reduces the 
indegree of Y to one thus implying that X w ill now never be red. Note 
that in both cases, point Z is not red, since i t  is not a terminal point 
of any COB. One should, in this case, choose V2 as the COB to save one
red point. The situation described in part (c) of step 3 of the
algorithm is a generalization of this case.
These arguments also allow one to find the bounds on the number of 
1-dimensional COBs required to cover a given graph. Minimum number of 
red points in a graph is equal to the number of maximal points and 
maximum number of red points equal the maximal (certain ly red) points 
plus indegree two (potentially  red) points in the graph. Using the
normalized execution times assumed in Section 1.2, one may also get
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
24
upper and lower bounds on the time complexity. For example, in the 
graph of Fig. 2 .6 , there are only 2 maximal points and 5 points with
indegree 2. Thus, fo r this graph,
2 < Number of COBs 7.
Using the time complexity expression in Section 2 .3 , and the fact that
maximal points have outdegree 0 one gets the time complexity of this  
graph as:
46 < Time Complexity ^  66.
2.5 Example
The following is an example to find implementation of the graph G 
in Fig. 2.6 on 1-register machine.
Step 1: Basic representation of G = ( P , E ) is G" = ( P% E') where 
P '  = P = {A ,B ,...,N } and V  = E -  {(A ,B),(K ,M )}. Set i= l .
Step 2: The computable paths are VI = (A,B,C,D), V2 = (A,B,C,J), and 
V3=(A,E,F).
Step 3: Since VI has a maximal point, i t  is chosen as the f i r s t  COB 
based on condition (a ). C1=(P1,E1) where PI = (A,B,C,D) and 
El = {(A ,B ),(B ,C ),(C ,D )}.
Step 4: Modified ? '  = {E ,F ,... ,N }  and
E' = { (E ,F ),(F ,G },(G ,H ),(G ,K ),(H ,I) ,( I,N ),(J ,K ),(K ,L ),(L ,M )}  . 
i «- 2.
Step 2: The computable paths are V1=(E,F,G ,H,I), and V2=(J).
Step 3: In the present case, (J ,K ), ( I ,N ) , (G,K) e E '  , I and J are 
terminal points of VI and V2 respectively, K e S and N ^ S. 
Hence, based on condition (c) of step 3, the second COB C2 is 
chosen as V I, C2=(P2,E2) where P2 = {E,F,G,H,l} and
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
25
E2 = {(E ,F ),(F ,G ),(G ,H ),(H ,I)}.
Step 4: Modified P ' = { j ,K , . . . ,N }  and E '  = {(J ,K ),(K ,L ),(L ,M ),(M ,N )}. 
i  ♦  3.
Step 2: The only computable path is V1=(J,K,L,M,N).
Step 3: Choosing the th ird  COB C3 as V I, C3=(P3,E3) where P3= {J,K,L,M,N} 
and E3 = {(J ,K ),(K ,L ),(L ,M ),(M ,N )}.
The implementation of the computation of Fig. 2.6 in a one register 
machine w ill need (from Section 2.3) only 3 Loads, 14 Mop(+,-) and 8 
Stores requiring a to ta l of 50 units of time. On the other hand, i f  
each point had been evaluated independently through a Load, Mop(+,-) and 
Store, then one would have required 84 units of time.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
26
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3
COMPUTATIONALLY ORGANIZED BLOCK: R-DIMENSION
As has been shown In Chapter 2, the number of edges between COBs 
basically determines the efficiency of implementation of the algorithm. 
The implementation on an r-reg is ter machine thus should be based on 
cleverly formed r-dimensional COBs with as few interconnections as 
possible. This would in general be a very d if f ic u lt  task, even for 
algorithms of moderate complexity. In this thesis we adopt an approach 
which allows us to design an implementation for an r  register machine
from that of an r-1 register machine.
In the f i r s t  section of this chapter, the time complexity of the
implementation of a graph using r  dimensional COBs is derived. In
Section 3.2 , an algorithm is presented to merge (r-l)-dimensional COBs 
to form r-dimensional COB cover for the graph. Using this algorithm 
repeatedly, any dimensional COB cover may be constructed. In order to 
illu s tra te  the COB merging process, 4-point Fast Fourier transform 
algorithm is presented as an example in Section 3.3.
3.1 Complexity of r  Register Implementation
In this section, time complexity of an arb itrary  computational 
graph covered by r-dimensional COBs is derived. The derivation is 
constrained to graphs with points, with maximum outdegree 3 points. This 
lim itation does not impose a significant handicap for a re a lis tic  
computational graph.
27
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
28
Suppose the given computational graph is partitioned in r -  
dimensional COBs. The following notation is used in the time complexity 
derivation.
En : number of edges outside of COBs, which s ta rt from the points with 
outdegree (not including the outdegree due to the -edges within 
COBs) of n.
En' : number of edges outside of COBs, which end at the points with 
indegree (not including the indegree due to the edges within COBs) 
of n.
Pn : number of points with outdegree n in the original graph.
Pn' : number of points with indegree n in the original graph.
Following operation counts based on an implementation of the graph 
in terms the r-dimensional COBs are easy to obtain.
# of Store : El + E2/2 + E3/3
# of Loads : P0'+ E272
# of Mop(*): P0'+ ?} '+  E1'+ E2'/2
# of Copies: P2 + P3 -  El -  E2/2 -  E3/3
# of Rop(*): P 2 '- EK - E272
I f  a ll arithmetic operations are assumed to be (+ ,- )  and the normalized 
times for various operations given in Section 1.2 are used,
Total Time = [ 4P0'+ 2PK+ P2 + P2'+P3 ]
+ [ El + E2/2 + E3/3 + E l '  + 1.5 E2' ] .
The quantities in the f i r s t  bracket are constants, since they are 
related to the original graph. However, the quantities in the second 
bracket are dependent upon the way the graph is partitioned in r -  
dimensional COBs and are therefore related to the particular choice of a 
r-dimensional COB cover. Thus, reduction of time complexity of an 
implementation in a machine with r registers implies proper selection of
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
29
a r-dimensional COB cover for the graph which minimizes the number of 
edges outside the COBs.
3.2 r-Dimensional COB Algorithm
Following algorithm may be used to obtain a r-dimensional COB cover 
for a graph from a (r-l)-dimensional COB cover.
Step 1( In it ia liz a tio n  )
Let C* be the set of (i—1)-dimensional COBs. Assign an integer 
function ni to  points in each COB Ci e C '  having the property that 
ni(x) < n i(y) ; x ,y  e Ci i f f  computation of x is done before the 
computation of y . Let E" denote the set of edges in the original 
graph G, not included in any of the COBs in C '. Set m = 1.
Step 2( Finding a ll  computable paths )
A computable path is a sequence of points of C '  along with a subset 
E '  <= E". A computable path is generated using the following four 
transformations:
I I : Let Ci be the last COB of the current path. COB Cj may be 
appended to the path i f f  the only inputs to Cj are from COBs on 
the path, and i f  Ck preceeds Ci on the path, fo r some x £ Ck, 
y e Ci, (x ,y )e  E ', then there should exist (z,w)e E" such that 
z e Ci and w e Cj and ni(y) _< n i(z ) . I f  Cj is added to the 
path, set E '  = E '  U (z,w ).
T2: COB Ck is inserted between two consecutive COBs Ci and Cj on 
the path i f f  the only inputs to Ck are from the COBs on the 
path upto Ci, and i f  for some x e Ci, y e Cj, (x,y) e E ', then 
there exists (x ,z) e E", z e Ck. I f  Ck is added to the path, 
set E'  = E'  U (x ,z ).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
30
T3: COB Ck is inserted between two consecutive COBs Ci and Cj on
the path i f f  the only inputs to Ck are from the COBs on the 
path upto Ci, there is no edge on the path going from Ci to Cj, 
and for some x e c i, y e Ck, (x,y) e E", such that the function 
ni has its  maximum value at x and the function nk has its  
minimum value at y. I f  Ck is added to the path, set E '
= E '  U (x ,y ).
T4: COB Ck is inserted between two consecutive COBs Ci and Cj on
the path i f f  the only inputs to Ck are from the COBs on the 
path upto Ci, for some x e Ci, y e Cj, (x,y) e E ' ,  function ni 
has its  maximum value at x, and for some z e Ci, w e Ck,
(z,w)c E", such that the function nk has its  minimum value at 
w. I f  Ck is added to the path, set E'  -  E '  U (x ,y ).
These transformations are illustrated  in Figure 3.1.
A computable path is generated as follows:
a. Set E'= 0 and choose a COB with no input edges as the f ir s t  
point of the path.
b. Let (C l,C 2,. . . ,Ct) be the current path. Insert a COB a fter Ci 
in the path by applying rules T!,T2,T3 and T4 above i f f  no COB 
can be inserted after C l,C 2 ,.. .C i- l.
c. The path is completed when rules T1,T2,T3 and T4 can no more be 
applied to add COBs to that path.
Step 3( Choosing an r-dimensional COB )
(a) For each computable path, find the number of COBs which can be 
attached to a path i f  input edges of attached COBs coming from 
COBs not on the path are disregarded. I f  there exists a path
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
31
T it ..CkCI - . a c i c j
FP: The f irs t  point of COB 
IP: The last point of COB
CJ
Ck
Fig. 3.1. The four basic transformations used 
to form computable paths.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
32
with 0 attachable COBs, then choose the path as the m-th COB of 
dimension r  and go to step 4.
(b) Find i f  possible, computable paths V1,V2,..,Vn such that more 
COBs can be attached to path Vi i f  input edges of attached COBs 
coming from COBs on path Vi-1 are disregarded for i  = 2 , . . ,n - l  
and more COBs may be attached to path VI i f  the input edges of 
attached COBs coming from COBs on the path Vn are disregarded. 
Choose path VI as the m-th COB of dimension r .
(c) Find computable paths V1,V2,..,Vn such that more COBs can be 
attached to path Vi i f  input edges of attached COBs coming from 
COBs on path Vi-1 are disregarded for i = 2 , . . , n - l .  Choose 
path VI as the m-th COB of dimension r  and go to step 4.
Step 4( Deleting a r-dimensional COB )
Delete from set E" edges originating from the COBs on the chosen 
path. I f  E"= 0, then the procedure terminates, otherwise, le t  m
= m + 1 and go to step 2.
The assignment of the integer function n (.)  in step 1 ensures the
computational ordering within a COB.
The four transformations used to obtain a computable path in step 2 
of the algorithm basically gurantee the computability of each path and 
also ensure that each path absorbs as many edges in E" as possible. I t  
may also be noted that the four transformations are mutually exclusive. 
T1 is the only transformation which adds a new COB at the end of the 
current path. Only in T3, new COB is inserted between two unconnected 
COBs on the current path which are not connected. Transformations T2 
and T4 would be identical only in the case when x is the last point of
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
33
Ci, y e Cj, z is the f i r s t  point of Ck, and (x,y) e E ', (x ,z ) e E". But 
in th is case, since the only inputs to Ck are from the COBs on the 
current path t i l l  Ci, COBs Ci and Ck would not be separate COBs of 
(r-l)-dim ension.
Step 3 of th is  algorithm may be reasoned out in exactly the same 
manner as step 3 of the algorithm for 1-dimensional COBs.
3.2 Example
In th is section, implementations on various machines of the 4 point 
Fast Fourier Transform (FFT) graph shown in Fig. 3.2 are sketched. The 
1-dimensional COB cover of th is graph shown in Fig. 3.3 is obtained by 
the algorithm of Chapter 1 and used as an input for the algorithm of the 
e a rlie r  section. The following steps describe the formation of 2- 
dimensional COBs derived through the application of th is algorithm.
Step 1: Graph G' = ( C ',E ') is constructed as shown in Fig. 3.4.
Integer function ni is assigned to each point for every COB. E' 
is set of edges remaining outside of COBs in Fig. 3 .4.
Steps 2 and 3 are shown in the following table for brevity.
Step 2 Step 3
Computable Path *  Number of
Path COB Sequence set E' Attachable COBs
VI C1C2 (1 ,1 ;2 ,2 ) 3
V2 C3 —  1
V3 C5C6C9 (5 ,1 ;6 ,2 ),(6 ,2 ;9 ,1 ) 2
V4 CIO —  1
*  Notation (a,b;c,d) stands for an edge from the point b of COB a 
to the point d of COB c.
There is no path with 0 attachable COBs. But path V2 may be extended by 
COB C4 i f  inputs to C4 from path VI ( (2 ,4 ;4 ,5 ) ) is disregarded.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
34
Fig. 3.2. Computational graph of 4-point FFT.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
35
Fig- 3 .3 . 1-dimensional COB cover of the 4-point FFT graph.









Fig. 3.4. Equivalent 1-register COB cover of the 4-point FFT graph.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
37
Sim ilarly path V4 may be extended by COB C ll i f  inputs to C ll from path 
V3 ( (9 ,2;11,5) ) are disregarded. Thus from condition (b) of step 3, 
one may choose either VI or V3 as the f i r s t  COB. Let V3 be the f i r s t  
2-dimensional COB.
Step 4: A ll edges originating from C5, C6, and C9 are deleted from E". 
Steps 2 and 3:
Step 2 Step 3
Computable Path Number of
Path COB Sequence set E' Attachable COBs
VI C3 —  1
V2 C1C2 (1 1*2 2) 3
V3 C10C11C14C15 (1 0 ,1;11,2 ),(11^3;14 ,1),(14,1;15 ,2) 3
There is no path with 0 attachable COBs. But path V2 may be extended by 
COB C13 i f  the input to C13 from the path V3 ( (11,5;13,2) ) is 
disregarded. Thus from condition (b) of step 3, V3 is chosen as the 
second 2-dimensional COB.
Step 4: Edges originating from CIO, C ll, C14 and C15 are deleted from 
E".
Steps 2 and 3:
Step 2 Step 3
Computable Path Number of
Path COB Sequence set E '  Attachable COBs
VI C3 —  1
V2 C1C2C13C20 (1 ,1 ;2 ,2 ) ,(2 ,2 ;1 3 ,1 ),(1 3 ,1 ;2 0 ,1) 0
V2 is chosen as the th ird  2-dimensional COB from condition (a) of step 
3.
Step 4: Edges originating from Cl, C2, C13 and C20 are deleted from E".
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
38
Steps 2 and 3:
Step 2 Step 3
Computable Path Number of
Path COB Sequence set E' Attachable COBs
VI C3C4C12C17 (3 ,1;4 ,2 ),(4 ,5 ;1 2 ,2 ),(4 ,5 ;1 7 ,1 ) 1
V2 C3C4C7C8C18 (3 ,1 ;4 ,2 ),(4 ,3 ;7 ,1 ),(7 ,1 ;8 ,2 ),(8 ,2 ;1 8 ,1 ) 0
V3 C3C4C16C19 (3 ,1 ;4 ,2 ),(4 ,2 ;1 6 ,1 ),(1 6 ,1 ;1 9 ,1 ) 0
V2 is chosen as the fourth 2-dimensional COB from condition (a) of step 
3.
Step 4: Edges originating from C3, C4, C7, C8 and C18 are deleted from 
E".
Steps 2 and 3:
Step 2 Step 3
Computable Path Number of
Path COB Sequence set E' Attachable COBs
VI C12C17 (12,1;17,1) 0
V2 C16C19 (16,1;19,1) 0
VI is chosen as the f i f th  2-dimensional COB from condition (a) of step 
3.
Step 4: Edges originating from C12 and C17 are deleted from E".
Steps 2 and 3:
Step 2 Step 3
Computable Path Number of
Path COB Sequence set E'  Attachable COBs
VI C16C19 (16,1;19,1) 0
VI is chosen as the sixth 2-dimensional COB.
Step 4: After edges originating from C16 and C19 are deleted from E",
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
39
E" = 0. Therefore procedure terminates.
The resultant 2-dimensional COB cover is shown in Fig. 3 .5 . In order to 
obtain 3-dimensional COB cover, the r-reg is ter algorithm is applied to 
Fig. 3 .5 . The result is 3 3-dimensional COBs, as shown in Fig. 3.6. 
Applying the r-reg is ter algorithm repeatedly, 4 to 9-register COBs are 
found, as shown in Fig. 3.7.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
40
Fig. 3.5. 2-dimensional COB cover of the 4-point FFT graph.
'20
Fig. 3.6. 3-dimensional COB cover of the 4-point FFT graph.






Fig. 3 .7 . 4- through 9-dimensional COB
covers of the 4-point FFT graph.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4 
APPLICATIONS
The intent of th is  chapter is to illu s tra te  the concept of a 
prim itive COB and its  integration with the principles developed in 
e a rlie r chapters. A prim itive COB is defined and illus tra ted  by an 
example in Section 4 .1 . In Sections 4 .2 , various prim itive COBs
suitable fo r Hadamard transform (HT), and th e ir  codes using the 
algorithms developed in Chapters 2 and 3 are obtained. In Section 4.3 , 
HT implementations using these prim itive COBs are investigated. 
Sections 4.4 and 4.5 repeat this exercise for fast Fourier transform 
(FFT).
4.1 Prim itive COB
Many signal processing algorithms have graphs which may be 
partioned into a set of identical subgraphs. This property greatly  
simplifies the software implementation of signal processing algorithms. 
As Morris illu s tra tes  in [19], automatic generation of d ig ita l signal 
processing software is possible by making use of the regular structure 
of the algorithm. In such software generation, a computational kernel 
is identified  and is used repeatedly to compute the complete algorithm. 
This computational kernel is usually the smallest repeatable subgraph 
possible.
A prim itive COB is a computational kernel, but not necessarily the 
smallest repeatable subgraph. A given graph may be covered using many
42
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
43
different prim itive COBs. A computational graph may also be implemented 
using d ifferent prim itive COBs simultaneously. The following example 
illus ta tes  this idea through the implementation of a binary
computational graph of 63 points (shown in Fig. 4.1) using a set of
prim itive COBs.
The procedure begins by finding a set of prim itive COBs as shown 
in Fig. 4 .2 . The complete graph can be implemented in two different 
ways. One way is to use the prim itive COB of 3 points and another way 
is to use the prim itive COB of 7 points. The results of these two
different implementations are shown in Fig. 4 .3 . In addition to
different implementations, each prim itive COB can be implemented on 
machines with d ifferent numbers of registers to compare the execution 
time for the complete graph. These implementations and the ir
complexities are shown in Fig. 4 .4  and Table 4.1 .
Table 4.1 . Dependence of the complexities of two different
implementations upon the number of registers in the machine.
Implementation using 3 point prim itive COB
# of registers Time/COB Eta # of COBs Total Time for the graph
1 16 5.33 21 336
2 13 4.33 21 273
3 13 4.33 21 273
Implementation using 7 point prim itive COB
1 36 5.14 9 324
2 30 4.29 9 270
3 27 3.86 9 243
An implementation of the complete graph may also be devised using 
the algorithm developed e a rlie r . The time complexity of this
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Fig. 4.1. A computational graph with 63 points.
<
C O I  O F 3 P O I N T S CO B OF 7 P O I N T S
Fig. 4 .2 . Prim itive COBs fo r implementation of 
the graph in Fig. 4 .1 .
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
45
1 t C « I S T C f t 2 R E S I S T S !
3 l E t l S T E I
1 R E S I S T S ! 2 R E 6 I  S T E R 3 R E 6 1 S T E R
Fig. 4 .3 . Various implementations of 3- and 7-point prim itive COBs.
1
U S IN G  3 P O I N T  P R I M I T I V E  C O B S U S IN G  7 P O I N T  P R I M I T I V E  C O B S
Fig. 4 .4 . Cover of complete graph using 3- and 7-point prim itive COBs.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
46
implementation w ill generally be smaller than those with prim itive COBs. 
This is because the usage of the prim itive COBs a r t i f ic ia l ly  severs some 
linKs in the graph without caring for its  global implications. However, 
the increase in time is  marginal as Table 4.2 shows.
Table 4 .2 . Comparison of implementations with and without prim itive COBs.
# of registers Time fo r implementation 
without prim itive COBs
% increase in time using COB of 
3 points 7 points
1 316 6.33 2.53
2 264 3.41 2.27
3 241 13.28 0.83
One may note that increasing the number of registers generally reduces 
the time gap between the implementations with and without prim itive  
COBs. The only exception to th is  occurs when the prim itive COB is too 
small to fu l ly  u t i liz e  a l l  the available registers.
In actual software implementation, time complexities due to 
decision-making and arithmetic operations for loop control are assumed 
to be eliminated by the use of in-line-code. Therefore, whether 
prim itive COB approach is used or not, the code sizes are approximately 
the same. However, the design of a large non-structural errorless 
software fo r an algorithm may be a time consuming task without prim itive  
COBs. With prim itive COBs, the software can be generated automatically 
and with ease since the portion of the software related to the prim itive  
COB can be used repeatedly to form a complete code.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
47
4.2 Hadamard Transform(HT)
In this section, a description of e ffic ie n t l - , 2 - ,  and 3-register
implementations for prim itive COBs of 2X2, 4X3, 8X4, and 16X5 points
useful for computation of HT is presented. These prim itive COBs are
12
then used to compute a 2 -point HT.
4.2.1 1-Register Implementation of Prim itive COBs
Prim itive COBs of 2X2, 4X3, 8X4, and 16X5 points which would be
used here for implementing the Hadamard transform are shown in Fig. 4.5.
These prim itive COBs were chosen fo r th e ir  superior performance (with
reference to th e ir  time complexity) from many different primitive COBs
that might be useful fo r implementing a Hadamard transform. Figure 4.5
also shows a 1-dimensional COB cover obtained through algorithm of
Chapter 2 and lis ts  the associated codes for a machine with only one
accumulator. Using the formula derived in Chapter 2, one obtains the
n
to ta l number of operations in the case of a 2 length HT as:
Total # of operation= # of COBs + # of points + # of points with
outdegree > 2 + # of COBs with terminal
point outdegree > 2
n-1 n n n
= (2+n)2 + (n+l)2 + n2 + 2
n-1 
=(5n+6)2.
Since execution time for Tload, Tmop(+,-), and Tstore are assumed to be 
2 units each, to ta l execution time is
n-1 n
Total Time=(5n+6)2 x 2 = (5n+6)2.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
computation •  Rl
code
2X2 prim itive COB
Rn: n-th register
In: n-th input data from memory
On: n-th output data to memory
Tn: n-th temporary scratch pad memory location
computation
4X3 prim itive COB
R1 13
R1 Rl ♦ 1 7
Tf Rl
R1 11
R1 Rl ♦ 15
T1 Rl
R1 Rl ♦ Tf
T2 Rl
R1 T1
R1 Rl •  Tf
TV Rl
R1 12
Rl Rl ♦ 16
T1 Rl
R1 I f
R1 Rl ♦ 14
n Rl
R1 Rl ♦ T1
T* Rl
Rl Rl ♦ T2
Of Rl
Rl T4
R1 Rl -  T2
01 Rl
R1 T3
R1 Rl -  T1
T2 Rl
R1 Rl ♦ Tf
02 Rl
R1 T1
Rl Rl -  Tf
03 Rl
code
Fig. 4.5a. 1-reg ister implementation of HT.






VrVAl " ' t ■
computation
Rl 17 01 Rl
Rl Rl ♦ IIS Rl T4
n Rl Rl Rl - T2
Rl 13 TZ Rl
Rl Rl *  111 Rl T9
T1 Rl Rl Rl -  T7
Rl Rl ♦ T l T4 Rl
TZ Rl Rl Rl ♦ TZ
Rl Tl 02 Rl
Rl Rl -  T l Rl T4
Tf Rl Rl Rl -  TZ
Rl IS 03 Rl
Rl Rl *  113 Rl T8
T1 Rl Rl Rl -  T6
Rl 11 TZ Rl
Rl Rl *  19 Rl Rl ♦ Tl
T3 Rl T4 Rl
Rl Rl ♦ Tl Rl Rl ♦ T3
T« Rl 04 Rl
Rl Rl ♦ T2 Rl T4
T5 Rl Rl Rl -  T3
Rl T3 OS Rl
Rl Rl -  Tl Rl TZ
T1 Rl Rl Rl -  Tl
Rl Rl ♦ T l Tl Rl
T3 Rl Rl Rl ♦ T l
Rl Tl 06 Rl
Rl Rl -  T l Rl Tl
T* Rl Rl Rl -  T |
Rl 16 07 Rl
Rl Rl ♦ 114
T1 Rl
Rl IZ
Rl Rl *  I I I
T6 Rl
Rl Rl ♦ Tl
T7 Rl
Rl T6
Rl Rl -  Tl
T1 Rl
Rl 14
Rl Rl ♦ 112
T6 Rl
Rl I I
Rl Rl *  18
T8 Rl
Rl Rl ♦ T6
T9 Rl
Rl Rl ♦ T7
T il Rl
Rl Rl ♦ T5
Of Rl
Rl T il
Rl Rl -  T5
code
Fig. 4.5b. 1-register implementation of HT(continued).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
50
16X5 prim itive COB computation
Fig. 4.5c. 1-register implementation of HT (continued).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
51
Rl - 115
Rl Rl ♦ 131
ra - Rl
Rl - 17
Rl -  Rl ♦ 123
Tl - Rl
Rl -  Rl ♦ T#
TZ • Rl
Rl -  Tl
Rl - Rl ♦ Tl
Tl -  Rl '
Rl -  I l l
Rl •  Rl ♦ 127
Tl •  Rl
Rl -  13
Rl • Rl ♦ 119
T3 • Rl
Rl -  Rl ♦ Tl
T4 •  Rl
Rl -  Rl ♦ TZ
T5 • Rl
Rl -  T4
Rl * Rl - T2
TZ * Rl
Rl -  T3
Rl - Rl - Tl
Tl • Rl
Rl - Rl ♦ T |
T3 - Rl
Rl - Tl
Rl - Rl - Tl
Tl - Rl
Rl .  113
Rl - Rl ♦ 129
Tl - Rl
Rl .  15
Rl • Rl ♦ 121
T4 .  Rl
Rl - Rl ♦ Tl
T6 • Rl
Rl .  74
Rl - Rl -  Tl
Tl -  Rl
Rl -  19
Rl - Rl * I2S
- Rl
Rl -  : i
Rl -  Rl » 117
T7 ♦ Rl
Rl -  Rl * T4
T8 - Rl
Rl -  Rl ♦ T6
T9 - Rl
Rl - RW 5
T il Rl
Rl T9
Rl Rl -  T5
T5 Rl
Rl Tl
Rl Rl -  T6
Tl Rl
Rl Rl ♦ T2
Tl Rl
Rl T6
Rl Rl -  T2
TZ Rl
Rl T7
Rl Rl -  T4
T4 Rl
Rl Rl ♦ Tl
T6 Rl
Rl Rl ♦ T3
T7 Rl
Rl T6
Rl Rl -  T3
T3 Rl
Rl T4
Rl Rl -  Tl
Tl Rl






Rl Rl ♦ 134
Tl Rl
Rl 16
Rl Rl ♦ 122
T6 Rl
Rl Rl * Tl
T9 Rl
Rl T6
Rl Rl - Tl
ri Rl
Rl I I I
Rl Rl » 126
T6 Rl
Rl 12
Rl Rl * 118
Til Rl
Rl Rl ♦ T6
TI2 Rl
Rl Rl * T9
T13 Rl
Rl T12
Rl Rl - T9
T9 Rl
Rl Til
Rt Rl - T6
T6 Rl
Rl Rt * Tl
Til Rl
Rl T6
Rl Rl - Tl
Tl Rl
Rl 112
Rl Rl ♦ 121
T6 Rl
Rl 14
Rl Rl * 121
T12 Rl
Rl Rl ♦ T6
T14 Rl
Rl T12
Rl Rl - T6
T6 Rl
Rl I I
Rl Rl ♦ 124
T12 Rl
Rl I I
Rl Rl ♦ 116
T15 Rl
Rl Rl ♦ T12
T16 Rl
Rl Rl ♦ T14
T I7 Rl
Rl Rl ♦ T13
T18 Rl
Rl Rl ♦ T i l
01 Rl
Rl T i l
Rl Rl - T i l
01 Rl
Rl T17
Rl Rl - T13
Til Rl
Rl Rl . T5
02 Rl
Rl Til
Rl Rl - T5
03 Rl
Rl T16
Rl Rl - T14
T5 Rl
Rl Rl ♦ T9
Til Rl
Rl Rl ♦ T8
04 Rl
Rl T i l
Rl Rl - T8
05 Rl
Rl T5
Rl Rl -  T9
15 Rt
>1 Rl ♦ T2
01 Rl
11 -15





Rl Rl ♦ T6
IS Rl
Rl Rl ♦ T il
11 Rl
Rl r i ♦ n
01 Rl
Rl T l
Rl r i -  n
09 Rl
Rl T5
Rl Rl -  T il
15 Rl
Rl Rl ♦ T3
Oil Rl
Rl T5
Rl Rl -  T3
011 Rl
Rl T2
Rl Rl -  T6
12 Rl
Rl Rl ♦ Tl
13 Rl
Rl Rl ♦ T4
012 Rl
Rl T3
Rl Rl ♦ T4
013 Rl
Rl T2
Rl Rl ♦ Tl
Tl Rl
Rl Rl ♦ TO
014 Rl
Rl Tl
Rl Rl ♦ Tl
015 Rl
16X5 prim itive COB code
Fig. 4.5d. 1-register implementation of HT (continued).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
52
We denote the time complexity of a COB implementation per point by
Eta. Eta is a measure of the efficiency of the implementation. A
smaller Eta indicates a better implementation. In a 1-register
n
implementation of a prim itive COB of 2 X(n+1) points, Eta = 5 + l / (n + l) .
4.2.2 2-Register Implementation of Primitive COBs
Primitive COBs shown ea rlie r may also be covered using 2- 
dimensional COBs and implemented on a machine using 2 accumulators 
e ffic ie n tly . The results, obtained from the algorithm of Chapter 3, are 
shown in Fig. 4 .6 . To compute the execution time of these 
implementations, an inspection of th e ir structure is in order. The odd 
and even indexed points of the f i r s t  n-1 stages of these highly regular 
implementations are mere duplicates of one lower size implementation. 
The last stage of the implementation is made up of three d ifferent types 
of bu tterflies  shown in Fig. 4.7. These butterflies  occur in a regular 
cycle of T y p e s -1 ,2 ,1 ,3 ,1 ,2 ,1 ,3 ,.... A Type-1 butterfly  computation 
involves only one Load, but two Mop(+) and Stores each. Its  complexity 
(complexity of computing the two end-points) is thus 10 time units. 
Type-2 bu tterfly  involves a Rop(+), a Mop(+) and two Stores. I t  also 
saves the storage of one of the source points. Its  effective complexity
is thus 5 time units. F ina lly , the Type-3 bu tterfly  involves two Mop(+)
and Stores but i t  converts the Store of source point into a Copy thus 
having an effective complexity of 7 time units.
n
From the above discussion, the time complexity of 2 X (n+1) point
primitive COB, C(n), is given by:
n-2 n-3 n-3
C(n) = 2C(n-1) + 10x2 +5x2 +7x2 ; n > 2.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
53
Rl 11
Rl Rl ♦ 13
R2 10
R2 R2 ♦ 12
T# R2
R2 R2 ♦ Rl
Of R2
Rl Rl •  TO
01 Rl
computation
2 X 2  prim itive COB








Rl Rl ♦ 17
RZ 11
RZ RZ ♦ IS
TO RZ
RZ RZ ♦ Rl
Tl RZ
Rl Rl -  TO
Tl Rl
Rl 1Z
Rl Rl ♦ 16
RZ 10
RZ RZ ♦ 14
TZ RZ
RZ RZ *R1
Rl Rl -  TZ
TZ RZ
R2 RZ *  T l
00 RZ
RZ TZ
RZ RZ -  T!
01 RZ
RZ Rl
Rl Rl *  TO
OZ Rl
RZ RZ -  TO
03 RZ
4 X 3  prim itive COB
code
Fig. 4.6a. 2-register implementation of HT.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
54
O— ■— r -Q -
/  /
/  // >
/  /
O" H ' / ■ p
1 1 * '' '• '
» ? 7 - /  ,°
t i t  •
; /  '/ • t \
-\  \  °
\




Rt Rl » IIS
RZ 13
RZ , RZ ♦ 111
T» RZ
RZ RZ ♦ Rl




Rl Rl » 113
RZ 11
RZ RZ ♦ 19
TZ RZ
RZ RZ ♦ Rl
Rl Rl -  TZ
TZ RZ
RZ RZ ♦ Tl
T3 RZ
RZ Rl
Rl Rl *  T l




Rl Rl *  114
RZ IZ
RZ RZ » I l f
TS RZ
RZ RZ ♦ Rl




Rl Rl » 112
RZ I f
RZ RZ ♦ 18
n RZ
RZ RZ ♦ Rl
Rl Rl -  n
T7 RZ
RZ R2 ♦ T6
T8 RZ
RZ R2 » T3
01 R2
RZ Rl
Rl Rl ♦ TS
RZ RZ -  TS
T5 Rl
Rl Rl *  T8
0* Rl
Rl RZ
RZ RZ ♦ T4
OS RZ
Rl Rl -  T4
07 Rl
Rl T8
Rl Rl -  T3
01 Rl
Rl T7
Rl Rl -  T6
RZ TZ
RZ RZ *  Tl
Tl Rl
Rl Rl » RZ
OZ Rl
RZ RZ -  Tl
03 RZ
Rl T5
Rl Rl -  T8
05 Rl
code
8X4 prim itive COB
Fig. 4.6b. 2-register implementation of HT (continued).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
/  —\  " v  O  -  -
/ ' * \  \  \J i l l  .  • V  X\ V \ \O /  /  /  -  O -^ \  •V̂ ’D̂  » O -  O
o /  /  -  - o ' \ ' Ss'‘- W  N*’* O 0
<&— .   \ \   -
/
/o— \
16X5 prim itive COB computation
Fig. 4 .6 c .  2 -r e g is te r  implementation o f  HT (continued).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
56
Rl IIS
Rl Rl ♦ 131
RZ 17
RZ RZ ♦ IZ3
TO RZ
RZ RZ ♦ Rl




Rl Rl ♦ 127
RZ 13
RZ RZ ♦ 119
TZ RZ
RZ RZ * Rl
Rl Rl - TZ
TZ RZ
RZ RZ ♦ T*
11 RZ
RZ Rl
Rl Rl ♦ Tl




Rl Rl ♦ 129
RZ IS
RZ RZ ♦ 121
T4 RZ
RZ RZ ♦ Rl




Rl Rl * IZS
RZ 11
RZ RZ ♦ 117
T6 RZ
RZ RZ ♦ Rl
Rl Rl - T6
T6 RZ
RZ RZ ♦ T4
T7 RZ
RZ RZ ♦ T8
T9 RZ
RZ Rl
Rl Rl ♦ TS
RZ RZ - TS
TS Rl
Rl Rl ♦ Tl
T il Rl
Rl RZ
RZ RZ ♦ T3




Rl ♦ Rl ♦ 130
RZ • IS
R2 * RZ ♦ 122
T13 RZ
RZ . RZ ♦ Rl




Rl « Rl ♦ 124
RZ ♦ 12
RZ • R2 ♦ 110
T1S • RZ
RZ • R2 + Rl
Rl » Rl -  T15
T15 • RZ
RZ * RZ ♦ T13
TIC • RZ
RZ • Rl
RZ • RZ ♦ T14




Rl • Rl ♦ IZS
RZ 14
RZ RZ ♦ 120
T18 ♦ RZ
RZ - RZ ♦ Rl




Rl - Rt * 124
RZ - 10
RZ * RZ ♦ 116
T20 - RZ
R2 * RZ *  Rl
Rl * Rl -  TZO
TZO * RZ
RZ - RZ *  T18
T21 • RZ
RZ * RZ *  T8
T22 ■ RZ
R2 * RZ ♦ T9
00 - RZ
R2 • Rl
R2 • RZ ♦ T19
Rl - Rl -  T19
T19 - RZ
RZ * R2 ♦ Tl
TZ3 RZ
RZ • RZ * T il
00 • R2
R2 - Rl
Rl * Rl ♦ T17
R2 ♦ RZ • T17
T17 » Rl
Rl * Rl «• T3
012 • Rl
Rl * RZ
RZ • RZ ♦ T12






RZ • RZ - T8
T7 * Rl
Rl • Rl * RZ




Rl • Rl - TO
RZ * TS
RZ • R2 - T4
TZ * RZ
RZ * RZ » Rl




Rl * Rl - T13
R2 * TZO
R2 * R2 - T18
T6 * RZ
RZ * RZ * Rl
Rl Rl - T6
T6 * RZ
RZ • R2 ♦ TZ
04 * RZ
RZ * Rl
Rl * Rl * T4




Rl * Rl - Tl
RZ * T19
RZ * R2 - T14
T l * R2
RZ - RZ ♦ Rl
Rl *  Rl -  Tl 
010 -  R2 
OH -  Rl 
Rl -  T22 
Rl -  Rl -  T9 
01 -  Rl 
Rl « TS 
Rl *  Rl -  TZ 
OS •  Rl 
Rl *  T23 
Rl *  Rl •  T il 
OS *  Rl 
Rl -  T17 
Rl *  Rl -  T3 
013 -  Rl
16X5 prim itive COB code
Fig. 4.6d. 2-register implementation of HT (continued).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
57
■ < q  >  Q H,S




N.SO ^ O N .S
TYPE 3
Fig. 4 .7 . The three types of bu tterfly  implementations prevalent 
in the 2-register implementation of HT.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
58
The solution of this difference equation yields the following closed
form expression for the time complexity of the two register
implementation.
n
C(n) *  (4n + 4.75)2 ; n > 2.
Also in th is  case, Eta = 4 + 0 .75 /(n + l).
4.2.3 3-Register Implementation of Prim itive COBs
The 3-dimensional COB cover of the prim itive COBs under 
consideration and the associated implementations on a machine with 3 
accumulators are shown in Fig. 4 .8 .
4.2.4 0-Register and In fin ite-R egister Implementations
I f  an implementation computes each graph point independently, 
without any regard for the graph structure, we ca ll i t  a O-register 
implementation here. Each HT computational point is calculated by f i r s t  
loading an operand, then adding to or subtracting from i t  an operand 
located in memory, and storing the result back into the memory, taking a 
total of 6 units of time. Each computational point, in this case, is a 
COB. Since a O-register implementation is constructed without any 
e ffo rt to minimize memory related operation, its  execution time is the 
worst possible.
Since every computational point takes 6 units of time, to ta l time 
for a computational graph may be obtained by merely multiplying the 
number of computational points in the graph by 6.






Rl Rl ♦ 13
RZ I I
RZ RZ * IZ
R3 RZ
RZ RZ ♦ Rl




2X2 p rin itiv e  COB
\  /  \
^  ¥ :  p  X  °
V  V  /  \  /  \
o— / V " ^ -----------
/  \  x -
t ------------- \  N
computation
Rl 13
Rl R l ♦  17
RZ 11
RZ RZ ♦  IS
R3 RZ
RZ RZ *  R l




Rl Rl *  16
RZ I I
RZ RZ + 14
R3 RZ
RZ Rl ♦ Rl
Rl Rl -  R3
R3 RZ
RZ RZ ♦ TO




Rl Rl ♦ Tl




4X3 prim itive COB
Fig. 4.8a. 3-register implementation of HT.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
60





!  h \  \  /  \y i '4 -W —* —p
' ' A \  \  / \  /
* / \ • v V
/ /  \  \  A  A
H— r6, A  ̂ --- 0< • ' \









































































































































































8X4 prim itive COB code
Fig. 4.8b. 3-register implementation of HT (continued).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
16X5 primitive COB computation
Fig. 4.8c. 3-register implementation of HT (continued).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
62
T il R2 RZ Rl
R1 IIS Rl 14 Rl Rl ♦ T16
R1 Rl ♦ 131 Rl Rl ♦ 130 RZ R2 -  T16
R2 17 RZ 16 R3 Rl
R2 RZ + 123 RZ R2 ♦ 122 Rl Rl ♦ T4
R3 R2 R3 R2 R3 R3 -  T4
M2 Rl ♦ Rl R2 RZ + Rl 012 Rl
R1 Rl -  R3 Rl Rl -  R3 013 R3
i» RZ T1Z RZ R3 R2
R2 111 RZ 110 R2 R2 *  T il
RZ R2 + 127 RZ Rl ♦ 126 R3 R3 -  T il
R3 13 R3 12 014 RZ
R3 R3 *  119 R3 R3 ♦ 118 015 R3
T1 R3 T13 R3 Rl T13
R3 R3 + R2 R3 R3 ♦ R2 Rl Rl ♦ T1Z
RZ R2 -  T1 RZ R2 -  T13 R2 T19
T1 R3 T13 R3 RZ RZ ♦ T17
R3 R3 ♦ 1» R3 R3 -  T12 R3 RZ
TZ R3 T14 R3 RZ RZ ♦ Rl
R3 RZ R3 R2 Rl Rl -  R3
RZ RZ ♦ Rl RZ RZ ♦ Rl R3 R2
R1 Rl -  R3 Rl Rl -  R3 RZ R2 ♦ T8
13 RZ T15 R2 RZ R3 -  18
T4 Rl T16 Rl 08 RZ
R1 113 Rl I1Z 01 R3
R1 Rl ♦ 129 Rl Rl ♦ I 28 RZ T7
RZ IS RZ 14 R2 R2 -  TZ
RZ R2 ♦ 121 R2 « 2 *  m R3 RZ
M3 RZ R3 9 2 RZ RZ -  Rl
RZ RZ ♦ Rl R2 R2 ♦ Rl Rl Rl ♦ R3
R1 Rl -  R3 Rl Rl •  R3 02 Rl
T5 RZ T17 RZ 03 RZ
RZ 19 RZ 18 Rl T1
RZ R2 > 125 R2 RZ + 124 Rl Rl -  T9
R3 11 R3 19 R2 T16
R3 R3 «■ 117 R3 R3 ♦ 116 R2 RZ ♦ T5
T6 R3 T18 R2 R3 RZ
R3 R3 ♦ R2 R3 R3 + RZ RZ RZ ♦ Rl
RZ RZ -  T6 RZ R2 -  T18 Rl Rl -  R3
T6 R3 T19 R3 R3 RZ
R3 R3 + T5 R3 R3 -  T17 RZ R2 - TZ1
T7 R3 T20 R3 R3 R3 + T21
R3 R3 ♦ T2 R3 R3 -  T7 R2 T29
T8 R3 T21 R3 RZ RZ -  T14
R3 RZ R3 RZ R3 Rl
RZ RZ Rl RZ RZ ♦ Rl Rl Rl -  R2
Rl Rl -  R3 Rl R2 • R3 RZ R2 ♦ R3
T9 RZ T22 R2 06 R2
RZ R2 ♦ T3 RZ R2 + T15 07 Rl
T19 RZ R3 R2 Rl T9
RZ Rl R2 R2 ♦ T18 Rl Rl -  T3
Rl Rl T4 R3 R3 -  T19 RZ T22
RZ R2 -  T4 08 RZ RZ R2 -  T15
T* Rl 09 R3 R3 RZ
16X5 prim itive COB code 
Fig. 4.8d. 3-register implementation of HT
R2 *  R2 ♦ Rl 
Rl *  R1 -  R3 
0 1 8 - R2 
Oil *  R1
(continued).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
63
The time complexity and the Eta value for the O-register 
n
implementation of a 2 X(n+1) point prim itive COB is given by:
n
Total time = 6(n+l)2 , Eta = 6.
When an in fin ite  number of registers is  available, three d ifferent
types of bu tterfly  computations ex is t. Each in i t ia l  stage bu tte rfly  is
computed using 2 Loads, 1 Copy, and 2 Mop(+ »-)- Each fin a l stage
bu tterfly  is computed using 1 Copy, 2 Rop(+,-), and 2 Stores. Each of
the remaining bu tterflies  is computed using 1 copy and 2 Rop(+,-).
n
These computations are shown in Fig. 4 .9 . Thus, for 2 length 
n n n (n-1) n
prim itive COB, 2 Loads, 2 Mop(+,-), n2 Rop(+,-), n2 Copies, and 2
n
Stores are required. Accordingly, the to ta l time for a 2 X(n+1) point 
prim itive COB implementation on an in f in ite  accumulator machine is: 
n n-1
Total time = 6(2 ) + 3n(2 ) , Eta = 1.5 + 4 .5 /(n + l) .
4,2.5 Consolidation of Results
Comparing the Eta values of 1-, 2- and in fin ite -reg is te r  
implementations with that of O-register implementation, one can note 
that for large values of n, by merely structuring the order of 
computation, one can obtain savings of 16.7%, 33% and 75% respectively, 
in the HT execution time compared to non-structured O-register case.
Table 4.3 lis ts  the complexities of various implementations of 
prim itive COBs.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
64
Table 4.3 Complexities of various implementations of 
HT prim itive COBs.
COB Size = 2 X 2 # of computational points: 4
# R Load Mop(+,-) Store Copy Rop(+,-) Time Eta
0 4 4 4 0 0 24 6.0
1 3 4 4 0 0 22 5.5
2 2 3 3 0 1 17 4.25
3-» 2 2 2 1 2 15 3.75
COB Size = 4 X 3 # of computational points: 12
0 12 12 12 0 0 72 6
1 8 12 12 0 0 64 5.33
2 5 10 S 1 2 51 4.25
3 4 8 6 4 4 44 3.67
5-«° 4 4 4 4 8 36 3.00
. COB Size = 8 X 4 # of computational points: 32
0 32 32 32 0 0 192 6
1 20 32 32 0 0 168 5.25
2 12 27 24 3 5 134 4.18
3 12 18 16 8 14 114 3.56
9 * *  00 8 8 8 12 24 84 2.63
COB size= 16 X 5 # of computational points: 80
0 80 80 80 0 0 480 6
1 48 80 80 0 0 416 5.2
2 28 68 60 8 12 332 4.15
3 24 50 43 20 30 284 3.55
1 7 - 16 16 16 32 64 192 2.40
As can be seen from Table 4 .3 , choosing a larger prim itive COB 
improves the efficiency of algorithm. But in practice, one should 
consider both the improvement in time and the increase in code (program) 
size to determine the appropriate prim itive COB. A prim itive COB should 
be small enough so that the code fo r i t  can be generated without 
d iff ic u lty . At the same time, i t  should be large enough to u t i l iz e  a ll
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
65
available registers e ffic ie n tly . The following example illustra tes  the 
choice of a primitive COB in a machine with three accumulators.
Example: Suppose the target CPU contains 3 accumulators. In order to 
fu l ly  u t iliz e  a l l  available registers, 3-register implementations of 
prim itive COBs should be used. Based on the parameters listed  in Table 
4.4 , an appropriate prim itive COB may be chosen as follows.
Table 4.4 Change in the values of Eta for various prim itive COBs
prim itive COB Time/COB Eta % decrease in Eta from previous line
2X2 15 3.75 . . .
4X3 44 3.67 2.13
8X4 114 3.56 3.00
16X5 284 3.55 0.28
The code size for a COB is d irec tly  proportional to the execution time 
for the COB. Thus as we go down the COBs iisted  in Table 4 .4 , the code 
size multiplies by a factor of approximately 1.5 each time. An 
inspection of Table 4.4 now shows that a prim itive COB of 8X4 points is 
probably the best in these circumstances. I f  the size of this COB is 
further increased, i t  has a marginal effect on Eta but the code size 
increases by 149*.
4.3 Implementation of a complete HT through prim itive COBs
This section discusses the issues involved in the implementation of
12
a complete graph through an example of 2 length HT. I f  the primitive
t
COBs of types discussed ea rlie r with 2 X(t+1) points are used to cover 
n (n-1)
a 2 length HT, then a to ta l of n2 /(t+ 1 ) prim itive COBs would be
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
66
required. Thus the odd divisors of (t+1) should divide n. In the.
present case, i t  rules out the 16X5 prim itive COB. The 2048 X 12 point 
prim itive COB also need not be considered because of its  excessive code 
size.
I f  one uses the 2X2 prim itive COBs, the resultant implementation has
six computing stages, each with 2048 COBs. A ll six stages may be made
identical by rearranging the graph of HT [2 0 ],[2 1 ]. Thus the code for
each stage is identical except fo r the memory locations of input and
output data. Further, every pair of consecutive stages may have an
12
in-place code. Therefore, software for the entire 2 length HT may
consist of the code fo r the f i r s t  2 stages placed in a loop, thus 
reducing the code size by approximately 66.7%.
Use of 4X3 prim itive COBs sim ilarly  results in 4 identical stages, 
each with 1024 COBs. Use of a loop reduces the code size by 
approximately 50%.
Use of 8X4 prim itive COBs implies 3 identical stages each with 512 
COBs. Use of a loop is not beneficial in this case.
F inally , i f  32X6 prim itive COBs are used for the implementation, 
there are only 2 identical stages each with 128 COBs. As in the e arlie r  
case, a loop is not useful.
The execution time of the complete HT depends upon both the size 
of the prim itive COB used and the number of registers available to 
implement each prim itive COB. Table 4.5 and F ig .4.9 display the results 
obtained. While calculating the code sizes, the possib ility  of using 
the in-place algorithm is kept in mind. One may conclude from these 
that the computational time of HT is largely independent of the choice 
of prim itive COB. Also, using a machine with more than three registers
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(1000 t 1 «  u n i ts )
300
COB SIZE





■ l . . . . . . ,-------------------------------------
0 1 2  3 ( regi sters)
Fig. 4 .9 . Time complexity of various implementations 
of 2 ^  length HT.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
68
is not ju s tifie d  in this case. A good trade o ff between the time and 
the code size is obtained when one uses a 3-register machine and a 2X2 
prim itive COB.
12














2X2 12288 24 294912 6.00 98304
4X3 4096 72 294912 6.00 147456
8X4 1536 192 294912 6.00 294912
32X6 256 1152 294912 6.00 294912
2048X12 2 147456 294912 6.00 294912
1-register implementation
2X2 12288 22 270336 5.50 90112
4X3 4096 64 262144 5.33 131072
8X4 1536 168 258048 5.25 258048
32X6 256 992 253952 5.17 253952
2048X12 2 124928 249856 5.08 249856
2-register implementation
2X2 12288 17 208896 4.25 69632
4X3 4096 51 208896 4.25 104448
8X4 1536 134 205824 4.18 205824
32X6 256 792 202752 4.13 202752
2048X12 2 99846 199692 4.06 199692
3-register implementation
2X2 12288 15 184320 3.75 61440
4X3 4096 44 180224 3.67 90112
8X4 1536 114 175104 3.56 175104
4.4 Fast Fourier Transform(FFT)
In this section, two primitive COBs for FFT are presented and
implemented using 0 to in fin ite  number of registers. They are then
8
applied to implement a 2 length FFT.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
69
4.4.1 2-Point Primitive COB
The graph shown in Fig. 4.10 computes two complex points in the FFT 
graph and hence is termed as the 2-point prim itive COB. The 1- and 2- 
register implementations and the associated codes are shown in Fig. 
4.11. The implementation of th is  small prim itive COB does not change i f  
the number of registers is increased beyond 2.
4 .4 .2  4-Point Prim itive COB
The graph of a 4-point prim itive COB is shown in the Fig. 3 .2 . 
Figures 3.3 through 3.11 then show its  1- through 9-register
implementations. A further increase in the number of registers does not 
affect the implementation of th is  COB.
4 .4 .3  Consolidation of Results
The complexities of the two FFT COBs and, in particular, th e ir  
dependence on the number of registers in the machine is shown in Table 
4.6 . These results indicate that while using the 2-point COB, a 2- 
register machine w ill perform optimally and even for the 4-point COB 
increasing the number of registers beyond 5 has very l i t t l e  e ffect on 
the time complexity.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Fig. 4.10. Computational graph of 2-point FFT.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
71
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission
72
Table 4.6 Complexities of various implementations of 
FFT prim itive COBs.
COB size -  2 X 1 # of complex computational points = 2
R Load Mop(+,-) Mop(x) Rop(+, - )  Copy Store Time Eta % dec.in Eta
0 10 6 4 0 0 10 68 34 •
1 6 6 4 0 0 8 56 28 17.65
2- 4 4 4 2 2 4 44 22 21.43
COB size = 4 X 2 # of complex computational points = 8
0 40 24 16 0 0 40 272 34.000 •
1 20 24 16 0 0 32 216 27.000 20.59
2 13 17 16 7 6 19 175 21.875 18.98
3 12 14 16 10 8 16 166 20.750 5.14
4 12 11 16 13 8 14 159 19.875 4.22
5 12 8 16 16 8 12 152 19.000 4.40
6 11 8 16 16 9 11 149 18.625 1.97
7 10 8 16 16 10 10 146 18.250 2.01
8 9 8 16 16 11 9 143 17.875 2.05
9- 8 8 16 16 12 8 140 17.500 2.10
8
4.5 Implementation of 2 Length FFT
8
An implementation of 2 length FFT using 2-point primitive COBs 
results in 8 identical computational stages of 128 COBs each. As for  
the case of HT, a pair of these stages may be calculated in-place 
[20,21]. The size of code may therefore be reduced by 75% by using the 
loop as described in Section 4 .3 . S im ilarly, use of 4-point prim itive  
COBs produces 4 identical stages of 64 COBs each. Use of a loop, in
th is case, w ill reduce the code size by 50%. Table 4.7 and Fig. 4.11 
display various factors affected by the choice of a particular 
implementation.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
73
8




# of prim.COBs 
within FFT
Time per Total time Eta of 
Prim.COB fo r FFT prim.COB
Code size 
fo r FFT
2X1 1024 68 69632 34 17408
4X2 256 272 69632 34 34816
1-reg ister implementation
2X1 1024 56 57344 28 14336
4X2 256 216 55296 27 27648
2-register implementation
2X1 1024 44 45056 22 11264













2X1 1024 44 45056 22 11264
4X2 256 159 40704 19.875 20352
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
74
)000 tin * units)
7 0 - ,
2 p o in t  FFT 




oFig. 4.12. Time complexity of various implementations of 2 length FFT.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5 
CONCLUSIONS
This chapter reviews the results obtained during th is  work. After 
summarizing the useful results in Section 5.1, and th e ir applications in 
Section 5 .2 , future research areas are identified  in Section 5.3.
5.1 Summary of Selected Results
This work for the f i r s t  time provides the means to design an 
implementation of a given arb itrary  computational graph, while taking 
into account the number of accumulators available in the processor. The 
1-register algorithm of Chapter 2 can be applied to form a time 
e ffic ie n t algorithm for the graph implemented on a one accumulator 
processor. Since most of the general purpose microprocessors available 
today have one accumulator, the results obtained here are universally 
useful. This 1-register algorithm is extended to r-reg is ter algorithm 
in Chapter 3. Given a machine containing n general purpose registers, 
any computational graph can be subjected to 1- and r-reg is ter algorithms 
to form a time e ffic ie n t implementation. Furthermore, since most signal 
processing algorithms contain regular structures, a computational 
kernel, called a prim itive COB here, may be used repeatedly to cover the 
complete graph, as shown in Chapter 4. The prim itive COB may be 
subjected to the algorithms derived in this thesis to obtain its  
e ffic ie n t code for any given processor. By repeating this basic code, 
one may then obtain an e ffic ie n t code for the complete graph.
75
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
76
The results obtained in Chapter 4 point out several important 
facts. F irs t, for a given computational graph, the time complexity 
decreases exponentially as the number of registers increases (See Figs. 
4.9 and 4 .11 ). This result implies that the increase in the number of 
registers a fte r a certain point does not y ield  a profitable decrease in 
time complexity. (For Hadamard transform, this is a modest three 
accumulator architecture). Consequently, an arb itrary  increase in the 
number of accumulators in processor design is not ju s tifie d  since the 
cost of hardware in fla tes  very rapidly as the number of accumulators 
increases. Another important resu lt obtained is that the size of 
prim itive COB does not a ffect the time complexity s ign ificantly , as long 
as i t  is large enough to fu lly  u t i l iz e  a l l  available registers. One may 
thus choose a small and e ffic ie n t prim itive COB, so that writing the 
code for i t  is a t r iv ia l  task.
5.2 Significance of the Results
The importance of this work is mainly due to the wide ap p licab ility  
of the algorithms developed in Chapters 2 and 3. These algorithms 
enable one to design a time e ffic ie n t code by giving due consideration 
to the hardware architecture, in particu lar, the number of registers 
contained in the CPU. These algorithms enable one to u t i l iz e  the 
hardware capabilities to th e ir  fu lle s t extent, thus improving the 
performance without any additional cost.
Another potential application of th is  research is to provide means 
to evaluate various architectures with respect to a given algorithm. 
The procedures of Chapters 2 and 3 allow one to systematically study the 
trade offs between various factors such as the time complexity, hardware
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
77
complexity and code size. This enables one to choose a good engineering 
design in most practical situations.
F ina lly , th is work also brings out the concept of a prim itive COB. 
A prim itive COB can be used for automatic generation of software for 
large signal processing problems and to reduce the code size of an 
algorithm without sacrificing time effic iency. I t  may also have a 
significant impact on the design of special purpose para lle l processing 
hardware for signal processing applications.
5.3 Suggestions for Further Work
The verification  on an actual multi-accumulator machine of the 
various implementations obtained here is highly desirable. I t  was not 
possible to carry this out mainly due to the time lim itation and also 
because of the lack of good multi-accumulator processors. Since most of 
the available microprocessors have architectures geared towards high- 
level language implementations rather than numerical applications, i t  is 
necessary to design a multi-accumulator hardware for this verifica tion . 
Such a hardware design would use b it-s lic e  microprocessors AM2901 or 
AM2903 [22-24], since they have a su ffic ient number of registers fo r our 
purpose and belong to a family that has a large number of support ICs.
Another potential area for future research is the investigation of 
the relationship between a graph structure and its  ultimate 
implementation on a f in ite  register SISD machine. In particu lar, one 
may be able to restructure the computational graph without affecting the 
fin a l results, such that the restructured graph may have a highly 
e ffic ie n t implentation.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
78
Finally  i t  should be mentioned that the r-dimensional COB model may 
not yie ld  optimum results in some cases and merits further attention.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
79
REFERENCES
[1] S. Winograd, "On computing the Discrete Fourier Transform," Proc. 
Nat. Acad. S c i., U.S.A., vol. 73, pp. 1005*1006, Apr. 1976.
[2] J. W. Cooley and J. W. Tukey, "An algorithm for the machine 
calculation of Complex Fourier Series," Nath, of Com., vol. 19, 
pp. 296-301, 1965.
[3] L. R. Norris, "A comparative study of time e ffic ie n t FFT and WFTA 
programs for general purpose computers," IEEE Trans. Acoust., Speech 
and Signal Processing, vol. ASSP-26, no.2, pp. 141-150, Apr. 1978.
[4] H. D. Toong and A. Gupta, "An architectural comparison of 
contemporary 16-bit microprocessors," IEEE Nicro, vol. 1, pp. 26-37, 
Nay 1981.
[5] Component Data Catalog, In te l Corporation, Santa Clara, CA, 1982.
[6] Z80 Microcomputer Data Book, Nostek Corp., Carrollton, TX, 1981.
[7] Electronic Device Division Data Catalog, Rockwell International, 
Anaheim, CA, 1981.
[8] Microprocessor Data Nanual, Motorola In c ., Austin, TX, 1981.
[9] J. P. Anderson, "A note on some compiling algorithms," Comm. ACM, 
vol. 7, no. 3, pp. 149-150, Mar. 1964.
[10] I .  Nakata, "On compiling algorithms for arithmetic expressions," 
Comm. ACM, vol. 10, no. 8, pp. 492-494, Aug. 1967.
[11] N. M. Brenner, "Fast Fourier Transform of externally stored data," 
IEEE Trans. Audio Electroacoust., vol. AU-17, no. 2, pp. 128-132, 
June 1969.
[12] P. S. Naidu, "FFT of externally stored data," IEEE Trans. Acoust., 
Speech, and Signal Processing, vol. ASSP-26, no. 5, pp. 473, 1970.
[13 ] J. 0. Exlundh, "A fast computer method for matrix transposition," 
IEEE Trans. Computers, vol. C-21, no. 7, pp. 801-803, July 1972.
[14] P. S. Naidu, "Fast matrix transpose computer implementation," Signal 
Processing, North Holland Publishing Company, pp. 457-459, Mar.
1982.
[15] H. Nawab and J. H. McClellan, "Bounds on the minimum number of data 
transfers in WFTA and FFT programs," IEEE Trans. Acoust., Speech and 
Signal Processing, vol. ASSP-27, no. 4, pp. 394-398, Aug. 1979.
[16] M. J. Flynn, "Very high-speed computing system," IEEE Proc., vol.
54, no. 12, pp. 1901-1909, Dec. 1966.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
80
[17] Peter M. Kogge, The Architecture of Pipelined Computers, New York: 
McGraw-Hill, In c ., 1581.
[18] J. L. P fa ltz , Computer Data Structures, New York: McGraw-Hill, 
In c ., 1977.
[19] L. R. Morris, “Automatic generation of time e ffic ie n t d ig ita l 
signal processing software," IEEE Trans. Acoust., Speech and 
Signal Processing, vol. ASSP-25, no. 1, pp. 74-79, February 1977.
[20] A. V. Oppenheim and R. VI. Schafer, D ig ita l Signal Processing, 
Englewood C l i f f ,  NJ: Prentice-Hall, l§7b.
[21] L. R. Rabiner and B. Gold, Theory and Application of Signal 
Processing, Englewood C l i f f ,  NJ: Prentice-HaiI, !9/t>.
[22] 0. Mick and J. Brick, B it-S lice  Microprocessor Design,
New York: McGraw-Hill, in c ., I98d.
[23] G. J. Myers, D ig ita l System Design with LSI B it-S lice  Logic,
New York: Wiley Interscience, 1980.
[24] D. E. White, B it-S lice  Design: Controller and ALUs, New York: 
Garland STPM Press, 1981.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
