Systolic architectures for parallel implementation of digital filters by Zhou, Bing Bing
SYSTOLIC ARCHITECTURES 
FOR 
PARALLEL IMPLEMENTATION 
OF 
DIGITAL FILTERS 
by 
Zhou Bing Bing 
A thesis submitted to the 
Australian National University 
for the degree of Doctor of Philosophy 
September, 1988 
PREFACE 
Some of the work of this thesis was carried out in collaboration with Pro-
fessor Richard Brent. In particular, Chapters 4 and 5 contain results which were 
established jointly. 
Elsewhere in the thesis, unless otherwise stated, the work described is my 
own. 
. 
1 
ACKNOWLEDGEMENTS 
I am indebted to Professor Richard Brent for his invaluable guidance through-
out the course of my research, and for his constructive criticisms of this thesis. In 
writing the thesis I was also greatly assisted by Dr. Iain Macleod who suggested 
many clarifications and corrections in the text. This work would have been im-
possible without their help. 
I am grateful to Dr. Heiko Schroder for his helpful advice. I would also like 
to thank all the staff of the Computer Sciences Laboratory for their friendship 
during the completion of this work. 
I would also like to record my deep appreciation for the support and encour-
agement of my wife Zheng Yong Min during my research. 
Finally I acknowledge with gratitude the financial support received from the 
Chinese Government and the Australian National University . 
. . 
11 
ABSTRACT 
This thesis addresses the design of efficient systolic architectures for real-time 
digital filters. 
Many VLSI architectures for digital filters have been introduced in the liter-
ature. Most of these architectures are one-dimensional because the problems to 
be solved are 1-D. Although 1-D architectures have the advantage of low I/0 
bandwidth requirement, the throughput is limited by their word-serial nature. In 
many high-throughput applications, (two-level) pipelined and/or (2-D) parallel 
architectures have to be considered seriously. 
Digital filters can be classified as non-recursive filters and recursive filters. In 
order to obtain efficient pipelined and/or parallel systolic architectures for these 
two types of filters, some difficult problems have to be solved. Suppose that N 1 and 
N2 are the numbers of coefficients and inputs of a non-recursive filter. Since there 
is no recursion in the system and there is a high degree of inherent parallelism, it 
is easy to obtain an N 1 x N 2 architecture for implementing that filter. However, 
this N 1 x N 2 architecture is impractical for implementation in VLSI because it has 
too many input/output lines and requires too large an area for a large problem. 
Therefore, the key problem is to construct 2-D systolic architectures which not 
only can solve the problem efficiently, but have a reasonable size well suited for 
VLSI implementation. 
Recursion implies sequential rather than parallel execution, which typically 
places an upper bound on the throughput with which a recursion can be imple-
mented. Therefore, to obtain parallel algorithms/architectures with guaranteed 
111 
stability for recursive filters is a more difficult task than for non-recursive filters. 
This thesis aims to solve these problems and then to obtain efficient pipelined 
and/ or parallel systolic architectures for non-recursive and recursive filters. 
We introduce a 2-D systolic architecture for linear phase non-recursive filters 
and two 2-D systolic ring structures for linear convolution problems. These archi-
tectures are based on nearest neighbour interconnections and have the property of 
linear area versus period tradeoff for a given problem. Moreover, they are unidi-
rectional structures. The technique of two-level pipelining may easily be applied 
to these structures. Thus the throughput can further be increased without a great 
. . increase 1n area. 
Although look-ahead computation is a good method for deriving parallel al-
gorithms for state-variable-form recursive filters, this conventional technique may 
cause numerical instability in direct-form recursive filters due to the effect of fi-
nite wordlength. We introduce a new method of Z domain derivation. Using this 
method, not only can parallel algorithms with guaranteed stability be derived, 
but the additional complexity required for this purpose can be minimized through 
a decomposition technique. A time domain derivation of parallel algorithms for 
direct-form recursive filters is also introduced. The derived algorithms are of par-
ticular interest for time-varying recursive systems. 
Using the stabilized parallel algorithms for direct-form recursive filters, very 
efficient pipelined and/ or parallel VLSI architectures can be constructed. We show 
that those algorithms lead directly to an efficient two-level pipelined structure. 
Two different parallel systolic structures are also derived based on those algo-
rithms. One has the advantage of regularity while the other can achieve a linear 
. 
IV 
complexity in parallel size for cascaded second-order recursive filters. Using 2-D 
parallel processing in combination with two-level pipelining, efficient pipelined and 
parallel architectures can also be constructed. With the same degree of complex-
ity, this combination of techniques enables a substantial increase in throughput 
compared to purely parallel architectures for a given problem. 
V 
TABLE OF CONTENTS 
Preface ................................................................... i 
Acknowledgern.ents ....................................................... ii 
Abstract ................................................................. iii 
Table of contents ........................................................ vi 
List of figures ............................................................ ix 
1. INTRODUCTION 1 
1.1. Systolic Arrays ..................................................... 3 
1.2. VLSI Implementation of Digital Filters .............................. 5 
1.3. Thesis Outline and Main Results .................................... 8 
2. CUT-SET LOCALIZATION RULES 10 
2.1. Introduction ....................................................... 10 
2.2. SFG Notations and Cut-Set Localization Rules ..................... 12 
2.3. Com.m.utative Rule ................................................ 14 
2.3.1. Linear convolution .......................................... 14 
2.3.2. Multiplication of two band matrices ......................... 16 
2.4. Discussion ......................................................... 20 
3. 2-D SYSTOLIC IMPLEMENTATIONS FOR 
NON-RECURSIVE DIGITAL FILTERS 21 
3.1. Introduction ....................................................... 21 
3.2. Linear-Phase Non-Recursive Filters ........ . ....................... 23 
3.2.1. Sub-system for computing y 2p •••••••••••••••••••••••••••••• 24 
3.2.2. Compact forin. ........................................... ... 28 
. 
V1 
3.3. Linear Convolution Problems ...................................... 30 
3.3.l. L x N1 2-D systolic array .................................. 30 
3.3.2. N1 x N2 2-D systolic array ................................. 33 
3.3.3. N 1 x L systolic ring structure ............................... 34 
3.3.4. An upper bound on AP2 ••••••••••••••••••••••••••••••••••• 42 
3.3.5. The second systolic ring structure ........................... 43 
3.4. Discussion ......................................................... 47 
4. STABILIZED PARALLEL ALGORITHMS FOR 
DIRECT-FORM RECURSIVE FILTERS 49 
4.1. Introduction ....................................................... 49 
4.2. Conventional Look-Ahead Computation ............................ 51 
4.3. Stabilized Parallel Algorithm ...................................... 55 
4.3.1. Algorithm .................................................. 55 
4.3.2. Degree of parallelism ....................................... 57 
4.3.3. Stability .................................................... 59 
4.4. Reduction of Complexity .......................................... 61 
4.5. Time Domain Derivation .......................................... 66 
4.5.1. Methods of derivation ...................................... 67 
4.5.1.1. Derivation without decomposition ................... 68 
4.5.1.2. Derivation with decomposition ....................... 70 
4.5.2. Condition for unique solution ...................... · ......... 72 
4.5.3. Stability .................................................... 76 
4.6. Discussion ......................................................... 78 
Appendix 4.1 ............................................................ 80 
.. 
Vll 
A pp end.ix 4.2 ............................................................ 82 
A pp end.ix 4.3 ............................................................ 84 
Appendix 4.4 ............................................................ 87 
A pp end.ix 4.5 ............................................................ 89 
A pp end.ix 4. 6 ............................................................ 91 
5. PIPELINED AND/OR PARALLEL ARCHITECTURES 
FOR DIRECT-FORM RECURSIVE FILTERS 94 
5.1. Introduction ....................................................... 94 
5.2. Two-Level Pipelined Structure .. ................................... 95 
5.3. Parallel Structures ................................................. 99 
5.3.1. Structure 1 ................................................. 99 
5.3.2. Structure 2 ................................................ 102 
5.4. Pipelined and Parallel Structures ................................. 106 
5.4.1. Case 1: L < N ............................................ 106 
5.4.2. Case 2: L > N ............................................ 110 
5.5. Discussion ........................................................ 112 
6. CONCLUSIONS 114 
REFERENCES 117 
Vlll 
LIST OF FIGURES 
2.1. Equivalent notations for a time delay operator .................... ....... 12 
2.2. SFG array for linear convolution ......................................... 15 
2.3. Systolic array for linear convolution ...................................... 15 
2.4. Transformation steps for obtaining an efficient 
systolic array from Fig. 2.2 .............................................. 16 
2.5. SFG array for multiplication of two band matrices ....................... 17 
2.6. Systolic array for multiplication of two band matrices .................... 17 
2. 7. Transformation steps for obtaining an efficient 
systolic array from Fig. 2.5 .............................................. 19 
3.1. Linear systolic aray for computing y~~o) ( N 1 = 6) ........................ 24 
3.2. Two equivalent systolic arrays for computing y~~l) ( N 1 = 6) .............. 25 
3.3. Systolic arrays ( a) for computing y~~) and (b) for computing y~~) ••••••••• 26 
3.4. 2-D systolic arrays ( a) for computing Y2p and (b) for computing Y2p+1 ••• 27 
3.5. The modified version of Fig. 3.4 ......................................... 29 
3.6. 2-D systolic array for the linear-phase non-recursive filter ................ 29 
3.7. The matrix-vector multiplier ............................................ . 31 
3.8. The L x N 1 systolic array for solving linear convolution problems ........ 32 
3.9. An N 1 x N 2 systolic array for solving linear convolution problems ........ 33 
3.10. A block-diagram of the N 1 x N 2 systolic array ........................... 34 
3.11. One block of the systolic array in Fig. 3.10 .............................. 35 
3.12. An N1 x L ring structure for solving linear convolution problems ........ 37 
. 
IX 
3.13. Systolic layout of the 2-D systolic ring structure for 
solving linear convolution problems ...................................... 38 
3.14. The route for one output in the ring structure ........................... 40 
3.15. A systolic array equivalent to that of Fig. 3.12 ........................... 44 
3.16. Steps for transforming Fig. 3.12 into Fig. 3.15 ........................... 45 
3.17. A systolic ring structure with L = 1 ..................................... 46 
4.1. Derivation without decomposition ( N = 2 and L = 6) .................... 68 
4.2. Derivation with decomposition ( N = 2 and L = 6) ....................... 70 
4.3. Derivation with decomposition (N = 3 and L = 8) ....................... 71 
4.4. Two other methods for time domain derivation (N = 2 and L = 6) ....... 73 
5.1. The structure for the recursive part (N = 2 and M = 4) ................. 96 
5.2. The structure for computing ( 5.4) ....................................... 97 
5.3. The structure with four stages of second-level pipeline for 
second-order recursive filters ............................................. 98 
5.4. The structure for the recursive part (N = 2 and L = 4) ................. 100 
5.5. The structure for computing (5.12) ..................................... 101 
5.6. The structure for computing Ysn+l for 4 < l < 1 ........................ 103 
5. 7. The structure for computing Y4n+l for O < l < 3 ........................ 104 
5.8. The structure for computing U4n+l for O < l < 1 ........................ 104 
5.9. The structure with M stages of second-level pipeline for 
computing Y4n+l for O < l < 3 .......................................... 111 
X 
CHAPTER 1 
INTRODUCTION 
This thesis addresses the design of efficient architectures for high-speed digital 
signal processing. 
Many signal processing applications have two characteristics which make them 
computationally intensive: (i) large amounts of processing (up to hundreds of el-
ementary operations at each point); (ii) real-time response ( at rates up to several 
MHz). Ever-increasing demands for sophistication and speed of processing clearly 
point to the need for dramatic increases in computational capability. Rapid im-
provements in digital integrated circuit technology have provided about a ten times 
increase in device speed every few years, but these improvements still cannot sup-
ply the required performance in areas of high-speed signal processing such as speech 
recognition, beamforming, and image analysis. Therefore, real-time computation 
needs accompanying architectural advances. 
Many compute-intensive tasks are now handled by general-purpose supercom-
puters. These general-purpose machines have complicated system organizations, 
are very expensive and provide facilities such as 64-bit word and arithmetic units 
which are not fully utilized in signal processing operations. The advent of low-
cost and high-density VLSI technology makes it possible to use special-purpose 
array processors for solving certain classes of compute-intensive problems faster 
and more economically than would be possible with a general-purpose system. 
In VLSI, communication is expensive whereas logic is relatively cheap [31]. To 
minimize communication cost, one can build architectures which are regular ar-
1 
rays or lattices of small processing elements. The only communication is between 
near neighbours and the array can be laid out simply and efficiently. Fortunately, 
most algorithms in modern signal processing possess some useful common proper-
ties, such as regularity, local data communication, and a high degree of potential 
parallelism. These properties may be utilized to exploit the power of VLSI and 
to circumvent its constraints. Because of its local communication, synchronous 
data fl.ow, simple control and modular parallelism with throughput directly pro-
portional to the number of cells, the systolic array [18,20] has been considered as 
a most promising VLSI architecture for real-time signal processing. In the last 
few years, there has been a dramatic worldwide growth in research on mapping 
various signal processing algorithms into systolic arrays. 
Digital filtering is one of the most important techniques in modern signal 
processing. It is widely used in areas such as speech and image processing, radar 
signal processing and biomedical engineering. While this technique is compute-
intensive, it is well suited to VLSI implementation and many examples have been 
reported in the literature. 
This thesis concentrates on the design of efficient systolic architectures for 
high-throughput digital filters. The next two sections of this chapter briefly review 
systolic architectures and VLSI implementation of digital filters. The final section 
then gives an outline of the following chapters, together with a summary of the 
main results obtained. 
2 
1.1. Systolic Arrays 
Systolic architectures are typically large, regular arrays with only a few dif-
ferent types of small processing elements, each capable of performing some simple 
operation. Data and control flow in the system is simple and regular. All op-
erations involving a data item are applied to it as it passes through the array; 
communication with the outside world occurs only at the "boundary cells". This 
method of computation eliminates the need to retrieve a data item from exter-
nal memory every time it is used. The regular pattern of local interconnections 
between cells implies that the systolic design can be made modular and extensi-
ble, so that a systolic array can easily be expanded to achieve high computation 
throughput without increased memory bandwidth. This property gives systolic 
architectures a major advantage over traditional architectures, which are limited 
by the "Von Neumann bottleneck". Moreover, the simplicity of systolic designs is 
particularly important for special-purpose applications, where design costs must 
be kept low. 
Because of their low memory bandwidth, simplicity, regularity, modularity 
and local communication, systolic architectures are well suited to implementation 
in VLSI. Various systolic systems have been developed to solve compute-intensive 
problems in areas such as signal and image processing and matrix arithmetic, for 
example see [1,4,5,12-14,17-22,24-26,37,44,45,50,52,56,57]. However, it should be 
noted that clock skew ( which may inhibit global synchronization for ultra-large-
scale 2-D arrays) is one possible disadvantage of the systolic array approach. A 
simple solution to this problem is to adopt the principle of dataflow computing in 
systolic array processors, which leads to wavefront array processors [24-26]. 
3 
In a wavefront array instructions cannot be executed until their operands 
have become available. The arrival of data from neighbouring processors will be 
interpreted as a change of state and will initiate some action. The wavefront arrays 
are named because of the analogy with wavefront propagation, and comprise a 
distributed and (globally) asynchronous array processing system. This approach 
replaces the requirement for correct timing by one of correct sequencing, and 
handles the data dependency locally. Thus, it eliminates the need for global control 
and global synchronization [25]. Since systolic arrays can easily be converted 
into wavefront array processors, in this thesis we regard the concept of systolic 
architectures as encompassing wavefront array processors. 
Systolic architectures can be either one-dimensional (1-D) or two-dimensional 
(2-D) depending on the problems to be solved. Sometimes a given problem can 
have both 1-D and 2-D systolic array solutions. 1-D systolic arrays have the 
advantage of low I/ 0 bandwidth requirement. However, throughput is limited 
by their word-serial nature. To achieve higher throughput, two-level pipelined 
and/ or 2-D parallel structures have to be considered. The two-level pipelined 
systolic approach, first introduced by H. T. Kung and his colleagues [19,21], is 
a good method not only for achieving high throughput computation, but also 
for reducing the area required in VLSI implementation in comparison with other 
parallel approaches. Thus it is desirable that a given system is implemented by 
first using two-level pipelining to the maximum possible extent, and then using 2-D 
parallel processing in combination with pipelining if further increase in throughput 
is required. 
4 
1.2. VLSI Implementation of Digital Filters 
In this thesis we study VLSI architectures for linear filters. Digital filters can 
be classified as non-recursive filters and recursive filters. 
Non-recursive filters 
One of the most important characteristics of non-recursive filters is that they 
can be designed to have exactly linear phase [38]. There have been many sig-
nal flow graph ( SFG) networks introduced to compute linear phase non-recursive 
filters [38]. 1-D systolic arrays can easily be obtained by applying cut-set localiza-
tion rules [24-26] to these SFG computing networks. ( One example is described 
in [24].) Suppose that the number of coefficients for a given problem is N 1 , which 
is assumed to be even. The derived 1-D systolic array will use N 1 /2 multipliers. 
We shall derive a 2-D systolic array for solving the same problem by using N 1 
multipliers. However, this 2-D architecture can achieve twice the throughput of 
the 1-D array. Thus our 2-D implementation is preferable in high-speed signal 
. processing. 
We also discuss 2-D systolic architectures for computing 1-D linear convolu-
tion equations. This is not only because the problem of non-recursive filtering can 
be represented by a linear convolution equation, but because other problems such 
as DFT, circular convolution and 2-D linear convolution can also be transformed 
into 1-D linear convolution [11,22,35]. Thus one structure can be used for several 
different kinds of problems. 
Although various types of systolic architectures for computing linear convo-
lution problems have been introduced in the literature [8,17 ,18,21,22,25,30,43,46], 
5 
only a few of them are 2-D. Suppose that N 1 and N2 are respectively the numbers 
of coefficients and inputs of a given problem. Cappello and Steiglitz [8] derived 
several N 1 x N2 2-D systolic arrays. In theory, these systolic arrays are AP2 
asymptotically optimal ( where A is defined as Area and P as Period, which is the 
reciprocal of throughput), but they are impractical for implementation in VLSI 
because they have too many input/output lines and require too large an area for a 
large problem. Lu and others [30] derived an L x N 1 2-D systolic array for the case 
L < N 2 • They split the linear convolution equation into a number of subequations. 
Each subequation contains L successive outputs. Because the subproblems have 
the form of matrix-vector multiplications, they can be implemented successively 
on a matrix-vector multiplier. In this design the throughput is increased by a 
factor of L in comparison with 1-D systolic arrays. However, a large proportion of 
the total area is consumed by the communication lines. Thus this implementation 
is not area efficient in VLSI. We shall introduce two N 1 x L systolic ring architec-
tures. These N 1 x L architectures are based on nearest neighbour interconnections 
and can achieve the same throughput as the L x N 1 array, but are more efficient 
in their use of area. 
Recursive filters 
Recursion implies sequential execution, which typically places an upper bound 
on the throughput with which a recursion can be implemented. Therefore, con-
ventional serial algorithms for recursive filters cannot effectively make use of large 
numbers of processing elements. To overcome this problem, Gold and Jordan [16] 
proposed block ( or parallel) recursion methods for implementing recursive filters. 
They demonstrated that rational transfer functions can be realized by finite con-
6 
volutions employing blocks of data. Subsequently, Voelcker and Hartquist [4 7] 
showed that a rational transfer function can be realized with block feedback and 
finite convolution. Burrus [6,7] developed a different approach based on a ma-
trix representation of convolution, which results in a state-variable description 
with block feedback. The relationship between the Gold and Jordan idea and 
the Burrus method was described by Gnanasekaran and Mitra [15], who also pro-
posed several new block structures using a matrix representation of convolution 
[34]. Meyer and Burrus [32,33] discussed the use of block implementations for 
multirate and periodically time-varying digital filters. Note that block imple-
mentations for state-variable-form recursive filters have recently received a lot of 
attention [2,3,30,34,36,39-41,48,53]. These implementations are based on the tech-
nique of so-called look-ahead computation. ( Although this terminology was first 
formally described in [39,41], the technique had been known previously for some 
time.) In the look-ahead technique, the given recursion is iterated as many times as 
desired to create the necessary concurrency; the concurrency created can then be 
used to obtain pipelined and/ or parallel implementations of recursive systems [41]. 
Although the look-ahead computation technique has been applied success-
fully to the parallel implementation of state-variable-form recursive filters, it may 
cause numerical instability in direct-form recursive filters due to the effect of fi-
nite wordlength. We shall introduce new methods for deriving parallel algorithms 
for direct-form recursive filters. In comparison with the original serial algorithm, 
not only is the degree of parallelism increased, but the stability is also improved. 
Moreover, these algorithms lead to very efficient systolic/wavefront architectures. 
Therefore, they are most suitable for VLSI. 
7 
1.3. Thesis Outline and Main Results 
In Chapter 2, the signal-flow graph (SFG) notations to be used throughout 
this thesis are described. A very powerful architecture transformation technique 
based on cut-set localization rules ( or systolization rules) and introduced by S. 
Y. Kung [24-26] is described. Then we introduce a commutative rule. Using cut-
set localization rules in combination with this commutative rule, one may obtain 
more efficient systolic arrays from certain feedforward SFG computing networks. 
Chapter 3 describes 2-D systolic architectures for non-recursive digital filters. 
Suppose that N 1 and N2 are respectively the numbers of coefficients and inputs of a 
linear filtering problem. It is easy to obtain N 1 x N 2 architectures for implementing 
that problem because there is no recursion in the system. However, as mentioned in 
the previous section these architectures are impractical for VLSI implementation. 
Therefore, the key problem is to construct 2-D systolic architectures which not 
only can solve the problem efficiently, but also have a reasonable size well suited 
for implementation in VLSI. We derive a 2-D systolic array for linear phase non-
recursive filters. By using twice the number of multipliers that 1-D systolic arrays 
normally require, this 2-D systolic architecture can achieve twice the throughput 
of l-D arrays for a given problem. We introduce two N 1 x L 2-D systolic ring 
structures, where L < N 2 • These 2-D structures are based on nearest neighbour 
interconnections and can achieve L times the throughput of 1-D systolic arrays 
for solving the same linear convolution problem. Moreover, the most efficient l-D 
systolic array for linear convolution is just a special case of a 2-D systolic ring 
array with L = l. The complexity measure AP2 is also analyzed. 
Chapter 4 derives stabilized parallel algorithms for direct-form recursive fil-
8 
ters. The conventional look-ahead computation may cause numerical instability 
in the direct-form implementation. Thus we introduce a new method for deriving, 
in the Z domain, a class of stabilized parallel algorithms for direct-form recur-
sive filters. In order to obtain this stabilized algorithm extra zeros and poles are 
introduced, so the complexity is greatly increased. We then introduce a decom-
position technique to minimize this increase in complexity. We also introduce a 
time-domain derivation of parallel algorithms, which have the same form as those 
derived in the Z domain. This is of particular interest for time-varying recursive 
systems. There are many different ways to achieve the same goal. The condition 
for the unique solution of the stabilized parallel algorithm is discussed. 
Chapter 5 introduces efficient pipelined and/or parallel architectures associ-
ated with the stabilized parallel algorithms derived in Chapter 4. We show that 
the stabilized parallel algorithms lead directly to an efficient two-level pipelined 
structure. Using these algorithms, we can also derive two different parallel struc-
tures. The first one has the advantage of regularity while the · second one can 
achieve a linear complexity in parallel size for cascaded second-order recursive 
filters. Combining parallel processing with pipelining we finally describe two ef-
ficient pipelined and parallel architectures for direct-form recursive filters. This 
combination of techniques enables a substantial increase in throughput compared 
to previously known architectures. 
9 
CHAPTER 2 
CUT-SET LOCALIZATION RULES 
2.1. Introduction 
A non-systolic system can be transformed into a systolic array by using ar-
chitecture transformation techniques. The original work was done by Leiserson 
and Saxe (27,28]; they introduced various rules such as retiming and slowdown 
and showed how these rules can be used to obtain an efficient systolic array from 
a given synchronous system. H. T. Kung and Lam (19] derived a cut theorem to 
show how to add delays to a given system without affecting its function. Because 
the main purpose of this theorem is to construct a two-level pipelined systolic 
array from a uni-directional structure, it is required that all the lines in the cut 
must point in the same direction. Thus further applications to more complicated 
systems are limited. Cut-set localization rules ( or "systolization" rules) were in-
troduced by S. Y. Kung (24]. By using these simpler, but more powerful rules, one 
can easily transform a signal flow graph (SFG) computing network into a systolic 
array. It has also been proved in (24] that all computable SFG arrays can be 
temporally localized by using these rules. 
Although an SFG computing network (or synchronous system) can be con-
verted into a systolic array by using the methods metioned above, we cannot 
guarantee that the derived system is the most efficient systolic array for solving 
the given problem. The reason is that the chosen SFG computing network may 
be inefficient. A method for converting a dependence graph into an efficient SFG 
computing network was introduced in (25]. Since a dependence graph may as-
10 
sociate with many different SFGs, we may choose the most efficient one among 
them (25]. 
In the following, we first describe the SFG notations and the systolization 
rules. Then we introduce a commutative rule for obtaining efficient systolic arrays 
from certain SFG computing networks without feedback. We give two examples 
to show that by applying the commutative law of addition, the direction of some 
lines in this kind of system may be changed without affecting the overall function. 
More efficient systolic arrays can thus be obtained after the modification. 
11 
2.2. SFG Notations and Cut-Set Localization Rules 
SFG notations 
The SFG notations used herein are similar to those in [24]. They are described 
as follows: A node is denoted by a square ( or a circle) representing an arithmetic 
or logic function performed with zero delay. On the other hand, a line denotes 
a delay. When a Ene is labeled with a number t ( a non-negative integer), it 
represents a time delay operator with a delay time t. We omit the number when 
it is zero. Therefore, a line without a number represents a zero delay operator. 
For convenience we sometimes use another notation for a time delay operator; 
a time delay operator with a delay time t ( t f- 0) is denoted by a small square 
( smaller than a node) with a number t in it. The t\vo equivalent representations 
are depicted in Fig. 2.1. 
t 
~ 
Fig. 2.1. Equivalent notations for a time delay operator 
12 
Cut-set localization rules 
1. 
2. 
There are only two basic rules: 
Time-scaling: AU delays in a system may be scaled by a single positive 
integer a. Correspondingly, the input and output rates also have to be scaled 
by a factor a. 
Delay-transfer: Given any cut-set of the SFG, we can group the lines in 
the cut-set into in-bound lines and out-bound lines, depending on the direc-
tions assigned to the lines. Rule 2 allows an advance of k time units on all 
the out-bound lines and a delay of k time units on the in-bound lines, and 
vice versa, provided the resulting delays are non-negative. 
The proof of cut-set localization rules and many examples of deriving systolic 
array structures from SFG computing networks can be found in [24-26]. 
13 
2.3. Commutative Rule 
From cut-set localization rules we see that the throughput rate of the derived 
systolic array is inversely proportional to the scaling factor a. The optimal local-
ization algorithm [26] is then proposed to search non-rescaling (NR) cut-sets so 
that a can be minimized. An NR cut-set is a "good" cut-set in which all the lines 
( excluding input lines) in the opposite direction to the zero-delay target line to be 
localized, have delays of at least two time units. (A "bad" cut-set is one in which 
this condition does not hold.) Once an NR cut is determined, one can simply 
apply the delay-transfer operation along the cut and localize the target line(s). If 
there exist no NR cut-sets in the original SFG computing network, time-scaling 
has to be applied, which will decrease the throughput of the system. In this section 
we give two examples to show how to avoid application of time-scaling to SFGs 
without feedback by using the commutative rule. 
2.3.1. Linear convolution 
A well known SFG computing network for linear convolution problems [18] 
is depicted in Fig. 2.2. It is easy to see that there exist no NR cut-sets in the 
structure. Time-scaling has to be applied and the throughput of the resulting 
systolic array is then decreased. 
To avoid use of time-scaling, we use the following transformation procedure. 
First we apply delay transfer to the SFG, as shown in Fig. 2.4( a). We see that each 
of the computation steps has now been divided into two stages; first all nodes or 
cells perform multiplications, and then these multiplication terms are accumulated 
from right to left as one of the outputs. Note that the order of additions is 
14 
X X X 
... 2 1 0 
Yo Y1 Y2 ... 
x . 
In 
Y out 
h· I 
X out := X in 
Fig. 2.2. SFG array for linear convolution 
Fig. 2.3. Systolic array for linear convolution in [27] 
changeable according to the commutative law. Therefore, the multiplication terms 
can also be accumulated from left to right. This means that we can change all 
left-pointing lines into right-pointing lines without affecting the function of the 
system. Thus we next change the direction of the left-pointing lines in Fig. 2.4( a). 
Since all the lines are now pointing in one direction, we can introduce a set of new 
cuts and add one delay to each of the lines in the cuts. The final result is the same 
as that described in [18] and has been proved to be the most efficient among 1-D 
systolic arrays for solving linear convolution problems [29]. 
15 
I 
~ ho J.!11 h1 
(a) Transfer delays 
(b) Change the direction of left-pointing lines 
I I I 
I 2 '2 '2 ~ ~ I l 1~ I l 1~ I l 1 :1 ho h1 h2 h3 
( c) Add delays 
Fig. 2.4. Transformation steps for obtaining an efficient sys-
tolic array from Fig. 2.2 
2.3.2. Multiplication of two band matrices 
Consider the problem of multiplying two band matrices as follows: 
C11 C1:? C13 C14 0 a11 a12 0 b11 b1:? b13 
C21 C22 C23 C2-1 a21 a22 a23 b21 b22 b23 
C31 C32 C33 C34 a31 a32 a33 b32 033 
C41 C-12 a-12 
0 0 0 
0 
b24 
b34 
Fig. 2.5 depicts an SFG network for computing the above matrix multipli-
cation. This structure was originally reported by Chern and Murata [9]. The 
corresponding systolic array, as shown in Fig. 2.6, was derived from Fig. 2.5 by 
using cut-set localization rules [24]. The derived systolic array is not very efficient 
16 
b43 b44 b45 b46 
b32 b33 b34 b35 
b in 
b21 b22 b23 b24 Cout 
0 b11 b12 b13 
ain aout 
... a34 a23 a12 0 
bout C in 
... a44 a33 a22 a11 aout . ain .-
bout .- bin 
... as4 a43 a32 a21 
* bin Cout 
. Cin + ain . -
Fig. 2.5. SFG array for multiplication of two band matrices 
0 0 b23 0 
0 b22 0 0 
b21 0 0 b13 
0 0 b12 0 
0 b 11 0 0 
0 0 0 0 
a32 0 0 a 21 O O 
0 0 a3 1 0 0 0 
Fig. 2.6. Systolic array for multiplication of two band matrices in (24] 
17 
because time-scaling with a = 3 is applied, which results in a reduction of the 
throughput by a factor of 3. Linear convolution and matrix multiplication are 
similar problems. We can use the localization procedure described in the previous 
sub-section to obtain a systolic array without decreasing the original throughput. 
Fig. 2. 7 depicts the transformation steps for obtaining a new systolic array. In 
Fig. 2. 7( a) we introduce a set of horizontal cuts and apply delay transfer. For the 
reason described in the previous sub-section, we can change the direction of the 
diagonal lines without affecting the function of the system. Finally we introduce 
a set of vertical cuts and add one delay to each of the lines in the cuts. A systolic 
array which maintains the original throughput is then obtained. (This array is 
the same as that in (49].) The only difference between these two systolic arrays is 
that all diagonal lines are reversed in direction, but the throughput of the array 
in Fig. 2. 7( c) is three times greater than that in Fig. 2.6. 
18 
(a) step 1 (b) step 2 
b54 b44 b34 b24 
~3 b33 t}z3 b13 
b32 b22 b12 0 
b21 b11 0 0 
0 0 0 0 
(c) step 3 
Fig. 2. 7. Transformation steps for obtaining an efficient sys-
tolic array from Fig. 2.5 
19 
2.4. Discussion 
In this chapter we first introduced the SFG notation to be used throughout 
this work. Then cut-set localization rules were described. Using these powerful 
rules all computable SFG networks can easily be converted into systolic arrays . 
The key problem in using these rules is to minimize the time-scaling factor. We 
introduced a new rule--the commutative rule. By applying this rule, we can 
avoid time-scaling and obtain better results from certain SFG networks without 
feedback. In Chapter 3 we use cut-set localization rules plus this commutative 
rule to transform one 2-D systolic ring structure into another for solving linear 
convolution problems. For SFG networks with feedback, however, it is not so easy 
to avoid time-scaling. In transforming an SFG structure ( derived from a serial 
algorithm) for an Nth order recursive filter into a systolic array, it is impossible 
to maintain the same throughput. In Chapters 4 and 5 we shall see that, if a high 
throughput is essential, the original algorithm has to be modified. This will cause 
an increase in complexity of the system. 
20 
CHAPTER 3 
2-D SYSTOLIC IMPLEMENTATIONS FOR 
NON-RECURSIVE DIGITAL FILTERS 
3.1. Introduction 
A non-recursive filter, or a linear convolution problem, can be expressed as 
N1-l 
Yn = L hiXn-i, 0 < n < N1 + N2 - 2 
i=O 
(3.1) 
where N 1 and N 2 are the numbers of coefficients and inputs respectively. This is a 
very important computational problem in modern signal processing. To perform 
this computation at high speed, various kinds of systolic architectures have been 
introduced [8,17 ,18,21,22,24,43,46]. Since most of these architectures are 1-D, the 
throughput is limited by the word-serial nature. To achieve higher throughput, 
2-D word-parallel structures have to be considered. 
One of the most important characteristics of non-recursive filters is that they 
can be designed to have exactly linear phase [38]. The finite impulse response for 
a causal non-recursive filter with linear phase has the property that 
Therefore, equation (3.1) can be rewritten as 
Ni/2-1 
Yn = L hj(Xn-j + Xn-(N1 -1-j)) 
j=O 
where we assume that N 1 is even. 
(3.2) 
(3.3) 
A 1-D systolic array can easily be obtained by applying cut-set localization 
rules to SFG structures corresponding to equation (3.3). (One example is described 
21 
in (24].) The derived systolic structure uses N 1 /2 multipliers. In section 3.2 we 
derive a 2-D systolic array for solving the same problem. By using N 1 multipliers, 
our 2-D systolic array can achieve twice the throughput of the 1-D array. Thus 
this 2-D implementation is preferable for high speed signal processing. 
The concept of a systolic ring was first introduced by H. T. Kung and Lam (19]. 
Section 3.3 introduces two N1 x L 2-D systolic ring structures for solving linear 
convolution problems in (3.1), where L < N2. These N 1 x L 2-D systolic ring 
structures are based on nearest neighbour interconnections and can achieve the 
same throughput rate as the L x N 1 structure described in (30]. The complexity 
measure AP2 for this systolic ring is inversely proportional to L, but AP is inde-
pendent of L. By varying L we can then trade off area versus period for a given 
problem. We shall see that the most efficient 1-D systolic array for solving linear 
convolution problems is just the special case of a 2-D systolic ring structure with 
L = 1. 
22 
3.2. Linear Phase Non-Recursive Filters 
In this section we derive a 2-D systolic array with twice the throughput of 
the 1-D systolic version for linear phase non-recursive filters. We only consider 
the case when N1 is even. The derivation for the filter with N 1 odd is similar. 
The basic idea for constructing this 2-D systolic array is as follows. The 
equation in (3.3) can be arranged into two groups, one containing the outputs 
with even subscripts and the other those with odd subscripts-
Ni/2-1 
Y2p = L hi(X2p-i + X2p-(N1 -1-i)) 
i=O 
Ni/2-1 
Y2p+1 = L hi(X2p+1-i + X2p+l-(N1-l-i)) 
i=O 
(3.4) 
It is easy to see that the above two equations can be computed independently. 
Therefore, we may use two sub-systems to solve these two problems in parallel. 
If each sub-system can achieve the same throughput as that of the l-D systolic 
version, the throughput of the entire system for the complete problem is then 
doubled. 
Since the two equations in (3.4) are similar, we can visualize that the struc-
tures for solving these equations are also similar. Therefore, we just give a detailed 
description for the structure for computing Y2p· 
23 
X2p 
y(OO) 
2p+1 
Xaut := Xin 
Y out := Yin + h i * X in 
Fig. 3.1. Linear systolic array for computing y~~o) (N1 = 6) 
3.2.1. Sub-system for computing Y2p 
We divide the equation for y 2 p into two sub-equations as 
M/2-1 
Yi~) = L h2q( X2p-2q + X2p-(N1-1-2q)) 
q=O 
(1) 
Y2p 
where M = N1 /2. 
M/2-1 
L h2q+1(X2p-(2q+l) + X2p-(N1-1-(2q+1))) 
q=O 
(3.5) 
The superscript O ( or 1) in (3.5) indicates that the equation contains only 
those coefficients with even ( or odd, respectively) subscripts. 
Again the two sub-equations in (3.5) have similar structures, so we only con-
"d (0) s1 er y2P • 
y~~) in (3.5) consists of two parts 
M/2-1 
(00) ~ h 
Y2p = L...J 2qX2p-2q 
q=O 
M/2-1 
(3.6) 
(01) ~ h 
Y2p L...J 2qX2p-(N1-1-2q) 
q=O 
where the second component of the superscript being O ( or 1) denotes that only 
the inputs with even ( or odd) subscripts are required for solving that problem. 
It is easy to see that the equation for y~~o) is a linear convolution problem 
which can be implemented in a 1-D systolic array, as shown in Fig. 3.1. 
24 
Based on the symmetric property of the linear phase non-recursive filter (i.e., 
(3.7) 
Thus the equation for y~~l) is also a linear convolution problem. Instead of using 
the array depicted in Fig. 3.1, however, we use another well known 1-D systolic 
array [18] to compute y~~l), as shown in Fig. 3.2( a). Using the symmetric property 
again, we obtain -Fig. 3.2(b ). The reasoning is as follows: If y~~l) is implemented 
as in Fig. 3.1, the coefficients should be placed in the opposite order. These two 
arrays cannot easily be combined into one array. We would then have to use N 1 /2 
multipliers for computing the equations for y~~o) and y~~l). But the equation for 
y~~) in (3.5) shows that only N 1 /4 multipliers are required for the problem. 
X2p+1 ~ I ~:1 h3 I ~:1 t: h1 hs y (01) Xin Xout 2p hi 
Yin Y out (a) 
I I xout := Xin 
X2p+1 
:t 1 l~:1 I l~:1 ~ Yin + hi * Xin h4 h2 ho Y out := Yk~1) I I 
(b) 
Fig. 3.2. Two equivalent systolic arrays for computing y~~l) (N1 = 6) 
Although the coefficients in Fig. 3.1 and Fig. 3.2(b) are in the same order, 
there are two problems to be solved before they can be combined into one array· 
with Ni/4 multipliers for computing y;~). One problem is that the outputs moving 
from one cell to the next in the two arrays do not travel at the same speed. The 
25 
other is that the inputs to these arrays do not accord with the demands of the 
equation for y~~) in (3.5). 
From Fig. 3.1 we see that y~~o) needs two time units for travelling from one 
cell to the next, while y~~l) in Fig. 3.2(b) requires only one time unit. Therefore, 
to solve the first problem we apply a set of cuts in Fig. 3.2(b) and add one delay 
to each line in the cuts. 
To solve the second problem, we consider the difference between the two inputs 
in (3.5). Let D be the difference. Then 
D = 2p - 2q - (2p - (N1 - 1 - 2q) 
(3.8) 
= N1 -1 - 4q 
Since the outputs in the arrays travel at the same speed after the first modification, 
we need only to consider D in the leftmost cells. Substituting q = M /2 - 1 into 
(3.8), we then obtain D = 3. This can be done by adding extra two delays to the 
leftmost input line in Fig. 3.2(b ). 
(a) 
(b) 
-.111~-...,__.........._ X 1 out 
X 1 out := X \n 
x2 out := x2in 
X2out 
Yout 
Y out ·- Yin +hi * ( X \n + x2 in) 
Fig. 3.3. Systolic arrays ( a) for computing y~~) and (b) for y~~) 
After the above two modifications, we can combine these two arrays into one 
for computing y~~) = y~~o) + y1~l) with N 1 /4 multipliers. The combined array is 
26 
depicted in Fig. 3.3(a). 
Since the equation for y~~) has a similar structure to the one for y~~), \Ve can 
use the same method described above to construct an array for computing y~~) . 
(See Fig. 3.3(b).) 
' ' ' 
' ' ' 
h4 
' 
h2 \ 
x2p 
Y2p 
1 X2p+1 ' hs ' h1 
' ' 
' \ ' 
' ' \ \ 
' (a) 
' ' 
' ' 
' ' hs \ h3 h1 X2p ' 
\ 
Y2p+1 
X2p+1 
h4 \ h2 
' 
ho 
\ \ 
\ \ 
(b) 
Fig. 3.4. 2-D systolic arrays ( a) for computing Y2p and (b) for 
computing Y2p+1 
We now combine the two partial results into the result y 2 p = y~~) + y~~). One 
way of doing this is to place an additional adder at the right ends of the two 
arrays in Fig. 3.3. Observing that the outputs in each array take two time units 
for moving from one cell to the immediately next, we can also accumulate the 
output Y2p by letting it move diagonally between these two arrays. Therefore, not 
27 
only one adder, but also about N 1 /2 shift registers can be saved. The array is 
depicted in Fig. 3.4( a). It is clear that this array requires only N 1 /2 multipliers 
for computing Y2p· 
3.2.2. Compact form 
By using the same method a similar systolic structure for Y2p+l may be ob-
tained, as shown in Fig. 3.4(b ). Though the two sub-systems in Fig. 3.4 can solve 
the complete problem with twice the throughput of 1-D arrays, we give a more 
compact form in this sub-section. 
To obtain the compact form, we first apply a set of cuts to the two sub-
systems in Fig. 3.4. It can be seen that all the lines in the cuts are pointing in 
one direction. Thus we can add one delay to each line in the cuts. The modified 
version is depicted in Fig. 3.5. The blank square in the figure denotes a null cell. It 
passes data through it, but has no other function. It is easy to see, from Fig. 3.5, 
that the inputs x 2p and x2p+l move through both sub-systems in exactly the same 
manner and the basic processing elements in Fig. 3.5( a) take the position of the 
null cells in Fig. 3.5(b ), and vice versa. If we put them together, the functions 
of the two sub-systems will not be changed. Therefore, the final compact form is 
obtained, as shown in Fig. 3.6. 
Although the idea described above can be extended to achieve a higher 
throughput, regular systolic structures cannot be constructed because long com-
munication lines in the system are required. This problem requires further study. 
28 
2 2 2 
x2p 1 
h4 1 h2 1 ho 
Y2p 
X2p+1 hs 2 h3 2 h1 
(a) 
2 2 2 
X2p 
hs 1 h3 1 h1 1 
2p+1 
X2p+1 
1 
1 h4 2 h2 2 
(b) 
Fig. 3.5. The modified version of Fig. 3.4 
X2p 
hs h4 h3 h2 h1 
Y2 
p+ 
x2p+1 h4 h1 
Fig. 3.6. 2-D systolic array for a linear phase filter 
29 
3.3. Linear Convolution Problems 
In Subsection 3.3.1 we describe the L x N 1 systolic array which was derived 
in (30]. The following subsections then derive two N 1 x L systolic ring structures. 
We first derive an N1 x N2 systolic array for solving a linear convolution problem 
with N1 coefficients and N2 inputs. This 2-D systolic array is impractical if N 2 is 
large. We next modify it and introduce an N 1 x L systolic ring structure for solving 
the same problem, where Lis an arbitrary integer in the range 1 < L < N 2 • By 
using cut-set localization rules plus the commutative rule described in Chapter 2, 
an equivalent N 1 x L systolic ring structure can be obtained. Then we can see 
that the most efficient l-D systolic array for solving linear convolution problems 
is just a special case of this 2-D array with L = l. An upper bound on AP2 for 
these ring structures is also presented in the following discussion. 
3.3.1. L x N 1 2-D systolic array 
In this subsection, we review the L x N 1 (L < N 2 ) 2-D systolic array derived 
by Lu and others (30]. 
Fig. 3. 7 depicts an L x K systolic array. This array is called a matrix-vector 
multiplier because it can perform a matrix-vector multiplication y = Hx, where 
H is an L x K matrix and y and x are L x 1 and K x 1 vectors respectively. 
The matrix H is prestored in the array. While the input data travel vertically 
from top to bottom and remain unchanged, the outputs accumulate their terms 
horizontally and obtain the final results at the right side of the array. If the same 
H can be used for computing a number of problems, then the period for this array 
becomes equal to one. 
30 
0 
0 
0 
Xo X1 X2 X3 X4 
hoo ho1 ho2 rlo3 ho4 Yo 
h10 h11 h12 h13 h14 Y1 
h20 h21 h22 h23 h24 Y2 
Fig. 3.7. The matrix-vector multiplier 
Use the notation x~K) to denote a K x 1 vector, that is, 
X(K) = 
n 
Xn 
Xn-1 
x in 
Yin h .. 
I J 
Xout 
Xout := xin 
Yout := Yin+hij 
The linear convolution problem in (3.1) can then be expressed as 
Yn = hT xc:'1) 
Yout 
* x in 
(3.9) 
(3.10) 
where hT is the transpose of the coefficient vector and x~1 ) is a vector of N1 
successive input data items. 
To achieve a high throughput rate, we can compute L outputs simultaneously 
by using the following equation 
(L) H (N1+L-l) 
YjL == xjL (3.11 ) 
where j = 0, 1, 2, 3 · · · and His an L x (N1 + L - l) banded Toeplitz matrix. For 
31 
0 ho h1 h2 YjL 
1 
0 ho h1 h2 Yj L-1 
1 
0 ho h1 h2 YjL~2 
Fig. 3.8. The L x N 1 systolic array for solving linear convolu-
tion pro bl ems 
example, when L = 3 and N 1 = 3, the equation has the form 
XjL 
( YiL ) (10 h1 h2 0 JJ XjL-1 YiL-1 ho h1 h2 XjL-2 YiL-2 0 ho h1 XjL-3 
XjL-4 
(3.12) 
The above equation describes a matrix-vector multiplication. It can then be 
computed by using the matrix-vector multiplier depicted in Fig. 3. 7. Because some 
input data has to be reused for computing different blocks of output (which is easily 
seen by taking two successive blocks of output into account), extra communication 
lines have to be introduced. From Fig. 3.8 we can see that the area taken by the 
communication lines is very large. Therefore, this implementation is not area 
efficient in VLSI. 
32 
3.3.2. N 1 x N 2 2-D systolic array 
. 
15 
or 
For convenience, we consider a simple example with N 1 = 3 and N 2 = 4, that 
2 
Yn = L hiXn-i, 
i=O 
Yo= hoxo 
(3.13.a) 
(3.13.b) 
By inspection of the equations (3.13) a 2-D systolic structure can easily be 
obtained, as shown in Fig. 3.9. 
h2 h2 h2 h2 h· I 
Yout 
xout 
h1 h1 h1 h1 Xout := Xin 
Y out:= Yin+h i * Xin 
ho ho ho ho ir . r-delay . 
Fig. 3.9. An N 1 x N 2 systolic array for solving linear 
convolution problems 
33 
line 
In this systolic array, the coefficients are pre-stored in the cells. Cells in the 
same row contain the same coefficient. While the input data travel in parallel from 
top to bottom, the partial results move diagonally downward to accumulate the 
terms for each sum in (3.13). 
From Fig. 3.9 it can be seen that the computation for a subsequent problem 
can start immediately after the inputs of the previous problem have entered the 
array and that the area is proportional to the product of N1 and N2. Therefore, 
AP2 = O(N1 N 2 ), which is asymptotically optimal [8]. 
3.3.3. N 1 x L systolic ring structure 
The 2-D systolic array derived above is impractical because it has too many 
input/output lines and requires too large an area for a large problem. If we use 
only one part of the array and do some modifications so that it can still solve the 
same problem, then this modified systolic array is also much faster than any 1-D 
systolic array for solving a given problem since more inputs can enter the system. 
- ---~ 
--
'4-L-+ 
a 
M-1 M-2 ... 1 0 N1 
H 
Fig. 3.10. A block-diagram of the N 1 x N2 systolic array 
34 
Fig. 3.10 gives a block-diagram of the 1V1 x N 2 systolic array. In this figure the 
array is divided into M blocks with L columns of cells per block, that is N 2 =ML. 
As mentioned above, only one block of the array is to be used for solving the given 
problem. The input has also to be organized as an M x L data matrix. Data 
enters this one-block array row by row. An example is given in Fig. 3.11, in which 
N2 = 9, N1 = 5 and L = 3, so M = N2 / L = 3. 
ho ho ho 
Fig. 3.11. One block of the systolic array in Fig. 3.10 
If N 2 is not a multiple of L, we can easily add some zeros to the left or right 
sides of the input. ( Correspondingly, we can put some extra columns of cells on 
35 
the left or right boundaries of Fig. 3.10.) This does not affect the correctness of 
the computation. 
From Subsection 3.3.2 it is known that the period is one for this systolic 
array. The corresponding partial results produced by different rows of the input 
in Fig. 3.11 cannot be accumulated to form the final output. By observing the 
block-diagram in Fig. 3.10, it can be seen that each block communicates only 
with adjacent blocks on the left and right boundaries. We can visualize that in 
Fig. 3.11 there must be some extra interconnections between the block's left and 
right boundaries in order to obtain the desired result for a given problem. 
In the following discussion we show that an N 1 x L ring structure, as shown 
in Fig. 3.12, can solve linear convolution problems. The only difference between 
Figs. 3.11 and 3.12 is that there are extra two-delay lines connecting the left and 
right boundaries in Fig. 3.12. A similar systolic ring structure was introduced 
in [19] for the LU-decomposition of a band matrix. It uses nearest neighbour 
interconnections, as shown in Fig. 3.13. 
The systolic layout in Fig. 3.13 can be obtained by the following method. We 
first number the columns from left to right as 1, 2, 3 and so on. For the first row, 
we bring the left and right columns together and get a ring structure. We then 
expand the space between columns by one cell length, so that if we flatten out this 
ring the consecutive columns in the "front" and "back" parts will be interleaved. 
For the remaining rows, we apply odd-even column interchanges. 
Before proving that the N 1 x L systolic ring structure can solve linear con-
volution problems, we set up Cartesian coodinates to locate the input Xk in the 
data matrix and the coefficients hi in the array. The coordinate axes are given in 
36 
Xg X7 Xs 
X5 X4 X3 m 
X2 X1 Xo 14 
h4 h4 h4 J 
ho ho ho 
Fig. 3.12. An N1 x L ring structure for solving linear convolution 
problems 
Fig. 3.12. 
The input XJ.: can be represented by XZTn· The relation between k and the two 
indices, l and m, is 
k = l +mL (3.14) 
where O < k < N 2 - 1, 0 < l < L - l and O < m < M - l. 
A cell in the systolic array is defined by its coordinates (j, l). For example, the 
37 
1 2 3 1 2 
1 2 3 3 2 
2 3 3 
x line 
1 2 3 2 
1 
I y line 
2 3 2 
Fig. 3.13. Systolic layout of the 2-D systolic ring structure for 
solving linear convolution problems 
3 
top-right cell in Fig. 3.12 is cell (0, 0) and the bottom-left cell is cell (N1 -1, L-1 ). 
Since cells on the same row store the same coefficient, it is easy to locate the 
coefficients in the array by using the j axis as 
i = Ni -1-j (3.15) 
where i is the subscript of the coefficients. 
Lemma 3.1 : The relation between the output Yn and the coordinates j, l and 
m must satisfy 
n = N1 - 1 + mL + l - j 
38 
(3.16) 
where n is the subscript of the output Yn· 
Proof: It can be seen from (3.1) that 
n-i = k (3.17) 
where n, i and k are the subscripts of y, h and x, respectively. 
Substituting (3.14) and (3.15) into (3.17), we obtain (3.16). D 
We define them th row in the data matrix as the m th wavefront since it requires 
m unit-time delays before it enters the systolic array. 
Consider the mih wavefront and the m~h or ( m 0 + a )th wavefront, where a is 
an arbitrary integer in the range 1 < a< M. IT these two wavefronts produce the 
corresponding partial results for a given n, the equation (3.16) must be satisfied. 
For the mih wavefront we then have 
n = N1 - 1 + moL + lo - io (3.18) 
and for the (mo+ a)th wavefront, 
n = N1 - 1 +(mo+ a)L + la - ia• (3.19) 
Combining (3.18) and (3.19) yields 
lo - io = aL + la - ia• (3.20) 
Since the difference between Figs. 3.11 and 3.12 is the interconnections be-
tween two boundaries, we only consider 10 = L - 1 (left boundary) and la = 0 
(right boundary). The equation (3.20) becomes 
io+(a-1)L+l=ia• 
39 
(3.21) 
The above equation shows that for a given n the partial result from the m 6h 
wavefront accumulates the final term at cell (j0,L -1), while the (mo+ a )th 
wavefront produces its first term for that n at cell (j0 + (a - l)L + 1, 0). It can be 
seen from Fig. 3.14 that the partial result at cell (j0 , L - 1) can only arrive at cell 
(jo + ( a - 1 )L + 1, 0). Therefore, we need to show that the partial result produced 
by the mih wavefront arrives at cell (j0 + ( a - 1 )L + 1, 0) at the same time as the 
(mo+ a)th wavefront arrives. 
U o, K-1) 
2 (j 0+ 1, 0) 
K 
(j 0+K, K-1) 
K 
(j 0 +2K,K-1) 
2 
(j 0+2K+ 1, 0) 
Fig. 3.14. The route for one output in the ring structure 
Lemma 3.2 : For a partial result to move downward L rows takes L + 1 units 
of time. 
40 
Proof: In Fig. 3.14, a partial result moving from cell (jo, L - l) to cell (j0 + 
L, L-1) first goes through a two-delay line to cell (jo + 1, 0) on the right boundary 
and then moves diagonally downward L rows to cell (jo + L, L - 1 ), which takes 
L - 1 units of time. Therefore the total time required is 
t = 2 + L - 1 = L + 1. 
It is easy to see that this is true not only for the first column on the left boundary, 
but also for other columns. D 
Suppose that the computation starts at time t = 0 and that the partial results 
from the (mo +a)th wavefront and from the m~h wavefront arrive at cell (j0 ,0) or 
(io + ( a - 1 )L + 1, 0) at times ta and to, respectively. 
The (mo + a)th wavefront requires m0 + a units of time before entering the 
system and then ia + l units of time to travel from the first row to the i!h row. 
The time ta is 
ta =mo+ a+ ia + l. (3.22) 
Now ia = io + (a - l)L + l, so the above equation becomes 
ta = mo + a + io + ( a - 1 )L + 2. (3.23) 
Similarly, the mih wavefront arriving at the jJh row takes to1 units of time, 
that is, 
to1 =mo+ io + 1. (3.24) 
The partial result produced by the m~h wavefront then moves zigzag down-
ward (a-l)L rows to cell (j0 +(a-l)L, L-1), which requires (a-l)L+a-1 units 
41 
of time according to Lemma 3.2. Finally this partial result has to pass through a 
two-delay line before arriving at cell (jo + ( a - l )L + 1, 0) on the right boundary. 
Fig. 3.14 gives an example with a = 3. The time for the partial result to move 
from cell (j0 , L - I) to cell (jo + ( a - I )L + 1, 0) is 
to2 = ( a - I )L + a - I + 2 = ( a - 1 )L + a + 1. (3.25) 
Therefore, combining (3.24) and (3.25), we obtain 
to= to1 + to2 =mo+ a+ io + (a - I)L + 2 = ta. (3.26) 
Since we assumed that a is arbitrary in the above discussion, all the partial 
results for the given output can properly be accumulated in the system. There-
fore, we conclude that the N 1 x L systolic ring structure can indeed solve linear 
convolution problems. 
3.3.4. An upper bound on AP2 
In this subsection we derive an upper bound on AP2 for the systolic ring 
structure described in Section 3.3.3. 
We first consider the area taken by the systolic array. From Fig. 3.13, it is 
easy to see that the area is dominated by cells. Therefore, the area taken by an 
N 1 x L systolic ring is proportional to the product of N 1 and L, that is, 
(3.27) 
Since linear convolution problems can be computed by using the N1 x L 
-
systolic ring, it is obvious that two product terms of hiXin produced by Xi 11 and 
Xib are accumulated by one output Yn in the N1 x L systolic ring if and only if 
42 
lia - ib l < N1 - 1. Therefore, in order to avoid two successive linear convolution 
problems interfering with each other, the first problem must be followed by N 1 -1 
zeros. However, since L inputs can enter the systolic array simultaneously at one 
time, the period is 
(3.28) 
Combining (3.27) and (3.28), we obtain the upper bound 
From (3.27) and (3.28), it can also be seen that AP is independent of L. 
Therefore, by varying L we can trade off area versus period for a given linear 
convolution problem. 
3.3.5. The second systolic ring structure 
Using cut-set localization rules plus the commutative rule described in Chap-
ter 2, we can transform the systolic array in Fig. 3.12 into the array in Fig. 3.15. 
First we apply a set of horizontal cuts and subtract one delay from each line 
in these cuts; the delays on the input ( Xi) lines become zero. Since Xi remains 
unchanged during the computation, we can change the direction of the input lines. 
We then apply a set of diagonal cuts and delay transfer for the lines in the cuts 
so that the delays on all diagonal lines become zero. We can change the direction 
of the diagonal lines without affecting the final result since this is a non-feedback 
system so that the commutative law for addition can be applied. 
Finally we apply another set of horizontal cuts and add one delay to each 
line on these cuts. The array in Fig. 3.12 is then transformed into the array in 
Fig. 3.15. 
43 
ho ho ho 
1 
Fig. 3.15. A systolic array equivalent to that of Fig. 3.12 
The above three transform steps are given in Fig. 3.16. 
For L = 1, the 2-D systolic ring in Fig. 3.15 becomes a 1-D systolic array, as 
shown in Fig. 3.17. This 1-D systolic array is the most efficient 1-D systolic array 
for solving linear convolution problems among those reported in the literature [29]. 
44 
(a) step 1 
(b) step 2 
(c) step 3 
Fig. 3.16. Steps for transforming Fig. 3.12 into Fig. 3.15 
45 
Fig. 3.17. A systolic ring structure with L = 1 
46 
3.4. Discussion 
In this chapter we derived several 2-D systolic architectures for non-recursive 
digital filters. 
We derived a 2-D systolic array for 1-D linear phase non-recursive filters. 
In this design we first divided the problem into several sub-problems. All these 
sub-problems are implemented in 1-D systolic arrays. Then we applied cut-set 
localization rules to put those l-D arrays together to form a regular 2-D system. 
By using N1 multipliers, this 2-D systolic array can achieve twice the throughput 
of the 1-D systolic arrays which use N 1 /2 multipliers for a given problem. The 
disadvantage of this design is that long communication lines are required to achieve 
a higher throughput. 
We also introduced two 2-D N 1 x L systolic ring structures for solving lin-
ear convolution problems. The method used in this design is different from the 
one for parallel implementation of linear phase filters. We first constructed a 2-D 
N1 x N 2 system for linear convolution problems with N 1 coefficients and N 2 in-
puts, and then partitioned this system into an N 1 x L ring structure. A systolic 
array for solving the same problem was obtained after applying odd-even column 
interchanges to the ring structure. The derived N 1 x L systolic ring structures are 
more efficient than the L x N 1 array described in [30] because to achieve the same 
throughput our arrays require less area in VLSI implementation. It is interesting 
to note that the property of the nearest neighbour interconnection in these systolic 
ring structures is always guaranteed when the length of Lis changed. With L = l 
a 2-D ring structure becomes a 1-D array which has been proved to be the most 
efficient 1-D systolic array for solving linear convolution problems. Because AP is 
47 
independent of L, we can trade off area versus time for a given problem by varying 
L. 
When we were constructing the 2-D systolic arrays for non-recursive filters , 
we also demonstrated the power and generality of cut-set localization rules. There 
are already several well known applications of these rules, such as to localize a 
computing network with global communications, to transform one structure into 
other different ones with the same functions and to apply second-level pipelining to 
a given system so that the throughput of that system is greatly increased without 
a great increase in complexity. In this chapter we showed that cut-set localization 
rules are also useful in combining several sub-systems to form an efficient system 
with high throughput. 
With some minor modifications, the systolic ring structures can also 
solve DFT, circular convolution and 2-D linear convolution problems because 
these problems can easily be transformed into 1-D linear convolution prob-
lems [11,22,43]. We shall see in Chapter 5 that the 2-D ring structures can be 
applied as a linear part in an efficient parallel structure for direct-form recursive 
filters. 
48 
CHAPTER 4 
STABILIZED PARALLEL ALGORITHMS FOR 
DIRECT-FORM RECURSIVE FILTERS 
4.1. Introduction 
Recursive filtering is one of the most important techniques in digital signal 
processing. A lot of effort has been made in the past few years to achieve high-
throughput implementation of this type of filtering [2,30,34,36,39-41,43,51,53-55]. 
However, these methods were mainly based on the conventional look-ahead com-
putation. Although this technique has been applied successfully to the parallel 
implementation of state-variable-form recursive filters, it may cause numerical in-
stability when applied to direct-form recursive filters, due to the effect of finite 
wordlength. Thus this chapter introduces new methods for deriving stabilized 
parallel algorithms for direct-form recursive filters. These algorithms lead to very 
efficient pipelined and/ or parallel structures, which will be described in Chapter 5. 
The derived structures belong to the category of systolic/wavefront arrays and are 
suitable for VLSI. 
In Section 4.2, we describe the conventional look-ahead computation technique 
and show that it may cause numerical instability in the direct-form implementation 
of recursive filters. Section 4.3 introduces a new method for deriving, in the Z 
domain, a stabilized parallel algorithm for direct-form recursive filters. The degree 
of parallelism and the stability of this algorithm are also analyzed. In order to 
obtain this stabilized algorithm, extra zeros and poles have been introduced. The 
complexity is then greatly increased. In Section 4.4 we introduce a technique 
49 
called decomposition to minimize this increase in complexity. Since the algorithm 
is derived in the Z domain, it cannot be used for time-varying recursive systems. 
In Section 4.5, we consider a time domain derivation of the parallel algorithms for 
direct-form recursive filters. 
50 
4.2. Conventional Look-Ahead Computation 
In the look-ahead technique, the given recursion is iterated as many times 
as desired to create the necessary concurrency and then the concurrency created 
can be used to obtain pipelined and/or parallel implementation of recursive sys-
tems [41). In the following we give an example to show that this conventional tech-
nique may cause numerical instability in the practical implementation of direct-
form recursive filters due to the effect of finite wordlength. 
An Nth order direct-form recursive filter can be expressed as 
N N 
Yn = L TjYn-j + L WjXn-j ( 4.1) 
j=l j=O 
Because Yn in ( 4.1) depends on the availability of the immediately previous output 
Yn-l, it is not obvious that any two outputs can be computed in parallel. Con-
ventional look-ahead computation is applied to increase the degree of parallelism 
as follows. 
Using ( 4.1 ), we write Yn-1 explicitly as 
N N 
Yn-1 = L TjYn-1-j + L WjXn-1-j ( 4.2) 
j=l j=O 
Let j' = j + l. Then 
N+l N+l 
Yn-1 = L Tj 1 -1Yn-j' + L Wj 1 -1Xn-j' (4.3) 
j'=2 j'=l 
Substituting ( 4.3) into ( 4.1), we obtain 
N N 
Yn = T1Yn-1 + L TjYn-j + L WjXn-j 
j=2 j=O 
N+l N+l N N 
= r1( L Tj-1Yn-j + L Wj-1Xn-j) + L TjYn-j + L WjXn-j (4.4) 
j=2 j=l j=2 j=O 
N+l N+l 
~ (1) ~ (1) L.J rj Yn-j + L.J Wj Xn-j 
j=2 j=O 
51 
where 
and 
j == 0 
1 <j < N 
j==N+l 
Since Yn is independent of Yn-1 after the modification, two outputs can be com-
puted simultaneously. We can prove that, after L - 1 iterations, Yn will become 
N+L-1 N+L-1 
~ (L-1) ~ (L-1) 
Yn = L..t rj Yn-j + L..t Wj Xn-j (4.5) 
j=L j=O 
In the above equation, the coefficients can be computed by the following iterative 
algorithm. 
for i < l < L do 
- -
begin 
for O < j < N + l - 1 do 
begin { to compute wJl-l) } 
if O < j < l - l then 
(l-1) (l-2) 
wj := wj 
else if l - l < j < N + l - l then 
(Z-1) ·- (l-2) (l-2) 
wj .-wj +r1_ 1 *Wj-(l-1) 
else w ( l-l) ·- r(l- 2) * w j .- l-1 N 
end; { end of computing wf-l) } 
for O < j < N + l - l do 
begin { to compute rf-l) } 
if O < j < l then 
r\l-1) := 0 
J 
52 
( 4.6) 
else if Z < j < N + l - l then 
( l-1 ) ( Z-2 ) (Z-2 ) 
rj :=rj +rz_1 *rj- (l-l ) 
( l-1 ) ( l-2 ) 
else r ·- r * r j . . l-1 N 
end { end of computing ry-i) } 
end. 
( The proofs of ( 4.5) and ( 4.6) are given in Appendix 4.1.) 
To analyze the stability of the modified algorithm, we transform it into the Z 
domain and then obtain the impulse response function as 
N+L-1 (L-1) -j 
• O W· Z H( ) J= J 
Z = ~N+L-1 (L-1) _ · 
1-L..,L rj z J 
(4.7) 
We prove, in Appendix 4.2, that H(z) in (4.7) can be rewritten as 
H(z) = H(z)~~:~ ( 4.8) 
In the above equation, fI ( z) is the impulse response function before the modifica-
tion and D(z) is an (L - 1rh order polynomial in z-1 , which is defined as 
L-1 
D(z) = 1 + L ry-i) z-i (4.9) 
j=l 
where r;j-l) can be computed by the iterative algorithm in ( 4.6). 
From ( 4.8) we see that, to obtain a parallel algorithm, we have multiplied both 
numerator and denominator of the original impulse response function by a factor 
D( z). Because of the zero and pole cancellation, the impulse response function 
after the modification is theoretically equivalent to the original one. However, 
we cannot guarantee that all roots of D( z) are within the unit circle. Thus the 
modification may cause numerical instability due to the effect of finite wordlength. 
We give an example below. 
53 
Consider a second-order recursive filter. The denominator of H(z) is given 
as 1 - l.5z-1 + 0.56z-2 . Since the two roots of this denominator are 0. 7 and 
0.8, the system is stable. We now derive a parallel algorithm by using the above 
conventional technique. After one iteration, the root of D( z) = 1 + r1 z-1 is 
-1.5, which is outside the unit circle. After two iterations, the roots of D( z) 
l+r1z- 1 +r?) z- 2 are -0. 75±1.5i. They are also outside the unit circle. Therefore, 
the modified system is definitely unstable. 
54 
4.3. Stabilized Parallel Algorithm 
In the previous section, we have shown that applying the conventional look-
ahead computation to the implementation of direct-form recursive filters may cause 
numerical instability because the extra poles introduced cannot be guaranteed to 
be inside the unit circle. In this section we derive a new parallel algorithm which 
is guaranteed to be stable if the original (serial) algorithm is stable. 
4.3.1. Algorithm 
The impulse response of an Nth order recursive filter can be expressed as 
( 4.10) 
where X(z) and Y(z) represent the Z transforms of input and output, respectively. 
To obtain our parallel algorithm, we first introduce a well known N x N 
matrix, which is called a companion matrix, 
0 1 0 0 
0 0 1 0 
B= ( 4.11) 
0 0 0 1 
-bo -b1 -b2 -bN-1 
In this matrix, the elements on the first superdiagonal are all equal to one and the 
jth element on the last row is -bj-l, but all other elements are equal to zero. It 
is known that 
det(zl - B) =ZN+ bN-1ZN-l + • • • + b1z + bo 
N 
N " N . = Z + L.J bN-jZ -J 
j=l 
( 4.12) 
where det(X) denotes the determinant of the matrix X and I is an N x N identity 
55 
matrix. Let -bj = rN-j· Then 
and 
0 1 0 0 
0 0 1 0 
B= 
0 0 0 1 
TN TN-1 TN-2 r1 
N 
det(zl - B) = zN - ~ TjZN-j 
i=l 
Multiplying both sides of ( 4.14) by z-N, we obtain 
N 
det(I - Bz-1 ) = 1 - ~ riz-i 
i=l 
( 4.13) 
( 4.14) 
( 4.15) 
From the above equation, we see that by using the companion matrix B, the 
denominator of the impulse response function in ( 4.10) can be expressed in matrix 
form. Thus we can rewrite H(z) as 
'\;"'N -i 
H(z) - _LJ_;...i_O _w_i z __ 
- det(I - Bz-1 ) ( 4.16) 
We next multiply both numerator and denominator of ( 4.16) by a factor 
det(LJ-01 Bi z-i), where B 0 = I and Bi is a product of j matrices B. (An 
efficient method for computing Bi is given in Appendix 4.3.) The impulse response 
function then becomes 
_ det(Lf 01 Bi z-i) H ( z) = H ( z )-d _(_"'_L ___ 1 B-. -_-.) 
et L.Ji=O J z J 
("'N -i)d t("'L-1 Bi -i) L.Ji=O WiZ e L.Ji=O z ( 4.17) 
- det(I - Bz-1 )det(I:f 01 Bi z-i) 
("'N -i)d t("'L-1 Bi -i) 
- L.Ji=O WiZ e L.Ji=O z 
det((I - Bz-1 )(I:f 01 Bi z-i)) 
Now 
L-1 
(I- Bz-1 )(~ Biz-i) = I- BLz-L ( 4.18) 
i=O 
56 
Substituting ( 4.18) into ( 4.17), we finally obtain our modified impulse response 
function as 
(""'N -i)d t(""'L-1 Bi -j) H z - Ltj=O WjZ e Ltj=O z 
( ) - det(I - BLz-L) ( 4.19) 
which implies a parallel algorithm. The rest of this section will analyze the degree 
of parallelism and the stability of this algorithm. 
4.3.2. Degree of parallelism 
We may divide ( 4.19) into two parts: 
Y(z) U(z) 
H(z) = U(z) X(z) = H2(z)H1(z) ( 4.20) 
In the above equation, U(z) is an intermediate variable, and H2(z) and H1(z) are 
the recursive part and linear part, respectively, defined as follows 
Y(z) 1 
H2(z) = U(z) = det(I- BLz-L) ( 4.21) 
and 
N L-1 U(z) ~ . ~ .. H1(z) = = (L..,, WjZ-J)det(L..,, BJz-J) X(z) . . 
;=O ;=O 
( 4.22) 
It is recursion that limits the parallel implementation of recursive filters. To ana-
lyze the degree of parallelism, therefore, we only consider the recursive part H2 ( z ). 
Lemma 4.1. Suppose that B is an N x N matrix. Then det(I - BL z-L) can 
be expressed as an N Lth order polynomial in z-1 with only N + l terms, that is, 
N 
det(I - BLz-L) = 1 - L bjLZ-jL 
j=l 
where bjL is a combination of some elements in BL. 
57 
( 4.23) 
Proof: Let A= zL. Then 
( 4.24) 
Since B is an N x N matrix, so is BL. Thus det(I - BL A- 1 ) is an Nth order 
polynomial in z-1 and can be expressed as 
N 
det(I- BLA- 1 ) = 1- L bjLA-j 
j=l 
where bjL is a combination of some elements in BL. 
Substituting zL for A in ( 4.25), we then obtain ( 4.23). 
Using Lemma 4.1, we can rewrite H2(z) as 
H2(z) = Y(z) = N 1 . 
U(z) 1 - ~j=l bjLz-JL 
Converting ( 4.26) into the time domain, we obtain 
N 
Yn = L bjLYn-jL + Un 
j=l 
( 4.25) 
0 
( 4.26) 
( 4.27) 
Because Yn in ( 4.27) depends only on Yn-jL for j = 1 to N, L outputs can be 
computed simultaneously. That is why we call our modified algorithm a parallel 
algorithm. 
In the following, we give an example of N = 2. 
( 
(L-2 ) 
From Appendix 4.3, we can obtain that BL = r fL-l) 
rL+l 
is computed by using ( 4.6). Then 
r L-1.. ( l ) (L-?)) 
(L-l ) , where r j 
I-BLz-L= -rL z 
( 
l (L-2) -L 
(L-1) -L 
-rL+1 z 
58 
rL 
(L-2) -L ) 
-rL-1 z 
1 (L-1 ) -L 
- TL Z 
( 4.28) 
We have 
d t(I _ BL -L) _ (l _ (L-2) -L)(l _ (L-1 ) -L) _ (L-2 ) (L-1 ) -2L e Z - TL Z TL Z TL-l TL+l Z 
_ l _ ( (L-2) + (L-1 ) ) -L _ 
- rL rL z 
( (L-2) (L-1) (L-2) (L-1)) -2L 
- rL-1 rL+1 - rL rL z 
where tr(BL) denotes the trace of BL, b1L = tr(BL) and b2L = -det(BL). 
Since H2(z) = YU((z)) = I;' 1 _ .L, then 
Z 1- b · LZ J i=l J 
Yn = b1LYn-L + b2LYn-2L + Un 
4.3.3. Stability 
Since we have assumed that Bis an N x N matrix, 
and 
L-1 L-1 
det(L Bi z-i) = z-N(L-l)det(L Bi zL-l-i) 
i=O i=O 
We can rewrite H(z) into another form as 
("!'_I . N-i)d t("~-1 Bi L-1-i) 61=0 WJZ e DJ=O Z 
H(z) = det(IzL - BL) 
Similarly, H(z) in (4.16) can be rewritten as 
°"N N-i 
- L.ij=O WjZ H ( z) - ---'-----
- det(Iz - B) 
59 
( 4.29) 
( 4.30) 
( 4.31) 
( 4.32) 
( 4.33) 
( 4.34) 
Suppose that the original algorithm before the modification is stable. Then the 
roots of det(Iz - B) are all in the unit circle. This means that the eigenvalues Zi 
of B are all in the unit circle. It is clear that the eigenvalues zf of BL are also 
in the unit circle and closer to the origin than their corresponding z/s. Thus, the 
stability of our modified algorithm is obvious. 
60 
4.4. Reduction of Complexity 
In this section we consider the number of multipliers required for implementing 
the modified algorithm in the 1-D case. (It will be seen, in the next chapter, that 
the number of multipliers in the 2-D case is only increased by a factor of L.) 
We shall see that the direct implementation of the modified algorithm is not very 
practical because the number of multipliers for that algorithm is very large. A 
technique called decomposition is introduced to reduce the number of multipliers. 
Lemma 4.2. Suppose B is an N x N matrix, then det(I:J 01 Bi z-i) is an 
N(L - l)th polynomial in z-1 . 
Proof: From Lemma 4.1, det(I - BLz-L) is an N Lth order polynomial in z-1 . 
We also know that det(I - Bz-1 ) is an Nth order polynomial in z- 1 . However, 
we have det(I - BLz-L) = det(I - Bz-1 )det(LJ 01 Bi z-i). Thus the order of 
( ~L-1 · · ( ) 0 det Lij=O BJz-J) must be NL - N =NL -1. 
Since det(LJ 01 Bi z-i) is an N(L - l)th order polynomial, N(L - 1) multi-
pliers are required for computing the associated convolution. Therefore, N(L -1) 
extra multipliers have been introduced in the modified algorithm, which dominates 
the total number of multipliers. In the following, we apply the decomposition tech-
nique to reduce this number of multipliers. 
Lemma 4.3. If L = l 1 l 2 , where 11 and 12 are positive integers, then LJ 01 Bi z-j 
can be expressed as 
L-1 11-l 11-l L Bi z-j = ( L (Bz-1 )il1 )( L (Bz-1 )5) ( 4.35) 
j=O j=O j=O 
where B 0 == I. 
61 
Proof: We arrange L terms of ~J-01 Bi z-i into l2 groups with l1 terms each 
as follows 
L-1 L Bi z-i = [I+ Bz- 1 + ... + (Bz- 1 ) 11-l ]+ 
i=O 
( 4.36) 
Let l1-l 
Q =I+ Bz-1 + ... + (Bz-1 )li-1 == L (Bz-1 )i ( 4.37) 
i=O 
For the ith group in ( 4.37) we have 
We then obtain 
L-1 L Bi z-i = Q + (Bz-1 )Zi Q + ... + (Bz-1 )(12-1)11 Q 
i=O 
= [I+ (Bz-l l1 + ... + (Bz-1 )<Z2-l)Z1 ]Q 
Z2-l 
= ( L (Bz-1 )il1 )Q 
i=O 
Z2-l Z1-l 
= ( L (Bz-1 )ili )( L (Bz-1 )i) 
i=O i=O 
D 
Lemma 4.4. If L = TI[ 1 lk, where lk is a positive integer, then 
L-1 K l1c-1 L Bi z-j = II ( L (Bz-1 / n::11 li) ( 4.39) 
i=O k=l i=O 
where B 0 = I and the li are not necessarily distinct. 
Proof: By induction on K from Lemma 4.3. D 
62 
We give an example with L = 12. Since 12 can be expressed as a product of 
three prime numbers 3, 2 and 2, we have 
11 L Bi z-i =(I+ Bz-1 + B 2z-2 ) + (B 3 z-3 + B 4 z-4 + B 5 z-5 )+ 
j=O 
( 4.40) 
From the above example, it is easy to see that a large polynomial has been 
decomposed into a product of three small polynomials. Then 'I:~ 1 0 Bi z-i can be 
implemented in a three-stage cascaded structure with only 4N multipliers, instead 
of llN multipliers, where we suppose B is an N x N matrix. This reduction of 
the number of multipliers can be formally expressed by the following two lemmas. 
Lemma 4.5. Suppose that B is an N x N matrix and q and p are constants, 
then 
q-1 (q-l)N 
det(L(Bz-1 )iP) = l + L djpz-iP ( 4.41) 
j=O j=l 
where B 0 = I and djp is a combination of some elements in BP,·· B(q-l)p. 
Proof: Let zP = .X. Then 
q-1 q-1 
det(L(Bz-1 )iP) = det(L BiP _x-i) ( 4.42) 
j=O j=O 
From Lemma 4.2, we know that det(I:1-~ BiP.X-i) is an N(q - l)th order poly-
nomial in .x-1 . It can then be expressed as 
q-1 (q-1)N 
det(L BiP .X-i) = L djp_x-1 ( 4.43) 
j=O j=O 
63 
where djp is a combination of some elements in B P, · · · B (q -l )p . 
Since the coefficient of ).0 is unity both in det (I-BL ).-L ) and in de t(I-B).-1 ) , 
the coefficient of ). 0 in ( 4.43) must also be unity. Then 
q-1 (q-l ) N 
det(I: BiP ).-i) = 1 + L djp).-J ( 4.44) 
j=O j=l 
Replacing ). by zP in ( 4.44), we then obtain 
q-1 (q-l)N 
det(L(Bz-1 )iP) = 1 + L djpZ-jp 
j=O j=l 
D 
We see, from Lemma 4.5, that there are only N( q - 1) multipliers required to 
compute the associated convolution although det(Lj-~ (Bz-1 )iP) is an Np( q- l )th 
order polynomial. By extending this result, we have Lemma 4.6. 
Lemma 4.6. If L = rrf 1 lk, there are N(Lf 1 lk - K) multipliers required 
for computing the convolution associated with det(LJ 01 Bi z-i) by using the 
decomposition technique, where B is an N x N matrix. 
Proof: From Lemma 4.4, we may have 
L-1 K l1c-l 
det( L Bj z-j) = det( II ( L (Bz- 1 )j rr:::t z.)) 
j=O k=l j=O 
K l1c-l 1c 
= II det( L (Bz-1 / rr.::t z.) 
( 4.45) 
k=l j=O 
. L-1 · · We see, from the above equation, that det(Lj=O B 1 z-1 ) can be expressed as a 
product of K small polynomials. It can then be implemented in a K-stage cascaded 
structure. From Lemma 4.5, however, there are N(lk - 1) multipliers required in 
the kth stage. For K stages, the total number of multipliers is then 
K K 
L N(lk - l) = N(L lk - K) ( 4.46) 
k=l k=l 
64 
Since ~f 1 lk - K may be much smaller than Tif 1 h - 1, the number of 
multipliers may be greatly decreased after the decomposition. If L = 2K ( or 
K = log2 L), for example, then the number in ( 4.46) becomes 
K 
~N(lk) = NK == Nlog 2 L ( 4.4 7) 
k=l 
65 
4.5. Time Domain Derivation 
In this section we consider a time domain derivation of the parallel algorithms 
for direct-form recursive filters. This is of particular interest for time-varying 
recursive systems. 
From Section 4.3 we see that the stabilized parallel algorithm derived in the 
Z domain has the form 
N 
Yn == ~ bjLYn-jL + Un 
j=l 
( 4.48) 
Thus our goal is to derive, in the time domain, a parallel algorithm with the same 
form. In Section 4.5.1, we use a second-order recursive filter as an example to 
demonstrate two methods of achieving this form from the original (serial) algo-
rithm. There are many other ways to achieve this form and one could ask if all 
these methods produce the same solution. Section 4.5.2 shows that uniqueness 
holds under one condition. If this condition is satisfied, all the methods give a 
unique solution and this solution is stable (if the original algorithm is stable), be-
cause it is the same as the one derived in Section 4.3, which has been proved to 
be stable. If the condition for uniqueness is not satisfied, we need to analyze the 
stability of the derived algorithm. This problem is discussed in Section 4.5.3. 
66 
4.5.1. Methods of derivation 
In this subsection, we use a second-order recursive filter as an example to 
demonstrate two methods for obtaining parallel algorithms with the same form as 
that in ( 4.48). We assume that the condition for the unique solution (which will 
be discussed in the next subsection) holds, so all steps of the following derivations 
are valid. 
A second-order recursive filter is expressed as 
( 4.49) 
where Vn = WoXn + W1Xn-1 + W2Xn-2. To increase the degree of parallelism by a 
factor of L = 6, a modification is made so that the following form is obtained 
Yn = b1LYn-L + b2LYn-2L + Un 
( 4.50) 
To achieve this, we introduce two methods. The first one needs (L - l)N steps to 
complete the derivation. Thus it is called the derivation without decomposition. 
The second one is called the derivation with decomposition because it takes only 
('I:;;° 1 h - K)N steps, where L = IT;;° 1 h. 
67 
1 (1 X 2 3)1 (0 1 2] 
2 (2X 3 4)1 (0 2 3] 
3 (3x 4 5)1 (0 3 4] 
4 (4X 5 6)1 (0 4 5] 
5 (5X 6 7)1 [O 5 6] 
6 (6 7x 8)1 [O 6 7) 
7 (6 3x 9)2 [O 6 8] 
8 (6 gx 10)3 (0 6 9) 
9 (6 1ox 11)4 (0 6 10] 
10 (6 11 X 12)5 [O 6 11] 
11 [O 6 12] 
Fig. 4.1. The derivation without decomposition (N = 2 and L = 6) 
4.5.1.1. Derivation without decomposition 
We are concerned with the derived form, but not with the exact values of 
coefficients. For simplicity, therefore, we use index notation to describe the pro-
cedure of our derivation, as shown in Fig. 4.1. The number in the first column 
denotes the step of the derivation. The numbers in the other columns represent 
an output. For example, 2 stands for Yn-2 and 3 for Yn-l. The numbers between 
the square brackets at the ith row represent an equation, which is the result from 
the immediately previous step and is called the equation at step i. For example, 
(0 6 7] on the sixth row denotes 
(5) (5) (5) 
Yn = b6 Yn-6 + b7 Yn-7 + Vn ( 4.51) 
where b)5 ) is the jth coefficient of the equation, which is derived from step 5. A 
68 
group of numbers between parentheses with a superscript i denotes that an output, 
the corresponding number of which has a superscript "x", is written explicitly by 
using the equation at step i. Since the equation at step 1 is expressed as 
(o) (o) (o) 
Yn = bl Yn-1 + b2 Yn-2 + Vn ( 4.52) 
(o) (o) (o) ( 
where b1 = r1, b2 = r2 and Vn = vn, then 6 7 8)1 can be written as 
(o) (o) (o) 
Yn-6 = bl Yn-7 + b2 Yn-8 + Vn-6 ( 4.53) 
. Thus (6 7x 8)1 on the sixth row is expressed as 
( 4.54) 
In the above we have assumed that bio) =/. 0. 
The operation at step i is described as follows: First an output is written ex-
plicitly according to the coefficients between the parentheses. Next the expression 
of this output is substituted into the equation at step i, which is determined by 
the coefficients between the square brackets and is derived from step i - 1. An 
equation for the next step is 'then obtained. An example follows. At step 6 in Fig. 
4.1, we write Yn-7 explicitly, using the equation at step 1. The result is given in 
( 4.54). Next we substitute it into ( 4.51 ), the equation at step 6. Then we obtain 
( 4.55) 
It is easy to verify, from Fig. 4.1, that the final result after ten iterations can 
be expressed as 
( 4.56) 
69 
where 
10 
Un= Vn + L djVn-j 
j=l 
( 4.57) 
Because ten extra multipliers are introduced for computing un, the complexity 
after the modification is greatly increased. This number of multipliers is just 
(L - l)N. That is why we call this method derivation without decomposition. In 
the following, we describe a more efficient method with decomposition. 
4.5.1.2. Derivation with decomposition 
1 (lx 2 3)1 (0 1 2] 
2 (2X 3 4)1 (0 2 3] 
3 (3 4x 5)1 (0 3 4] 
4 (3 5x 6)2 [O 3 5] 
5 (3X 6 9)5 (0 3 6] 
6 (6 gx 12)5 (0 6 9] 
7 (0 6 12] 
Fig. 4.2. Derivation with decomposition (N = 2 and L = 6) 
The index notation of this derivation is depicted in Fig. 4.2. Since we are 
considering the decomposition, the detailed derivation at each step ( as listed in 
Appendix 4.4) must be studied carefully. 
From Appendix 4.4 it is easily verified that Un can be computed in two stages, 
that is, 
4 
- "d(l) 
Un = Vn + L..J j Vn- j ( 4.58) 
j=l 
70 
and 
2 
- L d (2) -Un = Un + 3j Un-3j ( 4.59) 
j=l 
Therefore, only six multipliers are required for computing Un. 
1 (1 X 2 3 4)1 [O 1 2 3] 
2 (2 3x 4 5)1 [O 2 3 4] 
3 (2 4 5x 6)2 [O 2 4 5] 
4 (2x 4 6 8)4 [O 2 4 6] 
5 (4 5x 8 10)4 [O 4 6 8] 
6 (4 8 1ox 12)5 [O 4 8 10] 
7 (4X 8 12 16)7 [O 4 8 12] 
8 (8 12x 16 20)7 [O 8 12 16] 
9 (8 16 2ox 24) 8 [O 8 16 20] 
10 (0 8 16 24] 
Fig. 4.3. Derivation with decomposition ( N = 3 and L = 8) 
The idea described above can easily be extended to more complicated cases. 
Fig. 4.3 depicts an example of N = 3 and L = 8. It is easy to prove that Un in 
this example can be computed in three stages, using only nine multipliers ( which 
71 
4.5.2. Condition for unique solution 
In section 4.5.1, we described two methods for obtaining a time domain deriva-
tion. However, there are many other methods of achieving the same goal. Two 
of them are depicted in Fig. 4.4. In this subsection we give the condition for the 
unique solution of the stabilized parallel algorithm. 
The problem is defined as follows: Suppose that 
N (L-l)N N 
(1 - L Tjz-i)( L ajz-i) = L bjLZ-jL ( 4.60) 
j=l j=O j=O 
where b0 = 1 and 1 - ~f 1 TjZ-j is the determinant of the Z transform of a 
stable recursive filter. The question is whether there exists a unique solution for 
determining LN + 1 unknowns aj for O < j < (L - l)N and bjL for 1 < j < N. 
We first modify ( 4.60). Multiplying both sides of ( 4.60) by 
then 
(L-l)N "\;"'N b. -jL 
L -. L.,j=O :JLZ a ·z :J - -------:J - N . j=O 1 - Lj=l rjz-J 
By a power series expansion, we have 
l N N 
__ "\;"'_N ___ . = 1 + L TjZ-j + (L Tjz-i) 2 + ... 
1 - L.,j=l Tjz-J j=l j=l 
oo N 
= L(L TjZ-j)k 
k=O j=l 
00 
~ -k 
= L.J SkZ 
k=O 
1 
1- r · z-, 
~
N . ) 
j=l , 
( 4.61) 
( 4.62) 
where so = 1 and s k is a combination of some r j. (The calculation of s k for 
1 < k < (L - l)N can be found in Appendix 4.5.) The above expansion is 
convergent for z-1 in the unit circle because the original system is stable. 
72 
1 (1 2x 3)1 (0 1 2] 
2 (1 X 2 3)1 (0 1 3] 
3 (2X 3 5)2 (0 2 3] 
4 (3 5x 6)3 (0 3 5] 
5 (3X 6 9)5 [O 3 6] 
6 (6 gx 12)5 (0 6 9] 
7 (0 6 12] 
(a) 
1 (1 X 2 3)1 (0 1 2] 
2 (2 3x 4)1 (0 2 3] 
3 (2X 4 6)3 (0 2 4] 
4 (4X 6 8)3 (0 4 6] 
5 (6 3x 10)3 (0 6 8] 
6 (6 1ox 12)4 [O 6 10] 
7 [O 6 12] 
(b) 
Fig. 4.4. Two other methods for the time domain 
derivation (N = 2 and L = 6) 
Using ( 4.61) and ( 4.62), we can rewrite ( 4.60) as 
(L-l)N N oo 
L ajz-j = (L bjLZ-jL)(L skz-k) 
j=O j=O k=O 
( 4.63) 
From ( 4.63) it is easy to verify that, once bjL for O < j < N is determined, there is 
an unique solution for aj for O < j < (L- l)N. Therefore, our problem is reduced 
73 
to finding the condition for the unique solution of N unknowns bjL for 1 < j < N. 
Lemma 7. Consider an equation of the form ( 4.60) or ( 4.63). There exists a 
unique solution for bjL for 1 < j < N if and only if the following inequality is 
satisfied 
det 
S(L-l)N+l-L S(L-l)N+l-2L 
S(L-l)N+2-L S(L-l)N+2-2L 
SLN-1-L 
where s k = 0 for k < 0. 
S£N-1-2L 
SL-N+l 
SL-N+2 
SL-1 
Proof: We define aj = 0 for j > (L - l)N. Then we can write 
(L-l)N oo 
L ajz-i = L ajZ-j 
j=O j=O 
Substituting ( 4.65) into ( 4.63), we have 
oo N oo 
L ajz-i = (L bjLZ-jL)(L Skz-k) 
j=O j=O k=O 
( 4.64) 
( 4.65) 
( 4.66) 
Except for the first ( L - 1 )N + 1 linear equations, which are used to determine ai 
for O < j < (L- l)N, from (4.66) we can write an infinite set of linear equations 
as 
( 4.67) 
+ '' ' s L b(N-l)L 
where Sk = 0 for k < 0. 
Although there is a countably infinite number of linear equations in ( 4.67), 
only at most N of them are independent. This is proved as follows: Substituting 
74 
( 4.65) into ( 4.60), then 
N oo N 
(1 - L rjz-i)(L ajz-j) == L bjLZ-jL ( 4.68) 
j=l j=O j=O 
Since I:f O bjLZ-jL is an N Lth order polynomial in z-1 , the coefficient for any 
term with an order higher than NL must be zero. Thus we can write 
( 4.69) 
From the above equations it is easy to see that once the first N coefficients ai 
for (L - l)N + 1 < j < NL are known, aNL+l can be determined by the first 
equation, then aNL+2 is determined by the second and then aNL+3 by the third, 
and so on. Therefore, only at most N linear equations among those equations in 
( 4.67) are independent. 
To compute bjL for 1 < j < N, we choose the first N linear equations from 
( 4.67). Since bo == s 0 == 1, we then have 
0 == 8 (L-l)N+l +s(L-l)N+l-L bL + · · · +sL-N+l b(N-l)L 
0 == 8 (L-l)N+2 +s(L-1)N+2-L bL + ... +s L-N+2 b(N-l)L 
0 == SLN +sLN-L bL + ... +sL b(N-l)L +bNL 
or in a matrix form, 
S(L-1)N+1-L S£-N+l 0 bL S(L-1)N+1 
S(L-1)N+2-L S£-N+2 0 b2L S(L-1)N+2 
S£N-1-L S£-1 0 b(N-l)L S£N-1 
S£N-L SL 1 bNL SLN 
75 
( 4. 70) 
( 4. 71) 
where Sk = 0 for k < 0. From ( 4. 71 ), it is easy to see that the necessary and 
sufficient condition for the unique solution of bjL for 1 < j < N is 
det 
or 
S(L-l)N+l-L S(L-l)N+l-2L 
S(L-l)N+2-L S(L-l)N+2-2L 
det 
SLN-l-L 
SLN-L 
S£N-l-2L 
S£N-2L 
S(L-l)N+1-L S(L-l)N+l-2L 
S(L-l)N+2-L S(L-l)N+2-2L 
SLN-l-L 
where Sk = 0 for k < 0. 
S£N-l-2L 
4.5.3. Stability 
SL-N+l 0 
S£-N+2 0 
SL-l 0 
SL 1 
SL-N+l 
S£-N+2 
SL-l 
#0 ( 4. 72) 
D 
In the previous subsection we derived the condition for the unique solution 
of the stabilized parallel algorithm. In certain problems, however, this condition 
does not hold. Since the solutions in these cases are not the same, we need to 
analyze the stability of the derived algorithm. The following is an example with 
N = 2 and L == 6. 
From ( 4.64) we can express the condition of the unique solution for N == 2 
and L = 6 as 
S5 # 0 ( 4. 73) 
By using ( 4.A5.13) and ( 4.6), S5 can be written as 
( 4. 74) 
Therefore, the solutions will not be unique if eithe~ r 1 , rr + 3r2 or rf + r2 is equal 
to zero. 
76 
To investigate stability, we consider one of the above three cases, that is, 
(4.75) 
Since bj = 0 for j :/= iL in ( 4.A5.2), we can obtain a linear homogeneous system 
of N(L -1) equations. Using these equations plus the constraint (4.75), we can 
calculate the coefficient aj and then obtain 
10 L ajz-i = (l + r1z-1 - r2z- 2)(l - r2z- 2)(l + a6z-6) 
j=O 
( 4. 76) 
Suppose that the original system is stable. The two poles z1 and z2 of the 
system are inside the unit circle. It is easy to verify that the roots of 1 + r 1 z-1 -
inside the unit circle. To obtain a stabilized parallel algorithm, therefore, the 
absolute value of a6 in ( 4. 76) must be smaller than unity, that is, 
Consider one extreme case when a6 = 0. Then 
1 3 -6 = - r 2 z 
( 4. 77) 
( 4. 78) 
It can be verified that this result may be obtained by just using the conventional 
look-ahead technique described in Section 4.2. 
Using the same method described above to the other two cases, we can see 
that ( 4.77) is the condition for obtaining a stabilized algorithm with a degree of 
parallelism L = 6 for a second-order recursive filter when the condition for the 
unique solution ( 4.73) is not satisfied. 
77 
4.6. Discussion 
In this chapter we first showed that conventional look-ahead computation for 
obtaining parallel algorithms for recursive systems can cause numerical instability 
in the direct-form implementation of recursive filters. The reason is that the 
introduced poles cannot be guaranteed to be inside the unit circle. Then we 
introduced a new method for deriving, in the Z domain, an algorithm for direct-
form recursive filters. We showed that not only is the degree of parallelism of this 
new algorithm increased, but the stability is also improved. If the original serial 
computation is stable, that is, the eigenvalues Zi of the companion matrix in the 
original algorithm are all in the unit circle, the eigenvalues zf of the companion 
matrix in the parallel system will also be in the unit circle and closer to the origin 
than their corresponding zi's. This enhanced stability may mean that less bits per 
word are necessary in our parallel implementation. 
The penalty for using the new method is a great increase in complexity. For 
example, an extra N(L - 1) multipliers are required if we want to increase the 
degree of parallelism by a factor of L for an Nth order recursive filter. However, 
to implement the original serial algorithm requires only 2N + 1 multipliers. To 
alleviate this problem, we introduced a decomposition technique. After decompo-
sition, the extra multipliers introduced are decreased to N("2:f 1 lk - K) where 
L = IT~ 1 lk. 
We also considered the method for obtaining, in the time domain, parallel 
algorithms with the same form as the one obtained in the Z domain. The derived 
parallel algorithm can be applied to time-varying recursive systems which are an 
important tool in modern digital signal processing. There are many different ways 
78 
.... 
to achieve the same goal. We then discussed the condition for uniqueness of the 
stabilized parallel algorithm. We showed that if this condition is not satisfied, 
the derived algorithms are stable only under certain conditions. Therefore, the 
problem of stability has to be carefully considered when one is deriving parallel 
algorithms for a time-varying recursive system. 
The method described in this chapter can also be extended to obtain parallel 
algorithms for recursive filters in state-variable form. However, the decomposition 
technique is useful to reduce the increased complexity only if certain conditions 
are satisfied. A detailed description of this issue can be found in Appendix 4.6. 
79 
Appendix 4.1 
In this appendix, we prove ( 4.5) and ( 4.6) by using induction. 
From ( 4.2), ( 4.3) and ( 4.4), we see that ( 4.5) holds for L = 2. By induction, 
assume that the algorithm holds after L - 2 iterations, that is, 
N+L-2 N+L-2 
~ (L-2) ~ (L-2 ) 
Yn = .L..J r j Yn-j + .L..J W j Xn-j 
j=L-1 j=O 
where 
and 
(L-2) - r j + r L-2 r j-(L-2)' 
{ 
(L-3) (L-3) 
rj - (L-3) 
(L-2) 
W· = J 
rL-2 TN, 
(L-3) 
wj ' 
(L-3) (L-3) 
wj + rL-2 Wj-(L-2), 
(L-3) 
rL-2 WN, 
L - 1 < j < N + L - 2; 
j=N+L-2 
0 < j < L - 2; 
L - 2 < j < N + L - 2; 
j = N + L - 2. 
Using (4.1), we write Yn-(L-l) explicitly as 
N N 
Yn- (L-1) = L TjYn- (L-1)-j + L WjXn- (L-l) -j 
j=l j=O 
Let j 1 = j + L - 1. Then 
N+L-1 N+L-1 
Yn-(L-1 ) = L rj'- (L-l )Yn-j' + L Wj'- (L-1 ) Xn-j' 
j'=L j'=L-1 
80 
( 4.Al.1) 
( 4.Al.2) 
( 4.Al.3) 
( 4.Al.4) 
( 4.Al.5) 
Substituting ( 4.Al.5) into ( 4.Al.l ), we have 
N+L-l N+L-l 
(L-2) ( " " ) Yn = rL-l L_.,; Tj- (L-l )Yn-j + L_.,; Wj-(L-l)Xn-j + 
j=L j=L-l 
N+L-2 N+L-2 
" (L-2) " (L-2) + L,.,; rj Yn-j + L,.,; wj Xn-j 
j=L j=O 
N+L-2 
" ( (L-2) (L-2) ) (L-2) 
L_.,; rj + rL-l Tj-(L-l) Yn-j + rL-1 TnYn-(N+L-1)+ 
j=L ( 4.Al .6) 
L-2 N+L-2 
" (L-2) " ( (L-2) (L-2) ) + L,.,; wj Xn-j + L,.,; wj + rL-l Wj-(L-1) Xn-j+ 
j=O j=L-1 
(L-2) 
+ TL-1 WnXn-(N+L-1) 
N+L-1 N+L-1 
" (L-1) " (L-1) L,.,; rj Yn-j + L,.,; Wj Xn-j 
j=L j=O 
In the above equation, 
{ 
(L-2) (L-2) 
(L-1) - r j + r L-1 r j-(L-1)' 
rj - (L-2) 
rL-1 TN, 
and 
(L-1) 
W · = ] 
(L-2) 
wj ' 
(L-2) (L-2) 
wj + rL-1 Wj-(L-1), 
(L-2) 
rL-1 WN, 
81 
L < j < N + L -1; 
j=N+L-1 
0 < j < L - l; 
L -1 < j < N + L - 1; 
j = N + L -1. 
( 4.Al. 7) 
(4.Al.8) 
Appendix 4.2 : 
In the following, we prove 
N+L-1 N 
L w;L-l ) z-j = (L Wjz-i)D(z) ( 4.A2.1) 
j=O j=O 
where 
L-1 
D(z) = l + L r;j-l) z-i ( 4.A2.2) 
j=l 
The proof of the denominator of H(z) in (4.8) is similar. 
First let L = 2. Since we have, from ( 4.6), 
{ 
wo, j = O; 
w ;1) = w i + r1 w j-1, 1 < j < N + l; 
r1wN, j = N + l 
thus 
N+l N 
L w;1)z-i = wo + L(wj + r1Wj-i)z-i + r1wNZ-(N+i) 
j=O j=l 
N N 
= wo + L WjZ-j + r1(L Wj-lZ-i + WNZ-(N+i)) ( 4.A2.3) 
j=l j=l 
N N+l 
= L WjZ-j + r1 L Wj-1Z-j 
j=O j=l 
Let j' = j - 1. Then 
( 4.A2.4) 
Substituting ( 4.A2.4) into ( 4.A2.3), we obtain 
N+l N N 
L wJ1) z-i = L WjZ-j + r 1 z-1 L WjZ-j 
j=O j=O j=O 
N 
( 4.A2.5) 
= (L Wjz-i)D(z) 
j=O 
82 
By induction, assume that ( 4.A2.l) holds for L - 2, that is, 
N+L-2 N L-2 L w;L-2) z-j = (L Wjz-i)(l + L rf-l ) z-i) 
j=O j=O j=l 
Since we can also have, from ( 4.6), 
(L-1) 
W· = ] 
(L-2) 
w. ' ] 
(L-2) (L-2) 
wj + rL-1 Wj-(L-1), 
(L-2) 
rL-1 WN, 
thus 
N+L-1 L-2 N+L-2 
0 < j < L - 1; 
L - 1 < j < N + L - 1; 
j=N+L-1 
( 4.A2.6) 
~ (L-1) -j _ ~ (L-2) -j + ~ ( (L-2) + (L-2) . ) -j+ L.,,; wj z - L.,,; wj z L.,,; wj rL-l w 1_(L-l) z 
j=O j=O j=L-1 
+ 
(L-1) -(N+L-1) 
rL-1 WNZ 
L-2 N+L-2 
= L w;L-2) z-i + L w;L-2) z-i + 
j=O L-1 
N+L-2 
+ r1L /) ( L Wj-(L-1)Z-j + WNZ-(N+L-1)) 
j=L-1 
N+L-2 N+L-1 
~ (L-2) -j + (L-2) ~ -J 
L.,,; W J Z TL-l L.,,; Wj-(L-l)Z 
j=O j=L-1 
( 4.A2. 7) 
But 
N+L-1 N L Wj-(L-i)Z-j = z-(L-l) L WjZ-j ( 4.A2.8) 
j=L-1 j=O 
Thus 
N 
(L-2) -j + (L-2) -(L-1) ~ -j 
wj Z rL-1 Z L.,,; WjZ ( 4.A2.9) 
j=O j=O j=O 
Subtituting ( 4.A2.6) into ( 4.A2.9), we obtain ( 4.A2.l ). 
83 
Appendix 4.3 : 
In this appendix, we give an efficient method for computing B k. For simplicity, 
we show it in an example of N == 4. 
First we mention the shift matrix 
0 1 0 0 
0 0 1 0 
T== (4.A3.1) 
0 0 0 1 
0 0 0 0 
The effect of premultiplication by T is to upshift a matrix by one row and replace 
its last row by zeros. For example, 
0 1 0 0 a11 a12 a13 a14 a21 a22 a23 a24 
0 0 1 0 a21 a22 a23 a24 a31 a32 a33 a34 ( 4.A3.2) 
0 0 0 1 a31 a32 a33 a34 a41 a42 a43 a44 
0 0 0 0 a41 a42 a43 a44 0 0 0 0 
If the elements on the last row in the shift matrix are not all equal to zero, it is 
easy to see that nothing in the resulting matrix in ( 4.A3.2) is changed except that 
the elements on the last row will not all be equal to zero. 
Now consider B 2 , that is, 
0 1 0 0 
B2 == 
0 0 1 0 
0 0 0 1 
r4 r3 r2 r1 
Using the property of T, we have 
B2 == 
where 
0 1 0 0 
0 0 1 0 
0 0 0 1 
r4 r3 r2 r1 
0 0 1 0 
0 0 0 1 
84 
2 < j < 4; 
j == 5. 
( 4.A3.3) 
( 4.A3.4) 
By induction, suppose that, for k = K - l where k > N, we have 
(K-5) 
rK-1 
(K-5) 
rK-2 
(K-5) 
rK-3 
(K-5 ) 
rK-4 (K-4) (K-4) (K-4) (K-4) 
BK-1 = rK rK-1 rK-2 rK-3 (4.A3.5) (K-3) (K-3) (K-3 ) (K-3 ) 
rK+l rK rK-1 rK-2 (K-2) 
rK+2 
(K-2) 
rK+l 
(K-2) 
rK 
(K-2) 
rK-1 
where rY) can be computed by using the iterative algorithm in ( 4.6). 
Fork= K, then 
(K-5) (K-5) (K-5) (K-5) 
0 1 0 0 rK-1 rK-2 rK-3 rK-4 
0 0 1 0 (K-4) (K-4) (K-4) (K-4) BK= BBK-1 = rK rK-1 rK-2 rK-3 
0 0 0 1 (K-3) (K-3) (K-3) (K-3) 
rK+l rK rK-1 rK-2 
T4 T3 r2 r1 (K-2) (K-2) (K-2) (K-2) 
rK+2 rK+l rK rK-1 
( 4.A3.6) 
From the property of the shift matrix T, we know that BK should have the 
fallowing farm 
(K-4) 
rK 
(K-4) 
rK-1 
(K-4) 
rK-2 
(K-4) 
rK-3 
(K-3) (K-3) (K-3) (K-3) 
BK= rK+l rK rK-1 rK-2 ( 4.A3.7) (K-2) (K-2) (K-2) (K-2) 
rK+2 rK+1 rK rK-1 
X41 X42 X43 X44 
To compute x4j for 1 < j < 4 from ( 4.A3.6) requires N 2 = 16 multiplications. 
However, we have 
Then we can easily prove that 
X43 X44) = ( r(K-l) K+3 (K-1)) rK 
( 4.A3.8) 
( 4.A3.9) 
where rJK-l) can be computed from r;K-2) by using the iterative algorithm in 
( 4.6), that is, 
{ 
(K-2) (K -2) (K-1) _ rj +rK-1 Tj-(K-1), 
rj - (K-2) 
rK-1 r4, 
85 
K < j < K + 3; 
j = K + 3. 
From the above equation, we see that it takes only N 
compute X4j. 
In general, consider that B is an N x N matrix. Then 
fork< N, 
N-k+l 
and fork> N, 
Bk= 
0 
0 
0 
(k-1) 
rN+k-1 
(k-N) 
rk 
(k-N+l) 
rk+1 
. 
(k-1) 
rN+k-1 
k+l k+2 
1 
0 
0 
0 
1 
0 
TN-k TN-k-1 
(k-1) (k-1) 
rN-1 rN-2 ... 
(k-N) (k-N) 
r rk-N+l k-1 
(k-N+l) (k-N+1) 
rk ... rk-N+2 
. . 
(k-1) 
rN+k-2 
(k-1) 
rk 
4 multiplications to 
0 
0 
1 
(k-1) 
rk 
( 4.A3.10) 
(4.A3.11) 
Using the iterative algorithm in ( 4.6), the number of multiplications for computing 
Bk is Nk. 
86 
Appendix 4.4 : 
This appendix lists the detailed derivation steps using an example of N = 2 
and L = 6 with decomposition. 
The original equation can be written as 
(o) (o) (o) 
Yn = bl Yn-l + b2 Yn-2 + Vn 
where bio) = r 1 , b~o) = r 2 and v~o) = Vn. In each step of the following, the first 
equation is associated with the denotation on the second column in Fig. 4.2, the 
second one is the derived equation for the next step and the third is the modified 
equation for the linear part. We assume that no divisors below are equal to zero. 
Step 1: 
(0) (0) (0) 
Yn-l = bl Yn-2 + b2 Yn-3 + Vn-1 
(1) (1) (1) 
Yn = b2 Yn-2 + b3 Yn-3 + Vn 
Step 2: 
(0) (0) (0) 
Yn-2 = bl Yn-3 + b2 Yn-4 + Vn-2 
(2) (2) (2) 
Yn = b3 Yn-3 + b4 Yn-4 + Vn 
Step 3: 
87 
Step 4: 
(4) (4) (4) 
Yn = b3 Yn-3 + b6 Yn-6 + Vn 
Step 5: 
(4) (4) (4) 
Yn-3 = b3 Yn-6 + b6 Yn-9 + Vn-3 
(5) (5) (5) 
Yn = b6 Yn-6 + bg Yn-9 + Vn 
Step 6: 
1 b~4 ) 1 (4) 
Yn-9 = wYn-6 - wYn-12 - wvn-6 
b3 b3 b3 
b(6) b(6) (6) Yn = 6 Yn-6 + 12 Yn-12 + Vn 
From the above equations, it is easy to verify that 
4 
v( 4) = v(o) + "d\1) v (o) . 
n n L._,; J n-; ( 4.A4.l) 
j=l 
and 
2 
v(6) = v(4) + "d\2)v(4) . 
N n L._,; J n-3 ( 4.A4.2) 
j=l 
Let Un = v~4 ) and Un = v~6). We then obtain ( 4.58) and ( 4.59). 
88 
Appendix 4.5 : 
In this appendix, we calculate the value of s k for 1 < k < ( L - l )N. 
We first define bi = 0 for i -/= j L. Then 
Thus ( 4.60) can be rewritten as 
N (L-l)N NL 
(1 - L Tjz-i)( L ajz-i) = L bjz-i 
j=O j=O j=O 
( 4.A5.1) 
( 4.A5.2) 
Using the above equation, we can write a linear system of (L - l)N + 1 equations 
as 
ao bo 
a1 b1 
. . 
. . 
. . ( 4.A5.3) 
an bn -TN -r1 1 
0 a(L-l)N b(L-l)N 
or 
Ra=b ( 4.A5.4) 
where R is an ((L - l)N + 1) x (((L - l)N + 1) triangular Toeplitz matrix, a and 
b are ( ( L - 1 )N + 1) x 1 vectors. It is easy to verify that 
det(R) = 1 ( 4.A5.5) 
Therefore, R- 1 exists and ( 4.A5.4) can be rewritten as 
( 4.A5.6) 
Substituting ( 4.A5.l) into ( 4.63), we have 
(L-l)N NL oo 
L ajZ-j = (L bjz-i)(L skz-k) (4.A5.7) 
j=O j=O k=O 
89 
We can also write, from ( 4.A5.7), a linear system of (L - l)N + l equations as 
a= Sb ( 4.A5.8) 
where 
a= ( ao a1 a(L-1)N )T, 
b = ( bo b1 b(L-1)N )T 
and 
1 
S1 1 
S= S2 s1 1 
S(L-1)N S2 S1 1 
Comparing ( 4.A5.6) and ( 4.A5.8), we obtain 
( 4.A5.9) 
Therefore, to calculate s k for 1 < k < ( L - 1 )N, we can use the following equation 
SR=I ( 4.A5.10) 
Since Sis also a triangular Toeplitz matrix, all the information needed is contained 
in the first column ( or the last row). Then we can also use the following simplified 
equation for computing s k. 
1 
R 
S(L-1)N 
It is easy to verify that, for 1 < k < (L - l)N, 
1 
0 
0 
where rik-l) can be computed by using the iteration algorithm in ( 4.6). 
90 
( 4.A5.12) 
( 4.A5.13) 
Appendix 4.6 
This appendix extends the method derived in Chapter 4 to the state-variable 
form implementation of recursive filters. 
The state-variable form of an Nth order recursive filter can be expressed in 
two equations, that is, the state update equation 
x(n + 1) = Ax(n) + bu(n) ( 4.A6.l) 
and the output equation 
y(n) = cT x(n) + du(n) ( 4.A6.2) 
where A is an N x N matrix, b and c are N x 1 vectors, dis a scalar, x( n) represents 
an N x 1 state vector and u( n) and y( n) are input and output, respectively. 
To obtain a parallel algorithm, we consider the state update equation ( 4.A6.l) 
since only this equation contains recursion. Transforming ( 4.A6.l) into the Z 
domain, we have 
or 
X(z)z = AX(z) + bU(z) 
X(z) 
U(z) 
( 4.A6.3) 
( 4.A6.4) 
where I is an identity matrix and X( z) and U( z) represent the Z transforms of 
x( n) and u( n ), respectively. 
Multiplying both numerator and denominator of ( 4.A6.4) by a factor 
"'M-1 Ai -i th L-i=O z , en 
X(z) 
U(z) (I - Az-1 )(I:f1;1 Ai z-i) 
(~~;1 Ai z-i)bz-1 
I-AMz-M 
91 
( 4.A6.5) 
... 
The above equation can be divided into two parts, that is, a recursive part 
and a linear part 
1 X(z) 
V(z) I-AMz-M 
( ) M-1 V z == ( L Ai z-i)bz-1 
U(z) i=O 
M-1 
== L Aibz-(j+1) 
i=O 
where V(z) is a vector of intermediate variables. 
( 4.A6.6) 
( 4.A6.7) 
To analyze the degree of parallelism, we transform ( 4.A6.6) into the time 
domain and then obtain 
x(n + M) == AMx(n) + v(n + M) ( 4.A6.8) 
It can be seen from the above equation that M vectors of state var~ables x( n + i) 
for O < i < M - l can be computed simultaneously. 
Since Ai bis an N x 1 vector and can be precomputed, to implement ( 4.A6. 7) 
requires NM multipliers. The complexity of the modified system is increased. Un-
like the case in the direct form implementation, however, this number of multipliers 
may not be reduced by using decomposition if A is an N x N full matrix. 
Suppose that M can be factored as M == ITf 1 mk where mk is a positive 
integer. Applying the decomposition technique, Lf ; 1 Ai z-i can be expressed 
as a product form. Then 
M-1 K m"-1 
bz-1( L Ai z-i) == bz-1 II L (Az-1 )i rr:~t mi ( 4.A6.9) 
J= k=l i=O 
When A is an N x N full matrix, it is easy to verify that to implement ( 4.A6.9) re-
quires N 2(Lf 1 mk-K)+N multipliers where N 2 is due to full-matrix and vector 
92 
multiplications involved in the computation. Thus the decomposition technique is 
. 
not useful if N(~1( 1 mk - K + 1) is greater than M = IT1( 1 mk. 
To solve the above described problem, we recommend that A is a block-
diagonal matrix with block size q by q for q much smaller than N. A matrix 
with this form can easily be obtained from a parallel combination of qth order 
recursive filters. When A is a block-diagonal matrix with block size q by q, only 
qN multipliers are required for implementing each matrix-vector multiplication 
involved in ( 4.A6.9). Thus the number of multipliers is reduced to qN(~f 1 mk -
K)+N. Note that A will not be a simple diagonal matrix in most cases. Otherwise 
complex numbers will appear in the computation, which increases the complexity 
of the system. Consider the best case with q = 2. The number of multipliers 
required for implementing ( 4.A6.9) is minimized to 2N(~1( 1 mk - K) + N, which 
may be much smaller than N ITf 1 mk. 
93 
CHAPTER 5 
PIPELINED AND/OR PARALLEL ARCHITECTURES 
FOR DIRECT-FORM RECURSIVE FILTERS 
5.1. Introduction 
This chapter describes some efficient architectures associated with the stabi-
lized parallel algorithms derived in the previous chapter. The two-level pipeline, 
first introduced by H. T. Kung and his colleagues [19,21], is a good method not 
only for achieving high throughput computation, but also for having less area in 
VLSI implementation in comparison with other parallel approaches. In Section 
5.2 we show that, by using the stabilized parallel algorithms, an efficient two-level 
pipelined structure can easily be obtained. Another approach to achieving high 
throughput computation is by using parallel processing. We derive two parallel 
structures in Section 5.3. The first structure has the advantage of regularity while 
the second one can achieve a linear complexity in parallel size for cascaded second-
order recursive filters. Since the two-level pipelined structure has less area in VLSI 
implementation than parallel structures, the system should desirably be imple-
mented by first using pipelining to the maximum possible extent, and then using 
parallel processing in combination with pipelining if further increase in throughput 
is required. Therefore, we finally describe pipelined and parallel architectures for 
direct-form recursive filters in Section 5.4. 
94 
5.2. Two-Level Pipelined Structure 
In this section, we derive an efficient two-level pipelined structure with a 
throughput of M for Nth order recursive filters. As we know, the algorithm derived 
in the previous chapter can be divided into two parts, the recursive part and the 
linear part. We first introduce a structure for the recursive part and show that 
our stabilized algorithms can easily be used for second-level pipelining. Then 
we describe a structure for the linear part. Since this structure is unidirectional 
without feed-back, it can have as many second-level pipelined stages as desired. 
Therefore, the two structures can easily be matched with the same throughput. 
Recursive part : 
For convenience, we rewrite ( 4.27) as follows 
N 
Yn = L bjMYn-jM + Un 
j=l 
(5.1) 
It is easy to construct a structure for computing this algorithm, as shown in 
Fig. 5.1. The system has N identical basic processing elements ( or cells), each 
of which performs a simple multiplication-and-accumulation operation. It inputs 
the data Un at the leftmost cell and outputs the result Yn at the rightmost cell. 
The result is then fed back immediately into the system from the rightmost cell as 
another input for later results. Since Yn depends only on Yn-jM for j = 1, 2, · · · N, 
the required NM delays are evenly distributed in the system. By using cut-set 
localization rules described in Chapter 2, a two-level pipeline with M stages can 
be obtained. Therefore the throughput is M times higher than that of the system 
for computing the unmodified algorithm. 
95 
y.M~y. 1-~~I 
b· I Yout 
Fig. 5.1. The structure for the recursive part (N = 2 and M = 4) 
Linear part : 
From Lemmas 4.4 and 4.5, the impulse response function of the linear part H1 (z) 
in ( 4.22) can be expressed as a product of K + l polynomials in z- 1 as 
N K (m1c-l)N 
H 1 (z)=(~wiz-i)Il(l+ ~ d}k)z-ifI:,:tmi) (5.2) 
j=O k=l j=l 
where M = TI[ 1 m1.: for mk is a positive integer and d)k) denotes the coefficient 
in the kth stage in the product term. A I( + l stage cascaded structure may be 
applied for implementing this algorithm. Because of the similarity of these stages, 
we only consider one as follows 
U(z) (q-l)N 
V(z) = 1 + h djz-iP (5.3) 
where p and q are positive integers and V(z) and U(z) are the Z transforms of 
input and output, respectively. 
Converting (5.3) into the time domain, we have 
(q-l)N 
Un= Vn + ~ djVn-jp 
j=l 
96 
(5.4) 
The above equation can then be computed by using a structure such as that 
depicted in Fig. 5.2. This structure is similar to the one in Fig. 5.1, the only 
difference is that it is unidirectional without feed-back. 
V n-(q-1 )Np 
Un 
d· I 
U out 
Fig. 5.2. The structure for computing (5.4) 
Since the structure in Fig. 5.2 is unidirectional, as mentioned before it can 
have as many second-level pipelined stages as desired. Therefore, we can easily 
make Fig. 5.2 match ;Fig. 5.1 with the same throughput. An example of N = 2 
and M = 4 is given in Fig. 5.3. 
We see from the above discussion that N multipliers are required for cornput-
ing the recursive part. For implementing ~f O WjZ-j, we need N + 1 multipliers. 
Therefore, N + l + N(~f 1 mk - K) multipliers are required for computing the 
linear part. The total number of multipliers required for an Nth order recursive 
filter is then 
K K 
N + N(L mk - I()+ N + 1 = N(L mk - I(+ 2) + 1 (5.5) 
k=l k=l 
When mk is not a prime number and can be expressed as a product of some prime 
numbers, the small polynomial on the right-hand side of ( 5.2) can be further 
97 
Fig. 5.3. The structure with four stages of second-level pipeline 
for second-order recursive filters 
K' decomposed. In the best case ,vhen M = [Ik=l Pk where Pk is a prime number, 
therefore, the total number of multipliers can be reduced to N(Lf(' 1 Pk - I(' + 
2) + 1. If M is a power of two, for example, the total number of multipliers in 
(5.5) becomes 
1V + N log2 M + N + l = N(log 2 M + 2) + 1 (5.6) 
98 
5.3. Parallel Structures 
In this section, we derive two efficient parallel structures with a throughput 
of L for Nth order recursive filters. The first one is derived by using the stabilized 
algorithm and decomposition technique described in Chapter 4. The second one 
is derived by using the stabilized algorithm in combination with the original un-
modified algorithm. This technique is called incremental output computation and 
was originally introduced in [41] for block ( or parallel) implementation of recursive 
filters in state-variable form. 
5.3.1. Structure 1 
The structure can be divided into two parts, that is, the recursive part and 
the linear part. We first consider the recursive part. 
Recursive part : 
The algorithm for the recursive part is expressed as 
N 
Yn = L bjYn-jL + Un 
j=l 
Arranging the output Yn into L groups, we have 
for O < l < L -1 
N 
YLn+l = L biYL(n-j)+l + ULn+l 
j=l 
(5.7) 
(5.8) 
From the above equation it can be seen that to compute YLn+l we need YL(n-j)+l 
for j = l to N. However, they are just consecutive outputs in the same group. 
Therefore, the ~tructure for computing (5.7) or (5.8) may consist of L identical 
1-D sub-structures, which are independent of each other. Fig. 5.4 gives an exam-
ple with N = 2 and L = 4. 
99 
U4n+3 
Uin Y out 
b· I 
1 1 1 1 
1 Y out:= Yin 
Y 4n+1 Y4n+2 Y4n+3 
Fig. 5.4. The structure for the recursive part 
(N =. 2 and L = 4) 
Linear part : 
By using the decomposition techique, the linear part can be expressed as 
N K (lk-l)N 
H1 (z) = (L WjZ-j) IT (1 + L d)k) z-j n:,:t Zi) (5.9) 
j=O k=l j=l 
where L = fif 1 lk with lk a positive integer (or a prime number). A I(+ 1 
stage cascaded structure may be applied for computing this equation. However, 
to derive a parallel structure for each stage is not as simple as that in the two-level 
pipelined structure. Further effort is required. 
Consider one stage in (5.9) as follows 
or in the time domain 
U (q-1)N 
(z) = 1 + '°' d·z-iP 
V(z) ~ J 
J=l 
(q-l)N 
Un = Vn + L djVn-jp 
j=l 
where p and q are positive integers. 
100 
(5.10) 
(5.11) 
V4n+2 V 4n+1 V 4n+3 
u4n+3 
Fig. 5.5. The structure for computing (5.12) 
(N = _2, L = 4 and p = q = 2) 
We first arrange the output Un into p groups as 
for O < n2 < p and n1 = 0, 1, 2 · · · , 
(q-l)N 
Upn1+n2 = Vpn1+n2 + L djVpn1+n2-iP 
j=l 
(q-l)N 
= Vpn1+n2 + L djVp(n1-i)+n2 
j=l 
(5.12) 
From (5.12) we see that to compute Upni+n 2 we need those inputs only in group 
pn1 + n2. Thus p groups of the output can be computed independently. We 
can also see that each sub-equation in (5.12) is just a linear convolution problem 
so that they can be computed by using the systolic ring structure described in 
Chapter 3. Therefore, the system for computing (5.11) or (5.12) may consist of p 
identical ( q - 1 )N x L / p systolic ring structures. It is_ easy to determine that the 
number of multipliers required for computing (5.12) is 
p(q - l)NL/p = NL(q- l ) (5.13) 
An cxar11plc with N = 2, L = 4 and p = q = 2 is depicted in Fig. 5.5. 
101 
We have shown that NL multipliers are needed for computing the recursive 
part. Using (5.13), it can be shown that to implement the linear part requires 
N LC~[ 1 h - K + l) + L multipliers. Therefore, the total number of multipliers 
for this implementation is 
K K 
NL+ NL(L lk - K + l) + L = NL(L lk - K + 2) + L (5.14) 
k=l k=l 
5.3.2. Structure 2 
The second structure is derived from the incremental output computation. 
Consider the unmodified algorithm of an Nth order recursive filter 
where Vn = ~f O WjXn-j• 
N 
Yn = L rjYn-j + Vn 
j=l 
(5.15) 
We arrange the output into L groups and assume that (1) N < L and (2) the 
first N groups of the output are known. Then a regular and computable structure 
can easily be obtained. Fig. 5.6 gives an example with N = 4 and L = 8. 
It is known from the previous subsection that each group of the output can be 
computed independently by using the modified algorithm. Therefore our modified 
algorithm is applied for computing the first N groups of the output, which is 
expressed as follows 
for O < l < N - l 
N 
YLn+l = L bjYL(n-j)+l + ULn+l 
j=l 
(L-l)N 
ULn+l = VLn+l + L djVLn+l-j 
j=l 
102 
(5.16) 
Van+4 V an+S V an+6 Van+? 
Yan V · y in out 
r3 r3 r3 r3 
r · I 
Yan+1 
r2 r 2 r2 r2 V out Yin 
Yout:= Yin 
Yan+2 V out:= Vin +r i * Yin 
r 1 r 1 r 1 r 1 
Yan+3 
Yan+4 Yan+S Yan+6 Yan+? 
Fig. 5.6. The structure for computing Ysn+l for 4 < l < 7 
where Vn = ~f O WjXn-j· 
The structures for computing YLn+l for O < l < L - l and U£n+l for O < l < 
N - l are depicted in Figs. 5. 7 and 5.8, respectively. 
For computing Vn with a throughput rate of L we need (N + l)L multipliers. 
There are NL multipliers for computing L groups of the output YLn+l for l = 0 
to L - l. However, for the first N groups of the output we need to compute the 
intermediate result uLn+l for l = 0 to N -1, which requires (L- l)N 2 multipliers. 
Therefore, the total number of multipliers for this implementation is 
(N + l)L +NL+ (L - l)N2 = (2N + l)L + (L - l)N2 ( 5 .1 7) 
There are several disadvantages in this direct implementation of Nth order 
103 
1 
1 1 
r 1 
1 
Y4n+1 Y4n+2 Y4n+3 
Fig. 5. 7. The structure for computing Y4n+l for O < l < 3 
V 4n V 4n+2 V 4n+1 
u1 ,,1 ,1 , ,2 , '2 ~,2 
1 -- d1 - d2 -- d3 -- d4 .... ds -- ds ~ 
- - - - - - u 4n+1 
,,1 H , ' · U H H 
-- d1 -- d2 - d3 - d4 - ds - ds ~ 
-- -- - - - - U4n 
V out:= Vin 
V out 
Fig. 5 .8. 'l'he structure for computing U4n+l for O < l < 1 
recursive filters. Under the assumption we made previously, the incremental out-
put computation technique can only be applied in the case when N < L. Since we 
104 
must use two different algorithms for computing the two sets of L groups of the 
output, the structure is not as regular as the first one. From ( 5.17) we see that the 
number of multipliers is proportional to LN2 • By comparing (5.14) and (5.17), 
it can be verified that this implementation requires less multipliers than the first 
one only if the following inequality is satisfied: 
(5.18) 
where L = IJf 1 h. For example, if L = 8 then N < 4. 
This implementation can, however, be applied very efficiently for cascaded 
second-order recursive filters. The number of multipliers for one second-order 
stage is 
(2 x 2 + l)L + 22 (L - 1) = 9L - 4 
Then the total number of multipliers for N /2 stages becomes 
N 
-(9L - 4) = 4.5NL - 2N 
2 
This number is linearly proportional to the parallel size L. 
105 
(5.19) 
(5.20) 
5.4. Pipelined and Parallel Structures 
In the previous two subsections, we have constructed one two-level pipelined 
structure and two parallel structures for Nth order recursive filters. Now we com-
bine these two techniques together to construct efficient pipelined and parallel 
structures. To achieve the throughput of LM, we shall construct two systems by 
using a parallel size of L and a two-level pipeline with M stages. The first one can 
efficiently be applied in the case when L < N and the second one is used when 
the parallel size L is greater than the filter order N. 
5.4.1. Case 1: L < N 
As described in Subsection 5.3.2, the incremental output computation tech-
nique cannot be applied when L < N. To construct this structure, we then use 
the method in Subsection 5.3.1 for parallel processing. 
To achieve the throughput of LM, the algorithm ( derived in Section 4.3) can 
be expressed as 
( 5.21) 
Similar to the structures described in Section 5.2 and 5.3.1, this system also con-
sists of two parts, the recursive part and the linear part. These are as follows: 
106 
Recursive part : 
It is known that the algorithm in the time domain for the recursive part can 
be expressed as 
N 
Yn = L bjLMYn-jLM + Un 
j=l 
Arranging the output Yn into L groups, we have 
for O < l < L - 1, 
N 
YLn+l = L bjLMYLn+l-jLM + ULn+l 
j=l 
N 
= L bjLMYL(n-jM)+l + ULn+l 
j=l 
(5.22) 
(5.23) 
It can be seen that each group of outputs can be computed independently since 
YLn+l depends only upon YL(n-jM)+l for j = 1 to N. We can also see that each 
sub-equation in (5.23) is essentially identical to the equation in (5.1 ). Therefore, 
the structure for the recursive part may consist of L identical 1-D sub-structures, 
each of which is a two-level pipelined structure with M stages as depicted in 
Fig. 5.1. 
107 
Linear part : 
The algorithm for the linear part is expressed as 
N LM-1 
H1(z) = (L Wjz-i)det( L Bi z-i) (5.24) 
j=O j=O 
From Lemma 4.3, we can rewrite (5.24) as 
N L-1 M-1 
H1(z) = (L Wjz-i)det(L Biz-i)det( L BiLz-iL) (5.25) 
j=O j=O j=O 
In order to apply the decomposition technique, we assume that L and M can be 
factored, that is, L = IT[ 1 1 lk and Af = rrf l 1 ffik for lk and mk positive integers 
(not necessarily prime numbers). 
small polynomials in z-1 . Then the method for implementing this factor is ex-
actly the same as that described in Section 5.3.1. The term ~f O WjZ-j can be 
implemented by using an ( N + 1) x L systolic ring structure. Thus for implementing 
the linear part, we need only to consider the factor det(~f ; 1 BiL z-iL). 
Let BL = C and zL = .X. Then 
M-1 M-1 
det( L BjL z-jL) = det( L ci _x-i) ( 5.26) 
j=O j=O 
Since M = IJf 2 1 mk, by using the decomposition technique we have 
(5.27) 
j=O k=l j=l 
Replacing .X by zL, then 
(5.28) 
j=O k=l j=l 
108 
Consider stage k in the above equation, that is, 
(m1c-l )N 
U(z) = l + " c (.k) -jLm(k) 
V(z) ~ 1 z 
J=l 
(5.29) 
where m(k) = IT~ 11 mi, Transforming (5.29) into the time domain, we have 
(5.30) 
Arranging Un into L groups, we obtain 
for O < l < L - l, 
(m1c-l)N 
" (k) U£n+l = VLn+l + L....,; Cj VLn+l-jLm(k) 
j=l 
(m1c-l)N (5.31) 
= VLn+l + L ( k) Cj VL(n-jm(k))+l 
j=l 
It is easy to see that the L sub-equations in (5.31) are essentially the same as that 
in (5.4). Therefore, they can be implemented independently in the 1-D structure 
depicted in Fig. 5.2. 
Let LM = IT[ 1 l~ where K = Ki + K2, It is easy to determine that the 
number of multipliers required for computing the linear part is N L(Lf 1 l~ -K + 
1) + L. To implement the recursive part requires NL multipliers. Therefore, the 
total number of multipliers for this implementation is 
K K 
NL(L l~ - K + l) + L +NL= NL(L l~ - K + 2) + L (5.32) 
k=l k=l 
Using parallel implementation alone to achieve the same throughput requires 
N LM(Lf 1 l~ - K + 2) + LM multipliers. Thus the number of multipliers is 
reduced by a factor of M by using the pipelined and parallel implementation de-
scribed above. 
109 
5.4.2. Case 2: L > N 
When L > N, the incremental output computation technique can be applied 
to achieve a further reduction of complexity. 
As in the construction of the parallel structure in Section 5.3.2, the output 
Yn is arranged into L groups, that is, YLn+l for O < I < L - 1. For computing the 
first N groups of the output, we use our modified algorithm in (5.22) or (5.23). 
Then the associated structure may consist of N identical 1-D sub-structures and 
each sub-structure is just a two-level pipelined structure with M stages depicted 
in Fig. 5.1. By using the original unmodified algorithm in (5.15), the next L - N 
groups of output can be implemented in the structure in Fig. 5.6. Combining 
these two structures together, we obtain a structure for computing the L groups 
of the output YLn+l for O < l < L - 1. An example with N = 2 and L = 4 is 
depicted in Fig. 5.9. However, this structure is not a two-level pipelined structure 
with M stages. To solve this problem, we introduce a set of cuts to the system, as 
shown in Fig. 5.9. It can be seen that all the lines in the cuts are pointing in one 
direction. Therefore, the second-level pipeline with M stages in this structure is 
easily obtained by adding M delays to every line in the cuts. 
For the first N groups of the output, we need to compute 
U( ) LM-1 
_z_ = det( ~ Bj z-i) 
V(z) ~ 
;=O 
L-1 M-1 
(5.33) 
= det( L Bi z-i)det( L BiL z-jL) 
j=O j=O 
The first term det(:Z::J ; Bi z-i) can be implemented in the structure in Fig. 5.8. 
For implementing the second term, as described in the previous subsection a 
110 
,--
------- -
I 
---
I I 
I I 
b2 b2 I r2 I r2 I I 
I I 
I I 
I I 
I I 
I I 
I I 
I I 
I I 
b1 b1 I r 1 I r 1 I I 
I I 
I I 
I I 
I I 
I I 
I I 
I 
Fig. 5.9. The structure with M stages of second-level pipeline 
for computing Y-1n+l for O < Z < 3 
K 2 stage cascaded structure can be applied (when M = ITf(2 1 m1.:) and each stage 
consists of N independent 1-D sub-structures depicted in Fig. 5.2. 
It is easy to determine that to implement (5.33) we need N 2 (L - 1) + 
N 2 (~['l 1 mk - 1(2) multipliers. To implement I:f O WjZ-j requires (N + l)L 
multipliers, and NL multipliers are required for computing YLn+l for O < Z < 
L - 1. Therefore, the total number of multipliers in this implen1entation is 
N 2 (~!('l 1 mk + L - K 2 - 1) + L(2N + 1). For cascaded second-order recursive 
filters, this number is reduced to 4.5N L + 2N(~f'l 1 mk - I(z - 1). 
111 
5.5. Discussion 
This chapter introduced several efficient pipelined and/ or parallel structures 
for direct-form recursive filters by using the stabilized algorithms derived in t he 
previous chapter. 
We showed that those algorithms can lead directly to an efficient two-level 
pipelined structure with the complexity linearly proportional to I:f 1 mk - K 
when the size of two-level pipelining is M = IT[ 1 mk. Note that for simplicity 
we did not localize the feedback line in the recursive part of the structure. When 
this matter is taking into account, the size of two-level pipelining will be reduced 
to M - 1. In Chapter 2 we mentioned that applying cut-set localization rules 
to localize a recursive system derived from the original unmodified algorithm will 
decrease the throughput because time-scaling cannot be avoided. Our modified 
algorithm can be applied to maintain the original throughput. To achieve this 
goal, 3N + 1 multipliers are required. (This is easily verified by setting M = 2 in 
(5.6).) However, the system derived from the original unmodified algorithm only 
requires 2N + 1 multipliers. Thus the penalty is an increase in complexity. 
By using the method of parallel processing, we constructed two parallel ar-
chitectures. Although the first parallel implementation requires more multipliers 
than the second one, it has a regular form. Thus it is more easily implemented 
in VLSI. Moreover, the first architecture only uses our modified algorithm. It is 
known that this algorithm is more stable than the original unmodified one. There-
fore, in some applications the number of bits per word used in this implementation 
may be smaller than that used in the second one. 
We also constructed two efficient pipelined and parallel structures for direct-
112 
form recursive filters. With this combination of the methods for parallel processing 
and pipelining, the constructed architectures can achieve the same throughput 
with many fewer multipliers than pure parallel structures. This results in more-
efficient VLSI implementation of recursive filters. 
113 
CHAPTER 6 
CONCLUSIONS 
Special-purpose systolic arrays are a most promising VLSI architecture for 
modern signal processing and form a useful basis for high throughput applications. 
Demands for ever-higher speeds in signal processing are seemingly insatiable. This 
thesis demonstrates that substantial increases in throughput can be achieved by 
developing parallel computing structures which take advantage of the high degree 
of parallelism inherent in typical signal processing algorithms. These structures 
can at the same time benefit from the power of VLSI and fit within its contraints. 
Since the throughput of 1-D systolic arrays is limited by their word serial 
nature, in many high-throughput applications (two-level) pipelined and/or (2-D) 
parallel architectures have to be considered seriously. This thesis has investigated 
a number of pipelined and/ or parallel systolic architectures for high-speed signal 
processing, particularly for high-throughput digital filters. The main conclusions 
which can be drawn from this study are summarized below. 
The architecture transformation technique based on cut-set localization rules 
can convert all computable SFG networks into systolic arrays. The key problem 
in using this technique is to minimize the time-scaling factor. Using the commu-
tative rule described in Chapter 2, together with the previously known rules, one 
can avoid time-scaling and obtain better results from a class of feedforward SFG 
computing networks. 
A very important characteristic of non-recursive filters is that they can be 
designed to have exactly linear phase. 1-D systolic implementations of linear-phase 
114 
non-recursive filters can easily be obtained. Although the task of constructing 2-D 
systolic architectures is more difficult, we have demonstrated that th.is task can 
be simplified by synthesizing the 2-D structure from l-D computational modules. 
Using twice the area in VLSI implementation, our 2-D systolic architecture can 
achieve twice the throughput of l-D systolic arrays for a given problem. This 
results in no changes in the complexity measure AP. 
Linear convolution is a very important computational problem in modern 
signal processing. The simple and regular pattern of the convolution equation 
makes it well suited for VLSI implementation. Suppose that N 1 is the number of 
coefficients and L is a positive integer smaller than the number of inputs. Two 
N 1 x L systolic structures have been introduced. These structures are more effi-
cient than those described in the literature. They are based on nearest neighbour 
interconnections and can achieve L times the throughput of 1-D systolic arrays 
for the same problem. Moreover, the complexity measure AP is independent of L 
and the most efficient l-D systolic array for solving linear convolution problems 
is just a special case of a 2-D systolic ring structure with L = l. By varying L 
we can thus trade off area versus time for a given problem. Our 2-D systolic ring 
structures can also be applied to solve a number of other problems such as DFT, 
circular convolution and 2-D linear convolution because these problems can easily 
be transformed into 1-D linear convolution problems. 
Although look-ahead computation is a good method to derive parallel algo-
rithms for state-variable-form recursive filters, this conventional technique may 
cause numerical instability in direct-form recursive filters due to the effect of finite 
wordlength. Using the new method of Z domain derivation given in Chapter 4, 
115 
not only can parallel algorithms for direct-form recursive filters with guaranteed 
stability be derived, but the additional complexity required for this purpose can 
be minimized through a decomposition technique. 
A time domain derivation of parallel algorithms for direct-form recursive filters 
has also been introduced. The derived algorithms, which have the same form 
as those derived in the Z domain, are of a particular interest for time-varying 
recursive systems. The condition for unique solution of the stabilized parallel 
algorithms has been derived as one tool for determining the stability of the derived 
algorithm. 
Using the stabilized parallel algorithms for Nth order recursive filters, very 
efficient pipelined and/ or parallel architectures can be constructed. An efficient 
1-D systolic structure with two-level pipelining is directly obtained from those 
algorithms. Two different parallel systolic structures have also been derived based 
on those algorithms. The first one has the advantage of regularity while the second 
one can achieve a linear complexity in parallel size for cascaded second-order recur-
sive filters. Using 2-D parallel processing in combination with two-level pipelining, 
pipelined and parallel architectures can also be constructed. With the same degree 
of complexity, these architectures can achieve much greater throughput than pure 
parallel structures. 
116 
REFERENCES 
(1) Bandyopadhyay, S., Jullien, G. and Bayoumi, M., Systolic arrays over fi-
nite rings with applications to digital signal processing, in Systolic Arrays, 
Moore, W., McCabe, A. and Urquhart, R., (eds.) 
and Boston, 1986, pp. 123-132. 
Adam Hilger, Bristol 
[2) Barnes, C.W. and Shinnaka, S., Block shift invariance and block implemen-
tation of discrete-time filters, IEEE Trans. Circuits and Systems, Vol. 
CAS-27, Aug. 1980, pp. 667-672. 
[3] Barnes, C.W. and Shinnaka, S., Finite word effects in block-state realiza-
tions of fixed-point digital filters, IEEE Trans. Circuits and Systems, CAS-
27, May 1980, pp. 345-349. 
[4] Brent, R.P. and Luk, F. T ., A systolic array for the linear-time solution 
of Toeplitz systems of equations, Journal of VLSI and Computer Systems, 
Vol. 1, No. 1, 1983, pp. 1-22. 
[5) Brent, R.P. and Luk, F.T., The solution of singular-value and symmetric 
eigenvalue problems on multiprocessor arrays, SIAM Journal of Sci. Stat . 
Comput., Vol. 6, No. 1, 1985, pp. 69-84. 
[6] Burrus, C.S., Block implementation of digital filters, IEEE Trans . Circuit 
Theory, Vol. CT-18, Nov. 1971, pp. 697-701. 
(7) Burrus, C.S., Block realization of digital filters, IEEE Trans. Audio Elec-
troacoust., Vol. AU-20, Oct. 1972, pp. 230-235. 
(8) Cappello, P.R. and Steiglitz, K., Unifying VLSI array design with linear 
transformations of space-time, Advances in Computing Research, Vol. 2, 
117 
.. 
1984, pp. 23-65. 
(9] Chern, M.Y. and Murata, T., Efficient matrix multiplications on a concur-
rent data-loading array processor, Proc. 1983 Int. Conj. on Parallel Pro-
cessing, pp. 90-94. 
(10] Drake, B.L., Luk, F.T., Speiser, J.M. and Symanski, J.J., SLAPP: A sys-
tolic linear algebra parallel processor, IEEE Comput. Mag., Vol. 21, July 
1987, pp.45-49. 
(11] Ersoy, 0., Semisystolic array implementation of circular, skew circular, and 
linear convolutions, IEEE Trans. Computers, Vol. C-34, No. 2, Feb. 
1985, pp. 190-196. 
[12] Evans, R.A., Wood, D., Wood, K., McCanny, J.V., McWhirter, J.G. and 
McCabe, A.P.H., A CMOS implementation of a systolic multi-bit con-
volver chip, in VLSI '83, Anceau, F. and Aas, E.J ., (eds.) North-Holland, 
Aug. 1983, pp. 227-235. 
[13] Fu, K.S., Hwang, K. and Wah, B.W., VLSI architectures for pattern analysis 
and image database management, in VLSI and Modern Signal Processing, 
Kung, S.Y., Whitehouse, H.J. and Kailath, T., (eds.) Prentice-Hall, 1985, 
pp. 434-450. 
[14] Gentleman, W.H. and Kung, H.T., Matrix triangularization by systolic ar-
rays, Proc. SPIE, Vol. 298, 1981, pp. 19-26. 
[15] Gnanasekaran, R. and Mitra, S.K., A note on block implementation of IIR 
digital filters, Proc. IEEE, Vol. 65, July 1977, pp. 1063-1064. 
[16] Gold, B. and Jordan, K.L., A note on digital filter synthesis, Proc. IEEE, 
Vol. 56, Oct. 1968, pp. 1717-1718. 
118 
l. 
[17) Kung, H.T., Special-purpose devices for signal and image processing: an 
opportunity in VLSI, Tech. Report, Dept. of Comput. Science, Carnegie-
Mellon Univ., Pittsburgh, Pa., 1980. 
(18) Kung, H.T., Why systolic architectures? IEEE Comput. Mag., Vol. 15, 
No. 1, Jan. 1982, pp. 32-63. 
[19] Kung, H.T. and Lam, M.S., Wafer-scale integration and two-level pipelined 
implementations of systolic arrays, Journal of Parallel and DiJtributed Com-
puting, Vol. 1, 1984, pp. 32-63. 
(20] Kung, H.T. and Leiserson, C.E., Systolic arrays (for VLSI), Spar3e Matrix 
Proc. 1918, Academic Press, Orlando, Fla., 1979, pp. 256-282. 
(21] Kung, H.T., Ruane, L.M. and Yen, D.W.L., A two-level pipelined systolic 
array for convolutions, in VLSI Sy3tem3 and Computation3, Kung, H.T., 
Sproull, R.F. and Steele. Jr., G.L., (eds.) Computer Science Press, Oct. 
1981, pp. 273-278. 
(22] Kung, H.T. and Song, S.W., A systolic 2D convolution chip, Tech. Re-
port, Dept. of Comput. Science, Carnegie-Mellon Univ., Pittsburgh, 
Pa., 1981. 
(23] Kung, S.Y., From transversal filter to VLSI wavefront array, Proc. Int. 
Con/. on VLSI 1983, IFIP, North-Holland, 1983, pp.247-261. 
(24] Kung, S.Y., On supercomputing with systolic/wavefront array processors, 
Proc. IEEE, Vol. 72, No. 7, July 1984, pp. 867-884. 
(25] Kung, S.Y., VLSI array processors, in Systolic Arrays, Moore, W., Mc-
Cabe, A. and Urquhart, R., (eds.) 
1986, pp. 7-24. 
119 
Adam Hilger, Bristol and Boston, 
[26] Kung, S.Y., Lo, S.C. and Annevelink, J ., Temporal localization and sys-
tolization of signal flow graph (SFG) computing networks, Proc. SP IE, 
Real Time Signal Processing VII, SPIE, 1984, pp. 58-66. 
[27] Leiserson, C.E., Systolic and semisystolic design, Proc. IEEE Int. Conj. 
on Computer Design: VLSI in Computers, NY, Nov. 1983, pp. 627-
632. 
[28] Leiserson, C.E. and Saxe, J.B., Optimizing synchronous systems, Twenty-
Second Annual Symposium on Foundations of Computer Science IEEE, Oct. 
1981, pp. 23-26. 
[29] Li, G.J. and Wah, B.W., The design of optimal systolic arrays, IEEE 
Trans. on Comput., Vol. C-34, No. 1, Jan. 1985, pp. 66-77. 
[30] Lu, H.H., Lee, E.A. and Messerschmitt, D., Fast recursive filtering with 
multiple slow processing elements, IEEE Trans. Circuits and Systems, Vol. 
CAS-32, No. 11, Nov. 1985, pp. 1119-1129. 
[31] Mead, C. and Conway, L., Introduction to VLSI Systems, Addison-
Wesley, Reading, Mass., 1980. 
[32] Meyer, R.A. and Burrus, C.S., A unified analysis of multirate and periodi-
cally time-varying digital filters, IEEE Tran.s. Circuit.s and Systems, Vol. 
CAS-22, Mar. 1975, pp. 162-168. 
[33] Meyer, R.A. and Burrus, C.S., Design and implementation of multirate dig-
ital filters, IEEE Trans. Acou.st., Speech, Signal Process., Vol. ASSP-24, 
Feb. 1976, pp. 55-58. 
[34] Mitra, S.K. and Gnanasekaran, R., Block implementation of recursive digital 
filters - new structures and properties, IEEE Trans. Circuits and Systems, 
120 
Vol. CAS-25, April 1978, pp. 200-270. 
[35] Nussbaumer, H.J., Fast Fourier transform and convolution algorithms, 
Springer-Verlag, New York, 1982. 
[36] Nikias, C.L., Fast block data processing via new IIR digital filter structure, 
IEEE Trans. Acoust., Speech, Signal Process., vol. 32, No. 4, Aug. 1984. 
[37] Nudd, G.R. and Nash, J.G., Application of concurrent VLSI systems to 
two-dimensional signal processing, in VLSI and Modern Signal Processing, 
Kung, S.Y., Whitehouse, H.J. and Kailath, T ., (eds.) Prentice-Hall, 1985, 
pp. 307-325. 
[38] Oppenheim, A.V. and Schafer, R.W., Digital Signal Processing, Prentice-
Hall, Inc., Englewood Cliffs, N.J., 1975. 
[39] Parhi, K.K. and Messerschmitt, D.G., A bit-parallel bit level recursive filter 
architecture, Proc. of the IEEE Int. Conj. Computer Design, NY, 1986. 
[40] Parhi, K.K. and Messerschmitt, D.G., Block digital filtering via incremen-
tal block-state structure, Proc. of the IEEE International Symposium on 
Circuits and Systems, Philadelphia, May 1987. 
[41] Parhi, K.K. and Messerschmitt, D.G., Concurrent cellular VLSI adaptive fil-
ter architectures, IEEE Trans. Circuits and Systems, vol. 10, Oct. 1987, 
pp. 1141-1151. 
[42] Parhi, K. K. and Messerschmitt, D. G., Pipelined VLSI recursive filter ar-
chitectures using scattered look-ahead and decomposition, Proc. IEEE Int. 
Conj. Acoustics, Speech, Signal Process., NY, Apr. 1988. 
[43] Robert, Y. and Tchuente, M., An efficient systolic array for the lD recursive 
convolution problem, Journal of VLSI and Computer Systems, Vol. 1, No. 
121 
4, pp. 398-407. 
(44] Swartzlander, E., Systolic FFT processors, in Systolic Arrays, Moore, 
W ., McCabe, A. and Urquhart, R., (eds.) Adam Hilger, Bristol and Boston, 
1986, pp. 133-140. 
[45] Travassos, R.H., Real-time implementation of systolic Kalman filters, 
Proc. SPIE, Vol. 431, Aug. 1983. 
(46] Ullman, J .D ., Computational Aspects of VLSI, Computer Science Press, 
Rockville, Maryland 1984. 
[4 7] Voelcker, H.B. and Hartquist, E.E., Digital filtering via block recursion, 
IEEE Trans. Audio Electroacoust., Vol. AU-18, Oct. 1972, pp. 169-176. 
(48] Wambergue, C.A. and Roberts, R.A., Block processing structures for fixed 
point digital filters, Proc. of the IEEE Int. Conf. ASSP, Paris, May 1982, 
pp. 498-501. 
[49] Weiser, U. and Davis, A., A wavefront notation tool for VLSI array de-
sign, in Kung, H.T., Sproull, B. and Steele, G., (eds) VLSI Systems and 
Computations, Computer Science Press, Rockville MD, 1981. 
(50] Whitehouse, H.J., Speiser, J.M. and Bromley, K., Signal processing applica-
tions of concurrent array processor technology, in VLSI and Modern Signal 
Processing, Kung, S.Y., Whitehouse, H.J. and Kailath, T., (eds.) Prentice-
Hall, 1985, pp. 25-41. 
[51] Wu, C. W. and Cappello, R., Application-specific CAD of high through-
put IIR filters, Proc. Thirty-Second IEEE Computer Society International 
Conference, Feb. 1987, pp. 302-305. 
(52] Yen, D.W.L. and Kulkarni, A.V., Systolic processing and an implementation 
122 
for signal and image processing, IEEE Trans. Computer-', Vol. C-31, No. 
10, Oct. 1982, pp. 1000-1009. 
(53] Zeman, J. and Lindgren, A.G., Fast digital filters with low roundoff noise, 
IEEE Trans. Circuits and Systems, Vol. CAS-28, July 1981, pp. 716-
723. 
(54] Zhou, B. B. and Brent, R. P., An efficient architecture for solving the recur-
sive convolution equation with high throughput, Proc. 1st IASTED Inter-
national Symposium on Signal Processing and Its Applications, Aug. 1987, 
pp. 771-775. 
[55] Zhou, B. B. and Brent, R. P., A high throughput systolic implementation of 
the second order recursive filter, Proc. IEEE Int. Conf. Acoustics, Speech, 
Signal Process., Apr. 1988. 
(56] Zhou, B.B. and Brent, R.P., A stabilized parallel implementation of direct-
form recursive filters, Tech. Report TR-CS-88-07, Comput. Sci. Lab., 
Australian National Uni., 1988, submitted to Journal of Parallel and Dis-
tributed Computing. 
[57] Zhou, B.B. and Brent, R.P., A two-level pipelined implementation of direct-
form recursive filters, Tech. Report TR-CS-88-06, Comput. Sci. Lab., Aus-
tralian National Uni., 1988, submitted to IEEE Trans. Acoust., Speech, 
Signal Processing. 
123 
