The hardware implementation of block digital filters. by Slegel, Timothy John
Lehigh University
Lehigh Preserve
Theses and Dissertations
1-1-1982
The hardware implementation of block digital
filters.
Timothy John Slegel
Follow this and additional works at: http://preserve.lehigh.edu/etd
Part of the Electrical and Computer Engineering Commons
This Thesis is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact preserve@lehigh.edu.
Recommended Citation
Slegel, Timothy John, "The hardware implementation of block digital filters." (1982). Theses and Dissertations. Paper 1996.
THE HARDWARE IMPLEMENTATION OP 
BLOCK DIGITAL FILTERS 
by 
Timothy John Slegel 
A Thesis 
Presented to the Graduate Committee 
of Lehigh University 
in Candidacy for the Degree of 
Master of Science 
in 
Electrical Engineering 
Lehigh University 
1982 
ProQuest Number: EP76269 
All rights reserved 
INFORMATION TO ALL USERS 
The quality of this reproduction is dependent upon the quality of the copy submitted. 
In the unlikely event that the author did not send a complete manuscript 
and there are missing pages, these will be noted. Also, if material had to be removed, 
a note will indicate the deletion. 
uest 
ProQuest EP76269 
Published by ProQuest LLC (2015). Copyright of the Dissertation is held by the Author. 
All rights reserved. 
This work is protected against unauthorized copying under Title 17, United States Code 
Microform Edition © ProQuest LLC. 
ProQuest LLC. 
789 East Eisenhower Parkway 
P.O. Box 1346 
Ann Arbor, Ml 48106-1346 
CERTIFICATE OP APPROVAL 
This thesis is accepted and approved in partial 
fulfillment of the requirements for the degree of Master 
of Science. 
(date) 
Professo*r in Charge 
Chairman of Department 
-11- 
ACKNOWLEDGMENTS 
The author wishes to express his appreciation to Dr. 
Kalyan Mondal of Lehigh University and Bell Laboratories 
for introducing him to this fascinating area of digital 
signal processing. Also, Dr. Mondal*s many suggestions 
and comments were very helpful in performing this work. 
The author also wishes to thank the writers of the 
Scribe word processing program at Carnegie-Mellon 
University for making the typing and formatting of this 
thesis bearable, if not easy. 
-iii- 
Table of Contents 
ABSTRACT 1 
1. INTRODUCTION TO BLOCK DIGITAL FILTERING 2 
1.1 Description of Notation 3 
1.2 A Review of Block Filtering 4 
1.2.1 The basic block equation 4 
1.2.2 Higher order block equations (L < M) 8 
1.2.3 Block state filters 10 
2. COMPUTATIONAL  EFFICIENCY  FOR  VARIOUS  BLOCK   12 
FILTERS 
2.1 The Basic Block Filter 12 
2.1.1 Analysis of matrices 13 
2.1.2 Analysis of number of computations 15 
2.2 Comparison to Scalar Filter 18 
2.3 Analysis of Block State Structures 18 
2.3.1 Block state structure I 19 
2.3.2 Block State Structure II 20 
2.4 Analysis of Scalar State Variable Filters 21 
3. INTRODUCTION TO THE AP-120B ARRAY PROCESSOR 24 
3.1 Description of Architecture 24 
3.1.1 The pipelined units 25 
3.1.2 Description of APAL assembly language 27 
3.2 The Ax + b Computation 29 
4. ANALYSIS OF BLOCK FILTERS  IMPLEMENTED ON THE   34 
AP-120B 
4.1 Basic Block Filter Implementation 34 
4.2 Scalar Filter Implementation 37 
4.3 Block State Filter Implementation 38 
4.3.1 Block state structure I 39 
4.3.2 Block state structure II 40 
4.4 Scalar State Variable Implementation 41 
4.5 Suggestions  for  Efficient  Array  Processor  42 
Programs 
5. A  VECTOR  PROCESSING  ARCHITECTURE  FOR  BLOCK   44 
FILTERS 
5.1 General Description of Vector Processor 45 
5.2 Row Processor Description 47 
5.3 The Control Bus 51 
5.4 Master Processor Description 52 
5.5 Instruction Format 55 
6. VECTOR PROCESSOR IMPLEMENTATION OF BLOCK FILTERS    58 
6.1 Basic Block Filter Execution 58 
6.2 Block State Filter Execution 62 
6.3 Filter Execution for L > N 65 
6.4 Summary of Results 69 
REFERENCES 72 
VITA 74 
-iv- 
List of Figures 
Figure 1-1 
Figure 1-2 
Figure 1-3 
Figure 2-1 
Figure 2-2 
Figure 2-3 
Figure 2-4 
Figure 2-5 
Figure 2-6 
Figure 3-1 
Figure 3-2 
Figure 5-1 
Figure 5-2 
Figure 5-3 
Figure 5-4 
Figure 5-5 
Figure 6-1 
Figure 6-2 
Figure 6-3 
Figure 6-4 
Figure 6-5 
Figure 6-6 
Scalar Difference Equation in Matrix Form 6 
Block Matrices for L ■ 7 7 
Block Matrices for L ■ 4 9 
Special Symmetry of Block Matrices 14 
Example of Number of Non-zero Columns 15 
Basic Block Structure 16 
Block State Structure I 19 
Block State Structure II 20 
Numerical Example of Filter Efficiency 23 
AP-120B Data Path Interconnection 26 
Instructions to Perform Ax + b 31 
Matrix times Vector Multiplication 46 
Overview of the Vector Architecture 46 
Diagram of Row Processor 49 
Diagram of the Master Processor 54 
Instruction Format in Master Processor 57 
MP Program for Basic Block Filter 60 
MP Program for Block State Filter I 63 
Partitioned Ax + b Computation 66 
MP program for L > N 67 
Computational Efficiency for N RPs 69 
Comparison of Filters and Architectures 71 
-v- 
ABSTRACT 
Block digital filtering is a relatively new signal 
processing method where the input samples are grouped 
into vectors and then, operations are performed on these 
vectors to carry out the filtering operations. If 
certain types of parallel processing are used, greater 
throughput rates can be achieved. 
This thesis first examines the number of "raw" 
computations between a block digital filter and a normal 
scalar digital filter. Here, the block filter is shown 
to be far worse than a scalar filter because of the 
matrix times vector multiplications required by the block 
algorithm. 
Next, both types of filters are examined on how they 
might be implemented on a pipelined array processor. In 
this case, the block filter gains in throughput rate but 
is still somewhat slower than a scalar filter. This is 
because a scalar filtering algorithm can also be executed 
efficiently on an array processor. 
Finally, a vector processing architecture, which is 
optimized for executing a block filter, is presented. 
The execution time of a given block filter is shown to be 
a constant and is not dependent on the order of the 
filter. 
-1- 
CHAPTER 1 
INTRODUCTION TO BLOCK DIGITAL FILTERING 
One of the principal aims of all research in the 
area of digital signal filtering is to increase the 
throughput rates for real-time signal processing appli- 
cations. Many applications such as speech processing, 
radar and sonar detection, and seismic exploration 
require higher throughput rates than are currently 
available with present filter algorithms and processing 
architectures• 
This thesis will examine an algorithm called block 
digital filtering which first appeared in the literature 
in the early 1970s. The block algorithm has a great deal 
of inherent parallelism built into it and hence, is 
readily suited to implementation on array and vector 
processors. The advent of VLSI technology has also made 
it feasible to design and build a complex vector 
processor at relatively low cost and such an architecture 
will be presented in this thesis. Although very 
important, no attempt has been made to study the effects 
of roundoff noise and limit-cycle oscilations in these 
implementations. 
Several different formulations of block filtering 
have been presented in the, literature and were later 
shown to be essentially equivalent.  This* chapter will 
-2- 
first describe the notation that will be used throughout 
this thesis. Then the matrix implementation of block 
filtering will be briefly discussed. 
1.1 Description of Notation 
A scalar digital filter takes input samples {x(n)}, 
performs calculations, and produces output samples 
{y(n)J. Scalar quantities, in this thesis, will be 
indicated with lower case Roman letters with no 
subscript. In performing the computations, scalar 
coefficients:  a(i)'s and b(i)'s are used. 
On the other hand, a block digital filter takes 
blocks (or vectors) of L scalar inputs x^, performs 
computations, and produces L length output blocks y^* 
Vector quantities (a L X 1 matrix) will be shown as small 
boldface letters and the coefficient matrices will be 
indicated by upper case boldface letters. For example, 
if the block length L»4: 
xQ - [x(0),x(l).x(2),x(3)]T 
xx  - [x(4),x(5),x(6).x(7)]T 
xk - [x(4k),x(4k+l),x(4k+2),x(4k+3)3T (1.1) 
yk - Cy(4k),y(4k+l),y(4k+2).y(4k+3)]T 
Since, in general, the filters are recursive, a 
block filter will use previous blocks of both input and 
output samples. The required vector delay units will be 
indicated in the filter diagrams as D. 
-3- 
1.2 A Review of Block Filtering 
The block implementation of infinite impulse 
response (IIR) filters was first considered by Gold and 
Jordan [13 who showed that these filters could be 
realized by finite recursion. Later, Voelcker and 
Hartquist [2], Read and Meek [3], and Meek and 
Veletsos [4] presented papers on different works of the 
block implementation. However, all of these forms did 
not include any studies on stability, sensitivity, and 
techniques of hardware implementation of block filters. 
Then, Burrus [5, 63 devised a matrix formulation 
which makes it relatively easy to study the hardware 
implementation of these filters. Finally, Gnanasekaran 
and Mitra [7 3, and Gnanasekaran [83 slightly modified and 
expanded Burrus's method into the form that will be used 
in this thesis. They also conclusively proved the 
stability property of these filters. 
1.2.1 The basic block equation 
A linear time-invariant IIR digital filter can be 
expressed by the scalar difference equation (1.2) where M 
M M 
y(n) - z  a(i)x(n-i)  - z  b(i)y(n-i) (1.2) 
i-0 i-1 
is the order of the filter.  Note that the simplifying 
assumption has been made that the number of the zeros and 
-4- 
poles of the filter are equal. 
This can be shown [83 to be equivalent to the matrix 
formulation given in Fig. 1-1 for M»7. These filter 
matrices can now be partitioned into vectors of length L 
and block matrices of L X L dimension. This partitioning 
is shown in Fig. 1-2 for a block length L«7. It is clear 
that if the block length is greater than or equal to the 
order of the filter (L >. M), exactly two distinct sub- 
matrices will be generated in each larger coefficient 
matrix. By renaming the upper left Hb(i)H matrix BQ, the 
submatrix directly below it as B1# and similarly for the 
Ha(i)" submatrices AQ and A^, it is possible to rewrite 
equation given . Ln Fig . 1 "2  as Eqn. (1.3) 
"B0 0 0 0  . . ." "V "A0 0 0 0  . .. 
B1 BQ 0  0  ... vl ■i A1 AQ 0  0  ... 
0  B1 BQ 0  ... v2 0  A1 AQ 0  ... 
0   0   B^ BQ ... v3 0  0  A1 A0 ... 
•     •     •     •     • • • • •     •     •     •     • • • 
(1.3) 
where x^ and y^ are the input-output blocks given byx 
xk - Cx(7k), x(7k+l) x(7k+6)3T 
yx - Cy(7k), y(7k+l). ..., y(7k+6)3T (1.4) 
This assumption does not have to be made and 
essentially all the results obtained in this thesis would 
be the same. However, the mathematics become simplified 
with this assumption and little is gained by not making 
it. 
-5- 
b(0) 0 0 0 0 0 0 0 
b(l) b(0) 0 0 0 0 0 0 
b{2) b(l) b(0) 0 0 0 0 0 
b(3) b(2) b(l) b(0) 0 0 0 0 
b(4) b(3) b(2) b(l) b(0) 0 0 0 
b(5) b(4) b(3) b(2) b(l) b(0) 0 0 
b(6) b(5) b(4) b(3) b(2) b(l) b(0) 0 
b(7) b(6) b(5) b(4) b(3) b(2) b(l) b(0) 
0 b(7) b{6) b(5) b(4) b(3) b(2) b(l) 
0 0 b(7) b(6) b(5) b(4) b(3) b(2) 
0 0 0 b(7) b(6) b(5) b(4) b(3) 
0 0 0 0 b(7) b(6) b(5) b(4) 
0 0 0 0 0 b(7) b(6) b(5) 
0 0 0 0 0 0 b(7) b(6) 
. . .   0 y(o) 
...   0 yd) 
...   0 y(2) 
...   0 y(3) 
.. .   0 y(4) 
. . .   0 y(5) 
.. .   0 y(6) 
...   0 • y(7) 
...   0 y(8) 
.. .   0 y(9) 
.. .   0 y(io) 
...   0 y(ll) 
.. .   0 y(l2) 
.. .   0 
•   •   •        • 
y(i3) 
• 
•   •   •        • • 
a(0) 0 0 0 0 0 0 0 
a(l) a(0) 0 0 0 0 0 0 
a(2) a(l) a(0) 0 0 0 0 0 
a(3) a(2) a(l) a(0) 0 0 0 0 
a(4) a(3) a(2) a(l) a(0) 0 0 0 
a(5) a(4) a(3) a(2) a(l) a(0) 0 0 
a(6) a(5) a(4) a(3) a(2) a(l) a(0) 0 
a(7) a(6) a(5) a(4) a(3) a(2) a(l) a(0) 
0 a(7) a(6) a(5) a(4) a(3) a(2) a(l) 
0 0 a(7) a(6) a(5) a(4) a(3) a(2) 
0 0 0 a(7) a(6) a(5) a(4) a(3) 
0 0 0 0 a(7) a(6) a(5) a(4) 
0 0 0 0 0 a(7) a(6) a(5) 
0 0 0 0 0 0 a(7) a(6) 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
x(0 
x(l 
x(2 
X(3 
x(4 
x(5 
x(6 
x(7 
x(8 
x(9 
x(10) 
x(ll) 
x(12) 
x(13) 
Figure 1-1: Scalar Diff. Equation in Matrix Form 
-6- 
b(0)  0   0   0   0   0   0 
b(l) b(0)  0   0   0   0   0 
b(2) b(l) b(0)  0   0   0   0 
b(3) b(2) b(l) b(0)  0   0   0 
b(4) b(3) b(2) b(l) b(0)  0   0 
b(5) b(4) b(3) b(2) b(l) b(0)  0 
b(6) b(5) b(4) b(3) b(2) b(l) b(0) 
0 
0 
0 
0 
0 
0 
0 
o
 o
 o
 o
 o
 o
 o
 
b(7) b(6) b(5) b(4) b(3) b(2) b(l) 
0  b(7) b(6) b(5) b(4) b(3) b(2) 
0   0  b(7) b(6) b(5) b(4) b(3) 
0   0   0  b(7) b(6) b(5) b(4) 
0   0   0   0  b(7) b(6) b(5) 
0   0   0   0   0  b(7) b(6) 
0   0   0   0   0   0  b(7) 
b(0) 
b(l) 
b(2) 
b(3) 
b(4) 
b(5) 
b(6) o 
o
 
o
 
o
 
o
 
o
 
o
 
• *        •       ' •        •        •        •  .. 
• •••••• • 
» • •  • 
■ • •  • 
a(0)  0   0   0   0   0   0 
a(l) a(0)  0   0   0   0   0 
a(2) a(l) a(0)  0   0   0   0 
a(3) a(2) a(l) a(0)  0   0   0 
a(4) a(3) a(2) a(l) a(0)  0   0 
a(5) a(4) a(3) a(2) a(l) a(0)  0 
a(6) a(5) a(4) a(3) a(2) a(l) a(0) 
0 
0 
0 
0 
0 
0 
0 
.. 0 
.. 0 
.. 0 
.. 0 
.. 0 
.. 0 
... 0 
a(7) a(6) a(5) a(4) a(3) a(2) a(l) 
0  a(7) a(6) a(5) a(4) a(3) a(2) 
0   0  a(7) a(6) a(5) a(4) a(3) 
0   0   0  a(7) a(6) a(5) a(4) 
0   0   0   0  a(7) a(6) a(5) 
0   0   0   0   0   a(7) a(6) 
0   0   0   0   0   0   a(7) 
a(0) 
a(l) 
a(2) 
a(3) 
a(4) 
a(5) 
a(6) o 
o
 
o
 
o
 
o
 
o
 
o
 
• •••••• 
• •••••• 
• 
• 
9    • •     • 
... .J 
x(0) 
x(l) 
x(2) 
x(3) 
x(4) 
x(5) 
x(6) 
x(7) 
x(8) 
x(9) 
x(10) 
x(ll) 
x(12) 
x(13) 
Figure 1-2* Block Matrices for L 
-7- 
Equation (1.3) can be written in the algebraic form: 
Boyx + Biyk-i " Aoxk + Ai*k-i (1-5> 
Pre-multiplying by BQ  and rearranging, the basic block 
equation (1.6) is obtained. 
*k - ~BolBiyk-i + BolAoxk + BolAixk-i (1-6> 
The case where L < M results in more than four block 
matrices and is discussed in the next subsection. 
A block filter has several features which makes it 
more desirable than the associated scalar filter. It can 
be easily implemented on pipeline and vector processors 
giving greater computational efficiency: this is the 
major emphasis of the later chapters of this thesis. A 
block filter is also more stable than its scalar coun- 
terpart which leads to better sensitivity and error 
characteristics. Many other important properties are' 
discussed in [8]. 
1.2.2 Higher order block equations (L < M) 
The basic block  equation  (1.6)  derived  in  the 
previous subsection applies for L _> M but things become 
somewhat more complicated when this condition is not 
satisfied.  Referring back to the example filter (M«7), 
presented in Fig. 1-1, if the block length L is now made 
equal to 4, the situation of Fig. 1-3 arises.  There are 
-8- 
b(0)     0          0          0 
b(l)  b(0)     0         0 
b(2)   b(l)   b(0)     0 
b(3)  b(2)  b(l)  b(0) 
0          0          0          0 
0          0          0          0 
0          0          0          0 
0          0          0          0 
... 0 
... 0 
... 0 
...   0 
b(4)   b(3)  b(2)   b(l) 
b(5)  b(4)   b(3)   b(2) 
b(6)   b(5)  b(4)   b(3) 
b(7)  b(6)   b(5)   b(4) 
b(0)     0         0         0 
b(l)   b(0)     0          0 
b(2)  b(l)   b(0)     0 
b(3)   b(2)   b(l)   b(0) 
... 0 
...   0 
... 0 
...   0 
0       b(7)   b(6)   b(5) 
0         0       b(7)  b(6) 
0         0         0       b(7) 
0         0         0         0 
b(4)   b(3)  b(2)  b(l) 
b(5)   b(4)  b(3)  b(2) 
b(6)  b(5)  b(4)  b(3) 
b(7)  b(6)  b(5)  b(4) 
... 0 
... 0 
... 0 
...   0 
0         0         0         0 
0         0         0         0 
• •                     •                     • 
• •                     •                     • 
0      b(7)  b(6)  b(5) 
0         0       b(7)   b(6) 
• •                     •                     • 
• •                     •                     • 
... 0 
...   0 
• •  •       • 
• •  •       • 
y(o) 
yd) 
y(2) 
y(3) 
y(4) 
y(5) 
y(6) 
y(7) 
y(8) 
y(9) 
y(io) 
ydi) 
y(i2) 
y(i3) 
• 
• 
a(0)     0         0         0 
a(l)   a(0)     0         0 
a(2)  a(l)   a(0)     0 
a(3)   a(2)   a(l)   a(0) 
0          0          0          0 
0          0          0          0 
0          0          0          0 
0          0          0          0 
... 0 
...   0 
... 0 
...   0 
a(4)   a(3)   a(2)   a(l) 
a(5)  a(4)   a(3)   a(2) 
a(6)   a(5)   a(4)   a(3) 
a(7)   a(6)   a(5)   a(4) 
a(0)     0          0          0 
a(l)   a(0)     0          0 
a(2)   a(l)   a(0)     0 
a(3)   a(2)   a(l)   a(0) 
... 0 
... 0 
...   0 
...   0 
0       a(7)   a(6)   a(5) 
0         0       a(7)  a(6) 
0         0         0       a(7) 
0         0         0         0 
a(4)   a(3)   a(2)   a(l) 
a(5)   a(4)   a(3)   a(2) 
a(6)   a(5)   a(4)   a(3) 
a(7)   a(6)   a(5)   a(4) 
... 0 
...   0 
... 0 
...   0 
0         0         0         0 
0          0          0          0 
• •                       •                       • 
• •                       •                       • 
0       a(7)   a(6)   a(5) 
0         0       a(7)   a(6) 
• •                     •                     • 
• •                     •                     • 
... 0 
...   0 
• •  •       • 
• • •       • 
x(12) 
x(13) 
Figure 1-3 x Block Matrices for L - 4 
-9- 
now a total of 3 distinct subma trices for each coef- 
ficient matrix instead of 2 as before. It can be 
shown [8] that there will be (TM/Ll + 1) submatrices for 
each coefficient matrix.2 
The basic block equation (1.6) can now be rewritten 
in the general form: 
yk - B0 A0xk + B0 Alxk-1 + **• + B0 AJxk-J      .  ,. 
-1 -1 11.7) 
~
B0 Blyk-1 " *•• " B0 BJyk-J 
where J ■ (# of submatrices - 1). Clearly, the block 
case reduces to the scalar difference equation for a 
block length L-l. 
1.2.3 Block state filters 
The implementation of several block state filters 
will be discussed rather extensively in later chapters so 
the basic ideas of these filters will be presented in 
this section. Block state filters were first described 
by Gnanasekaran [83. 
The matrix derived by putting the infinite number of 
convolution coefficients, h(i), into a matrix similar to 
Fig. 1-1 and "blocking" it results in submatrices H^.  It 
2 ■'• Throughout this thesis the notation (*t1 will be used 
to indicate the smallest integer greater than or equal to 
t. Note that Eqn. (1.7) is also valid for L >_ M. 
-10- 
can be shown [83 that the first two of these submatrices 
are as given in Eqn. (1.8). 
Hl " B0 Al " B0 B1H0 
By introducing a new vector yK# a new block structure can 
be derived from Eqn. (1.6) which is given in Eqn. (1.9). 
I*"**?* (1.9) 
vk+i " -Bo Blvk + Hl*k 
This will be called Block State Structure I. 
Also, by defining another vector «K, it is possible 
to derive [8] what will be called Block State Structure 
II given in Eqn. (1.10). 
*k " Hl,k-1 + H0*k K    IK x    UK (1.10) 
•k " ~B0 Bl*k-1 + *k 
Now that all of the basics of the several block 
structures have been stated, the next chapter will 
investigate the various properties of the block matrices 
and determine the number of "raw" computations for 
implementing the structures. Later chapters will use 
these results to discuss different hardware implementa- 
tions of the various structures. 
-11- 
CHAPTER 2 
COMPUTATIONAL EFFICIENCY FOR VARIOUS BLOCK FILTERS 
The only discussion of the number of computations 
for block filters in the literature is given by 
Burrus [6]. But this presentation is not very clear and 
leads to misleading conclusions. Burrus makes 'compu- 
tations by trying to implement the basic block equation 
(1.6) by using the FFT but he fails to mention that the 
FFT is a somewhat more complex calculation than a simple 
matrix times vector multiplication. Hence, the actual 
execution time of an FFT based block filter on a 
processor will often be longer than by not using the FFT. 
This will be discussed further in later chapters. 
This chapter will discuss and compare the number of 
raw computations for basic block filters, scalar filters, 
block state filters, and scalar state variable filters. 
An analysis of the various block matrices will also be 
.1' 
presented. 
2.1 The Basic Block Filter 
This section will deal with the general basic block 
filter given in Eqn. (1.7) and repeated here as Eqn. 
(2.1). 
-12- 
*k " B0 A0*k * B0 Alxk-1 + •*• + B0 AJxk-J      ,„ ,% 
-1 -1 (2#1) 
-B0 B1yk_1 - ... - B0 Bjyk_j 
where J - TM/Ll 
The following discussion will hold for the general case 
of arbitrary L and M. 
2.1.1 Analysis of matrices 
The first thing to note in Eqn. (2.1) is that each 
product term contains the product of BQ and another 
matrix. Since both matrices are constructed of constant 
coefficients, these matrices can be pre-multiplied before 
the actual filter computation begins. Therefore Eqn. 
(2.1) will be rewritten as Eqn. (2.2). 
yk " PAOxk + PAlxk-l + ••• + PAJ^-J 
+
 
PBlyk-l + *•• + PBJyk-J tr%  „, 
i -i (2*2) 
where PAi - BQ4^, PBi - -B0 Bi# i"0,l,...J 
and J - TM/Ll 
The HP" matrices will hereafter be called the product 
matrices. 
Note that since BQ is lower triangular (see Fig. 
1-3), then +BQ is also lower triangular and has the same 
special symmetry that BQ does. This symmetry is illus- 
trated in Pig. 2-1 for a block length L»6. The matrix 
P. Q is also lower triangular with the same symmetry 
(given in Fig. 2-1) because AQ also has this form.  It 
-13- 
night be possible to decrease memory requirements in a 
processor by taking advantage of this symmetry but 
probably at the expense of increased execution time due 
to index manipulation. 
a 
b 
c 
d 
e 
f 
0 
a 
b 
c 
d 
0 
0 
a 
b 
c 
d 
0 
0 
0 
a 
b 
c 
0 
0 
0 
0 
a 
b 
0 
0 
0 
0 
0 
Bl 
Figure 2-1i Special Symmetry of Block Matrices 
Most of the other matrices:  PA1' ••*» PA(J-1)' and 
...» ^BdJ-l) are ^u11 matrices with no symmetry 
present. 
The remaining two matrices, PAJ and PBj# also have 
special properties. For these matrices the first 
[L - (M mod' L)] columns will be equal to 0 where: 
{ M mod L for M +  iL M mod' L - ( L   for M ■ iL where i-1,2,3,... 
This is because Aj and Bj also have [L - (M mod' L)] 
columns equal to zero so the product matrices do the same 
as well.  Figure 2-2 shows an example of the number of 
non-zero columns for a 6th order filter.  This property 
becomes very useful in reducing the execution time of 
-14- 
processors using a block algorithm since it is not 
necessary to actually multiply by zero. A diagram of the 
block structure is given in Fig. 2-3. Note that for the 
very important special case L _> M, Eqn. (2.3) holds. 
vk " PBl^k-l + PA0Xk + PAlXk (2-3) 
# of zero   # of non-zero   # of P 
L   columns columns matrices 
8 2 6 2 
7 1 6 2 
6 0 6 2 
5 4 1 3 
4 2 2 3 
3 0 3 3 
2 0 2 4 
10 1 6 
Figure 2-2: Example of Number of Non-zero Columns for M»6 
2.1.2 Analysis of number of computations 
The number of computations for a basic block filter 
are derived in this section. These calculations are 
simply raw multiplications and raw additions and do not 
make use of any parallel execution schemes which will be 
shown in Chapters 4, 5, and 6. Also, the final results 
will be in terms of the number of computations per output 
sample point. This is because block filters give L 
results simultaneously so the total computations must be 
divided by L to make a fair comparison with scalar 
-15- 
t.. 
filters. 
PBJ 
f 
1 
t 
D 
• A 
* ■ 
D 
l 
• 
1 PAJ 
Figure 2-3: Basic Block Structure 
*y* 
First, consider the PAo*k computation. Since PAQ is 
a lower triangular L X L matrix, a total of L(L + l)/2 
multiplications and L(L - l)/2 additions will be required 
(see for example [9]). 
Next, the calculations dealing with PA1X)C-1'  •••» 
PA(J-l)*(k-J-l)« and pBl*k-l*  ••" PB(J-l)y(k-J-l) wil1 
be considered.  Since each product matrix is full, as was 
shown in the previous subsection, L multiplications and 
-16- 
L(L - 1) additions are required for each product term. 
Therefore, a total of 2(J - 1)L multiplications and 
2(J - 1)L(L - 1) additions are required for these calcu- 
lations. 
Finally, consider the PAJxk-J an(* PBJ^k-J 
computations. Because there are only M mod' L non-zero 
columns, a total of 2L(M mod* L) multiplications and 
2L[(M mod' L) - 1] additions are required. There are 
also a total of 2LJ additions required to add together 
all the above intermediate vectors. 
Equation (2.4) gives the total number of compu- 
tations required. 
Mults. - L(L+l)/2 + 2(J-1)L2+2L(M mod' L) 
(2.4) 
Adds. - L(L-l)/2 + 2(J-1)L(L-1) 
+ 2L[(M mod' L)-l] + 2LJ 
Finally, the number of computations per output sample 
point is given in Eqn. (2.5). 
Mults./point - (L+D/2 + 2L(J-1) + 2(M mod* L) 
(2.5) 
Adds./point - (L-D/2 + 2(L-1)(J-1) 
+ 2t(M mod' L)-l] + 2J 
For the important special case, L >_  M, the number of 
computations reduces tot 
Mults./point - (L+D/2 + 2M 
(2.6) 
Adds./point « (L-D/2 + 2(M-1) + 2 
-17- 
2.2 Comparison to Scalar Filter 
A scalar digital filter can, of course, be 
implemented with the basic difference equation (1.2). 
The number of computations to generate an output sample 
point is given in Eqn. (2.7). 
Mults./point - 2M + 1 
(2.7) 
Adds./point - 2M 
By comparing Eqn. (2.7) with Eqn. (2.5) it can be 
seen that the number of raw computations is somewhat 
worse for a block filter as compared to the associated 
scalar filter. Furthermore, the block filter gets worse 
as the block length increases. For a block length L-1, 
the block filter reduces to the scalar filter as is 
expected. A numerical example of the number of 
computations for various filters will be given in Fig. 
2-6 at the end of this Chapter. 
It must be restated that these results do not make 
use of the inherent property of block filters: the 
ability to perform computations in parallel.  This will 
be discussed in great detail in later chapters. 
2.3 Analysis of Block State Structures 
Since much of the  analysis of the number of 
computations for block state filters is similar to that 
derived for the basic block equation, this discussion 
-18- 
will be relatively brief with only the main results 
presented.   The following analysis holds only for the 
i 
normal case, L > M. 
2.3.1 Block state structure I 
This filter is implemented by Eqn. (1.9) which was 
derived in chapter 1. The matrix Hg is equivalent to PA0 
as was shown in Eqn. (1.8) and is lower triangular. BN 
is a full matrix in general and -Bg BI"PB1 bv E<!n' (2.2) 
with L - M columns equal to 0. Therefore, Eqn. (1.9) can 
be rewritten as Eqn. (2.8) 
*k " vk + PA0*k 
vk+l " PBlVk * Hlxk 
with the structure given in Fig. 2-4, 
(2.8) 
» P AO 
•& 
Vk 
rBl 
H, •5 
D 
Figure 2-4: Block State Structure I 
0 The total number of computations for a y^ output 
block are given in Eqn. (2.9). 
Mults. ■ L(L+l)/2 + L2 + ML 
Adds. - L(L-l)/2 + L(L-l) + L(M-l) + 2L 
(2.9) 
The first term in the number of multiplications is for 
-19- 
PA0*v 
and
 
the
 second term is for H^x^.  The third term is 
A 
for PoiYv and the last term for additions is to sum the 
intermediate vectors. The number of computations per 
output sample point are given in Eqn. (2.10). 
Mults./point ■ (L+D/2 + L + M 
- 1.5L + M + .5 
Adds./point (L-D/2 + (L-l) ■*■ (M-l) + 2 
1.5L + M - .5 
(2.10) 
2.3.2 Block State Structure II 
Equation (1.10) gives the definition for this block 
structure. In this case, the term -BQ B, can be pre- 
multiplied to yield the matrix PBi« Therefore, the 
modified structure is given in Eqn. (2.11) 
•k " PBl*k-l + xk 
vk " Hl8k-1 + PA0xk 
and a diagram of it is given in Fig. 2-5. 
x 1      >P 
(2.11) 
A0 •© *k 
Hi 
> 
rBl* 
Figure 2-5: Block State Structure II 
The total number of computations for a y^ output 
block are given ass 
-20- 
Mults.- L(L+l)/2 + L2 ♦ L2 
(2.12) 
Adds. - L(L-l)/2 + L(L-l) + L(L-l) + 2L 
and the number of computations per output sample are 
given by Eqn. (2.13). 
Mults./point ■ (L+D/2 + 2L - 2. 5L + .5 
(2.13) 
Adds./point - (L-D/2 + 2(L-1) + 2 « 2.5L - .5. 
These results, Eqn. (2.10) and Eqn. (2.13) show 
approximately the same computational efficiency and are 
of the same order as the basic block filter (Eqn. (2.6)). 
However, as will be shown in the next subsection, they 
are far more efficient than a scalar state variable 
filter. Also, note that block structure II is totally 
independent of the order of the filter (for L >, M) . 
2.4 Analysis of Scalar State Variable Filters 
■ A simple scalar state variable filter is given by: 
v
n+l " Avn + bx(n) (2.14) 
y(n) - cvn + dx(n) 
where A is an M X M matrix, b, cT, vR are M length 
vectors and d is a scalar. The total number of compu- 
tations per output sample point are given by: 
Mults./point  - M2+M+M+l»   (M+l)2 
, (2.15) 
Adds./point  -  M(M-l)   +   (M-l)   +M+1-MZ+M 
-21- 
In summary, in terms of raw computations, the most 
efficient filter (with no parallel processing available) 
is the basic scalar difference equation. All three of 
the block structures are roughly equivalent but their 
efficiency decreases as the block length becomes large. 
The least efficient filter is the scalar state variable 
filter. A numerical example is presented in Fig. 2-6. 
Since the number of multiplications is approximately 
equal to the number of additions for all filters, only 
multiplications will be considered in this example* 
The next chapter will give a review of a popular 
commercially available array processor so that these 
filters may be examined when implemented on a pipelined 
architecture. 
3 
This will be reversed if parallel processing is used 
as will be shown in chapter 6. 
-22- 
Number of Multiplications per Output Point 
Scalar Basic Scalar Block Block 
M L filter block state state state 
filter variable I II 
16 50 33 57.5 289 91.5 125.5 
16 32 33 48.5 289 64.5 80.5 
16 25 33 45.0 289 54.0 63.0 
16 16 33 40.5 289 40.5 40.5 
16 8 33 36.5 289 * * 
16 1 33 33.0 289 * * 
25 50 51 75.5 676 100.5 125.5 
25 32 51 66.5 676 73.5 80.5 
25 25 51 63.0 676 63.0 63.0 
25 16 51 58.5 676 * * 
25 8 51 54.5 676 * * 
25 1 51 51.0 676 * * 
* indicates formula not derived 
Figure 2-6: Numerical Example of Filter Efficiency 
-23- 
CHAPTER 3 
INTRODUCTION TO THE AP-120B ARRAY PROCESSOR 
The AP-120B, manufactured by Floating Point Systems, 
Inc., is an extremely popular commercial array processor 
and has been interfaced with several different minicompu- 
ters and large-scale computer systems. It is typically 
used for such applications as radar signal analysis, 
digital image processing, nuclear simulation, and other 
real-time applications. The AP-120B has an architecture 
that is typical of many attached array processors and 
this is why it was selected for analysis in this 
research. 
This chapter will rather briefly discuss the more 
important architectural considerations necessary for the 
understanding of the next chapter. For a much more 
thorough presentation, the reader can refer to refer- 
ences CIO] through [14]. Finally, the computation of 
Ax + b, where A is a matrix and x and b are vectors, will 
be discussed in great detail which will form the basis of 
Chapter 4. 
3.1 Description of Architecture 
The basic features of the AP-120B are a pipelined 
t 
adder,  pipelined multiplier,  and a partial crossbar 
switching arrangement to connect these arithmetic units 
-24- 
to four independent data memory units. A single assembly 
language (called APAL) instruction can control a maximum 
of two data computations* two memory accesses, four data 
register accesses, an address computation, and a condi- 
tional branch, for a total of ten distinct opera- 
tions [11]. Figure 3-1 shows the data paths along with 
the size of the memory units. Note that the X and Y 
registers can be read and written to simultaneously (a 
total of four operations). The interface to the host 
processor and formatting of data transfers, although 
fascinating, is not critical to an understanding of this 
research and hence, will not be discussed here. 
The array processor is constructed of TTL logic 
gates and is operated at a basic clock cycle of 167nS. 
This corresponds to a maximum theoretical computational 
rate of 12 MFLOPS or 30 MIPS. This rate can be 
approached, in practice, with good programming and long 
vector lengths. 
3.1.1 The pipelined units 
The data computation units consist of a two-stage 
(two clock cycles long) pipelined adder and a three-stage 
pipelined multiplier.   Also, the auxiliary memory and 
Figure 3-1 is taken from [11] on p. 22. 
-25- 
Memory Unit of words Word Size (bit 
32 38 
32 38 
to 512K 38+3 parity 
to 64K 38 
to 4K 64 
16 16 
16 16 
X data registers 
Y data registers 
Main memory (interleaved) 
Auxiliary memory 
Program memory (not shown) 
Index registers 
Sub. return registers 
Figure 3-1: AP-120B Data Path Interconnection 
-26- 
main memory have pipelined read/write addressing.  The 
various stages of these units are as followst 
- Adder 
1. Align and add mantissas 
2. Normalize and round result 
- Multiplier 
1. Begin mantissa product and add exponents 
2. Complete mantissa product 
3. Normalize and round result 
- Main Memory 
1. Process memory address 
2. Begin read of memory location 
3. Complete read of memory location 
- Auxiliary Memory 
1. Process memory address 
2. Read memory location 
3.1.2 Description of APAL assembly language 
The machine instructions for the AP-120B consist of 
64-bit words which can control up to ten concurrent 
operations. The use of the APAL assembly language 
simplifies program writing but is still rather compli- 
cated. This section will present a slightly simplified 
version of APAL that could be rewritten into true APAL 
and be executed in the same amount of time as discussed* 
APAL instructions will be divided into five fieldsi 
(1) present clock cycle (not required in true APAL), (2) 
memory operations, (3) multiplication operations, (4) 
addition operations, (5) conditional branches and com- 
ments. Each line represents one APAL instruction and is 
executed in a single clock cycle.  Variable names with 
-27- 
subscripts in parentheses are used to indicate array 
elements. 
The main types of instructions (for the various 
fields) are described below: 
- FETCHM: fetch from main memory. Takes 3 clock 
cycles to get data. 
- FETCHA: fetch from auxiliary memory. Takes 2 
clock cycles. 
- SAVEM and SAVEA: analogous to FETCH above but 
used for storing to memories. 
- FETCHX, FETCHY, SAVEX, and SAVEY: read/write 
to data registers. All four may occur 
simultaneously. 
- ADD opl,op2: puts 2 data values in adder 
pipeline.  Takes 2 cycles to get data result. 
- MUL opl,op2: puts 2 data values in multiplier 
pipeline. Takes 3 cycles to get complete 
result. 
- CLEAR: set variable to zero for initializing 
pipeline. 
- WAIT: no operation. Used in emptying 
pipelines. 
Although  it  is  slightly  more  complicated  than  is 
presented, the assembler (for the most part) keeps track 
of which data- registers and memory locations to read or 
write to.  For example, SAVEX TEMP would store TEMP in 1 
of 32 X registers depending on the context that this 
instruction appears in the overall program.  Therefore, 
it will be assumed that the assembler will correctly 
control the storage locations in the various memories. 
-28- 
In general, it is quite difficult to get an 
efficient array processor program. Through the use of 
instruction overlapping and folding [113 it is usually 
possible to greatly decrease the execution time. More 
will be said about how to get efficient programs for 
block filtering in the next Chapter. 
3.2 The Ax f b Computation 
The most common type of computation in block 
filtering is a matrix multiplied by a vector and this 
product added to another vector. This section will 
discuss, in detail the Ax + b computation on the AP-120B 
array processor where A is a Q X Q' matrix and x and b are 
Q length vectors. It is assumed that all required values 
are stored in the array processor memory: there is a DMA 
controller that will store these values from the host 
computer automatically. 
Obviously, there are two methods for performing the 
computation:  (1) perform Ax, save the temporary vector, 
then add it to b and (2) take a row of A and multiply by 
x, then immediately add the appropriate element from b 
(continue this for all other rows).  Both methods will be 
considered to find the most efficient one.   In either 
case,  decisions must be made on where to store the 
various data and coefficient values.   Clearly, if two 
different values from the same memory unit are required 
-29- 
by an arithmetic pipeline simultaneously, a great 
decrease in efficiency will result since two simultaneous 
fetches can not be made. Therefore, the various types of 
data should be stored in different memory units. It will 
be assumed that the A matrix is stored in auxiliary 
memory, the x vector in main memory, and the b vector in 
the Y registers. 
Figure 3-2 shows the most efficient [113 code for 
performing Ax + b using method 1. Clock cycles 1 through 
3 are needed because of memory pipeline addressing and 
cycles 4 through 6 load the multiplier pipe. The inner 
loop at cycles 7 to Q forms the sum of products of the 
row times the vector elements with a new result being 
produced (and fed back to the adder pipe) every clock 
cycle. Steps Q+l through Q+6 are used to empty the 
pipelines and Q+7 through Q+9 are used to combine the two 
partial sums still in the addition pipeline. Then Q+10 
stores the element in an X register for temporary 
storage. The above operations are repeated Q times so 
that Ax is formed. The final steps, Q(Q+10)+1 through 
Q(Q+10)+Q+2, add Ax to b.  This method requires: 
Q(Q+10) + Q + 2 - Q2 + 110 + 2 clock cycles     (3.1) 
5 
This  limits the vector size' Q to 32 but a less 
efficient algorithm can be used for Q > 32. 
-30- 
Clock cycle Memory operations Mult, operations   Addition operations   Branches and comments 
(Comment)  The first Q+10 instructions are repeated Q times so each element of Ax can be computed. 
1 
2 
3 
4 
5 
6 
FETCHM X(l) 
FETCHM X(2); 
FETCHM X(3); 
FETCHM X(4); 
FETCHM X(5); 
FETCHM X(6); 
FETCHA A(i) 
FETCHA A(2) 
FETCHA A(3) 
FETCHA A(4) 
FETCHA A(5) 
For 
J:« 
7 to Q 
FETCHM X(J); FETCHA A(I,J-1) 
MUL X(1),A(1) 
MUL X(2),A(2) 
MUL X(3),A(3) 
Fill pipeline 
CLEAR TEMP(-l) 
CLEAR TEMP(O) 
MUL X(J-3),A(I,J-3) ADD A*X(J-6),TEMP(J-8) 
Increment inner 
loop counter (J) 
Branch when ■ Q. 
Q+l 
Q+2 
Q+3 
0+4 
Q+5 
Q+6 
FETCHA A(I,Q) MUL X(Q-2),A(I.Q-2) 
MUL X(Q-1),A(I,Q-1) 
MUL X(Q),A(I,Q) 
ADD A*X(Q-5),TEMP(Q-7) 
ADD A*X(Q-4),TEMP(Q-6) 
ADD A*X(Q-3),TEMP(Q-5) 
ADD A*X(Q-2),TEMP(Q-4) 
ADD A*X(Q-l.),TEMP(Q-3) 
ADD A*X(Q),TEMP(Q-2) 
Empty pipeline 
Q+7 
Q+8 
Q+9 
Q+10 
SAVEX TEMP(Q-l) 
FETCHX TEMP(Q-l) 
WAIT 
SAVEX TEMP(I) 
ADD TEMP(Q).TEMP(Q-l) Combine partial sums] 
Increment row 
loop counter (I). 
Branch when I-Q. 
(Comment)  The last part of code adds Ax to b. 
For K:» 
Q(Q+10)+1 to  FETCHX TEMP(K); FETCHY b(K) 
Q(Q+10)+Q     SAVEY TEMP(K-3) 
ADD TEMP(K),b(K-l) Add intermediate 
vectors. 
Q(Q+10)+Q+1   SAVEY TEMP(Q-l) 
Q(Q+10)+Q+2   SAVEY TEMP(Q) 
Empty pipe 
Figure 3-2x Instructions to Perform Ax + b 
-31- 
Method 2 immediately adds an element of b to the 
element of Ax as soon as it is computed. At first 
glance, this method appears to be better but this is not 
the case. Method 2 uses Fig. 3-2 except that the 
instruction at Q+10 is replaced byt 
Q+10   FETCHY b(I) ADD TEMP(I),b(I) 
Q+ll   WAIT 
Q+12   SAVEY TEMP(I) 
and   Q(Q+10)+1   through   Q(Q+10)+Q+2   are   deleted. 
Therefore, this method takes a total  oft 
Q(Q+12) - Q2 + 12Q clock cycles (3.2) 
which is worse than method 1 for Q > 2 as is normally 
true. The reason that this method is not as good as 
method 1 is because it effectively performs a scalar 
addition in the middle of the computations. Array 
processors are not particularly good at this scalar 
arithmetic because of the start-up and flush times for 
the pipelines. 
This Chapter presented the basic architecture of the 
AP-120B and discussed the Ax + b calculation in detail. 
The next Chapter will apply these results to obtain 
execution times of the various block structures on this 
These results could be multiplied by 167nS to get an 
actual execution time but this' is not necessary since the 
results are relative. 
-32- 
array processor. 
-33- 
CHAPTER 4 
ANALYSIS OP BLOCK FILTERS IMPLEMENTED ON THE AP-120B 
Since an array processor is good at implementing 
algorithms with inherent parallelism, this chapter will 
deal with how the AP-120B could be used to make use of 
the parallelism in block filters. However, as will be 
shown, simple scalar filters can also be implemented 
efficiently on the AP-120B so a great deal will not be 
gained by using an array processor for block filtering. 
This chapter will discuss five major areas. First, 
a basic block filter implemented on the AP-120B will be 
analyzed and this will be compared with a scalar filter 
implementation. Third, block state structures will be 
analyzed and compared to a scalar state variable filter. 
Finally, some suggestions will be made on how to get 
efficient array processor programs when implementing 
block filters on the AP-120B. 
4.1 Basic Block Filter Implementation 
For all block filters analyzed in this chapter it 
will be assumed that L ^ M (the normal case) and 
6 _< I* _< 32. The second assumption is necessary because 
for L < 6, the algorithm becomes more complicated because 
of the effects of thtf start-up and flush times of the 
pipelines and it is unlikely that a block filter would be 
-34- 
used for a simple (L < 6) filter anyway. The other 
assumption, L £ 32, is required because of the limited 
size of the X and Y registers although a more complicated 
(and less efficient) program could be written for L > 32. 
It is first necessary to make some decision about 
how the various memory units will be organized. An ef- 
ficient method is to use the auxiliary memory to hold the 
coefficients (block matrices) and the X and Y registers 
for storing intermediate vectors. The main memory will 
be used for storing the x(n) input data samples and the 
y(n) output data samples. It will be assumed that there 
is some sort of DMA controller responsible for getting 
the data samples to and from the "real world." This 
extra time will not be considered in these calculations 
because it is not part of the block algorithm itself and 
is dependent on the host computer being used. It can be 
noted, however, that this data movement time could be 
substantial. Mostly, the time needed will be small if 
the data movement is done in parallel with the 
computations. 
Recall from Chapter 2 that the basic block equation 
for L ^ M can be rewritten as: 
*k " PAOxk + PAlxk-l + PBiyk-l {4'l) 
where PAQ is lower triangular and P.^ and Pfil have L-M 
zero columns. ■ 
-35- 
The first step is to perform the multiplication 
I>AOxk* Since P^Q is lower triangular, the inner loop of 
Pig. 3-2 need only be repeated R times where R is the Rth 
row of PAQ* The first 6 rows of the calculation should 
be coded sequentially (no looping) since using a loop 
would increase the execution time. This is because with 
less than 7 elements in a row, much of the time is spent 
filling and flushing the pipelines. Beginning with the 
7th row, looping in the code results in fewer instruc- 
tions (Fig. 3-2 can be used explicitly). Row 1 takes 9 
clock cycles, row 2 takes 12 cycles, row 3 ■ 13, row 4 ■ 
14, row 5 » 15, row 6 ■ 16, and all succeeding rows take 
R+10 clock cycles. Therefore, the total time for the 
first matrix time vector multiplication isx 
L 
Time - 9+12+13+14+15+16+ I   (R+10) 
R-7 (4.2) 
- .5L2 + 10.5L - 2 clock cycles 
The next step is to multiply PAixv-i an(* a<*d to the 
intermediate vector from step 1 above. Since PAi only 
has M non-zero columns, Fig. 3-2 can be used with an 
inner loop length of M. Therefore, the execution time isx 
Time - L(M+10) + L + 2 
(4.3) 
« LM + 11L + 2 clock cycles 
The final step is to multiply Pgxy^-x' ad<* to the 
intermediate vector from the previous step and store the 
-36- 
y^ block back into main memory. The store back into main 
memory can be incorporated into the addition of vectors 
in Fig. 3-2. The total execution time for this step is 
therefore: 
Time ■ LM + 11L + 5 clock cycles (4.4) 
Now the 3 steps can be added together to get a total 
execution time given in Eqn. (4.5) 
Time - .5L2 + 2LM + 32.5L + 5 clock cycles      (4.5) 
and the execution time per output sample point is given 
in Eqn. (4.6). 
Time/point - .5L + 2M + 32.5 + 5/L clock cycles (4.6) 
It should be noted that there is still an L dependence in 
the execution time so the best case is when L»M.  As an 
if 
example, for an L-M*»10 filter, the time per output sample 
point is 9.6uS which implies a processing rate greater 
than 100,000 samples/sec. 
4.2 Scalar Filter Implementation 
A filter derived from the scalar difference equation 
(1.2) can also be implemented efficiently on the AP-120B. 
It is clear that the first part (up to adding together of 
vectors) of Fig. 3-2 can be used where the index 
registers are needed to address the different sets of 
coefficients.  The a(i) and b(i) coefficients would be 
-37- 
stored in auxiliary memory and the input and output data 
samples are stored in main memory. The only difference 
from Fig. ,3-2 is that instead of saving the result in an 
X register, the new y(n) must be stored in main memory 
which requires 2 additional clock cycles. 
The total execution time is therefore: 
Time - (2M+1) + 12 
-  2M + 13 clock cycles (4.7) 
Note that this is still slightly better than the 
associated block filter. This is because the scalar 
difference equation represents an algorithm that can also 
be executed efficiently on a pipelined array processor. 
Therefore, to achieve the greatest possible efficiency 
for the block algorithm, it will be necessary to use a 
processor that can handle several vectors simultaneously. 
This will be the main emphasis of Chapters 5 and 6. 
4.3 Block State Filter Implementation 
Since block state structures I and II are rather 
similar in their method of execution, structure I will be 
derived fully and only the results will be presented for 
block state structure II. 
-38- 
4.3.1 Block state structure I 
The equations defining block state structure I are 
repeated here as Eqn. (4.8). 
Vk " *k + PAO*k (4.8) 
vk+l " PBlvk + HlXk 
This filter presents an additional problem that was not 
encountered in the basic block filter: here the state 
vector y^ must be stored in main memory because the X and 
Y registers are used for intermediate results. One must 
be careful here so that the computation is not slowed by 
data usage conflicts in the main memory unit. 
The first step is to multiply PAgxk as *n the Das^c 
A 
block filter but now y^ will be added immediately and the 
resulting vector stored back in main memory. This will 
take a total oft 
Time ■ (.5L2+10.5L-2) + (L+3+2) + (L+3) 
, (4.9) 
»  .51/ + 12.5L + 6 clock cycles 
The first term is identical to that derived in section 
A 
4.1 above. The second term is the pipelined fetch of y^ 
from memory and add to PAQxk* The **nal term is to store 
back to main memory. Note that the fetch-add-store can 
not be overlapped because of a main memory conflict. 
The next step in the computation is to perform H^x^ 
and since H^ is a full matrix, this is exactly the first 
-39- 
part of Fig. 3-2 which takesi 
Time - L(L + 10} clock cycles (4.10) 
The final step involves multiplying Pg^y^' adding to 
Hlxk' and storing in main memory. Recalling that Pfil has 
M non-zero columns, this part is equivalent to the second 
step in section 4.1 above.  Therefore, a total oft 
Time « L(M+10) + (L+2) + 3 
(4.11) 
- LM + 11L + 5 clock cycles 
are required. The second term is the addition of inter- 
mediate vectors and the 3 is for the pipelined write back 
to main memory. 
Hence, a totalt 
Time - 1. 5L2 + LM + 33.5L + 11 clock cycles   (4.12) 
is required for an output block and the time per output 
sample point is given in Eqn. (4.13). 
Time/point - 1.5L + M + 33.5 + 11/L cycles     (4.13) 
4.3.2 Block state structure II 
Block state structure II is very similar to 
structure I in terms of array processor execution. The 
defining equations are repeated here as Eqn. (4.14). 
yk " Hl»k-1 + PA0*k , (4.14) 
'k " PBlsk-l + *k 
-40- 
Again, the same problem arises here in that the xk vector 
must be stored in main memory which tends to decrease 
efficiency. 
It can be shown that the execution time per output 
sample point is given by Eqn. (4.15). 
Time/point - 2.5L + 33.5 + 11/L clock cycles   (4.15) 
Mote that for the case L-M, both block state 
structures have the same execution time but for larger L, 
structure I becomes more efficient. Also, note that 
structure II is independent of the order of the filter 
M. These results are very similar to the results for the 
number of raw computations. 
4.4 Scalar State Variable Implementation 
The state variable filter described in Chapter 2 is 
repeated here as Eqn. (4.16). 
v
n+l " Avn + bx(n) (4.16) 
y(n) ■ cvn + dx(n) 
Here, all coefficients should be stored in auxiliary 
memory. The scalar data points x(n) and y(n) are stored 
in main memory. The state vector u could be stored in a 
fast register unit but to keep the discussion general it 
will be assumed that it is stored in main memory. 
The first step is to perform Aun which is exactly 
-41- 
the first part of Fig. 3-2 and takes M(M+10) clock 
cycles. The next step is to perform the multiplication 
bx(n), add it to the intermediate vector above, and store 
the result in main memory which takes a total of: 
Time - (M+5) + (M+2) + 3 clock cycles (4.17) 
The third step is to perform cun which is the inner loop 
of Fig. 3-2 and takes M+10 clock cycles. Next, dx(n) is 
performed and added to an intermediate value which is 
effectively a scalar multiplication and addition which 
takes 8 clock cycles. Finally, y(n) is stored back to 
main memory which takes 3 cycles. 
Therefore,  the  total  clock cycles  required  for 
execution is given by Eqn. (4.18). 
Time/point - M2 + 13M + 31 clock cycles       (4.18) 
It should be noted that this time is rather long compared 
to the block state structures and in fact even has an M 
dependence.   This is because the array processor is 
inefficient at performing scalar computations. 
4.5 Suggestions for Efficient Array Processor Programs 
There were several ideas which were incorporated 
into getting efficient array processor filter execution 
in this chapter.  These included: 
- Use memory efficiently.  Keep coefficients in 
auxiliary memory and try to avoid memory unit 
-42- 
conflicts. 
- Fold the instruction loops.  See [11]. 
- Chain results out of multiplier pipe directly 
into the adder pipe without an intermediate 
store to a register. 
- Do not do calculations on zero elements in 
matrices that are lower triangular or have 
first column(s) zero. This results in some 
overhead in filling and flushing the pipes but 
gives better results in the long run. 
- Keep L <_ 32 if at all possible. For L > 32, 
there is a great deal of extra time required in 
transferring intermediate results to and from 
main memory (this is specific to the AP-120B). 
-43- 
CHAPTER 5 
A VECTOR PROCESSING ARCHITECTURE FOR BLOCK PILTERS 
The use of an array processor, as discussed in 
Chapters 3 and 4, for implementing block filters does not 
make maximum use of the parallelism inherent in the block 
algorithm. For example, when multiplying a matrix by a 
vector, if more than one processor is available it is 
possible to have each processor work on a single row of 
the matrix. And since each processor can itself be 
pipelined it seems feasible to reduce the filter 
execution time by a factor of approximately N if N 
processors are available. A vector architecture that 
makes use of this parallelism will be presented in this 
chapter. It should be noted that although this processor 
will be called a vector processor in this thesis, it is 
more formally known as a distributed array processor [15] 
or a modular array processor [163* 
This chapter will first give a general description 
of the overall architecture. Next, the row processors 
will be described followed by a discussion of the control 
bus. Finally, the master processor and instruction set 
will be described.  The next chapter will deal with the 
7 
It is actually slightly less than N because of control 
overhead. 
-44- 
implementation of block filters on this system. 
5.1 General Description of Vector Processor 
When multiplying a matrix by a vector, the 
calculation is, of course, performed as is shown in Fig. 
5-1. In block filtering, all of the elements of the z 
vector (which may be an input block or intermediate 
vector) are available simultaneously within the 
processor. Therefore, if there are N processors 
available0 it is reasonable that each processor could 
perform its row computation independently and simul- 
taneously with the other processors. In other words, the 
ith  processor  would  be  performing  the  computation 
Ailzl+Ai2z2+ *** +AiNzN* Henceforth, the processor that 
computes the ith element of the product vector will be 
called row processor i or RPJ. 
The execution of the complete processor is 
controlled by the master processor or MP. The MP 
contains the program that is actually executed and since 
this program can be changed, various types of block 
filters can be implemented. An overview of the 
architecture is presented in Fig. 5-2.  It is envisioned. 
Q 
If there are fewer than N processors available this 
method can still be used with reduced efficiency. This 
will be discussed later in the next chapter. 
-45- 
1 1    1 9  • • *  A 
A**]    OO   • • •  A' 
AN1  AN2 
IN 
2N 
zl 
z2 
• • - 
• m 
NN 2N 
—    — 
A11z1+AjL2z2+ ... +A1NZN 
A21zl+A22z2+ '* * +A2NZN 
AN1Z1+AN2Z2+ l"ANNZN 
Figure 5-1t Matrix times Vector Multiplication 
since each of the RPs are identical, that a single VLSI 
chip could contain all of the hardware for a single RP. 
This would tend to keep the cost of the system relatively 
low. 
Output   Input 
Samples Samples 
\L 
\ 
Master 
Processor 
(MP) 
Bidirectional Data Bus 
 71 
RP, 
7[ 
^1 
FT 
RP, PpN-: 
7T 
Control Bus 
Figure 5-2t Overview of the Vector Architecture 
A bidirectional data bus was selected over discrete 
input and output busses for two reasons.   First,  a 
bidirectional bus keeps the system architecture simple 
with very little loss of efficiency.   Secondly, since 
each of the RPs would be on a single VLSI chip, a 
bidirectional data bus keeps the number of pins on the 
-46- 
package to a reasonable number. Note that no decision 
ha 8 been made on the number of data bits in this 
processor. It is assumed that the implementor will 
choose an appropriate number of bits for the particular 
application (probably between 16 and 32 bits). 
A processor such as described will be shown to have 
the following properties: 
- Execution time per output sample is approxi- 
mately constant and is not dependent on the 
order of the filter (providing enough RPs are 
available). This is a remarkable result and is 
very different from a scalar filter no matter 
how it is implemented. 
- If the row processors are implemented on a VLSI 
chip, the cost for the system is proportional 
to the block length L plus the cost of the 
master processor. Therefore, as for hardware . 
cost, it appears best to keep the block size 
and hence the number of processors small (equal 
to the order of the filter). 
- The total system cost would probably be in the 
same range or less than the cost of commercial 
array processors but the execution time is 
greatly decreased with the specialized vector 
architecture. Also, if several systems are 
built the overhead costs for the first VLSI 
chip can be spread out so that system cost will 
go down. 
5.2 Row Processor Description 
As  stated  earlier,  the  row  processors  are 
responsible for multiplying a specific row of a matrix by 
a vector and summing the products.   Each RP can be 
implemented on a single VLSI chip [17] with inputs and 
outputs to the bidirectional system data bus and inputs 
-47- 
X 
from the control bus. It would be possible to use a 
commercially available IC such as the Bell Laboratories 
DSP chip [18], however, in order to make a fair 
comparison to previous results a new architecture will be 
presented. The RPs will have a multiplyer pipeline 
length of 3 clock cycles and an addition pipe length of 2 
cycles so that these results may be equitably compared to 
the results in Chapter 4 for the AP-120B. Also, the Bell 
Labs DSP chip is rather inefficient for use as a row 
processor since it has unneeded functions and some 
required functions would have to be implemented with 
software. 
The block diagram for a row' processor is given in 
Fig. 5-3. A typical filter system would probably have 
less than 32 RPs so that 5 bits, RPA (row processor 
address), are required to select a specific RP. There 
would be an external decoder for each RP that would put a 
"1" on the RPS (row processor select) line when the 
master processor is "talking to" that particular RP. The 
individual lines of RPA are also needed internally in the 
chip to select RAM addresses. 
The coefficient RAM is divided into 4 blocks of up 
to 32 data words each. A particular block is selected by 
the MBO and MB1 control lines which will be discussed 
further in the next section.  Within each RP is an ad- 
-48- 
Read 
Write 
CLK - 
CLR ■ 
NV  ■ 
TD  ■ 
PD  - 
MB1 ■ 
MBO 
RPA4 
RPA3-*f 
RPA2 
RPA1- 
RPAO" 
Bidirectional Data Bus 
A 
< 
i± 
Coefficient 
RAM 
RPA 
lines t> 
A 
Clear 
Clock 
--■p 
—-> 
iz iz 
Multiplier 
Pipeline 
RAM address 
Address 
Counter I 
Control 
Unit 
RPS 
:> 
Clear --->! 
Clock H 
iii 
Adder 
Pipeline 
£ Clear - . _   
Accept ---» Output 
Feedback-- fe|Registet 
Output **'" * 
External 
Decoder 
* indicates tri-state output 
—> indicates RP control signal inputs 
Figure 5-3: Diagram of Row Processor 
-49- 
dress counter which is incremented on every clock cycle. 
This address counter selects a particular address for 
reading (or writing) within a given memory block. The 
coefficients may be easily modified to change the filter 
response by writing to an RP's RAM from the master 
processor by selecting the appropriate control bus lines. 
The multiplier pipeline is 3 clock cycles long and 
the adder pipeline is 2 cycles long. Each of these units 
receive the clock input which moves partial results along 
the pipeline. The multiplier pipeline can be implemented 
in a manner similar to the AP-120B. The first cycle is 
used to start the multiplication of mantissas, the second 
cycle to finish the multiplication, and the third cycle 
to add exponents ana normalize the mantissa. 
The output register is used to hold intermediate and 
final results whose output may then be fed back to the 
adder pipeline or onto the system data bus. The feedback 
path is only enabled at the end of a matrix times vector 
multiplication to empty the addition pipeline of partial 
sums. All of the data units have a clear input that 
zeros all registers. Note that some data units have 
tri-state outputs so that more than one unit can use a 
common data bus at different times. 1( 
Each RP has a control unit that may be implemented 
in a PLA or random logic.  The control unit takes inputs 
-50- 
V 
from the control bus, decodes them, and sends them to the 
various arithmetic and memory units. Because of the 
relatively large number of control lines, this circuitry 
can be fairly small. 
5.3 The Control Bus 
The system control bus consists of 13 lines from the 
master processor to all of the row processors. There is 
no type of "handshaking" arrangement so it is assumed 
that the RPs have indeed received or transmitted their 
data as instructed. There is also no type of error 
signals between MP and RPs but this would not be dif- 
ficult to design into the system. Since there are only 
approximately seven possible instructions to the RPs, the 
various instructions could be encoded into 3 bits instead 
of the 5 that are used. However, in order to keep the 
RPs' logic simple (by reducing the amount of decoding 
logic needed), the full 5 bits are used. 
The control bus consists of the following lines: 
- CLK:  System clock. 
- CLR: Instructs the RPs to clear all registers 
and pipelines for new calculations. This would 
probably only be used (by itself) to clear the 
system after coefficients are changed in order 
to create a new filter. 
- NV: New vector. Tell RPs -that a new Ax+b type 
of calculation is about to begin. 
- TD: Take data. Instructs all RPs to take data 
from  the' data bus  and  load  it  into  the 
-51- 
multiplier pipeline (or memory). 
- PD: Put Data. Tells a specific RP indicated 
by the RPA lines to put the contents of the 
output register on the data bus. 
- MBO, MB1: Memory block. The 4 combinations of 
these 2 lines indicate the high order address 
bits in coefficient RAM. This is so up to four 
matrices can be stored in RAM. 
- RPAO, RPA1, . .., RPA4: Row processor address. 
This has two uses. First, it indicates which 
RP the master processor is talking to. This is 
physically determined by the external decoders 
for each RP chip. Secondly, when NV«1 these 
lines contain the low order starting address in 
the RP's RAM, i.e., this address is loaded into 
the RP's address counter. 
There are also a few combinations of control bits 
which create different meanings: 
- (PD-l)AND(TD-l): The RP specified by the RPA 
lines puts its output register onto the data 
bus and also transfers it to its own multiplier 
pipeline. As is normally the case, the other 
RPs take that data word into their pipelines. 
- (CLR-l)AND(NV-l): Instructs the RP indicated 
by the RPA lines to write the data word on the 
data bus into RAM. This is used for changing 
filter coefficients. 
5.4 Master Processor Description 
The master processor (MP) is responsible for con- 
trolling all of the row processors and contains the 
software program which actually causes the filter to 
execute. The main elements of it are the program and 
data RAMS, instruction register (IR), instruction 
decoder, two registers, direct memory access (DMA) con- 
-52- 
troller, and the system clock (which could be external 
depending on the application). A diagram of the MP is 
shown in Fig. 5-4. 
The instruction register is divided into three 
fields: the opcode field, a data field, and an address 
field. The use of these fields will be described in the 
next section on the instruction format, however, the 
implementation of them will be dealt with here. The 
opcodes are transferred to the instruction decoder which 
sends out control signals to other parts of the MP and on 
the control lines. In certain instructions, the data 
field is transferred to REG which is a register that is 
decremented on every clock cycle (depending on the 
instruction). This register is used for looping within a 
program and in this case, the address field is trans- 
ferred to the program counter (PC) to branch to the 
beginning of the loop. On certain instructions, the low 
order 5 bits of the address field are transferred to the 
address register. This is an incrementable register used 
as an address register for data RAM and to control the 
RPA lines to the row processors. The data RAM is also 
partitioned into 4 blocks and the block selected is 
through the use of the MBO and MB1 bits in the 
instruction. 
There is also a direct memory access controller 
-53- 
Data  Data 
in  out 
A 
Program 
RAM C 
Address 
Program (PC] 
Counter 
A 
V Instruction Register (IR) 
Opcodes Data Address 
iLi3 
Instruction 
Decoder 
T 
Clock 
CLK Yo 
-AJL 
REG  1 
ntrol 
Bus 
^Addr. (AR) 
Register 
t0j 
y RPA
lines 
System 
Data 
Bus 
Figure 5-4i Diagram of the Master Processor 
-54- 
which handles I/O to the "outside world." Although the 
details are not important for this thesis, the DMA 
controller would store input blocks automatically into 
data RAM where they could later be used by the system. 
Output is normally handled directly by the master pro- 
cessor by connecting the system data bus to the outside 
world as the y^ vector is being stored into data RAM for 
future use. 
Since, the MP has relatively simple hardware 
complexity, it could easily be assembled on a single PC 
board using standard SSI and MSI components. If desired, 
the MP could be produced on a single VLSI chip but this 
seems economically impractical because only one is re- 
quired for a system unless a large number of systems are 
to be manufactured. 
5.5 Instruction Format 
The instruction words are 28 bits long and control 
all functions within the system. A horizontal pro- 
gramming (very long instruction words) method was 
selected over vertical programming (many instructions 
needed to perform a single "macroinstruction") for 
several reasons [193* By using horizontal programming, 
many operations can be performed in parallel and 
therefore, as an instruction executes in the master 
processor, it is within the time frame that events are 
-55- 
occurring in the row processors. With vertical pro- 
gramming, a very fast clock would have to be used in the 
MP to execute the required number of microinstructions 
needed in one system clock for the RPs. 
The instructions are divided into three fieldst the 
opcode field, the data field, and the address field. 
Figure 5-5 gives detailed description of the various 
fields and bits within the fields. Many of the bits 
correspond exactly to the control bus lines so the 
instruction decoder is relatively simple. 
The program resides in IK of program RAM and can be 
easily modified. This could easily be enlarged if 
necessary by simply adding another address bit. But, 
this is probably unnecessary since the block filter 
programs are rather short because of the horizontal 
programming. Also, if more than 32 RPs are needed, an 
extra bit(s) may be added to the DATA field to address 
more RPs. 
The description of the block filter processing 
system is now complete and the next Chapter will use this 
processor for executing filters. Machine language 
programs will be given for each of the various filters 
and execution times will be derived. 
-56- 
CLR NV TD PD INC A MVI DJNZ WM RM OUT DATA (5) Mfil MBO ADDRESS (10) 
-Opco< <—Data—> 
I 
I 
CLR:  Clear 
NV:  New vector 
TD:  Take data 
PD:  Put data 
INCA:  Increment address register 
DJNZ:  Decrement REG and jump to 
ADDRESS if REG f0 
MVI:  Move immediate DATA -> REG 
WM:  Write to data memory 
RM:  Read from data memory 
OUT:  Connect data bus to OUT bus 
MB1 & MBO:  Memory block address 
DATA:  5 bits of immediate data 
ADDRESS:  10 bits of address used 
for branching 
Note:  When NV-1 the address register AR is loaded with the low order 
bits of ADDRESS. 
Figure 5-5i Instruction Format in Master Processor 
CHAPTER 6 
VECTOR PROCESSOR IMPLEMENTATION OF BLOCK FILTERS 
The vector processor described in the previous 
Chapter is extremely efficient in executing block digital 
filters. This Chapter will discuss the execution time of 
three types of filters: basic block structure, block 
state structures, and filters where the block length is 
greater than the number of processors available. The 
required control program stored in the master processor 
will also be given for most of these filters. 
6.1 Basic Block Filter Execution 
The basic block equation is repeated here as Eqn. 
(6.1). 
vk " PBlVk-l + PAlXk-l + PAOXk <6'1-> 
In this section it will be assumed that L ^ M and that 
there are L row processors available in the system.  As 
can be seen in Eqn. (6.1), two previous vectors of data 
values are required and one present vector.   If the 
previous output vector, Y^-i'  is still stored in the 
output registers of the row processors (this will indeed 
be shown to be the way the program will work), then an 
efficient method of execution is to put the elements of 
y^_^ on the data bus sequentially.  As they appear, they 
will be taken and stored by the MP for output and be 
-58- 
taken by all of the RPs to begin execution of poiYv-i• 
The remaining two matrix times vector multi- 
plications must have the data vectors coming from the MP 
since this is the only place in the system that they are 
stored. Since individual elements of PaiY^-i are stored 
in the output registers of the RPs, when the computation 
PAlxk-l ^8 De9un» the new values are automatically summed 
to the respective elements of Puivk-1 * Tne same Process 
is repeated for Paoxk* In otlier words, a given RP 
multiplies and sums a row of the complete block equation. 
Figure 6-1 gives the MP program which will cause a 
basic block filter to execute.  In this example program, 
the block length L«10.   A description of the dynamic 
filter execution for a general filter of block length L 
is given below along with the clock cycle(s) that the 
action occurs: 
Clock Cycle Action 
1 All of the system address counters are 
set to zero and the block length of 10 is 
put into the MP data register. 
2 to L+l       The RPA lines are incremented with each 
clock cycle so that successive RPs put 
their element of y^-i on the data bus and 
these are reentered into all other RPs. 
Also, the PBi coefficients are loaded 
into the multiplier pipelines. The MP 
also takes in the y^_i data values and 
outputs them. 
L+2 to L+8     The pipelines are emptied and the partial 
result  is  stored  in the RP's output 
-59- 
I 
o 
I 
Address CLR NV TD \ PD INC A MVI DJNZ WM RM OUT DATA MB1 
MBO ADDRESS 
00000 0 1 0 0 0 1 0 0 0 0 01010 0 0 0000000000 
00001 0 0 1 1 1 0 1 1 0 1 00000 0 0 0000000001 
00010 0 0 0 0 0 1 0 0 0 0 00111 0 0 0000000000 
00011 0 0 • 0 0 0 0 1 0 0 0 00000 0 0 0000000011 
00100 0 1 0 0 0 1 0 0 0 0 01010 0 1 0000000000 
00101 0 0 1 0 1 0 1 0 1 0 00000 0 1 0000000101 
00110 0 0 0 0 0 1 0 0 0 0 00111 0 1 0000000000 
00111 0 0 0 0 0 0 1 0 0 0 00000 0 1 0000000111 
01000 0 1 0 0 0 1 0 0 0 0 01010 1 0 0000000000 
01001 0 0 1 0 1 0 1 0 1 0 00000 1 0 0000001001 
01010 0 0 0 0 0 1 0 0 0 0 00111 1 0 0000000000 
01011 0 0 0 0 0 0 1 0 0 0 00000 1 0 0000001011 
01100 0 0 0 0 0 1 1 0 0 0 00000 0 0 oooboooooo 
Figure 6-1t MP Program for Basic Block Filter 
register. 
L+9 Zero address registers. 
L+10 to 2L+9   MP puts x^..^ previous input data vector 
on data bus on successive clocks. Note 
that MB is set so that the PA1 
coefficients are used. Note that since 
pBlvk-l *8 already in the output 
registers, it is automatically summed 
into the new computation. 
2L+10 to 2L+16  Empty pipelines. 
2L+17 Zero address registers. 
2L+18 to 3L+17  Same  process  is  repeated  for  PAOxk 
computation. 
3L+18 to 3L+24  Empty pipelines.   Note that the new y^ 
vector is now stored in the RPs* data 
registers. 
3L+25 Jump to program address 0000000000 to 
begin computation on a new block. 
Therefore the total execution time for a block is 
3L+25 clock cycles or the time per output sample is given 
in Eqn. (6.2) 
Time/point - 3 + 25/L clock cycles (6.2) 
This is an outstanding result and shows that this 
architecture is extremely efficient for block digital 
filtering. Recall that a block filter executed on the 
AP-120B had an execution time of .5L+2M+32.5+(2/L) clocks 
and a simple scalar filter's execution time was 2M+13 
clocks. The vector architecture is therefore faster than 
an array processor in executing a block filter. 
-tl- 
6.2 Block State Filter Execution 
The block state structure I equation is repeated 
here as Eqn. (6.3). 
? ■* + y* 
yk+i " pBiyk + Hixk 
Again, it will be assumed that L ^ M and that there are L 
row processors available. For this type of filter the 
present input vector is needed twice and the previous 
state vector is also needed twice. With some hardware 
modifications. it might be possible to do both 
computations simultaneously and thus achieve a great 
increase in speed. However, with the system discussed in 
Chapter 5, it will be necessary to do three matrix times 
vector multiplications sequentially. 
In executing the filter equations, the first equa- 
tion will be computed to yield y^. This algorithm 
assumes the fact that the y^ vector is stored in the 
output registers of the RPs which will be shown to indeed 
be the case. Next, the y^ vector must be recalled from 
the master processor data memory to compute Pniyv. Then, 
HiX. is calculated as usual. Finally, the yv+i state 
vector is sequentially put on the data bus to be stored 
in MP memory while still being saved in the output 
registers of the RPs.   Figure 6-2 gives the machine 
language program for executing this block state filter 
-62- 
Address CLR NV TD PD  INCA  MVI DJNZ  WM RM OUT   DATA  MB1  MBO  ADDRESS 
00000 0100010000 01010 0 0 0000000000 
00001 0010101010 00000 0 0 0000000001 
00010 0 0 0 0 0 1 0 0 0 0 001110 0 0000000000 
00011 0000001000 00000 0 0 0000000011 
00100 0 1 00010000 01010 0 0 0000000000 
00101 0001101001 00000 0 0 0000000101 
00110 1 10 0 0 1 0 0 0 0 01010 0 1 0000000000 
00111 0010101010 00000 0 1 0000000111 
01000 0 0 0 0 0 1 0 00 0 00111 0 1 0000000000 
01001 0000001000 00000 0 1 0000001001 
01010 0100010000 01010 1 0 0000000000 
01011 0010101010 00000 1 0 0000001011 
01100 0000010000 00111 1 0 0000000000 
01101 0 0 00 0 0 1 0 0 0 00000 1 0 0000001101 
^ OHIO 0100010000 01010 1 0 0000000000 
"    01111 0001101100 OOOOO 1 0 0000001111 
10000 0100011000 OOOOO 0 0 0000000000 
Figure 6-2i MP Program for Block State Filter I 
(for L»10).  The following description lists the actions 
at the various times in the system: 
Clock Cycle Action 
1 Clear all address counters. 
2 to L+l       RPA lines are incremented on successive 
clocks so that the x^ vector is sent out 
and the HQX^ calculation is begun. 
A 
L+2 to L+8      The pipelines are emptied and since y^ 
was already in the output registers, the 
yk vector is complete and must be sent to 
the MP. 
L+9 Clear address counters. 
L+10 to 2L+9   The yk vector is sent back to the MP 
sequentially on the data bus. 
2L+10 Clear the RPs' output register for the 
second filter equa.tion. Also, clear the 
address counters. 
A 
2L+11 to 3L+10  pniY)c  computation j.a     begun.    Master 
processor sends out y^ vector that it has 
stored in its memory. 
3L+11 to 3L+17  Empty pipelines. 
3L+18 Clear address counters. 
3L+19 to 4L+18  Begin  computation  of  Hixv'    Master 
processor sends out x^ sequentially. 
4L+19 to 4L+25  Empty pipelines. 
4L+26 Clear address counters. 
A 
4L+27 to 5L+26  The yk vector is sent back to the MP. 
5L+27 Jump back to the beginning to start the 
next block computation. 
A total of 5L+27  clock cycles  are  required  to 
-64- 
compute an output block and the time per output sample 
point is given in Eqn. (6.4). 
Time/point ■ 5 + 27/L clock cycles (6.4) 
Again this is far less than the 1.5L+M+33.5+(8/L) cycles 
required on the AP-120B. Note that this time is slightly 
greater than the time for a basic block filter but the 
difference is insignificant compared with the time sav- 
ings in using the vector architecture over the array 
processor architecture. 
The derivation of the execution time for block state 
structure II is very similar to structure I and will not 
be given here. It can be shown that the execution time 
is given by Eqn. (6.5). 
Time/point « 5 + 29/L clock cycles (6.5) 
6.3 Filter Execution for L > N 
When the number of row processors N is less than the 
block length L, the system efficiency is reduced con- 
siderably but the results are still excellent compared to 
other signal filtering processors. Now, it is impossible 
to perform a complete Ax+b calculation at one time but 
instead the computation will have to be broken down into 
several parts. Figure 6-3 shows the elements of an Ax+b 
computation.   The A matrix will be partitioned into 
blocks of N rows (the number of RPs in the system) and 
-65- 
the calculation will be performed on each partition as in 
the normal case with intermediate results being held in 
the master processor data memory. This process will have 
to be repeated B times where B»fL/Nl but the last block 
may have to be padded with zero rows if L/N is not an 
integer. 
N rows 
N rows { 
<N rows/ 1 
B partitions 
L X L matrix \     / L length vectors 
Figure 6-3: Partitioned Ax + b Computation 
The calculation to be considered will be the basic 
block equation (6.1) although the general method will 
work with any block filter.   As usual,  the machine 
language program is given (Fig.  6-4) and the following 
is an action description of the computationt 
Clock Cycle Action 
1 Clear output register and set address 
counters  to  00000  for  the  first 
partition. 
2 to L+l Begin first partition of ^l^k-l 
computation. Note that now the vector 
yJc_1 is sent out from the MP since only a 
part  of  it  is  held  in  the  output 
-66- 
Address CLR NV TD PD INCA MVI DJNZ WM RM OUT DATA MB1 MBO ADDRESS 
00000 1 1 0 0 0 1 0 0 0 0 01010 0 0 0000000000 
00001 0 0 1 0 1 0 1 0 1 0 oodoo 0 0 0000000001 
00010 0 0 0 0 0 1 0 0 0 0 00111 0 0 0000000000 
00011 0 0 0 0 0 0 1 0 0 0 00000 0 0 0000000011 
00100 0 1 0 0 0 1 0 0 0 0 01010 0 1 0000000000 
00101 0 0 1 0 1 0 1 0 1 0 00000 0 1 0000000101 
00110 0 0 0 0 0 1 0 0 0 0 00111 0 1 0000000000 
00111 0 0 0 0 0 0 1 0 0 0 00000 0 1 0000000111 
01000 0 1 0 0 0 1 0 0 0 0 01010 1 0 0000000000 
01001 0 0 . 1 0 1 0 1 0 1 0 00000 1 0 0000001001 
as    01010 0 0 0 0 0 1 0 0 0 0 00111 1 0 0000000000 
? 01011 0 0 0 0 0 o. 1 0 0 0 00000 1 0 0000001011 
01100 0 1 0 0 0 1 0 0 0 0 xxxxx 0 0 0000000000 
01101 0 0 0 1 1 0 1 1 0 1 00000 0 0 0000001101 
OHIO 0 1 0 0 0 1 1 0 0 0 00000 0 0 0000000000 
xxxxx is the binary number of RPs 
Figure 6-4t MP program for L > N 
registers. 
L+2 to L+8 Empty pipelines. 
L+9 Reset address counters to 00000 
L+10 to 2L+9 Begin first partition of P^l^k-l* 
2L+10 to 2L+16 Finish PAi«x-l• 
2L+17 Reset address counters to 00000. 
2L+18 to 3L+17 Begin first partition of PAoxk* 
3L+18 to 3L+24 Empty pipelines.  First partition of yk 
is now complete and must be sent back to 
the MP. 
3L+25 Reset address counters to 00000. 
3L+26 to 4L+25 The first partition is sent back to the 
MP for temporary storage. 
4L+26 to B(4L+25) 
The above sequence is repeated for each 
of the B partitions. Note that where the 
address counters were set to 00000 for 
the first partition, they will be set to 
the appropriate starting address for 
future partitions. 
B(4L+25) + 1   Jump back to address 0000000000 to begin 
next output block computation. 
The total time required is B(4L+25)+l per output 
block or an execution time given in Eqn. (6.6) per output 
sample point. 
Time/point » [B(4L+25) + 1] / L clock cycles.   (6.6) 
Note that although this time is excellent, it does not 
reduce to Eqn. (6.2) if there are in fact L processors 
available.    This  is  because  this  method  requires 
-68- 
intermediate output vectors to be stored in the MP 
instead of directly in the RPs' output registers. Figure 
6-5 gives an example of computational efficiency using 
this algorithm. 
Block Length L«32 
♦ of RPs        Time/output sample (clock pulses) 
32 4.78 
16 9.56 
10 19.13 
4 38.25 
2 76.50 
1 153.00 
Figure 6-5i Computational Efficiency for N RPs 
6.4 Summary of Results 
The vector processing architecture described in 
Chapter 5 has been shown to be extremely efficient in 
terms of execution time for block digital filters. A 
numerical example is presented in Fig. 6-6 which compares 
the execution times on an array processor and the vector 
processor. The great efficiency of the vector archi- 
tecture is achieved because many parts of a matrix times 
vector multiplication are done in parallel and each of 
these parts are themselves pipelined. The execution time 
of a block filter executed on a vector processor is often 
faster by more than an order of magnitude over a scalar 
filter 1  It should also be noted that there is no known 
way to implement a scalar filter on a vector processor 
-69- 
since the algorithm is not of a vector nature, 
This thesis has first derived the number of raw 
computations for both scalar and block filters. As was 
expected, the number of raw computations was greater for 
the block filter than the scalar filter. Then, the 
execution times were derived for the AP-120B array 
processor. In this case, the array processor executing 
the block algorithm improved but was still somewhat 
slower than the associated scalar filter. Finally, a 
vector architecture was presented and execution times 
derived. Here, the block filter turned out to be far 
better than a scalar filter and the vector processor 
architecture proved to be extremely efficient for these 
types of algorithms. 
There are several areas where more work is needed. 
The vector processor could be actually built and tested 
to confirm the results in this thesis. In doing this, 
the actual dollar cost could be determined to find out if 
this is a practical means of implementing digital filters 
for high speed applications. Also, the vector processor 
architecture could be generalized to handle other signal 
processing applications efficiently such as the FFT and a 
language could be developed to microcode various filters 
on the vector processor. 
-70- 
Execution time (clock cycles) per output sample 
ARRAY PROCESSOR        VECTOR PROCESSOR 
N 
Scalar  Block 
Filter Filter 
Block 
State 
1 
Block 
Filter 
Block 
State 
I 
For M«10 filter: 
32 32 33.0 68.7 91.8 3.8 5.8 
16 16 33.0 60.8 68.2 4.6 6.7 
10 10 33.0 58.0 59.6 5.5 7.7 
10 4 33.0 58.0 59.6 19.6 21.9 
10 1 33.0 58.0 59.6 65.1 67.5 
For M-20 filter: 
32 32 53.0 88.7 101.8 3.8 5.8 
20 20 53.0 82.8 84.1 4.3 6.4 
20 8 53.. 0 82.8 84.1 15.8 17.9 
20 4 53.0 82.8 84.1 26.3 28.5 
20 1 53.0 82.8 84.1 105.1 107.4 
For M-32 filter: 
32 32 77.0 112.7 113.8 3.8 5.8 
32 16 77.0 112.7 113.8 9.6 11.6 
32 8 77.0 112.7 113.8 19.2 21.3 
32 4 77.0 112.7 113.8 38.3 40.4 
32 1 77.0 112.7 113.8 153.0 155.2 
Figure 6-6: Comparison of Filters and Architectures 
-71- 
REFERENCES 
1. B. Gold and K.L. Jordan, Jr., "A note on digital 
filter synthesis," Proc. IEEE (Letters), Vol. 56, 
October 1968, pp. 1717-1718. 
2. H.B. Voelcker and E.E. Hartquist, "Digital 
filtering via block recursion," IEEE Trans. Audio 
Electroacoust. (Special issue on digital 
filtering), Vol. AU-18, June 1970, pp. 169-176. 
3. R. Read and J. Meek, "Digital filters with poles 
via the FFT," IEEE Trans. Audio Electoacoust. 
(Corresp.), Vol. AU-19, December 1971, pp. 322-323. 
4. J.W. Meek and A.S. Veletsos, "Fast convolution for 
recursive digital filters," IEEE Trans. Audio 
Electroacoust. (Corresp.), Vol. AU-20, March 1972, 
pp. 93-94. 
5. C.S. Burrus, "Block implementation of digital 
filters," IEEE Trans, on Circuit Theory, Vol. 
CT-18, November 1971, pp. 697-701. 
6. C.S. Burrus, "Block realization of digital 
filters," IEEE Trans. Audio Electroacoust., Vol. 
AU-20, October 1972, pp. 230-235. 
7. R. Gnanasekaran and S.K. Mitra, "A note on block 
implementation of IIR digital filters," IEEE 
Proceedings (Letters), Vol. 65, July 1977, pp. 
1063-1064. 
8. R. Gnanasekaran/ Block implementation of one- 
dimensional recursive digital filters, PhD 
dissertation. University of California, Santa 
Barbara, 1978. 
9. Murray R. Spiegel, Mathematical Handbook of 
Formulas and Tables, McGraw-Hill Book Company, New 
York, Schaurn's Outline Series , 1968. 
10. Floating Point Systems, Inc., AP-120B Array 
Processor Handbook, Portland, OR, 1976. 
11. A.E. Charlesworth, "An approach to scientific array 
processing: the architectural design of the 
AP-120B/FPS-164 family," Computer, Vol. 14, 
September 1981, pp. 18-27. 
-72- 
12. J. Strelchun, "Array processor responds in 
real-time," Electronics, Vol. 22, August 16 1973, 
pp. 118-124. 
13. W.R. Wittmyer, "Array processor provides high 
throughput rates," Computer Design, Vol. 17, March 
1978, pp. 93-100. 
14. R. Bernhard, "Giants in small packages," Spectrum, 
Vol. 19, No. 2, Feb. 198 2, pp. 39-44. 
15. A. Dummich, "Array processor, Vector processor," 
Electronics, Vol. 22, Aug. 16 1979, pp. 119. 
16. P. Alexander, "Array processor design concepts," 
Computer Design, Vol. 20, Dec. 1981, pp. 163-172. 
17. E. Swartzlander, "VLSI Architecture," in Very Large 
Scale Integration Fundamentals and Applications, 
D.F. Barke, ed., Springer-Verlag, New York, 1980, 
ch. 6. 
18. R.C. Chapman (Editor), "Special Issue on DSP Chip," 
Bell System Technical Journal, Vol. 60, No. 7, Part 
2, Sept. 1981, pp. 1431-1709. 
19. S. Husson, Microprogramming: Principles and Prac- 
tice, Prentice-Hull, Englewood Cliffs, NJ, 1970. 
-73- 
VITA 
Timothy J. Slegel was born on August 27, 1958 in 
Philadelphia, PA, son of Joseph F. and Luetta B. Slegel. 
He received his Bachelor of Science degree in Electrical 
Engineering from Lehigh University in June 1930 and is a 
member of Tau Beta Pi, Eta Kappa Nu, Phi Eta Sigma, and 
the IEEE. 
While engaged in graduate studies, Mr. Slegel worked 
as a Teaching Assistant in the Electrical and Computer' 
Engineering Department.  His primary area of interest is 
in  the  field of digital  logic design and  computer 
architecture. 
-74- 
