Bit-sequential VLSI architectures for digital signal processing / by Tewari, Neeraj
Lehigh University
Lehigh Preserve
Theses and Dissertations
1986
Bit-sequential VLSI architectures for digital signal
processing /
Neeraj Tewari
Lehigh University
Follow this and additional works at: https://preserve.lehigh.edu/etd
Part of the Electrical and Computer Engineering Commons
This Thesis is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact preserve@lehigh.edu.
Recommended Citation
Tewari, Neeraj, "Bit-sequential VLSI architectures for digital signal processing /" (1986). Theses and Dissertations. 4655.
https://preserve.lehigh.edu/etd/4655
• • 
Bit-Sequential VLSI Architectures 
for 
Digital Signal Processing 
by 
Neeraj Tewari 
A Thesis 
Presented to the Graduate Committee 
of Lehigh University 
in Candidacy for the Degree of 
Master of Science 
• 1n 
Electrical Engineering 
Lehigh University 
1986 
This thesis is accepted and approved in partial fulfillment of the require-
ments for !he Degree of Master of Science in Electrical Engineering . 
• in 
EE Division Chairperson 
CSEE Department • 1rperson 
..J 
.. 
11 
' 
Acknowledgments 
I am deeply indebted to Dr. M.D. Wagh for both his involved guidance 
and the numerous suggestions he made, without which parts of this thesis would 
have been incomplete. In addition, valuable insight was provided by friends and 
fellow students, at Lehigh University and I would like to thank them all. I 
would also like to thank tbe CSEE department at Lehigh University for the 
financial support that was provided during my stay here. Finally, I owe 
gratitude to my parents for their constant encouragement over the years. 
,• 
\ 
I 
l 
• 
. . . 
Ill 
) 
,, 
Table of Contents 
Abstract 
1. Introduction 
1.1 Signal Processing and Computational Architectures 
1.2 Thesis Outline 
2. Special Purpose VLSI Architectures 
2.1 Architecture design principles 
2.1.1 Balancing computation with 1/0 
2.1. 2 VLSI implementation considerations 
2.1.2.1 Modular and repetitive structure 
2.1.2.2 Pipelining modules 
2.1.2.3 Localized data and control flow 
2.1.2.4 Reducing pin count 
2.2 State-of-the-art signal processing algorithms and architectures 
2.3 Advantages of bit-sequential VLSI architectures 
3. Bit-sequential Architectures for Bilinear Algorithms 
3.1 Bit-sequential architectures for finite length algorithms _ 
3.2 Bilinear Algorithms 
3.3 Bit-sequential Arithmetic Modules 
3.3.1 A Proof of the 1\1ultiplier Operation 
3 .4 Architectural Desc.ri pt ion 
3.4.1 Architecture Design 
3.4.2 Architecture Control 
3.5 F1 FT Butterfly Elernent design 
4. Bit-sequential Systolic Array Architectures 
4.1 Vector-inner product algorithms 
4.2 Design considerations for array processing 
4.3 Finite Impulse Response Filters 
4.3.1 Proof of the FIR filter design 
4.4 Infinite Impulse Response Filters 
4.5 Evaluation of filter designs 
4.6 Pattern Recognition 
4.6.1 Double linear array architecture 
4.6.1.1 Proof of operation 
4.6.2 Single linear array architecture 
4. 7 Evaluation of pattern recognition arrays 
5. Conclusions 
References 
. 
IV 
1 
2 
2 
5 
7 
7 
7 
8 
10 
10 
10 
11 
11 
12 
16 
16 
18 
21 
24 
32 
32 
34 
38 
42 
42 
43 
47 
51 
55 
58 
59 
60 
61 
64 
66 
68 
70 
Figure 2-1: 
Figure 2-2: 
Figure 3-1: 
Figure 3-2: 
Figure 3-3: 
Figure 3-4: 
Figure 3-5: 
Figure 3-6: 
Figure 3-7: 
Figure 4-1: 
Figure 4-2: 
Figure 4-3: 
Figure 4-4: 
Figure 4-5: 
Figure 4-6: 
Figure 4-7: 
Figure 4-8: 
Figure 4-9: 
Figure 4-10: 
I 
List of Figures 
Illustration • of multiple use of data points and concur-
rency in special purpose architectures 
A comparison of serial and parallel architectures for a 
hypothetical VLSI system 
The bit-sequential two's complement adder 
The bit-sequential two's complement su btractor 
The bit-sequential two's complement multiplier. Note 
that if the 2n-1-bit Q' is obtained by sign extending an 
n-bit number, then bits qn-l through q211_2 are identical. 
Enhanced design of the multiplication module. 
Implementation of two-point convolution 
Timing diagrarns of the control signals in Figure 3-5 
Bit-sequential architecture for an FFT butterfly element 
The implementation of ( 4.3) 
The implementation of ( 4.4) 
Serial-input Sequential-output FIR filter 
Bit-sequential array element for the filter 
Bit-sequential array FIR filter 
Bit-Sequential Systolic Array IIR Filter 
Array module for the double array design 
Double array architecture for pattern recognition 
Array module for the single array dPsign 
Single array architecture for pattern recognition 
V 
9 
14 
22 
22 
25 
33 
35 
37 
40 
46 
46 
48 
49 
50 
57 
62 
62 
65 
65 
, 
Abstract 
With advances in signal processing and lack of adaptability of general pur-
pose machines, designing special purpose architectures has become increasingly 
important. Proposed in this thesis are bit-sequential architectures for signal 
processing algorithms that can be expressed· in either a bilinear form or as a 
vector-inner product of two time dependent vectors. A three level architecture 
has been presented for bilinear algorithms and systolic arrays for vector-inner 
product algorithms. 
These architectures are modular, employ extensive concurrency and pipelin-
ing with localized data and control flow and have a low pin count for input 
and output, thus making them ideally suited for VLSI implementation. The 
design philosophy is supported by examples for for FFT butterfly-element, FIR 
and IIR filters, linear and cyclic convolution and pattern recognition. 
1 
) 
Chapter 1 
Introduction 
.. 
1.1 Signal Processing Bnd Computational Arcl1itectures 
Modern signal processing techniques, algorithms and architectures are going 
through a major revolution due to the recent advances in VLSI technology. 
With more advanced signal processing algorithms there has been a steep increase 
in the complexity of computations, processing speed requirements and the 
volume of data handled. The current microelectronics technology, which permits 
about a million devices per chip and over a million bits/square cm. of storage1, 
makes it possible, with the right design to implement these increasingly sophis-
ticated systems on VLSI chips. 
The quest for effic;ient signal processing algorithms and their implemen-
tation leads to distributed and parallel processors due to their fast processing 
nature. Concurrent processing techniques employing different architectures have 
been studied, simulated and implemented by many researchers [1-10]. In 
general, these architectures can be broken down into three broad architectural 
configurations : 
• Pipeline computers 
• Array processors 
• Multiprocessor systems 
A pipeline computer performs overlapped computations to exploit temporal 
parallelism, whereas array processors use multiple synchronized arithmetic logic 
1based on less than half a micron line width 
2 ' 
units to achieve spacial parallelism and in multiprocessor systems asynchronous 
parallelism is obtained through a set of iterative processors with shared resources 
[1]. In fact, many modern designs combine these features to maximize the 
processing concurrency by pipelining parallel processors. Various communication 
strategies ranging from time-shared common bus to contention schemes like 
CSMA/CD and token ring and interprocessor connection strategies ranging from 
completely connected networks to serially LOnnected networks[2], have been 
proposed and studied. Software packages to solve signal processing problems 
have been developed for array processors and vector supercomputers [3-4). 
Signal processing, however, places computational demands significantly dif-
ferent from what the above mentioned general purpose architectures were 
designed for. These concurrent architectures were designed with the objective of 
efficiently distributing processes corresponding to single or multiple jobs, the 
motivation primarily being high speed besides better performance and improved 
reliability. As a consequence, considerable effort ,vas spent on dynamic schedul-
ing of processors under varying load conditions to avoid throughput degradation 
at heavy loads. The emphasis was on achieving the best possible average job 
time rather than a maximum worst case time. Signal processing on the other 
hand offers a fixed load and since most applications are real time there is a 
definite upper limit on the time available to finish the operation on the data 
set. Therefore, the traditional design of parallel computers would be unsuitable 
for signal processing due to heavy supervisory overhead incurred by synchroniza-
tion, communication and scheduling tasks, which severely hamper the very criti-
cal throughput rate. Moreover, even if signal processing algorithms were suc-
cessfully implemented on these machines to do computations in real-time, it 
3· 
• 
would be a definite overkill and inefficient use of the hardware. 
Another significant factor is that even though most signal processing al-
gorithms are inherently compute bound, input/output (I/0) has to be carefully 
handled in order to avoid a possible 1/0 bound caused by heavy demands 
placed on data transfer needed for computations. For example, one of the most 
frequently encountered operations are matrix-vector product and the convolution 
function, both of ,vhich are compute bound tasks. The former needs· every 
column in the matrix to be multiplied by the same entry in the vector and the 
latter needs each x. to be multiplied by each of the k weights. Efficient com-
a 
putation of a variety of elementary operations such as multiplication, vector 
rotation and trignometric functions need to be implemented. Also, these al-
gorithms exhibit a substantial amount of rerursion, where all processors perform 
almost identical tasks, repeating the fixed set of tasks on sequentially available 
data[6). This recursive nature of algorithms tends to increase I/0 and a 
communicati'on bottleneck may be reached with conventional architecture ap-
proach. For every operation, at least one or two operands have to be fetched 
( or stored) from ( or to) memory, so that the total amount of 1/0 is propor-
tional to the number of operations rather than the number of inputs and out-
puts. Thus, a problem that was compute bound can become 1/0 bound during 
execution. The general purpose distributed processor architectures do not ex-
ploit this repetitive nature of signal processing algorithms. Therefore, there is a 
need to : 
1. Incorporate parallelism and recursion in algorithms to reduce com-
putational complexity and to make efficient use of data to lower 1/0 
• 
requ 1remen ts. 
2. Design special purpose architectures to implement these algorithms. · 
4 
• 
For real-time signal processing systems, special purpose array processors like sys-
tolic arrays and wavefront arrays have been proposed[5-10]. These arr~ys are, 
however, word-oriented architectures and do not exploit pipelining beyond the 
word-level. The other class of architectures, i.e. the bit-sequential machines, be-
sides being pipelined down to the bit-level can also be designed very efficiently 
1. 
in VLSI both in terms of architecture and pin requirement [11-14]. 
1.2 Thesis Outline 
Presented in this thesis are bit-sequential computational architectures for 
different classes of signal processing algorithms. Two classes of algorithms, 
covering most signal processing applications have been considered: 
1. Finite length algorithms, which can be expressed efficiently in a 
bilinear form i.e. an addition stage, followed by a multiplication stage 
and a last addition stage. Most of the transforms in signal process-
ing fall into this class of algorithms, classic examples being the Dis-
crete Fourier Transform (DF1') and cyclic convolution. 
2. Algorithms which can be expressed as an inner product of two vectors 
i.e. a finite length vector sliding over a possibly infinite length vec-
tor. Most of the applications that do not fall in the first class are 
covered here. Typical examples of such algorithms are digital filter-
ing, co11volution and pattern recognition. 
Chapter 2 reviews architectural design principles for special purpose signal 
processors from VLSI implementation point of view. Currently existing array 
processors have been presented and compared with bit-sequential architectures, 
outlining the merits and demerits of such architectures. 
Chapter 3 deals with algorithm and hardware design principles of bit-
sequential architectures for bilinear algorithms. As a design example, the ar-
chitecture for the Fast Fourier Transform (FFT) butterfly element is given at 
the end of the chapter . 
5 
C_h·ap.ter 4-. d.eals. with:. t_he cla.ss.: of· algo-tithms that can be expressed as in:.. 
·ne.r product of two, v.ettots. Sy~tolic bit-se.qtrential array architectures have been 
proposed for digital filtering_, cpnvQ)uti'o.n and ,_pattern recognition applications. 
Chapter 5 :.conclu:d.es' ·thS:s w.ork ·by ·:presenting the highlights and mak:irrg 
.sH_gg·esti:onf>. for:: future :e-xte_ns:ion~· ·qf the· material presen~ed. 
·e 
-~ .. 
.,. 
Chapter 2 
Special Purpose VLSI Architectures 
2.1 Architecture design principles 
Cost-effectiveness is a chief concern .in .. :.designing, s.pecial :p.urpo·se s:yst'etrfs; 
t'hefr c.ost m11s>t. be:. low enoug·h, lo j u:stify their limited ·applicability. These costs 
can, :.be classified as tH;)n·recurring (des.ign) ·and r¢turring. (.part_.) cos.ts. Rec:ent. ad·-
vances. in· .VLSI h.ayg significantly reduce.cl the part- ,cqs~ and therefore the d.esig.ri 
cqst of a. spec-.ial..;:purpose system must be relatively small for it- to: b·e att:tat:tive.·.: 
I)esired.. .sp.eed: for --d.i{fer¢n:t <tpplications can be _achieved by a· co:mbinatioh ,of 
to11(::urrency an.cl. very: fa.st ·rno:dules. Since: the VLSI te_chnolog_y trend offers a 
il1Jnitiishi_n:g grow .. th rate for :c·ompo-nen t speed. ·.of torn-pl~:x m.odule~, ~n-y major. 
i.m:provement i'n s_peed :mn:st- c,orne: Jrom. concurre-ri-t. :u~,e: t>'f .tri:any· sim_ple, ve.ry· fast 
pro:C.essin.g elem-en.ts. ,·Th·e: tet'hnology,: h_owe-v.er·, ,offers the capability .of ·integrat-, 
rng. the~e rnoclu]es on :a single chip :and'. it fs possible wit-h. the right desigp.- ·to 
h.ave e·cclhomital VL.-Sl special purpose archiJe.ctures for signal ·ptocessin:g. 
:Some: des}gn. principles .for -sp~cial pu,rpose signal processo.rs are· sliii)fn<iti.z~d 
b.elow: 
2.1.1 Balanr.lng. com:putation wlth I/0 
Real time .respo·nse· of a ·~·p~c;ial purpose: :s-ystem .tan be: clas~1fied. as:, the 
respe·n~·e rat.e wh·ere the· systerr1 ·can· ~1_cc·ept ·a ·-n·ew' in.put. an:.d :o_utput :·a result,: ,at 
:the. I/0 bandw:idth be-tween t:h.e, lros.t ·and itself~: As. ntentione.d e<1rlier ,., sigr1al 
proc:e~sing algorithms ·are highly recµtsive a-nd improper us.e of .data can lea_d: -to: 
·ar1 I/0 bottleneck:.. .However, tbi.:s recursive 11at:ure of algprit}lm.s· can be· ex;. 
ploit:ed :b_y makiti'g rep:e-tit'ive 1.i~e:. pf each' d:ctta poin,J a/n·d t;onc~r.ren·c-y, ·thetfiby 
1 
: ... 
~c-h:ieving ·teal time re~:pon_se at the expense of increased hardware and· more in.;. 
t¢rnal memory in :th·e.: sy$tem~: By replacing a single processing element (PE) 
with an array of PE's., a higher computation throughput ·can be achieved with ... 
out increasing I/0 bandwidth. Information flows. in th·ese systems in a pi.pelined 
Ia~hion and commu·nication wit.h the host occ_urs only .at' qo.u·ndary P_.Es. 
The above ideas are illu_strated in .Fig'.: 2- l.- ::Sµ,ppl1se th.at= :the· .I/.O 
bandwidth between the host :a.rtd t:he :~:pe.cial purrios~.: :system: is 10· million bytes, 
J>er second.. .-Assumin·gi that two' b_ytes are· read frorh -or wri~ten to the- host for 
:eath o.peration,: the. maxirn1im rate would be 5 million operation per~ second 
{MOPS:),: rio m~tter how fast the. special purpose system -can operate.: By m.ur-
tiple. 11:se of each data item the .speed can. be, ,en·hanced to 30 MOPS . 
. 2 .. 1_.·2 VLSI implementat1:o.n ·considerations 
As mentioned in See:. 2.1, the design cost fo.t .t:he:se special p~!'~pose systeri1s 
:must be· ke,pt small. 
into:· 
Frorr1 V,LSI con:siqer.ations_ the :design. co~t· tao be split .. u·p ,, ' 
"l. l.c)gic:. design. ti-me. 
,2. t-he lay?ut tc:>"nslraipts; :and ·t.he tesul:tant In.terpqnnectI:on cos.ts in. 
tern-rs ·of time-: . 
. . -.· .... ·
. 
ln addition,, r~f4r_rin1~ fpartJ cost, de.pen·clent. 'U,pon th·e· silicc>'n. arep, · taken by· the: 
.desig:n and :th.e: pin count, :should b.e. k_e_pt: :low. ·D·esign prin.ciple~ to minlmiz·e 
th·ese· costs are. listed. 'below·: 
8 
.... 
INSTEAD OF: 
iOO ns. 
WE HAVE : 
1 00 n s. 
-
.. 
Figure 2-1: 
MEMORY 
. 
PE A,M· 
-, 
MEMORY 
PE PE PE PE 
~ 
--
I 
~ 
-
PE 
r 
5 Mi 11 ion 
• operations per 
second at most 
30 MOPS 
possible 
lllustraLion of rr1u]Lip]e use of data points and concurrency 
in special purpose architectures 
9 
2.1.2.1 Modular and repetitive structure 
,Architectures that can be decomposed into a few types of simple substruc-
tures or building blocks, which are used repetitively with simple interfaces can 
significantly reduce design time. Such designs are likely to be modular and 
therefore, adjustable to various performance goals. This can be achieved with a 
careful redesign and topological mappings from algorithms to architectures [ 5]. 
2.1.2.2 Pipelining modules 
Multiple use of each data point and concurrency demand pipelining to op-
timize throughput, even though this would happen at the cost of additional' · 
delay between the input and output (latency). Pipelining could be implemented 
at all levels of the design i.e. system-level, word-level and the bit-level. 
2.1.2.3 Localized data and control flow 
Communications and control become increasingly important for VLSI im-
plementation \vhcre routing determines cost, power, time and area for the design 
[ 11]. Highly concurrent systems must restrict communications to local processes 
in terms of spacial proximity to exploit the fast speeds offered by modular 
structure and to reduce interdependence [6]. The reduced interdependence can, 
with proper design, lead to distributed simple local control as opposed to global 
synchronization problem that exists in all such processors. 
Again, since scaling down a design in VLSI decreases the gate (switching) 
delay whereas the interconnection (communication) delay is left unchanged, the 
speed at which a circuit can operate is eventually dominated by interconnect 
delays rather than device delays[5]. Therefore to permit useful scaling of the 
design, communication and control should be kept local. 
10 
2.1.2.4 Reducing pin count 
To exploit the fast speeds offered by these architectures, the I/0 require-
I• 
ments should be minimized to what can be handled by the electronic packaging 
technology or else efficient 1/0 port sharing and multiplexing schemes will have 
to be designed. 
2.2 State-of-the-art signal processing algorithms and architectures 
The current state-of-art in signal processing relies heavily on array proces-
sors. The most popular array architectures studied and implemented are the 
systolic arrays and wavefront arrays[5-13]. These arrays are word-oriented 
processors and can be used for signal and image processing, matrix arithmetic 
and non-numerical applications as well. Both these architectures make multi'ple 
use of each input data and use extensive concurrency to enhance speed. In ad-
dition, both designs use multiple, few types of simple cells with only nearest 
neighbor connections. 
/ -\ 
In systolic arrays each PE is a M~ore machine[13J and operates on a com-
mon global clock(which is the only global communication besides power and 
ground). Every PE regularly pumps data in and out, each time performing 
some short computation, so that a regular flow of data is kept up in the 
array (9]. The PEs are arranged, both in terms of topology and neighbor con-
nections, such that data flow is regulated in the array to produce the desired 
result. For example, it is shown in [8,10] that some basic "inner product" PEs, 
each performing the operation Y == Y + Ax B , can be locally connected to perform 
digital filtering, matrix multiplication and other related operations. 
Wavefront arrays are a variant of systolic arrays and are based on the no-
tion of a computational wavefront which can be obtained by expressing the algo-
11 
rithm as a recursion [5,6). In addition to the data-flow locality of systolic ar-
,t 
rays, wavefront arrays also achieve control-flow locality. The PEs are data 
driven, a concept extensively used in data-flow computersf 2] and therefore the 
array is asynchronous and self-timed, each PE using a simple handshake 
protocol with neighboring PEs. Since wavefronts of two recursions never inter-
sect (Huy gen 's principle), these wavefronts can be pipelined to achieve 
concurrency f 5]. ( 
2.3 Advantages of bit-sequential VLSI architectures 
While the above mentioned array architectures satisfy the computational 
demands and avoid the real-time bottleneck mentioned in Sec. 1.1, they fail to 
reduce the data bandwidth requirement at, the input and output. This problem 
:s critical, specially where the number of data points is large. As an example, 
even a small 8 by 8 point Fast Fourier Transform (FFT) block, for 16 bit 
\\'ords \Vould require 256 input/output pins, which is clearly unsuitable for i'LSJ 
implementations. The solution is then to use a large number of small block~ or 
to time multiplex few blocks. Wavefront arrays use the first technique, whereas 
systolic arrays can be designed in either fashion. This, however leads to a 
time-hardware ( multiplexing versus parallelism ) tradeoff with a conflict be-
t ""Neen high speed and small system size. 
Bit-sequential architectures offer a way out of this impasse. These archit~c-
tures have communication and computation executing simultaneously under a 
global bit clock. As a consequence of serial communication, all data paths 
within the network will be single wires. In contrast, bit parallel systems would 
use buses of N wires, where N is the system wordlength. Bit sequential architec-
tures have clear advantages in terms of system size, power and cost, for VLSI 
12 
implementation, over the bit parallel systems, without compromising on the 
speed. This can be argued as below: 
Taking a common (maximum) clock rate (1/ T) a sequential computation 
unit would take time NT to process each word with' 1/ N-th of the hardware 
needed for parallel systems. The two realizations are externally equivalent when 
N serial units are linked (in cascade or in parallel) as in Fig. 2-2 to realize the 
same total processing capability as a single parallel unit. The total memory re-
quirement, fixed by the application, is the same in both cases. 
Fig. 2-2 however does bring out a lot of significant points. ~""'irstly, in the 
VLSI domain, it is the case in practice that one or more of the bit-sequential 
modules will likely fit onto a single chip. The system development cost will 
thus reflect the economic advantage of using high volumes of a single device 
type. Thus the bit-sequential architectures offer a better structured, lower pin 
count so]ution. 
Looking at communication in this system, it can be seen that signals cross 
chip boundaries in the sequential architecture at the external data rate. In 
parallel architectures, the poor partitioning leads to signals crossing chip boun-
daries at the (potentially much faster) computation rate. This leads to a sig-
nificant power advantage to the bit sequential solution. 
All of these architectural advantages pertain to a comparison of the system 
under the condition of equal clock rates. In practice however the parallel sys-
tem clock must be severely derated to accommodate lengthy carry delays in the 
the basic arithmetic operations. These delays are not present in bit-sequential 
systems because they are naturally pipelined down to the bit level. As a con-
sequence, the bit-sequential systems can be clocked at close to optimal shift 
13 
., 
., . 
MEMORY 
COMPUTATIONS 
LSI PARTITIONS 
, I 
·· o! D~! D~ 
-~r---.,)! __ I 11 __ [ I 
D~~ Ogi D~ 
• 
-=-=-=· =.;. { . lb .1 -. -, I. . [ ·d--
o gi D~i Do 
---- [ lj I Ii [ I 
{a) Serial Architecture 
I J 
I 
r------+-, ----, 
LSI PARTITIONS? 
,__._ ___ I --
1 I 
' I 
I I 
I J ~-
I I 
------t- 10 
-----, 
I 
I ------L 
1 ' 
.--- .--.. --
I I 
J 
-----+-· ___ _, I 
1 I 
I 
(b) Parallel Architecture 
Figure 2-2: A comparison of serial and parallel architectures for a 
hypothetical VLSI system 
14 
register rates [ 14]. 
Therefore, it is desirable to design bit-sequential special purpose architec-
tures which can be readily implemented in VLSI. 
Presented in the next two chapters are such bit-sequential architectures. 
These architectures exploit the fact that for signal processing one of the 
operands is known or can be precomputed and preprogrammed. The designs are 
modular and repetitive , using identical blocks for such diverse operations as 
1nultiplications and additions and need only a single bit bus for intermodular 
communications. The resultant hardware is simpler than that of word oriented 
machines without compromising on the execution time. Further, the bit serial 
nature of the input and output saves on the number of bonding pads and exter-
nal pins, thereby 
15 
Chapter 3 
Bit-sequential Architectures for 
Bilinear Algorithms 
3.1 Bit-seque11tial architectures for finite length algorithms 
This chapter presents a bit-sequential architecture design philosophy for 
digital signal processing algorithms that can be expressed in a finite matrix-
vector product form with known matrix coefficients. Discrete Fourier Transform 
and all other commonly encountered transforms in signal processing fall into this 
category of algorithms. Also, cyclic convolution is another application where 
such architectures can be utilized. These architectures use modular and repeti-
tive blocks which operate synchronously under a global clock. Because these 
systems would be designed specifically for fixed size algorithms, global com-
munication has been permitted with the idea of reducing the hardware com-
plexity and obtaining a compact design. Global communication is however 
limited to the front and the rear of a three level design and is simple enough 
so as not to affect the timing or disrupt the modularity. Reduced 1/0 require-
ments permit single chip implementation. 
It can be shown that the hardware complexity of bit-sequential architec-
tures is directly related to the number of multiplications in the algorithm. 
Naturally, before an algorithm is implemented on such an architecture, it should 
be redesigned tc reduce multiplications, even at the cost of additions. Such al-
gorithms are bilinear in nature, i.e., they can be partitioned into three non-
trivial sequential computational stages: additions, multiplications and additions. 
Bilinear forms have attracted a great deal of attention [15-23] since it is known 
(! 
16 
that a good bilinear form optimizes the number of multiplications. Lower 
bounds on the number of multiplications for bilinear forms have been studied 
and in most cases algorithms achieving those bounds have been obtained. 
Finally it should be mentioned that in the case of many signal procef,sing al-
gorithms, larger length bilinear algorithms can. be rather easily obtained from 
those of smaller lengths. 
Presented first are th--e features of bilinear forms important for implemen-
tation through bit-sequential architectures. These architectures need serial input 
- serial output multipliers, adders and subtractors which can be efficiently 
pipelined together. Since the hardware complexity of a multiplier is significantly 
more than that of the other t,vo modules, special emphasis needs to be,,, placed 
on its design. Such a multiplier for positive n-bit numbers requiring 2n (5,3) 
counters and its extension for 2's complement numbers using n (5,3) counters 
has been demonstrated earlier [25,26). The carry-save add-shift ( CSAS) struc-
ture of these- multipliers eliminates ripple propagation delays and avoids carry 
look-ahead hardware present in earlier designs [ 1]. The design was further 
sirnp'lified in [27] for signal processing applications using the fact that one of the 
operands may be precomputed. That design uses a ( n-1)-bit ripple carry adder 
along with n XOR gates to provide the n most significant bits in parallel. A 
2's complement multiplier, which is strictly serial with simple timing, control 
and pipelining capabilities has been presented next. An enhanced version of this 
design and the salient features of the proposed architecture are presented next. 
This architecture is well suited for VLSI implementation of bilinear signal 
processing algorithms. An FFT butterfly design illustrative of these concepts is 
presented at the end of the chapter. 
17 
3.2 Ilili11e11r Alg<>rit.I1111s 
l"'ct a linear sigual proces~i11g op<·rat.io11 rnariipulating a length N vector x 
to ubtai n a leugth N' vector X be PX µrc~scd as 
X :: Jl.x (3.1) 
.... , .. ,-
A bilinear algorithm to evaluate ((3.1)) is based on constant matrices A, B, and 
C and has the form: 
X~C.(Ar0 Bx) (3.2) 
,vhere, B and C are of dimensions m x N and N' x m respectively with small 
. 
integer entries, A has dimensions m x s, r is a length s vector made up of in-
dependent entries in R and 0 denotes component wise product of vectors. As-
suming that R is known prior to the implementation of the algorithm, Ar could 
be precompu tcd and stored. Neglecting these calculations as well as multiplica-
tions by B and C ( since they could be converted to additions because of their 
integer entries ), the multiplicative complexity of the algorithm is m. 
Example 1. To evaluate cyclic convolution of two points 
-· .... 
we may construct a bili~ear algorithm based on 
l O O 0 1 0 TO 
0 1 0 0 0 I I 1 0 0 rl 
A - B C - r -- - - -
0 0 1 0 1 0 0 0 1 1 rl 
0 0 0 I 0 1 ro 
which translates into the following in-line code requiring 4 multiplications and 2 
18 
' 
., 
• 
• 
t.· 
add it.ions. 
m.=a .. b. 
1 I 1 
i == 1,2,3,4 
·u 
One may also base this ca]cu]ation on matrices 
A - 0.5 0.5 
0.5 -0.5 
B - 1 1 
1 -1 
C - 1 1 
1 -1 
r - ro 
rl 
and have an equivalent code with only 2 multiplications and 4 additions. 
a 1 = (r0 + r 1)/2 
a 2 = ( r O - r 1) /2 
m.=a .. b. , i == 1 2 , 
I I t 
As can be seen f ram the example, the bilinear algorithm consists of an addition 
' 
stage followed by multiplications and subsequently another addition stage. It 
' has been shown 'i-ti 3.4 that the three stages of a bilinear computation interface 
naturally with each other in a bit sequential architecture. The hardware com-
p]cxity of such an irnplenientation is mai11ly detcrrnincd by the number of mul-
tiplications in the algorithm. Thus the architecture for the second algorithm 
listed above will be only half as complex as that for the first one. This reduc-
1 19 
{ 
• 
.. 
\. -. 
" 
tion in the number of multiplications may be attributed to the nontrivial nature 
of matrix A and a willingness to reduce the number of multiplications even at 
the cost of additions. 
As has been illustrated by this example, there could be more than one 
bilinear algorithms to carry out a linear computation. 
known [16] that any set of matrices A, B and C satisfying 
R == C[ dia ( A r) ] B 
In particular, it is 
(3.3) 
can be used to obtain X of equation (3.1). Matrix [ dia(Ar) ] in eq. (3.3) is a 
diagonal matrix with vector Ar as its main diagonal. The number of mul-
tiplications in the algorithm equals the dimension of this matrix. Thus to im-
plement (3.1) on a minimum complexity bit-sequentia1 architecture, one would 
try to minimize the dimension of [ dia( Ar)] remaining within the constraints of 
(3.3). 
As has beer1 shown here, a bilinear form computation consists of three 
stages: addition, multiplication and addition. Thus the entire computation can 
be carried out through a pipeline architecture based upon adder and multiplier 
modules. Further, all the multiplications in the second stage are independent of 
each other and can therefore be done simultaneously. As shown later, the adder 
' 
complexity is less than the multiplier complexity by a factor equal to the word 
length. Thus, at the cost of a small increase in the number of additions, one 
can afford to m3,ke all the addition stage outputs independent of each other. 
This implies a possibility of parallel computation of all the outputs of the adder 
·stages, thus minimizing the total pipeline length. 
The features of bilinear algorithms presented here, are exploited by the ar-
20 
.. 
chitecture of section 3.4. 
3.3 Bit-sequential Arithmetic Modules 
As has been shown in Section 3.2, a bilinear algorithm requires only ad-
ditions (or subtractions) and multiplications. Bit-sequential modules for these 
operations are obtained in this section. These designs are based on two's com-
p]ement fixed point arithmetic and assume that the operands are so scaled that 
no overflow occurs. 
The adder module as shown in Figure 3-1 is essentially a latched full ad-
der with the output carry fed back to the same adder as a carry input. The 
latches storing the sum and the carry are referred to as S-latch and C-latch 
respectively and have both reset and preset capabilities. The sum of two num-
bers P and Q can be obtained from the adder module by resetting both the S-
and C-Jatches and then feeding the two numbers synchronously in a bit-serial 
fashion at the inputs of the module. If the input strings are fed with the Least 
Significant Bit (LSB) first, then the output can be shown to be a binary string 
corresponding to the sum of P and Q with its LSB coming out first. 
Subtraction can be done by the same rnodule with the addition of an in-
verter at the subtrahend input. If the C-latch is preset, the S-latch reset and 
· the binary sequences corresponding to P and Q are applied to the modified 
module, as before, then the output is the difference of P and Q. The two's 
complement subtractor module is shown in Figure 3-2. 
The multiplier is the workhorse of the processor and its simplicity for bit-
sequential processing is what makes the architecture so desirable. Recall that in 
the case of bilinear algorithms for signal processing, one of the operands in 
every multiplication is precomputed and known while designing the architecture. 
'· 
21 
p Q 
~ , I , 
. . 
p Q 
FULL ADDER 
,. 
.. z 
--
, , ~ ' 
c- s-
latch later 
p 
Reset ~ Ill. , . , 
s 
Figure 3-1: The bit-sequential two's comple1nent adder 
p Q 
• 
I , , '
-
p Q 
FULL ADDER 
-
z 
-
, ' 
~ , 
c- s-
latch later 
p 
. 
Preset ~ ~ T ~ , 
, ' 
s 
Figure 3-2: The bit-sequential two's complement subtractor 
~ 
22 
These known operands can be programmed or hardcoded into the architecture 
depending on the particular application. 
Let the multiplier, P, and the multiplicand, Q, be expressed in n-bit two's 
complement notation. It is now shown that if the two numbers are sign ex-
tended to 2n-1 bits and the least significant 2n-1 bits of the product of these 
binary strings considered, then one gets the product PQ in its {2n-1)-bit two's 
complement form. It can be easily verified that the sign extension to 2n-1 bits 
implies the representation of numbers in {2n-1}-bit two's complement notation. 
Thus in the extended notation, a negative number ( say P) will be represented 
by a string of bits whose binary value would be P' = 22n-l_ P whereas a posi-
tive P will have a string value P' == P . 
To verify that the product of these binary strings P' and Q' truncated to 
2n-1 bits, is indeed the desired result, three cases need be considered. 
If both P and Q are positive then P' Q' is a string of less than 2n-1 bits 
and no truncation occurs and hence P' Q '== PQ. 
If only one of the operands, say P, is negative, then 
P'Q' == (2 2n-l - P). Q mod 22n-I_ (3.4) 
The modulo operation in (3.4) results from the truncation of the product to 
2n-1 least significant bits. If Q is 0, then from (3.4), P'Q' == 0 as expected.,i 
Otherwise rearranging terms of (3.4), one gets 
p, Q , == ( 2 2 n - I -_ PQ) + 2 2 n -1 ( Q _ 1) 
== ( 2 2n -1 _ PQ) , 
mod 22n-l 
which is the 2n-1 bit repres(;ntation of the negative number PQ. 
Finally, if both P and Q are negative then, 
23 
(3.5} 
.. ~~ .. 
P'Q, == (22n-1 _ P) (2 2n-1 _ Q) 
== PQ + 2 2n - 1 ( 2 2n - 1 _ p _ Q) 
== PQ, 
mod 22n- l 
mod 22n-l 
,. 
which is the correct representation of the positive number PQ. 
(3.6) 
This analysis, and the fact that one of the operands is known prior to the 
computation leads to the multiplication module shown in Figure 3-3. Note that 
the {2n-1}-bit sign extended precomputed operand is stored in a register (or is 
hardwired) and is available parallelly to the full adders. After resetting all the 
C- and S-latches, the other (2n-1)-bit sign extended operand is fed in serially 
(LSB first). The 2n-1 bit output is also obtained as a binary string with its 
LSB emerging first and is delayed with respect to the input string by exactly 
one clock cycle. This module is adapted from the usual concept of multiplica-
tion involving repeated additions of shifted versions of the multiplicand. 
However, since the output is now generated bit-serially, one need not complete 
each addition with carry-propagation from one end to the oLher. The module of 
Figure 3-3, thus stores all the partial sums as well as all the carrys at each 
clock. A proof that the module would indeed calculate the product of the two 
2n-1 bit strings in 2n-1 clocks is presented below. 
3.3.1 A Proof of the Multiplier Operation 
This section shows that the multiplication module of Figure 3-3 indeed 
produces the binary string that corresponds to the product of operands P' (fed 
serially) and Q' (fed parallelly). 
The inputs and outputs of the i-th (o < i < 2n-2) full adder of Figure 3-3 
at the end of j-th (o < j < 2n-1) clock cycle can be defined as follows. The 
24 
... 
q 
2n-2 -
, , , ' 
·o'--
Reset 
-
1 I 
p 
Adder 
Module 
Q 
R S 
i\ 
,, 
Figure 3-3: 
q 
2n-J 
. . 
p 
-
, ' ' ' 
., 
Q 
Adder 
Module 
R S 
1\ 
.\. 
e • • I 
. ' 
. 
. . . 
ql 
. , , , 
\. ~ 
"'" 
-
T-
' I 
I , 
p Q 
Adder 
Module 
R s 
,\ 
, I 
p 
.. 
qo -
, , ' , 
' 
\. .) 
--
, , , , 
p Q 
Adder 
Module 
R s 
4 .. 
. 
' , 
M 
The bit-sequential two's complement multiplier. Note that if 
the 
2n-1-bit Q' is obtained by sign extending an n-bit number, 
then bits qn-l through q2n_2 are identical. 
25 
'' 
inputs to the adder are denoted by P(i, J) and Q{i, J), the carry input by Z(i, J), 
and the outputs of the S- and C-latches by S(i,.11 and C(i,3j respectively. Note 
that in this notation, the rightmost adder index is O and the indices increase 
from right to left. Let the {2n-1}-bit strings P' and Q' and their product 
string truncated to {2n-1) bits, D, be 
P'-[ 
- P2n-2 P2n-3 
D =[d d 2n-2 2n-3 
· · · Pol 
. . . qol 
· · · d al 
It is then required to show that 
S ( 0, j+ 1) == d ., 0 < j < 2 n - 2. 
J 
The operation of the i-th adder can be characterized by following equations 
S(i,0)==0, 
C(i,0)=0~ and for 1 < j < 2n-l, 
S(i, j) == P(i, j-1) + Q(i, j-1) + Z(i, j-1) i -/ 2n-1, and 
(3.7) 
C(i, 1) == [P(i, j-1) /\ Q(i, j-l)l V [S(i, 1) I\ Z(i, j-l)l (3.8) 
,. 
Note that the symbols + , /\ and v in (3.8) denote the logical operations of 
XOR, AND and OR respectively and S represents the complement of S. 
I 
The interconnections of the full adders shown in Figure 3-3 imply following 
relations between the variables 
For O < i < 2n-2 and O < j < 2n-1, 
P(i, 1) == S(i+l, i) 
P(2n-2, 1) == 0, 
if i == 2n-2, 
26 
Q ( i, J 1 == qi.pi and 
Z(i, 11 == C(i, 11- (3.9) 
Visultlize the multiplication of P' and Q' as a repeated addition of the shifted 
versions of precomputed operand Q '. Let E(k) denote the partial sum thus ob-
tained at the k-th step. Then 
Clearly, 
k-1 
E ( k) == L Q ' p 1 2i, 
;"=0 
E(2n-1) == P'Q' 
and the ref ore, 
1 < k < 2n-1. 
- - ~ 
E(2n-1) mod 22n-l == P' Q' mod 22n-l - D. 
(3.10) 
(3.11) 
Define a function F(k) related to the sum and carry bits produced by the cir-
cuit of Figure 3-3 as 
2n-2 k-1 
F(k) == L [2C(i, k) + S(i, k)].2i+k-I + L S(O, i).2i-l 
i=O 
We then have the following relation. 
Lemma 1. E(k) = F(k), 1 < k < 2n-1. 
Proof. Supplied later. 
i=l 
Using Lemma 1, we prove (3.7) as following theorem. 
,:·.'Theorem 1. S(O,i+l)==d., 0 < j < 2n-2. 
J 
(3.12) 
Proof. From Lemma 1 and the definition equation (3.12) of F(k}, one 
gets 
27 
•. 
E(2n-1) == F(2n-1) 
2n-2 2n-2 
== 22n-l L C(i, 2n-l).2' + 22n- 2 L S(i, 2n-1).2' 
i=O 
2n-2 
+ L S(O, i).2i-l 
i=l 
i=O 
2n-2 2n-l 
== 22n-l L C(i, 2n-l).2' + 22n-l L S(i, 2n-l).2i-l 
i=O 
2n- l 
+ L S(O, i).2i-l 
i=l 
i=l 
Substitution of this value of E{2n-1} in equation (3.11) and use of 
modulus function now gives \ 
\ 
"' 2n-l L S(O, i).2i-l == D 
i= I 
or equivalently, 
2n-2 2n-2 
L S(O, i+l).21 == L d .. 21 . i 
i=O i=O 
Comparing terms on the two sides of this expression, 
S(O, i+l) == di, 0 < i < 2n-2. Q.E.D. 
Theorem 1 proves that the output S(O, i+ 1) of the. multiplier rri:odule 
generates successive bits (beginning with the LSB) of the product string D. 
Note that it also shows that there is exactly one unit delay between the input 
strings and ~he output strings. We now give a proof of Lemma 1 to complete 
the proof of this theorem. 
Proof of Lemma 1. (By Mathematical Induction} Equation (3.12) gives 
2n-2 
F( 1) == L [ 2 C ( i, l) + S ( i, 1) ] . 2'. 
i=O 
.. 
28 
• 
But from (3. 7) and (3.8), 
C(i,1)=0 and 
I • 
Therefore , 
2n-2 
F(l) == L qi.p0.21 == E( 1 ),from(3. l 0) (3.13) 
i=O 
Thus Lemma 1 is true for k ::= 1. We now show that if F(Jj = E(J), then 
F( j+ l) == E ( i+ l) . 
From (3.10) one gets, 
2n-2 
E(j+ 1) = E(1) + " q .. p .. 2i+i ~ ' J (3.14) 
i=O 
Since E(j) is assumed to be equal to F(j), one may use (3.12) to obtain its 
value. Substituting it in (3.14) one gets, 
2n-2 j-1 
E(f+1) == L :2c(i, i) ~ s(il i)].2i-+i- 1 + L s(o, i).2i- 1 
· i=O i= 1 
2n-2 
L 2j+i + q .. p .. 
' J 
i=O 
2n-2 
== L [2C(i, J} + S(i, 1)].2i+j-l - S(O, .1).2i-l 
i=O 
J 2n-2 
+ L S(O, i).2i-l + L Q(i,-J).2i+i 
i=l i==O 
J 
== M( i' j) + L S(O, i).2i- l - S(O, 11 .2i-l (3.15) 
i=l 
where, 
2n-2 2n-2 
M(i, J} == L [2C(i, JJ + S(i, i)).2i+j-l + L Q(i,Jj.2i+i (3.16) 
i=O 
(3. 7) and (3.8) can be combined to get, 
29 
C(i,j+l)=IS(i+l,J) A Q(i,J)] v [S(i,j+l) /\ C(i,J)]and 
S(i, .i+l) == S(i+l, J) + Q(i,JJ 
, 
(3.17} 
Since all the logical variables in (3.17) have binary values O or 1, it is easy to 
verify that if they satisfy (3.17), they also satisfy the following numerical equa-
tion 
C ( i, J 1 + S ( i + 1, J) + Q ( i, J) - 2 C ( i, j+ 1) + S ( i, j+ 1} ( 3 .. 18) 
M( i, J) can then be written as, 
2n-2 2n-2 2n-2 
M(i, J) == L C(i, J).2i+j + L S(i, J).2i+j-l + I: Q(i, J).2i+i 
i=O i=O i=O 
2n-2 2n-2 
== }: C(i, ,1.2i+j + S(O, JJ.2j-l + L S(i, JJ.2i+;·- 1 
i=O i=l 
2n-2 
+ L Q(i,11.2j+i 
i=O 
2n-3 
== C(2n-2~ JJ.2 2n+j-Z + L C(i, j).2i-t j + S(O, j).2j-l 
i=O 
2n-3 2n-2 
+ L S(i+l, 3j.2i+j + L Q(i, JJ.2i+i 
i=O i=O 
2n-3 
== L [C(i, J} + S(i+l, JJ + Q(i, 1)].2i+j + S(O, J).2j-l 
i=O 
+ C{2n-2, 3j.22n- 2+i + Q(2n-2, J).22n- 2+i 
using (3.18), this now yields 
2n-3 
M(i, j) == L [2C(i, j+l) + S(i, j+l)].2i+j + S(O, j).2j-l + L(i, JJ (3.19) 
i=O 
where, 
L(i, J} == C(2n-2, j).22n- 2+j + Q(2n-2, j).22n-2+i (3.20) 
30 
But from (3.9), S(2n-1, 1) = P(2n-2, J) = 0. Adding this variable to (3.20) and 
using (3.18), one gets 
I,( i, J) = C(2n-2, JJ .22n- 2+ j + S(2n-1, J} .22n- 2+ j + Q (2n-2, 3)22n- 2+j 
== [2C(2n-2, j+l) + S(2n-2, j+l)].22n-2+i. 
Using (3.15) and (3.19), E(j+1) can now be written as 
2n-3 j 
E ( j+ 1) == L [ 2 C ( i, j+ 1) + S ( i, j+ 1)] . 2 i + j + L S ( O, J) . 2i - 1 
i=O i=l 
- S(O, 3).2i-l + S(O, 3).2i-I + L(i, 3). 
2n-2 j 
== L [2C(i, j+l) + S(i, j+l)].2i+j + L S(O, i).2i-l 
i=O i=l 
== F(j+l) from(3.12)). 
Hence by induction, 
F(k) == E(k) 1 < k < 2n-1. 
- -
Q.E.D. 
This use of carry save technique has implications in both hardware and 
time savings. Since the architecture is pipelined, as explained in Section 3.4 the 
slowest module in the pipeline determines the throughput. The bit rate of the 
·:, i;) 
multiplier of Figure 3-3 is determined by a single synchronous full adder and 
thus it can accept and generate bits at the same rate as the modules of Figure 
3-1 and 3-2. All three modules described here have therefore the same speed 
and can be pipelined and driven by the same clock without any loss of speed. 
The principles of a complete bit-sequential architecture using these modules 
are presented in the next section. 
31 
3.4 Architectural Description 
3.4.1 Architecture Design 
Recall from Section 3.2 that the bilinear computation is made up of three 
sequential stages. The adders and multipliers used in these stages accept and 
produce sequences bit-serially ( LBS first ) and therefore can be interconnected 
to form a computational network to implement the bilinear algorithm. 
By exploiting the inherent nature of the computations involved, the com-
plexity of the multiplier can be ~reduced by a factor of two to sirnplify the ar-
chitectural design and implementation. To achieve this, note that both the 
operands involved in the multiplication are sign extended from n-bits to ( 2n-1) 
bits and one of them, Q ', is precomputed and ri)ade available parallelly to the 
circuit. If Q' is positive, its sign bit qn-l and the extended bits qn to q2n_ 2 are 
all zeroes. Since these are ANDed with the incoming bits the leftmost n adders 
in Figure 3-3 are fed with zeroes and hence can be effectively ignored. 
The effects of the sign bit of Q' can be easily accounted for by an 
EXclusive-OR (XOR) gate and a latched full adder, leading to the multiplier 
design shown in Figure 3-4. The operand Q' in this modified scheme is stored 
in sign-magnitude form, w bile the serial input is still a sign extended {2n-1) bit 
two's complement number. The top block of (n-1} adders behaves as a mag-
nitude multiplier multiplying the serial input by the absolute value of the s'uored 
operand. The sign 1 or O of Q' presets or resets the C-latch of the bottom 
level adder. For positive Q ', the last adder does not affect the product 
evaluated by the top level adders but for negative Q ', the magnitude product is 
two's complemented by the combination of XOR and the carry preset. This 
multiplier is capable of multiplying two n-bit two's complement numbers in a 
32 
I 
.. 
.,_. ______________________________ __,,------------------,--~ pi 
I QI 
--
' , 
, , 
p Q 
Adder 
Module 
R s 
. ,
, I ' ' 
\.. ~ 
--
' , 
, , 
p Q 
Adder 
Module 
R s 
. . . ' 
\. .) 
-
..... -
t I ' , 
p Q 
Adder 
Module 
R s 
, ' 
•--...&--------------- ... '. 
Reset l 
Reset 2 I Q I ... A 
1 , ' 
-- . -
, , 
latch 
R 
. 
., 
B 
\. ~ ... _
., 
p Q 
Adder 
Module 
R s 
FA: Full Adder 
s 
-
: ' C ___ __,,,.._ .I1-----+----'l~ ... "'----.~ ._, OU t •, 
-
Reset 3 p Latches 
R R 
___ _,. .... '- -IL ~ " I '; M 
Figure 3-4: Enhanced design ,of the multiplication module. 
33 
bit-serial fashion to yield a {2n-1) bit serial output and using n latched full ad-
ders, {n-1) AND gates and an XOR gate. Its hardware complexity is thus 
0( n) where n is the word length. 
The architecture design is intimately related with the structure of the 
bilinear algorithm. In particular, matrix B of the algorithm is implemented 
through adder and subtractor modules of Section 3.3 and constitute the first 
stage of the architecture. The second stage uses the required number of mul-
ti pliers presented here, working parallelly. Finally, the third stage designed 
from matrix C uses the adder /subtractor modules and computes all the output 
points parallelly . The implementation of the two multiplication algorithm of 
example 1 is shown in Figure 3-5 It uses {2n+4) latched full adders and {2n+2) 
gates besides the control circuitry based on a 2n bit ring counter as described 
in section 3.4.2. The multipliers need (2n-1)-bit sign-extended input which can 
be easily obtained by suppressing the clock to the first addition stage after n 
clock cycles. 
3.4.2 Architecture Control 
The architecture is efficiently pipelined since each segment, a latched adder 
with a gate, has the same delay. The system clock period is given by 
T =--= t + t + t + t c gate full adder latch setup 
Typical values of these quantities (for fanout of 4) in 2µ CMOS VLSI technol-
ogy are [24] tgate = 2.8 nsec., t Julladder == 5.8 nsec., tlatch = 4.4 nsec. and tsetup == 1.5 
nsec., which yields T == 14.5 nsec. Therefore it is conceivable to use clock of 
C 
period 20 nsec. creating output bits at 50 MHz. Based on this assumption, one 
) 
can obtain entire set of 32-bit output words at . a frequency greater than 1.5 
34 
1 
2 
3 
•• 
4 
·, 
• 
• 
• 
• 
• 
. .. 
" 
5 
2n 
' . 
-
, 
-
, 
' ( 
11------1"""~---------
• I 
p Q 
Adder Module 
R s 
j • 
~ .... 
., 
' p 
Mu l t. Module 
Stored Coeff.= 
(r0+r 1)/2 
Rl 
R2 
R3 M 
~ L 
"~ 
' ~ J 
, ( 
p Q 
Adder Module 
R s 
' \ 
, 
Xa 
-
-
-.. 
r 
' , I t 
Q 
Adder Module 
Pr s 
j ~ 
'I 
p 
Mult. Module 
Stored Coeff.= 
(ro-r1)/2 
Rl 
R2 
R3 M 
j • 
• I 
-----------~--p Q 
Subtracter 
Module 
R s 
Figure 3-5: Implementation of two-point convolution 
35 
0 
MHz. 
The delay between the input and the output can be computed by noting 
that the adder and multiplier modules introduce 1 and 3 clock delays respec-
tively and that many modules of the same type work in parallel. When the 
number of input/ output data points, N, is large, the adders tend to be or-
ganized in O(log2N) level trees and the net delay of the pipeline is O(log2N). 
Since the delays in any data path accumulate, one should ensure that the 
data entering each module is properly synchronized. It might sometimes be 
necessary to delay data by a fixed number of cycles in order to compensate for 
the difference in lengths of two combining paths. 
The control of this architecture is simple and uses a 2n-bit ring counter. 
New n-bit data sets are scheduled to enter the architecture every 2n clocks and 
to produce {2n-1)-bit output words after the pipeline delay. This provides one 
clock period to initialize the latches for every new data set. These reset/preset 
signals are generate~ by a single I rippling through the ring counter. Figure 
3-6 shows the timing for the architecture of Figure 3-5 generated by the ring 
counter. >J ote that all except the first level latches are applied the same clock. 
The clock to the first level is blocked after the first n cycles to create the sign 
extension. The control signals ( reset or preset ) for all the modules at a given 
level are synchronous and are delayed by one clock from those of the previous 
level. 
(:,0· 
This simple control strategy adds to the charm of these bit-sequential ar-
chitectures. 
36 
2n clock cycles 
~1 
System__.! I f I j I I I I I I~ 
clock 
Level _J 
clock I I I r· ..... 1 I J_I I I L 
clock cycles 
. . . . . . 
(n-1) clock ...:....I 
-+: eye 1 es. Absen~ 
of clock leads to 
r---Next data 
set 
sign e~-~~-~~-ion. /---. ... 1_____ _ Leve 1 1 
Reset 
~Request for input pulse 
R l __ ___.I _, --- ...... . ~- ...... __________________ II.___ ______ 
R2 ________________ 1 '----------- • • • • • • .... -------------~' '----
R3 ____ __.I ----' - ....... . . ' . .. . 
Last level R/P 
I-•~--- Pipeline Delay 
I I.._____. . . . . . • • • • • 
____ .,... , 11111~ __
 ou t put. 1b i ts av a i 1 ab 1 e 
in LSB first format. 
Figure 3-6: Timing diagrams of the control signals in Figure 3-5 
37 
3.5 FFT Il11t.t.<~rfly Ele111c•11t. cl<'sig11 
1"'hc concc'pt.s in the previous sc'ct.ious arc illustrated through the design of 
the buttcrny clc111ent (l!E), which is characterized by the operation j16] 
y 
z 
I a y 
I -a z 
where, a is a cornplex constant, representing some power of the N-th primitive 
root of unity for a N point FFT. Y and Z represent the complex outputs for 
the cornp]ex input points y and z of the 13E. Let subscripts r and i represent 
the real and irr1aginary parts of these variables. An efficient three multiplica-
tion bilinear algorithm for the above computation can be written in the form of 
(3) as 
where, 
and 
' 
Bx~!z +z., z., z, y, y.]T 
r i i r r t 
Ar=la, -a.-a, a.-a, 1, l]T 
r i r z r 
1:-
·.j -
I 1 0 I 0 
1 0 I O I 
-1 -1 0 1 0 
-1 0 -1 0 1 
1"'his algorithm can be implcrncntcd in the manner described in last section. 
The multiplicands for the three multiplications given below, can be precomputed 
and stored in sign magnitude form. 
38 
' 
The implementation is shown in Figure 3-7 for an input word length of four. 
Dotted lines. in this figure indicate preset/reset control by the sign bit at the 
proper clock cycle. Note that the two su btractor modules at the final stage 
need preset control, whereas all other addition modules need reset control. 
Delay elements are needed for synchronization as demonstrated in this 
design. These ensure that at any level proper bits are being added, subtracted 
or multiplied. The reset/preset pulse of these delay elements is identical to 
other blocks at the same level. Since the first level blocks are clocked only for 
the first n clock cycles the design ensures that x and x. are properly sign ex-
. 
r I 
tended before being added to data from the other path. Also, note that though 
Ar has five elements, the effective number of multiplications is only three since 
multiplication by unity can be implemented through a delay element in the ar-
chitecture. 
The . delay of the pipeline is six clock cycles and is independent of the 
word length. The processing time of this BE for an output word length of 32 
bits and a clock rate of 50 MHz ( obtained in section IV) is only 640 nsec. 
Note that new data sets may enter the architecture every 2n clocks ensuring the 
most efficient utilization of the entire hardware. This facilitates easy pipelining 
of these BE's into large meaningful FFT processors. As can be seen from 
Figure 3-7 the architecture uses only 3n + 7 latched adders and a few gates and 
latches, for n-bit input words. 
One significant aspect of such a design i·s that all the outputs are com-
39 
I t 
~atch 
f 
.~ 
f 
1 , 
f 
' . 
'. 
~ . 
i 
p 
Y· 1 
. , 
1 atch 
R 
. , 
t 
. , 
f 
' , 
4 I 
I 
t 
Q 
Adder 
Module 
R S 
• 
·¥ 
r 
Figure 3-7: 
....______,Ji'\..\.-----------------, 
I 
p 
Adder 
Module 
4 I 
Q 
R S 
i 
I ' 
p 
Mult. Module 
Coeff.= ar 
R2 
~1u 1 t. 
Coe ff. 
R2 
•• 
latch 
R 
I 
1--
r~odu 1 e 
= -a·-a , r 
Mul t. 
. , 
latch 
R 
• 
1 • 
p 
Module 
Coeff.=a;-ar 
R2 
Rl R1 ~1 Rl R3 M Rl R3 M 
J ~ 4 I 4 \ ~ . ~ . ~ j ~ 
' ' • I o 
' 
"' ~ 
' 
• I I I 
p ~ J C 
" . Adder Adder 
r~odu 1 e Module 
R s R s 
l ...... ,, 
I r 
• 
j 
-~ ., 
-
~ ... ~ 11,,. ... 
~ 
T 'I 
p Q p Q p Q 
Adder Subtractor Subtractor 
Module Module fv1odu 1 e 
R S R S R S 
,, t i' ,., 
·v. ·rz ·'L. 
1 r 1 
Bit-sequential architecture for an FFT butterfly element 
40 
. 
puted simultaneous_ly and therefore, the time it takes to compute one output 
word is the time it takes for the FFT computation. Serial outputs ensure that 
implementing such a processor is fea.sible even for large length problem. This is 
in contrast to the word oriented machines where there is a time-hardware 
tradeoff for large size problems, because of infeasibility of implementation. Also 
these numbers show that the architecture described here achieves as much speed 
as the state of the art processors with relatively little hardware. 
41 
ii • 
Chapter 4 
Bit-sequential Systolic Array Architectures 
4.1 Vector-inner product algorithms 
Signal processing algorithms, that can be represented as an inner product 
of a finite vector with time shifted versions of another finite or infinite vector, 
constitute an important class of algorithms. A large number of important sig-
nal processing algorithrns fall in his category, the more commonly used ones be-
ing digital filtering, linear and cyclic convolution and pattern matching al-
gorithms. Since inner product calculation demands several arithmetic or logical 
operations to be performed simultaneously (multiplications and additions in digi-
tal filtering, bit comparisons in pattern recognition), va.riuus architectures incor-
porating parallelism can be constructed for the same. Efficient implementation 
of inner product algorithms is, therefore, still a rnajor area of current research 
in VLSI and researchers have _proposed varied architectures for the 
same[9,10,30-33]. These designs have however been proposed for word oriented 
machines. With the advantages of bit-sequential processing mentioned earlier, it 
is of value to generalize these designs to be able to input and process data 
stream in a bit-sequential fashion. 
Outlined in the following sections are the design considerations and 
methodology for architectures implementing bit-sequential vector 
. processing. 
Numerical processing for filtering and convolution as well as non-numerical 
processing for pattern recognition has been considered. The algorithms to 
facilitate bit-processing are obtained and architectures implementing these 
modified algorithms are given. Section 4.2 lists some of the key ideas in algo-
42 
\ 
rithm and architecture design for array processors. Since Finite Impulse 
Response (FIR) filter and linear convolution algorithms have the same represen-
tation, they both are discussed in Sec. 4.3. Sections 4.4 and 4.6 deal with In-
finite Impulse Response (IIR) filters and pattern matching respectively. Sections 
4.5 and 4. 7 evaluate these designs in terms of their performance. 
The designs presented are centered around the modified multiplier module 
developed in Chapter 3 and are simple and modular, using identical repetitive 
. 
blocks. The overall architectures are based on nearest neighbor interconnections, 
avoiding any broadcasts and can be easily implen1ented as highly area efficient 
VLSI arrays. In addition, the basic modules are easily cascadable, without 
timing considerations and can be used as a basic block for implementing larger 
length algorithms. The arJ.·ays employ concurrent processing and are completely 
pipelined; that is, the throughput rate (bits/second) is independent of the word 
size and the filter length. 
4.2 Design considerations for array processing 
Array processing concepts mentioned in Chapter 2 have been elaborated in 
this section. Arguments are based on the vector product for digital filtering, for 
illustration, but can be easily generalized to algorithms having a similar struc-
ture. 
Digital filters can be characterized by 
K M-1 
Y(n) == L B(i).}'(n-i) + L A(j).X(n-3j (4.1) 
i=l j=O 
where, Y(n-i) is the n-th output delayed by i units in time and X(n-1) is the 
n-th input delayed by j units. The filter is called FIR or !IR , depending upon 
43 
whether or not all A(i)'s are zero. M is an important parameter of the filter 
and is referred to as the order of the filter. The value of K determines the 
number of poles of the filter and stability arguments dictate that K be less 
than M. The expression for convolution is identical to that of an FIR filter. 
The computational demands placed by such algorithms are significantly dif-
ferent from those of bilinear algorithms discussed in sections 3.1 and 3.2. The 
number of multiplications required to evaluate each output component as per 
(4.1) equals the number of non-zero filter coefficients A(i)'s and B(i1's. In most 
realistic situations this is a reasonably small number and more importantly, it is 
independent of the number of points in the input sequence. Thus, there exists 
no inherent need to minimize the number of multiplications. As a consequence, 
these algorithms can be implemented without the multiplication minimizing con-
straint, giving more flexibility in terms of architecture. The pre-multiplication 
addition stage, encountered in bilinear algorithms is also absent here. We now 
sho\v how to map the multiplication-addition structure of these algorithms into 
VLSI arrays. 
' The key idea behind array processing is to spread out the computations in 
space, thus executing the algorithm in a highly concurrent (parallel) fashion. 
The spread in space, along with pipelining results in high computation through-
puts. The data flo, 1  is laminar and the control is also spread out in space, i.e. 
control signals flow in the array as does data, thus simplifying timing. Also 
since most operations have to be performed simultaneously, pipelining these ar-
rays is essential to improve throughput. Array structures have the added ad-
vantage of simple analysis and simulation tools (30]. 
To .illustrate these concepts, consider a FIR filter, 
44 
M-1 
(4.2} 
Y(n) == L A(J).X(n-3) 
j=O 
Defining the delay operator, Z, as Z.X(n) == X(n-1), equation (4.2) can be 
written as, 
or as, 
M-1 
Y( n) == L A ( 31 [ zi. x ( n) J 
j=O 
M-1 
r ( n) == L zi [ A ( 11. x ( n) l 
j=O 
(4.3) 
( 4.4) 
Even though (4.3) and {4.4) are computationally equivalent, they have dif-
ferent characteristics. Figs. 4-1 and 4-2 show the networks corresponding to 
these equations. For example, it is clear that ( 4.3) requires a simultaneous ad-
dition of M elements, whereas ( 4.4) uses pipelined additions which may be per-
formed within a shorter cycle than needed for the former, hence yielding a 
higher throughput. Also. for bit-sequential implementation, addition of more 
than two operands simultaneously is undesirable since it implies forming a tree 
of adder modules of Fig. 3-1, with appropriate delays to balance addition 
paths. The computation of ( 4.4) is thus, more suited for bit-sequential im-
plementation than the former. 
It may however be observed that unlike { 4.3), ( 4.4) requires the broadcast-
ing of the input signal X == { X( n)} to all the computational elements. Broadcast 
is undesirable for VLSI implementation, because of the delay, timing and fanout 
problems it may cause. In addition, broadcast disrupts modularity and cascad-
ing of smaller blocks to form bigger blocks. 
Both of these implementations spread the computations entirely in space, 
thereby achieving fully parallel implementations. In the implementation of Fig. 
45 
. . . . .. ... _ __.,. z 
A(O) A( l) A(2) A ( 3 ) · · · · · · · '. · · · A(M-1) 
Figure 4-1: The irnplementation of ( 4.3) 
Xin 
A(O) A(l) . . . .. . . . . . . . . . . A(M-1) 
z z ..,._ .... 0 
Figure 4-2: The implementation of ( 4.4) 
46 
4-1 all the computations for each y(n) are simultaneous whereas in that of Fig. 
4-2 the computations are skewed in the time/space domain, which is typical of 
pipelines. Moreover, both designs have implicit control embedded in space, by 
virtue of the array size M. 
The translation of the design of 4-2 into ~it-sequential array filters is 
presented in the following section. 
4.3 Finite Impulse Response Filters 
Consider two's complement multipliers {M modules) which take in input 
words N bits wide and produce in a bit-sequential fashion the 2N-1 bits of the 
product of the input word with the stored coefficient. Using these and two's 
complement adder modules of Fig. 3-1, the FIR filter implementation in Fig. 4-2 
can be transformed to the design shown in Fig. 4-3. 
Each word delay element in Fig. 4-2 is now replaced by 2N bit delays to 
obtain the same effect. Since the modules need one cycle to clear the residue 
bits form a previous ~peration, as mentioned in section 3.4, after giving an in-
put to the filter, the multipliers and the adders are clocked 2N times to yield 
the 2N-1 bit output and then to clear the modules. 
Since for each clocking of the multiplier only one bit of output is produced 
and needed in the design, we can make use of the multiplier module described 
in Fig. 3-3 to even take input in a bit sequential form. As described in Sec. 
3.3 the input is 2N-1 bits, in two's complemented sign extended form. This 
however does not avoid the broadcast of the input bits to all modules. 
The broadcast can be avoided by moving the delays between additions to 
between multiplications. One other important feature of this design is that t~e 
control signal for clearing the residues in the computational modules are delayed 
47 
Par al le l Input 
"" 
I • • • e • • • _______ ..._ 
_,-7....._. ____ -r"'I"" _______________ • • • • • ••• -------..J7" , 
Re~et 
, 
1 , 
Delay 
' , ~ 
p 
PS Multip-
1 i ers 
Coeff.=A(O) 
R M 
' ' 
..... 
1 ' 
p 
Adder 
,., 
Q 
Module 
,R s 
.... " 
". 
. 
p 
PS Multip-
l i ers 
Coeff.=A(l) 
R M 
' 
\ 
--
,, 
, I 
p Q 
Adder 
Module 
,R s 
. . . . . . . . 
. . . 
• 
. 
. 
. 
. 
. 
. 
"""l,, 
p 
PS Multip-
l i ers 
Coeff.=A(M-1 ) 
R M 
' ' 
-0 
~ I 1 ' 
p Q 
Adder 
Module 
s 
I I 
~-------"--'1'--~---_..._ _ _,,'--__ . . . . . ... ---_ ... -~ ~ .... 
-Serial Output 
1 , ~ (2N-1) 
De lay 
. 
• 
- (2N-1) - ~, 
uelay 
PS Multipliers : Par.l lel input Serial output Mul tip,1 iers. 
Figure 4-3: Serial-input Sequential-output FIR filter 
in the sarne fashion as the input bits. 1"'his leads t.o the design of the basic ar-
ray elernent shown in Fig. 4-4, which could be cascaded to get FIR filters of 
any order. 
It can be seen that this element consists of the multiplier and the adder 
presented in Chapter 3 along with a single bit delay in the propagation path of 
the input bits, to the next module and delays in the path of the reset signal to 
successive levels of latches within the element. 'fhe output of the multiplier is 
48 
t, 
, 
X;n X 
~>-------------1-~ Delay -----~--------------..:o~~t 
I I/ 
a j, N-1 
' , Absolute value 
Reset· Multiplier(first level 
, 1 n ... 
, ~-~~~R of fig. 3.4) M 
~-_,, .... \,,--______ _ 
~ , 
Delay 
1 , 
R 
Resetout 
-----J"' ... '--------------~-
-
Delay 2N-2 delay ,,___....~AI 
., 
A B 
Full Adder 
' 
( 'I 
Reset/ 
-
Preset ,- C;n 
Control Cout s 
Preset1 
., ,, 
.,,. Iii.. ... Pr 
--Reset1 
; lo.. 
9elay delay 
- l{ . 
' 
. ' AO ~ ,, 
Figure 4-4: Bit-sequential array element for the filter 
49 
.. 
) 
In put 
... x. 
, in 
Re set R. 
.. 
,n , 
-
AO 
-
I I 
2N-3 
Delay 
) 
I 
OUT 
/ 
x. X 
out -• ,n 
R R. 
out - ,n 
-
AI - AO 
Module#O 
X 
out ~ 
R 
out ~ 
AI 
-
-Module#! 
x. 
,n 
R. 
, n 
X 
out 
R 
out 
AO AI 
Mod u l e # ( ~1- 1 ) 
Figure 4-5: Bit-sequential array FIR filter 
50 
....,.__ 0 
'\ /\ 
( I' \ 
\ I/ ./ 
'"" '\ 
,I 
I 
added to the input from the adjacent module delayed by 2N-2 bits. As ex-
plained earlier, the constant filter coefficient is stored in a sign-magnitude form 
and its sign is employed to obtain the two's complement of the product if 
necessary. 
The array structure using these modules is shown in Fig. 4-5. The input 
to the array is the successive bits of a signal component, sign-extended to 2N-1 
bits. Every 2./V-th clock, a clear signal is introduced into the first element 
which ripples through to clear the multiplier and the adder in each elernent and 
is also passed to the succeeding elements. ,.fhr next 21V-l clocks are used to in-
put bits of the next signal component and the entire cycle is repeated till all 
the signal components are exhausted. The control of the array is thus sirr1ple 
and can be done through a 2N counter. In order to synchronize the input anc 
the output an additional 2.;V-3 bit delay is introduced, before the output is ex-
. tracted fro1n the array. ,.fhis \\'Oldd producP a delay exactly equal to 2N clocks 
( thr \\'ord delay) bet\\·een the input and th<· output. 
The operation of design in Fig. 4-5 can be verified by a mathematical 
characterization of the circuit. This leads to a proof of operation which is 
C, 
presented next. 
4.3.1 Proof of the FIR filter design 
Developing a mathematical representation derr1ands evolving a notation for 
such bit-sequential architectures. In the arguments deve]oped in this section, 
the fallowing notation is used: 
VAR(i,j), O < j < M-1 
where, 
51 
I·, 
VAR( i, JJ is the bit for the variable VAR of the j-th module 
at the i-th time instant. 
As apparent, if either argument is negative for the variable, then the value of 
the variable is zero. 
Let the input and output samples and the stored coefficients be respec-
tively represented as: 
X ( k) = [ x k, 2.N - I x k, 2/1.7 - 2 • . • x k, 0 ] 
Y(k) = [ Yk, ZN-I Yk, 2N-2 ... Yk, o] 
· 4 ( k ) = - [ a k, N -- I a k, N - 2 · • • a k, o ] {4.5) 
1\s rnrntioned before the constant is stored in sign-magnitude form and the N 
bit input is given in two's complement sign extended form to obtain the 2N-1 
bit output in t\\·o's complement form. Therefore, 
= 1 otherwise 
and 
' 
xk,2N-I = xk,2N-2 = 
• • • .:..= X k N-I , 
The input to for the j-th module, X. is then given by, 
,n 
Xin(i, j) = x(i- j) div 2N, (i- j) mod 2N (4.6) 
'fhe reset condition for the C-Jatch at the final stage adder can be expressed as, 
C ( i, j) = a . N- 1 J, j for i = 2kN -+ j + 2 (4.7) 
l.Jet P(i, j) be the output of the EX-OR gate, which is fed to the final stage ad-
der. 
2N-1 
Theorem 1. L P(2kN+ i+ l+ 2, l) - (22N- 1). a1,2N-I = A(l). X(k) 
i=O 
Proof. As proved in Sec. 3.3 the output of the multiplication block M( i, j) 
52 
is given by 
2N-l 2N-1 
L 2i.M(i+l,l)=IA(l)I. L 21 .Xin(i,l) . (4.8) 
i-=O i~O 
The romplementation by the sign bit of the stored ro(lfficient can be charac-
terized as, 
P( i + 1 , I) = a 1 2 N _ 1 + ( -1 ) al, 2 N 1. M ( i, l) , (4.9) 
By changing variables and summing both sides of ( 4.9) we get, 
2N--1 
L 2 t. P( 2 k N + i + l + 2, l) 
i -=O 
2N-1 2N-1 
=-= L 2i. a1 2N_ 1+ (-l}al,2N-1'. L 2i. M(2kN+ i+ l+l, l) (4.10) 
' i=O i -=O 
and then substituting from ( 4.8) into ( 4.10) \Ve get, 
2N-1 2N-1 
= L 2i.a, 2N-1 + (-l)al,2N--l. IA(l}I. >~ 2i. xin(2kN+ iT I, I) 
' 
\J i=O i=O 
2N--1 
~ ( 2 2 N - 1 ) . al, 2 N _ 1 + ( - 1 ) a: , 2 JV - l . I A ( l ) . L 2 i . Xi n ( 2 k N -+- i -+ l , I ) 
us1ng (4.6) this can be written as, 
I 2N-1 
==(2 2N - 1). a1 2N-l + (-l)al,2N-1. IA(l) I- L 
' 
. 
21 . xk . 
' l 
or, \ 
2N-1 
L 2i. P(2kN+ i+ I+ 2, l) - (2 2N_ 1). a!, l.N-l 
i=O 
2N-1 
= (-l)al, 2N-1. IA(l)I. L 2i. xk, i (4.11) 
i=O 
The right hand side of ( 4.11) can be expressed as, 
2N-1 
== IA(l)I- L for IA(l)I > 0 
i=O 
and, 
53 
.. 
\ 
/ 
VAR(i, 1) is the bit for the variable VAR of the j-th module 
at the i-th time instant. 
As apparent, if either argument is negative for the variable, then the value of 
the variable is zero. 
Let the input and output samples and the stored coefficients be respec-
tively represented as: 
X ( k) == [ x k, 2N-1 x k, 2N- 2 . . .' x k, 0 ] 
Y( k) == [ Y k, 2N- 1 !/ k, 2N- 2 • • • JI k, O ] 
A ( k) == [ a k N- 1 a k N- 2 • • • a k O ] 
' , ' 
(4.5) 
As mentioned before the constant is stored in sign-magnitude form and the N 
bit input is given in two's complement sign extended ~orm to obtain the 2N-1 
bit output in two's complement form. Therefore,; 
ak N-l == 0 ifl A( k)I > 0 
, 
== 1 otherwise 
and, 
x k,2N-1 == x k,2N-2 == • • • == x k,N-1 
The input to for the j-th module, Xin is then given by, 
Xin ( i, J) == x (i- j) div 2N, (i- j) mod 2N (4.6) 
The reset condition for the C-latch at the final stage adder can be expressed as, 
C(i, j) == aj,N-l for i == 2kN + j + 2 (4.7) 
Let P( i, J) be the output of the EX-OR gate, which is fed to the final stage ad-
der. I 
2N-1 
Theorem 1. L P(2kN+ i+ l+ 2, l) - (2 2N- 1). a1,2N-l == A(l). X(k) 
i==O 
Proof As proved in Sec.· 3.3 the output of the multiplication block M( i, 1) 
52 
... 
' 
,. 
) 
is given by .. 
2N-1 2N-1 
L 2i. M(i+l, l) = IA(l)I, L 2i. xin(i, I) (4.8) 
i=O i=O 
The complementation by the sign bit of the stored coefficient can be charac-
terized as, 
P(i+ 1,l) = a1 2N-l + (-l)al, 2N-l. M(i, l) I (4.9) 
By changing variables and summing both sides of ( 4.9) we get, 
2N-1 
L 21 • P( 2 k N + i + I+ 2, l) 
i=O 
2N-1 2N-I 
= L 2i. a1, 2N-I + (-l)al,2N-1. L 2i. M(2kN+ i+ l+l, /) (4.10) 
i=O i:::: 0 
and then substituting from (4.8) into (4.10) we get, 
2N-1 2N-1 
== L 2i.al,2N-I + (-l)al,2N-L IA(l)I. L 2i. xin(2kN+ i+ l, l) 
i=O i=O 
2N-1 
==(22N - 1). al,2N-I + (-I)al,2N-1. IA(l)i. L 2i. xin(2kN+ i+ l, l) 
i=O 
using (4.6) this can be written as, 
or, 
2N-1 
==(22N- 1). a1 2N-l + (-l)al,2N-1. IA(l) j. L 
' i=O 
2N-1 
L 2i. P(2kN+ i+ l+ 2, l) - (22N- I). a1 2N-l I 
i=O 
2N-1 
= (-I)al, 2N-I. jA(l)I. L 2i. xk,i 
i=O 
The right hand side of ( 4.11) can be expressed as, 
2N-1 
== IA(l)I- L for IA(l)I > 0 
i ::::0 
and, 
53 
" 
. 
21 . x L . 
F.: 1 
(4.11) 
2N-l 
== - IA(l)I. L 2i xk,i for IA(l)I < 0 
i=O 
which is just the product A(l). X(k) in either case. 
Q.E.D. 
2N-l M-1 
Theorem 2. L 2i.OUT(i+ 2kN) == L X(k- l- 1). A(l) 
i=O l=O 
Proof. The final stage adder in each module of Fig. 4-5 is characterized by 
AO ( i, J j = P( i - 1, J) + A I ( i- ( 2 N - 2) - 1, J 1 + C ( i- 1, J) - 2 C ( i, J) 
but, 
A I ( i - 1, j) = AO ( i- 1, j+ 1 ) 
and, 
Al(i, M-1) == 0 (4.12) 
using equations ( 4.12) the above expression can be recursively expanded to yield, 
M-1 
AO ( i, j) = L I P( 'i - l ( 2 N - 1) - 1 , l) + C ( i - l ( 2 N - I ) - 1, l) 
1=0 
- 2 C ( i - l ( 2 N- 1 ) , l) ] (4.13) 
Since the delay of each module is 3 units, consider the output of the zero-th 
module at time i+t, where t == 2kN+ 3. Summing the output with appropriate 
weights, from ( 4.13) we obtain, 
where, 
2N-l 2N-1 M-1 
L 2i. AO(i+ t, 0) = L 2i L P(i+ t- l(2N-1)- 1, l) 
i=O i=O 
M-1 
+ L S(l) 
l=O 
( 
l=O 
54 ' 
,) 
(4.14) 
I'• 
2N-l 
S(I) == L [ 2ic(i+ t- 1(2N-1)- 1, I) - 2i+ 1c(i+ t- 1(2N-1), 1) J 
i=O 
2N-1 i=2N ~ 
- L 12iC(i+ t •,(2N-l)- 1, 1) - L 2iC(i+ t- 1(2N-1)- 1, I) I 
i=O i= 1 
== C(t- l(2N-1)- 1, l) - 22N.C(t-- (l-1)(2N-1), l)] 
Substituting the value of t and using reset conditions for the carry from ( 4. 7), 
== C ( ( k- 1 ) 2 N + l + 2, l) - 2 2N C ( ( k - l -t 1 ) 2 N + l + 2, l) 
== - (22N_ l).a, 2N-1 
I 
Finally since, 
OUT(i+2kN) = AO(i+ 2kN- (2N-3), 0) 
using ( 4.14) and ( 4.15) we obtain, 
2N-1 2N-1 M-1 
L 2i.OUT(i+ 2kN) == L 2i L P((k- l- 1)2N+ i+ I+ 2, l) 
i=O i=O l=O 
M-1 
- L (22N_ l).a, N-1 
, 
l=O 
M-1 2N-1 
= L [ L 2 i P( ( k- I - l ) 2 N + i ~ l -t- 2, l) - ( 2 2 N - 1 ) al n _ 1 ] , 
l=O i=O 
Using theorem 1 the above equation can be written as, 
2N-1 M-1 
L 2i.OUT(i+ 2kN) == L X(k- l- l).A(l) 
i=O l=O 
4.4 Infinite Impulse Response Filters 
(4.15) 
Q.E.D. 
An /IR filter implementation using two linear arrays of sizes M and K and 
an adder is shown in Fig. 4-6. As in the FIR filter, the upper array modules 
are numbered O to M-1, with the i-th module storing A(i). For the lower ar-
ray, modules numbered O to K-1 store coefficients B(l) to . B(K) respectively. 
·rhe 2N-3 bit delay of the previous case is now generated through the unit 
55 
delay of the the adder, and an additional delay of 2N-4 bits and the output is 
fed into the lower array to create the required inner products of the output vec-
tor and the B( l) coefficient vector. 
The analysis of the FIR filter architecture of the last section can be easily 
extended to the /JR filter, as presented in theorem 3. 
Theorem 9. For the array structure of Fig. 4-6, 
2N-l M-1 K 
L 2i. OUT(i+ 2kN) == L X(k- l- 1).A(l) + LY(k- [- 1).B(l). 
l=O l= 1 i=O 
Proof Let the output bit of the upper array at time t be denoted by 
AO ( t) and that of the lower bank by BO ( t). Also let the carry of the full ad-
der in Fig. 4-6 be C(t). Then, the adder can be characterized by 
OUT(t+ 2N- 4) + 2C(t) == AO(t- 1) + BO(t- 1) + C(t- 1) and, (4.16) 
C(2kN+ 3) = 0, for all integer k. (4.17) 
By substituting t = 2(k- l)N+ i+ 4, and summing both sides of (4.16) \\rith proper 
weights, we get 
2N-1 2N-1 
L 2i. OUT(i+ 2kN) = L 2i. A0(2(k- l)N+ i+ 3) 
i=O i=O 
2N-1 
+ L 2'.B0(2(k- l)N+ i+ 3) + S(k) (4.18) 
i=O 
where, 
2N-1 2N-l 
S(k) == L 2i.C(2(k- l)N+ i+ 3) - L 2i+ 1.C(2(k-l)N+ i+ 4) 
i=O i=O 
2N-l 2N 
== L 2i.c(2(k- 1 )N+ i+ 3) - L 2i+ 1.c(2(k-1)N+ i+ 3) 
== C(2(k- l)N+ 3) - 22N.C(2kN+ 3) ) i=O i=l 
==0, from (4.17). 
// 
By combining (.Y.J4) and (4.15) and using theorem 1, one gets 
56 
ut X X X X .. . - -, , • , • out ,n out In , 
lnp X X 
• out In 
et 
Rin Rout - Rin Rout , - ~ , , Rin Rout 
Res 
~ AO P. I - AO Al 
-
-
- -
AO Al 
Module # 0 Module # 1 Module #(M-1) 
0 
, I 
'I 
A B 
Fu 11 Adder 
(latched outpL t) 
- Cin , 
cout R s 
. ~ 
1 I 
'I 
3 
- Delay 
-
2N-4 
Delay 
0 
-{ 
-
BO Bl BO Bl 
-- (- .. 
-
BO Bl 
R R R R 
' 
.. 
' 
• out • - In - 1n out - ~- .. , r .. 
R 
• 1n out 
X. X X X 
out • In - In out , - . . . r , 
X 
out 
, t 
Module # 0 Module # 1 Module #(K-1) 
OUT 
Figure 4-6: Bit-Sequential Systolic Array IIR Filter 
57 
l 
2N-l M-1 
L 2i. A0(2(k- l)N+ i+ 3) == L X(k- l- 1).A(l). 
i=O l=O 
Also, • since the input to 
Xin(i,JJ =- fll(i- j- 2N}/2NJ, {i- j- 2N} mod 2N' and 
B(l), B(2), ... , B(k), it can be argued that 
2N-1 K-1 
the 
the 
lower 
stored 
L 2i. B0(2(k- l)N+ i+ 3)-= L Y(k- l- 2).B(l+ 1). 
i=O l=O 
From (4.18), (4.19) and (4.20), 
2N-1 M-1 K 
(4.19) 
• 
array IS 
constants are 
( 4.20) 
L 2i. OUT(i+ 2kN) == L X(k- l- l).A(l) + LY(k- l- 1).B(l). 
i=O l=O l= 1 
Q.E.D. 
As before, it is therefore verified that the output bits of Y(k- 1) are avail-
able during clock cycles 2kN+ i to 2(k+ l)N- 1, i.e. 0[/T(i+ 2kN) == y(k- I),i" 
4.5 Evaluation of filter desig11s 
The systolic array design is robust in terms of cascadability to any length 
problem, retaining the simplicity of control and data flow. The hardware 
needed for these designs is O{L), where L is the total number of coefficients in 
( 4.1) and is lower than that of corresponding word-oriented designs by a factor 
equal to the wordlength. The time for one output computation is O{N), where 
N is the wordlength of the output words. The design permits very high clock-
ing rates. Assuming the 50 MHz. clock used earlier in Chapter 3, it would 
take 640 nsec. to process a 32 bit word, which corresponds to a throughput rate 
of 1.5 MHz. 
58 
• 
4.6 Pattern Recognition 
Text-editing, visual processing and signal reconstruction often • require 
searching through a string of characters or bits, looking for instances of a given 
"pattern" string. The obvious way to search for a matching pattern is to begin 
searching from the first position in the text and proceeding till a mismatch is 
found, in which case, the starting position is advanced by one and the search 
continued. This approach is very inefficient; in particular if all possible matches 
in the string are to be determined, the worst case time needed would be 
O{mn), where m is the length of the pattern and n is the length of the string. 
Researchers have proposed various algorithms which can match the string 
in O(m+ n) time with O{m) hardware [31-34]. In particular, the algorithm by 
Knuth and Pratt [31] creates a jump table and uses this to implement 
asynchronous jumps of various steps over the string. Such irregularities in data 
flow make the hardware implementation of the algorithm inefficient. Other al-
gorithms suffer from similar limitations. 
This section presents two pattern matching architectures based on bit-
sequential systolic arrays, to find all occurrences of a pattern of length m in a 
string of length n. The first design, based on a double linear array structure, 
takes O{m+n) time and the second design based on a linear array, 0{3m+n} 
time. The hardwa:e complexity of both designs is O(m). The characters in both 
the pattern and the string are assumed to be bits but the hardware can easily 
be replicated for parallel matching in case the characters are words and the 
same time efficiency is desired. 
Let the bit pattern P, string S and the output flag vector R be 
represented as: 
59 
• • • 
• • • 
As explailled earlier, each component of these vectors is either a O or a 1. The 
output bit ri is set to I if the substring [xi -m+l xi-m+ 2 ••. xi] matches the 
pattern P in every bit. r. is set to O otherwise. 
1 
Thus 
(4.22) 
where A is a two operand AND. Therefore ri == 1 implies a match at position 
(i- m+ 1) in the string ([ r0 r 1 •. rm_ 2] are ignored). Without loss of 
generality we can assume m = 2M and by defining 
F(J) = r j+ m- I ' (4.23) 
.. equation ( 4.22) can be expr~ssed as 
2M-l 
F( j) == II --, { a ( i) EB x ( j+ i) } ( 4. 24) 
i=O 
where EB is the exclusive-or, , is the logical not and CT stands for the m ul-
tioperand AND. Equation {4.24) can be manipulated to yield two different 
designs as shown in the next two sections. 
4.6.1 Double linear array architecture 
Using the delay operator, Z, described in Sec. 4.2, equation ( 4.24) can be 
expressed as: 
M-1 M-1 
F(j)== II [,{a(2i) ffi x(j+2i)}] /\ IT [,{a(2i+l) EB x(j+2i+l)}] 
i=O i=O 
60 
... 
M-1 
== IT { z- i I z- i --, ( a ( 2 i) EB x ( J J ) ] } 
i=O 
M-1 
I\ z- 1 TI { z- i I z- i --, ( a ( 2 i + 1 ) EB ; ( J) ) ] } 
i=O 
where z-1 implies advancing the appropriate signal by unit time. Modules 
shown in Fig. 4-7 can be cascaded in the form of a double array, each of length 
N, as shown in Fig. 4-8, to implement the above equation. If the modules are 
numbered O to M-1 from left to right, as shown, then the stored pattern bit for 
the j-th module is given by, 
and, 
a( 2(M- 1)- 2) 
a( 2(M- 1)- 1) 
for the upper array 
for the bottom array ( 4. 25) 
As in the filter case, it can be observed that the results flow in the direction 
opposite to that of data. Since all the modules have latched outputs, i.e., they 
are Moore machines, any length pattern can be matched on the architecture by 
increasing the array length, without affecting the clocking rate. 
4.6.1.1 Proof of operation 
For representation purposes let, 
VAR( i, i, k) 0 < j < M-1, k= upper or lower 
be the value of the variable VAR for the j-th module at the i-th clock. The 
index k associates the variable with the upper or the lower array. Then using 
( 4.25), the interconnections of Fig. 4-8 yield the following relationships: 
and, 
Xi n ( i, j, up p e r) == Xi,; ( i, j, lo we r) = x ( i - j) 
Rout(i+l, i, upper)= Rout(i, j+l, upper) I\ [ -,{a( 2(M-f)-2) EB Xin(i, j, upper)}] 
== Rout ( i, i + 1, upper) /\ [ -, { a ( 2 ( M- J) - 2) EB x ( i- j) } ] . 
61 
Rout D 
Figure 4-7: 
In put 
... X X ,. 
-
. 
• out 1n 
-
R R 
-
- out • -1n 
Module #0 
- D ,, 
~ 
' 
- R R. .-
- out In 
~ x. X 
... In out -
Figure 4-8: 
\/ 
a 
Array rnodule for the dou hie array design 
R 
X 
X X - X X , 
• In out • out 1n 
R R 
-
out • ,n -
R R 
- • 
- out In 
Module#l Module#(M-1) 
OUTPU -
-
T 
R. .J T 
out In 
~ R R. . 
out 1n 
. . .. . . . .. 
X 
• 1n out . . - . . .. . . -
X X 
• . 
... In OUI. 
Dou hie array architecture for pat tern 
recognition 
62 
-
... 
-
I 1 I 
~ 
-
I 1 I 
~ 
... 
. 
.. 
Rout ( i + 1, j, lower) == Rout ( i, j+ 1, Io we r) /\ I • { a ( 2 ( M-11-1) EB x ( i- 11}) ( 4. 26) 
If O(i) is the output of the array, then 
·, 
O(i+ 1) == Rout(i, 0, upper) I\ Rout(i+ 1, 0, lower) ( 4.27) 
Also, 
R. (i, M-1, 4tpper) == R. (i, M-1, lower)== 1, for all i 
in in 
(4.28) 
The following theorem describes the output of the circuit in Fig. 4-8: 
Theorem 4: For the array structure of Fig. 4-8, 
O(k+ 2M) == F(k). 
Proof. From ( 4.27) we get, 
' 
O(k+ 2M) == Rout(k+ 2M- 1, 0, upper) I\ Rout(k+ 2M, 0, lower) ( 4.29) 
The terms in ( 4.29) can be expanded using ( 4.26) as below: 
Rout(k+ 2M- 1, 0, upper) 
= Rout(k+ 2M- 2, 1, upper)/\{,[a(2M- 2)/\x(k+ 2M- 2)]} 
and, finally through recursion, 
M-1 
== Rin(k+ M- 2, M-1, upper) I\ II {,[a(2i) EB x(k+ 2i)]} (4.30) 
i=O 
Similarly, using ( 4.26), for the bottom array we can derive, 
Rout(k+ 2M, 0, lower) 
M-1 
= Rin(k+ M- 1, M-1, lower) I\ II {,la(2i+ l) EB x(k+ 2i+ 1)]} (4.3\J 
i==O 
Using (4.28), (4.29), (4.30) and (4.31) we get, 
M-1 M-1 
O(k+ 2M) == 11 { ,[a(2i) EB x(k+ 2i)]} /\ II { ,[a(2i+ 1) EB x(k+ 2i+ 1))} 
i=O i==O 
2M-l 
== IJ {,[a(i) EB x(k+ i)]} 
i==O 
== F(k). 
63 
Q.E.D. 
From ( 4.23), ( 4.24) and theorem 4 it can be inferred that ri's, the in-
dicators of occurrence" of pattern in the string, are available as outputs of the 
array starting at time 2M, which is the length of the pattern. 
4)6.2 Single linear array architecture 
I 
The design approach for systolic arrays designed so far was based on 
manipulating the delays in the algorithm, so that broadcast and multiple data 
addition could be avoided, being balanced by delays introduced. The implemen-
tation of the modified algorithm was a mere translation of the final expression. 
In this section a different approach is taken and the design is based on 
the study of data flow in an array. For the pattern recognition algorithm the 
flow of data and results, can be mapped into the array shown in Fig. 4-10 
which uses the modules of Fig. 4-9 as its basic elernents. 
The modules numbered O to m-1, from left to right, as shown, store a(O) 
to a{m-1} respectively. Let u(i, j) and v(i, j) be the outputs of latches u and v 
respectively, for the j-th module at the i-th clock. Then the interconnections of 
the modules can be expressed as: 
v(i+ 1, 11 == u(i, JJ 
u ( i + 1, j) == v ( i, j- 1) /\ [-. { a (j) EB x ( i - j)} J 
v(i,-1)=1 
Theorem 5: For the array of Fig. 4-10 
Proof. 
v ( k + 2 m, m - I) == F( k) . 
v(k+ 2m, m- 1) == u(k+ 2m- 1, m-1) 
using ( 4.32) this ~an be recursively expanded to get, 
64 
' 
'. 
( 4.32) 
· 1 
Input ... , 
I l I 
... 
, 
... D - X 
-
,. 
- out 
-
~ a 
... 
-
... -- -~ \7 
I 
u V 
-
D f,+ D I 
'"" 
-
. 
r 
-
out 
Figure 4-9: Array module for the sing]e array design 
X X . X X - X X 
• out , • out . ,n ,n - • out ,n 
u . V u V 
-• In out - In out r -
u V 
• 
- In out 
Module#O Module # l Module #(m-
Figure 4-10: Single array architecture for pattern 
recognition 
65 
-
-
.... OUTPUT 
) 
J .... 
m-1 
= v ( i, -1 ) /\ IJ [ -, { a ( J 1 EB x ( i+ k) } ] 
i=O 
m-1 
== IT [ -, { a ( J 1 EB x ( j+ k) } ] 
i=O 
== F( k). 
Q.E.D. 
Therefore the pattern indicators, F(O), F(l), ... , are available at the out-
put of the array starting at time 2m, which is twice the length of the pattern. 
4. 7 Evaluation of pattern recognition arrays 
As mentioned earlier, the two designs presented for pattern recognition ap-
plications are derived from different considerations. The double array design is 
based on broadcast and multiple data addition elimination through delay utiliza-
tion, whereas the single array is based on data flow study. Since the first tech-
nique has been well illustrated in this chapter it is simpler to implement for 
inner-vector product algorithms as opposed to the second technique which suffers 
from a strong theoretical design background. 
The fact that results and data flow in opposite directions in the double ar-
ray design can be utilized to load in the pat t.ern in the first M clocks in paral-
lel with the data ( both the arrays are simultaneously loaded and it takes M 
cycles to load 2M pattern samples ), whereas in the single array design such a 
parallel load is not possible and the pattern has to be preloaded. Thus the 
overall executioa time, wJ;ich is the load time and the time it takes to generate 
n match indicators, F(k)'s, is {m+n) for the double array compared to (3m+n) 
for the single array. rfheref ore the single array may be undesirable in applica-
tions where a new pattern has to be frequently loaded. The hardware com-
plexity of both designs is however same, being O{m). 
.. 
The arrays can be clocked close to shift register rates, yielding a very high 
throughput. They are easily cascadable for large length patterns without effect-
~ 
ing system performance. Also, for applications where the charactefs are more 
• 
than single bit wide, as is frequently the case, these arrays can be cascaded in 
parallel to work on individual bits of the character simultaneously and the ,.out-
puts of all such parallel arrays ANDed. Wildcards for matching can be incor-
porated into the array at the expense of extra hardware in the basic array 
modules without affecting control or the timing. High speeds, modular There-
fore, these arrays are versatile, capable of being designed for any length of pat-
tern and any character set. With all the advantages mentioned above, the ar-
ray design is very attractive for VLSI implementation. 
\ 
67 
Chapter 5 
Cohclusions 
' 
Signal processing algorithms need efficient computational architectures as 
well as input/output handling to process increasingly large volumes of data en-
countered in modern day signal processing techniques. Current parallel com-
puters, such as pipeline computers, array processors and multiprocessing systems 
are however designed for general purpose computing and hence are not efficient 
for signal processing applications. 
Special purpose architectures, including systolic and wavefront arrays have 
• 
recently been proposed by researchers, for signal processing applications. These 
architectures exploit the recursion present in signal processing algorithms and 
aim at balancing computation with input/ output. In addition, these architec-
tures are modular, using pipelined repetitive blocks with localized data and con-
trol flow, which makes them very efficient VLSI designs. 
The above designs, however, suffer from the large bandwidth needed at the 
input/ output, which becomes increasingly critical as the problem size grows. 
Bit-sequential architectures, enjoy the same advantages as the arrays mentioned 
above and in addition lead to an even better structured, lower pin count solu-
' 
tion. They are pipelined to the bit level, have a distinct, power advantage and 
usually lead to a hardware complexity which is lesser than that of word 
oriented machines by a factor equal to the word length, without compromising 
on the execution speed. 
Bit-sequential architectures for two different classes of signal processing al-
gorithms have been presented in this thesis. For algorithms that can be ef-
ficiently represented in the bilinear form, like most of the transforms encoun-
68 
. 
• 
.,_ 
tered in signal processing, a three level architecture has been proposed. For al-
gorithms which can be expressed as an inner product of two vectors, bit-
sequential linear arrays have been proposed. The bilinear architecture is ii-
lustrated by constructing a FFT butterfly element. The latter class of al-
gorithms cover digital filtering, convolution and pattern recognition and 'designs 
), -
for all of these applications have been given. Also, for both classes, the design 
philosophy has been outlined and the architecture is a mere translation of the 
modified algorithm • expression. The designs are simple and use identical 
modules for diverse operations like addition and multiplication. These designs 
are easy and efficient to implement in VLSI and are cascadable to any size 
problem, without timing problems. 
These designs, however, do not lead to an improvement in speed as com-
pared to the word-oriented machines. Any improvement in speed must come 
from parallel ·processing of individual bits or bit-slices, which would be a com-
promise between ,vord-oriented and bit-sequential machines. This could be one 
area of intensive research as a possible extension of this work. Further work 
,, -could be done on designing a system supporting these designs, which would lead 
to a bit-sequential machine . 
With the right peripheral support and speed enhancements through bit-slice 
processing, the ideas presented in this thesis, could lead to very sophisticated 
and cost-effective designs in VLSI. 
69 
' 
References 
[1 ]Kai Hwang and Faye A.Briggs, "Computer Architecture and Parallel 
Processing," McGraw-Hill,New York, 1984. 
[2]J.B.Dennis, "Data Flow Computers," IEEE Computer Magazine, pp. 
48-56, November 1980. 
[3]J.R.Rice, "Matrix Computations and Mathematical Software," 
McGraw-I1ill,New York, 1981. 
[4]Kai Hwang, S.P.Su and L.M.Ni, "Vector computer architecture and 
processing techniques," in Advances in Computers, vol.20, Academic, 
New York, 1981. 
[5JS.Y. Kung, "VLSI Array 
4-22, July 1985. 
Processors," IEEE ASSP . magazine, pp. 
[6]Sun-yuan Kung, K.S.Arun, Ron J.Gal-Ezer and D.V.Bhaskar Rao, 
"Wavefront Array Processor: Language,Architecture,and applications," 
IEEE Trans. on Computers, vol. C-31, pp. 1054-1063, November 
1982. 
[7)S.Y.Kung, "On Supercomputing with Systolic/Wavefront Array 
Processors," Proceedings of the IEEE, vol.7, January 1984. 
[ 8] H. T .Kung, ~, Why Systolic Architectures?,"' Corrlputer, vol 15, No. 1, 
pp. 37-46, January 1982. 
[9]H.T.Kung, "The strurture of parallel algorithms," in Advances in 
Computers, vol. 19, Ed.: ~1arshall C. Yovits, pp. 65-112, Academic 
Press, 1980. 
[lO)H.T.Kung and C.E.Lieserson, "Systolic Arrays(for VLSI)," in Sparse 
Matri·x Syrnposium, SIAM, pp. 25G-282, December 1983. 
[11 ]I.E.Sutherland and C.A.Mead, "Microelectronics and Computer 
Science," Scientific A rnerican, vol 237, No. 3, pp. 210-228, September 
1977. 
[12]C .. D.Thompson, "A Complexity Theory for VLSI," Ph.D. thesis, 
Carnegie-Mellon University, Computer Science Department, August, 
1980. 
[13]P .Lieserson, "Area-efficient VLSI computation," MIT Press, 
Cambridge, Mass.,. 1983. · 
70 
[14]P.B.Denyer, "An introduction to bit-~erial architectures for VLSI sig-
nal processing," in VLSI Architecture, Ed.:B. Randell and P .C. 
Treleaven, pp. 225-241, Prentice-Hall, 1983. 
[15]8. Winograd, "On the number of multiplications necessary to compute 
certain functions," Comm. Pure and Appl. Math. vol 23, pp. 165-179, 
1970. 
[16]C.M.Fiduccia, "On obtaining upper bounds on the complexity of 
matrix multiplication," In Complexity of Computer Computations, Ed: 
R.E. Miller, J. W. Thatcher and J .D. Bohlinger, Plenum Press, New 
York, N.Y., 1972. 
[ 17] R. W .Brockett and D.Dobkin, "On the optimal evaluation of a set of 
bilinear forms," Proc. of Fifth Annual ACM Symposium on Theory of 
Computing, pp. 88-94, Austin, Texas, 1973. 
[I8]R.C.Agarwal and C.S.Burrus, "Fast one-dimensional digital convolu-
tion by multidimensional techniques," IEEE Trans. Acoust., Speech 
and Signal Processing, vol. ASSP-22, pp. 1-10, Feb. 1974. 
[19]J.C.Lafan, "Optimal computation of p bilinear forms," Linear Algebra 
and Appl., vol. 10, pp. 225-240, 1975. 
[20]S. Winograd, "Some bilinear forms whose multiplicative complexity 
depends on the field of constants," Math. Syst. Theory, vol. 10, pp. 
169-180, 1977. 
[21 ]R.C.Agarwal and J. W .Cooley, "New algorithms for digital 
convolution," IEEE Trans. Acoust., Speech and Signal Processing, 
vol. ASSP-25, pp. 392-410, Oct. 1977. 
[22]S. Winograd, "On computing the discrete Fourier t~ansform," Math. 
Compnt., vol. 32, pp. 175-199, Jan. 1978. 
[23]P.Zellini, "On the optimal computation of a set of symmetric and 
persymmetric bilinear for1ns," Linear Algebra and Appl., vol. 23, pp. 
101-119, 1979. 
[ 24] E. Feig, "Certain systems of bilinear forms whose minimal algorithms 
are all quadratic," J. Algorithms, vol. 4, pp. 137-149, 1983. 
[25]1.N.Chen and R.Willcrner, "An O{n) parallel multiplier with bit-
sequential input and output," IEEE Trans. Comput., vol. C-28, pp. 
721-727, Oct. 1979. 
[26]R.Gnanasekaran, "On a bit-serial input and bit-serial output 
71 
) 
( 
multiplier," IEEE Trans. Comput., vol. C-32, pp. 878-880, Sept. 
1983. 
[27)N.R.Strader and V.T.Rhyne, "A canonical bit-sequential multiplier," 
IEEE Trans. Comput., vol. C-31, pp. 791-795, Aug. 1982. 
[28]R.Gnanasekaran, "A fast serial-parallel binary multiplier," IEEE 
Trans. Comput., vol. C-34, pp. 741-744, Aug. 1985. 
[29) LSI Logic Corporation, "CMOS Macrocell Manual," LSI Logic Corp., 
Milpitas, CA, Sept. 1984. 
[30] Lennart Johnsson and Danny Cohen, "A mathematical approach to 
modelling the flow of data and control in computational networks," 
in VLSI Systems and Computaions, Ed.:H.T. Kung, Bob Sproull and 
Guy Steele, pp. 213-224, Computer Science Press, Inc., 1981. 
[31]D.E.Knuth, J.H.Morris and V.R.Pratt, "Fast pattern matching in 
strings," SIAM Journal of Computing, Vol. 6, No. 2, pp. 323-350, 
June 1977. 
[32]M.J.Foster and H.T.Kung, "The design of special purpose VLSI 
Chips," Computer, Vol. 13, No. 1, pp. 26-40, January 1980. 
[33]M.J.Fischer and M.S.Paterson, "'String matching and other 
Massachussetts Institute of Technology, Product MAC, 
Report 41, 197 4. 
products," 
Technical 
(34)R.S.Boyer and J .S.Moore, "A fast string search algorithm," Comm. 
A CM, Vol. 20, No. 10, pp. 762, October 1977. 
\. 
() 
72 
Vita 
Neeraj Tewari was born to Mr. and Mrs. M.S. Tewari in May '63. After 
graduating from St. Joseph's High School, Nainital, he went on to get a 
Bachelor of Technology degree in Electrical Engineering from the Indian In-
stitute of Technology, Kanpur, in May 1984. During this periocJ he received an 
award for academic excellence and the best project award for his senior year 
project titled " Band selectable low frequency spectrum analyzer". Sub-
,-
sequently, he was enrolled as a fellow, for a Master of Science degree, in Com-
puter Science and Electrical Engineering at Lehigh University, Bethlehem. 
After his graduation in May 1986, he is going to be a Member of Tech-
nical Staff at M/A-COM DCC, in Germantown Maryland, working on packet 
switching systems for satellite communications. 
His professional interests are in computer system architecture, parallel 
processing and telecommunication networks. His other interests include flying, 
mathematics and classical literature and poetry. 
73 
