Design of low-complexity pipeline digital systems by Rao, Anita
Lehigh University
Lehigh Preserve
Theses and Dissertations
1999
Design of low-complexity pipeline digital systems
Anita Rao
Lehigh University
Follow this and additional works at: http://preserve.lehigh.edu/etd
This Thesis is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact preserve@lehigh.edu.
Recommended Citation
Rao, Anita, "Design of low-complexity pipeline digital systems" (1999). Theses and Dissertations. Paper 632.
Rao, Anita
Design o··f Low-
Complexity
Pipeline Digital
Systems
January 2,000
Design of Low-Complexity Pipeline
Digital Systems .
by
Anita Rao
A Thesis
Presented to the Graduate and Research Committee
of Lehigh University
in Candidacy for the Degree of
Master of Science
In
Computer Engineering
Lehigh University
June 1, 1999

Acknowlegdements
I am indebted to my mentor and thesis advisor Dr. Meghanad Wagh who has instructed me,
stimulated my research and encouraged my learning. Everyone of our meetings always left me
highly motivated. This thesis which represents the culmination of all that I have learnt in the
course of my study at Lehigh University, was made possible by his guidance and patience. I
w:ould like to acknowledge here, his breadth as well as depth of knowledge.
I am grateful to Zhenyu Zhu of the Image Processing and Patern Analysis Lab, for providing me
with material that was helpful in preparing the applications of the model.
And, I wish to acknowledge the support from my husband Venkatesh Rao, for his valuable
feedback on this thesis and his help in the preparation of this document.
iii
Contents
Abstract
1 Introduction
2 Base Module and An Example
2.1 Example..............
2.2 Modules: Registers and Adders
2.3 Modeling and Analysis .
2.3.1 Properties of Generator Polynomial
2.4 Module Design Algorithms. . . . . . . . .
2.4.1 Algorithm 1 (Factor (1 + xk )) •••••
2.4.2 Algorithm 2 (Complete Factorization)
2.5 Architecture Design .
2.6 Implementation Issues . . . . . . . .
iv
1
2
7
7
10
12
13
17
17
19
21
22
3 Applications of the Model
3.1 Running Sum of 16 Numbers:
3.2 Running Maximum.
3.3 Digital Filters .
3.4 Extrapolation
4 Generalizations and Extensions
24
25
26
30
33
37
4.1 Design of Non-Conforming Systems.
4.1.1 Heuristic Algorithms . . . . . . . . . . . .
38
38
4.2 Design· and Implementation of 2-D Applications. 42
4.2.1 The Laplacian Filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 42
5 Conclusion
v
47
List of Figures
1.1 Example for Systolic Architecture: Matrix Multiplication 3
1.2 Example for Systolic Architecture: Matrix Multiplication . . . . . . . . . . . ., 5
2.1 A bad implementation of Yi = Xi + 3Xi-l + 3Xi-2 + Xi-3 ..
2.2 Another Implementation of Yi = Xi + 3Xi-l + 3Xi-2 + Xi-3 .. '\ .
8
9
2.3 Our Implementation of Yi = Xi + 3Xi-l + 3Xi-2 + Xi-3 . . . . . . . . . . . . .• 11
2.4 Basic Problem 12
2.5 Base Module 13
2.6 Propagation through a single module . . . . . . . . . . . . . . . . . . 15
3.1 Running Sum: A General Implementation
3.2 Running Sum: Our Implementation .
25
27
3.3 Running Maximum . . . . . 28
3.4 Running Maximum Module
3.5 Sobel Operator . . . . . . . . . . . .
vi
29
31
3.6 Extrapolation............................
3.7 Implementation........... ,. . . . . .
4.1 Rowand Column Modules for Laplacian Filter .
32
36
44
4.2 Implementation of Laplacian Filter . . . . . . . . . . . . . . . . . . . . . . . . ., 45
,.
vii
Abstract
This thesis introduces a novel architecture suitable for VLSI implementation for linear as well as
nonlinear systems. The simple design techniques can be applied to diverse applications to obtain
architectures with minimal hardware and optimal time complexities.
In order to facilitate the model and design of these systems, the concept of Generator Polynomial
is introduced. An analysis of the generator polynomial and it's salient properties are presented
so as to aid in the design methodology. The architecture consists of several identical modules
connected serially. The internal hardware of each module depends on the specificity of the
application. Two algorithms have been derived to design the hardware and the architecture
structure.
To illustrate the wide potential of this model, design and implementations of practical applica-
tions have been incorporated into the thesis. For applications that do not directly embrace the
technique, heuristic algorithms have been proposed. Generalization of the concept for 2 - D
applications has also been introduced. It is believed that this is a first ideal step toward design
and implementation of powerful, low-complexity pipeline architectures.
1
Chapter 1
Introduction
Very large-scale integrated (VLSI) circuits have permeated the electronics industry in
the form of consumer and commercial chips such as microprocessors, memory, digital signal
processors, and embedded controllers. These mass-market chips have found their way into an
a.CJtounding range of consumer products from computers down to simple appliances. Another
wave of VLSI chips that are now becoming prevalent in many commercial ventures are the
application-specific integrated circuits (ASICs). These are designed for specialized applications
and produced in much lower quantities than the previously described chips..
The exponential advances in VLSI technology has made complex systems possible on single
chips. For example, it is now possible to implement a simple risc core on 1 square millimeter:
It is projected that with the ongoing phenomenal progress, logic implementation with several
hundreds of millions of transistors will soon be integrated on a single IC. In case of memory
chips, the density is expected to be even higher.
The major advantages of a "system on a chip" are highly increased speed (because of short
interconnects), low cost (because of high throughput and greater applicability (because of fewer
2
- - i
- h f
g e c
d b -
a -
-
1 1 1
n m I PI P2 P3 f------ z y x
i i I
p q r
Figure 1.1: Example for Systolic Architecture: Matrix Multiplication
ICs, smaller real estate and power). FUrther, there is high throughput since concurrent processes
are possible.
Howeyer, to realize these advantages, VLSI designs need to be modular with each module indi-
vidually optimized; they also need to be repetitive so that the layout is area efficient and layout
time is optimized. Therefore, unfortunately, the design is deprived of it's flexibility. Added to
all these, is the constrain that data paths should not cross each other.
An important class of architectures that exploits the recent trends in VLSI technology are the
systolic architectures,[l], introduc,ed by H. T. Kung and Charles Leiserson in 1978. A systolic
architecture is an arrangement of processors in an array (often rectangular) where data flows
synchronously across the array between neighbors. Each processor is a Moore machine and data
is exchanged only between adjacent processors. In this architecture, alternate processors are
clocked together.
3
A processor at each step takes in data from one or more neighbors (e.g. top and left), processes
it and, in the next step, outputs results in the opposite direction (downward and right).
Thus Systolic computation employs both pipelining and parallel computation efficiently. Because
of it's modularity and regularity, it is ideal for VLSI implementation. Finally, because of it's
communication only with the neighbors and the Moore model of each processor, the clock speed
of a systolic architecture is independent of the size of the data sequence.
An example of a systolic algorithm might be matrix multiplication. One matrix is fe.d in a row
at a time from the top of the array and is passed down the array, the other.tmatrix is fed in a
column at a time from the left hand side of the array and passes from left to right. Dummy
values are then passed in until each processor has seen one whole row and one whole column. At
,
this point, the result of the multiplication is stored in the array and can now be output a row or
a column at a time, flowing down or across the array.
As illustrated in Fig 1.1, each cell (PI, P2, P3) does just one instruction - Multiply the top and
bottom inputs, add the left input to the product just obtained, output the final result to the
right. The cells are simple with just one adder and a few registers.
The order in which one feeds input into the systolic array is very important. At time to, the
array receives 1, a, p, q, and r (the other inputs are all zero). At time ti, the array receives m,
d, b, p, q, and so on. Results emerge after 5 steps.
Systolic arrays have traditionally provided efficient, high performance execution for computation
intensive applications. The idea is to exploit VLSI efficiently by laying out algorithms (and
hence architectures) in 2 - D (not all systolic machines are 2 - D, but probably most are). The
architectures thus produced are not general but tied to specific algorithms.
However, this is good for computation-intensive tasks but not I/O-intensive tasks, for example,
signal processing. Most designs are simple and regular in order to keep the VLSI implementation
4
; YI
W4 W3 W2 WI
~
Each Module: Multiplier and Accumulator
O=n
Yout = Yin + WXin
Xout = Xin
Figure 1.2: Example for Systolic Architecture: Matrix Multiplication
costs low. Programs with simple data and control flow are best. Hence we have developed a new
architecture which addresses all the areas - computation as well as I/O intensive applications. It
is not necessary to flush all the buffers and refill with data when the processing sequences are
very long. For example Fig 1.2 shows Linear Convolution implemented with systolic arrays. It
can be seen that for very long sequences of data, the modules have to be flushed and refilled.
In this thesis, it is shown that many problems could be solved using much lower hardware and
time complexities. The architecture has pipelining and the partial data actually moves through
cascaded units reducing computation time and the amount of hardware.
For example, using this strategy, it is possible to find a running maximum of every 16 consecutive
numbers in a sequence. By using only four adders and nineteen registers, one can find each
maximum within time equal to one adder delay.
A Laplacian filter used in image processing may likewise be implemented. The hardware com-
plexity for this is a mere four adders and ten registers. The time to process each set of data
is just one adder delay. Similarly, point detection and line detection filters can be realized us-
5
ing this strategy requiring four adders and eleven registers, and, six adders and fifteen registers
respectively.
Another example of where this idea may be applied is numerical integration which uses a linear
combination of data. The hardware complexity for using the trapezoidal rule is just two adders
and four registers.
Chapter 2 of this thesis formulates the problem and models the architecture. All the theoretical
results necessary to obtain final implementation are derived here. We also provide implemen-
tation of an example application which clearly illustrates the superiority of our technique over
others. Chapter 3 is devoted to four diverse applications. In particular, we derive architecture
for Running Sum, Running Maximum, Sobel Operator and Extrapolation. The hardware and
time complexities of each architecture are also discussed. Chapter 4 deals with generalizations
including one to 2 - D applications. Finally, Chapter 5 summarizes the conclusion and presents
avenues for further research.
6
Chapter 2
Base Module and An Example
In the previous chapter, we talked about how our architecture was faster and reduced hardware
requirements. Beginning with an example, we illustrate this point further. For any design, there
have to be specific ways or rules to follow so as to realize it. In this chapter, we lay the groundwork
for designing architectures using our method. The theory behind the results has been explained
and substantiated. We have introduced here, the base module and the generator polyn~mial
which are the building blocks for our architecture. This chapter also gives the algorithms one
would require to design the modules for a particular problem. Another important aspect that is
dealt with here is translation of the design method into the architecture. Issues that come up
during this mapping are discussed.
2.1 Example
Suppose Yi = Xi + 3Xi-l + 3Xi-2 + Xi-g· This problem may be implemented in a number of
ways. We have shown how with our design, low hardware complexity may be achieved.
7
Figure 2.1: A bad implementation of Yi = Xi + 3Xi-l + 3Xi-2 + Xi-3
8
Figure 2.2: Another Implementation of Yi = Xi + 3Xi-l + 3Xi-2 + Xi-3
One possible implementation :
Fig 2.1 shows one of the ways in which the stated problem may be realised. It uses 5 adders and
seven registers.
Another implementation:
Fig 2.2 shows another implementation of the stated problem. It is uses just three adders as
compared to five in the previous case. But the number of registers is nine. And the greatest
drawback is the presence of the multiplier in this architecture which increases hardware and
c~mputation time.
9
OUT Design
Our architecture for the same problem shown in Fig 2.3 is optimum for hardware as well as
computational time. It uses three adders and 6 registers and the result is available for every
adder time.
In this design, we use cascaded adders so that the solution propagates in a pipeline fashion. For
the first adder, one input is the data Xi and the second input is Xi delayed by 1 unit. The output
of this adder and it's delayed version are the inputs to the second adder and so on.
In this particular implementation, the inputs to the first adder are Xi and Xi-I' Therefore the
output of the first adder is Xi + Xi-I. Hence, the second adder sums the terms Xi + Xi-I and
Xi-I + Xi-2' The output of this is Xi + 2Xi-1 + Xi-2. Propagating through the third adder in
the same fashion, the final output is Yi = Xi + 3Xi-1 + 3Xi-2 + Xi-3'
The last implementation uses just three adders and five registers. And it may be pointed out
that one clock cycle of this architecture is equivalent to one adder time.
2.2 Modules: Registers and Adders
In this thesis, we will study architectures which look similar to that of the last implementation.
One can notice that there is a basic module which is repeated three times to provide the needed
result.
One may observe that using this module yields a large degree of flexibility towards solving a
problem. There may be a few different ways in which to group the modules. Some methods may
require additional adders, and others may require additional delay elements. It is interesting to
note that requirements of a Delay FF and a Full Adder are comparable [2], thus giving not much
10
Figure 2.3: OUf Implementation of Yi = Xi + 3Xi-l + 3Xi-2 + Xi-3
11
xCi) SYSTEM
Figure 2.4: Basic Problem
y(i)
reason to prefer one method to the other. A delay FF would require 22 transistors and 16 grids,
while a FA would require 24 transistors and 15 grids. put reference here.
2.3 Modeling and Analysis
Consider a system which has input x(i) and output y(i) as shown in Fig 2.4. Design of such
a system is often a complicated task. Further, for a linear system, y(i) may be expressed as a
linear convolution of the input x(i) and the system impulse response. While dealing with nonlinear
systems, designing the hardware becomes an especially involved process. In this section, we show
how we try to achieve this by modeling the problem. We write the output as:
y(i) = :=: x(i - j)Cj
where :=: is some arbitrary linear or nonlinear operation. Cj'S are integers. We will assume that :=:
is associative and commutative. However, the multiplication may not be distributive over :=: and
therefore we require a mapping of the problem so that the system can be implemented. Examples
of :=: are addition, subtraction, maximum, minimum, parity, weights, etc.
In order to study these systems, we use the concept of a generator polynomial represented by
n
gn(z) = l: Cizi
i=O
12
Ear
Figure 2.5: Base Module
We will show that for certain forms of the generator polynomial, a system can be implemented
in a modular fashion with minimized hardware and simultaneously optimal time. All modules
are identical and their functionality mimics 3. Fig 2.5 shows one such module. The overall
architecture, on the other hand, will have a structure based only upon the generator polynomial.
We now present the desirable properties [3][4] of such a generator polynomial.
2.3.1 Properties of Generator Polynomial
PI. Roots on Unit Circle The generator polynomial 9n(Z) = l:~=o CiZi has all its roots on
unit circle.
Proof (by induction): Consider the first stage of an architecture where the module looks like in
Fig 2.5. If the delay is k units, and x(i) is the input, then the output will be
y(i) = x(i) EEl x(i - k)
13 .
And the polynomial corresponding to this output will be
g(z) = 1+ Zk
Thus for stage 1, the roots of g(z) are on unit circle.
We assume that the Generator Polynomial at intermediate stage in the architecture has roots on
the unit circle. L~o CiXi =rrf=l (1 + xd(i»). The degree of the polynomial N = L~=l d(i). Also,
it is assumed that Co = 1. The polynomial may be represented as x(i) +clx(i+1) +c2x(i+2) +...
Let y(i) =L~o ctx(i + t). Consider y(i) propagating through one module with a delay d(n + 1)
as shown in 2.6. The output of the adder will be
N N
LCtx(i + t) +L ctx(i - d(n + 1) + t)
t=o t=o
N N
= LCtx(i + d(n + 1) + t) + LCtx(i + t)
t=o t=o
N+d(n+l)
L c~x(i + t), where, c~ =Ct + ct-d(n+l)
t=o
Note that Ct = 0 if t < 0 or t > N.
Now, the new representation of the polynomial is
N+d(n+l)L c~zi
i=O
N+d(n+l)
L (Ci + Ci_d(n+l»)zi
i=O
N+d(n+l) N+d(n+l)
L Cizi + L Ci_d(n+l)Zi
i=O i=O
N N+d(n+l)
"i" ·iL..J Ci Z + L..J Ci-d(n+l)2Z
i=O i=d(n+l)
N NL Ci zi + L cii+d(n+l)
i=O i=O
14
y(i)
y(i-d(n+l)
Figure 2.6: Propagation through a single module
N
= 1 + Zd(n+l) L CiZi
i=O
n+lII (1 + Zd(i))
i=O
I
PI. Symmetry The coefficients of the generator polynomial are symmetric, i.e., the coefficient
of Zi is the same as the coefficient of zj if i + j = degree of the polynomial.
Proof: Now, for a polynomial 9n(z) to be symmetric, it has to satisfy the following condition.
The generator polynomial is represented by
n-l
9n-l(Z) = L Ci Zi
i=O
15
Using the above representation, it follows that
n-l n-l
9n(z) L Ci zi +L Ci zHt
i=O i-O
Multiplying both sides of the equation by the highest power in 9n(z-1),
Therefore it follows that the new polynomial is symmetric. I
P2. Power of 2 The sum of coefficients of the generator polynomial is a power of 2, i.e., L~o Ci
= 2P where p is an integer.
Proof: Let us assume that the generator polynomial 9n-l (z) = L~:Ol Cizi has sum of its coeffi-
dents to be a power of 2.
The new polynomial may be written as
In the above equation, 9n(Z) may be treated as a product of two polynomials. Then the sum
of the coefficients of (1' + zt) is 2. It has already been assumed that the sum of coefficients of
16
gn-l (z) is a power of two, say 2P where p is any integer. Then the sum of coefficients of gn(z)
will be the product of 2 and 2P which is again a power of 2. Hence the sum of coefficients of the
new polynomial is still a power of two. I
The architecture design is clearly related to the factors of the generator polynomial. The following
section details how the polynomial may be factorized and implemented to facilitate modular
implementation [5]. The complexity of the algorithms proposed is also discussed.
2.4 Module Design Algorithms
Factor (1 + xk) of a degree n polynomial P(x) can be determined using the following algorithm.
2.4.1 Algorithm 1 (Factor (1 + xk ))
1. (initialization) Let P(x) = ao +alX +a2x2 +... +anxn, where ai, 0 ~ i ~ n are constant
coefficients.
2. (computation) Compute constants bi , 0 ~ i ~ n as:
otherwise
(2.1)
3. (checking) If bi = 0 for all n - k < i ~ n, then (1+ xk) is a factor of P(x), and P(x) may
be written as P(x) = (1 +Xk)P1 (x) where,
17
Proof : Polynomial P(x) of degree n may be written as L:~=o aixi. If (1 + xk) is a factor of P(x),
then
n
. (1 +x k)\ L aixi
i=O
n
Thus, L ai(_I)LtJximodk = 0
i=O
The LHS may be written as
LlfJ-IL at+kj(-I)j
j=O
k-I q
= LL at+kj(-I)j
t=o j=O
The degree of the polynomial n, may be expressed as n = qk + t where q and t are integers and,
0:::; t < k. Hence, the last k terms in PI (x),
bn- t an-t - bn-t-k
= an-t - (an-t-k - bn- t- 2k)
q
= L an-t-kj (-I)j
j=O
I
The number of terms in the polynomial are (n + 1). The computation of P(x) when xi = 1 is
blown up and shown below:
18
aO + alX + a2 x2 + ... + anXn
ai + ai+l X + ai+2 X2 + ... + aHnXn
a2i + a2HIX + a2H2 X2 + ... + a2HnXn
a3i +
Notes:
For each (1 + x k ) that is tested as a factor, the number of additions involved are (n + 1 - k),
where 1 ::; k ::; n. The number of terms in the polynomial are (n + 1).
Suppose k = 1. Then xk = x. Therefore, all the terms in P(x) add together making the total
number of additions = total number of terms in P(x) - 1, which is (n + 1 - 1) = n. If k = 2,
then rntll terms would be of the form cxo and the rest, of the form dx l where c and d may be
any integer. So the number of additions in each set will be (number of terms in set - 1). On the
whole, number of additions is given by total number of terms in P(x) - 2 = (n - 1).
Hence, it may be observed that for some k, the number of sets is k, and each group of nil would
add together totaling (n + 1 - k) additions.
A degree n polynomial may be factorized completely with the following algorithm.
2.4.2 Algorithm 2 (Complete Factorization)
1. (initialization) Let PI (x) = ao + alx + a2x2 +... + anxn , where ai, 0 ::; i ::; n are constant
coefficients and (1 + xk1 ) be a factor of PI (x).
2. (quotient representation) From previous step,
where P2 (x) may be written as bo+ b l X + b2x2 + ... + bn-kl Xn - k1 •
19
3. (iterative computation) Using Factor (1 + xk ) Algorithm, a factor of the form (1 + xk2 )
may be found for P2(x). This process is continued until PI (x) is reduced to the form
(1 +x~)(l+x~) ... Pj(x), where, Pj(x) is either (1 + x~) or cannot be reduced further using
Algorithm 1.
Proof (by induction:) Let Q(x) = bo + bIxI + ... + bn_kxn- k be the quotient when (1 + x k)
bo = ao
ki:1
otherwise
Let us assume that it is true for bi. Then for the next term,
i+1~k
otherwise
(2.2)
I
Proposition :
Factoring a degree n polynomial (1 + XiI )(1 + xi2 )(1 + X i3 ) ••• (1 + xit ) requires exactly
j=n+I-il-i2-....... - i tI: (j)
n
additions or subtractions.
20
Proof (by induction): Let i 1 S i z S ...... S it. For the first factor, we test (1 + x) which
amounts to one addition. If (1 + x) is indeed a factor, then since i 1 = 1, the number of additions
required was 2:;=n+l-il (j) = n. For the second factor, then first (1 + x) is tested and then
(1 + X Z) is tested tested. If this satisfies as a factor, then the required number of additions is
2:;~~~1-h-i2(j). Hence now the total number of additions for determining factors one and two
are:
Total number of additions =
nl: (j) +
j=n+l-il j=n-l-il-i2
nl: (j)
j=n-l-il-i2
(j)
Now, let us assume that the proposition is true for factors determined up to it.
Then, if the next factor is (1 + xit+m ), the number of additions/subtractions required are
nL (j) + (n + 1 - i 1 - ... - it - it+l) + ... + (n +1- it - ... - it - it+m)
j=n+l-i1-oo·-i t
=
n
L (j)
I
2.5 Architecture Design
Until this point, the theory suggests what has to be done to a given particular problem. The
problem is written in terms of a pertinent polynomial which is then subject to the two algorithms
proposed in the previous section. Still, the translation of the result of these algorithms into the
21
final architecture remains to be discussed. We will do so in this section of Architecture Design.
From the second algorithm proposed in the previous section, all the factors are extracted. These
factors are the key to what the architecture will look like.
Consider a problem which has been written in the form of a polynomial P(x) = ao + alx1 +
a2x2 +.. ·+anxn. When algorithms Factor (l+xk ) and Complete Factorization are applied P(x)
reduces to (l+xt)(l+x~)... (l+x~). From these factors, we can say what the architecture looks
like. The number of adders in the design are the number of factors obtained from the second
algorithm. Meaning, in the above case, each factor translates to an adder. And the delay units
depend on the power to which the x terms in the factors are raised. For example, for the first
adder, the delay would be equal to k1, taken from the first factor (1 +xt) and the second adder's
delay would be k2 and so on.
2.6 Implementation Issues
Hardware and Time Complexities
The hardware reduces drastically with our architecture. This is due to the fact that partial
results are moved into the next cascaded stage and thus using the previously computed saves
hardware and time. The hardware complexity is determined by the factors to which the problem
has been reduced to. The number of adders and delay units required to implement the problem
define the hardware complexity. If P(x) is factorizable in unitary factors, then the total delays
required in its implementation = degree of P(x) and the number of adders = the number of
factors. The output of the system depends on the adder time. It is exactly equal to one adder
time and this gives the time complexity.
Complete Factorization
The complexity involved in the design is reduced when the pertinent equation has been fully
22
factorized. And as discussed in section 4 of this chapter, the Complete Factorization algorithm
is simple in that it can be applied and result obtained very fast.
Subtractors
In some cases, one or more factors of the polynomial may be of the form (1- xk ). Then the
module corresponding to this factor will be a subtractor and not an adder. The architecture
complexity does not differ in this case and the time also remains very much the same.
23
Chapter 3
Applications of the Model
In the previous chapter, the theory of the generator that fits into the architecture was developed.
It looks limited in scope, but as illustrated here, the applications of our architecture span many
different areas. The first section shows how our technique may be used to solve an age old but
very simple problem - the Running Sum. This is a non-linear application and has been solved
by many people in different ways. Using our method, we show that there isa drastic reduction
in hardware and time complexities. The second section deals with an interesting problem: The
Running Maximum. Again the simplicity of the design solution shows how potent our technique
is. In the next section, we deal with more particular application in Image Processing. Filtering
and working on images is a slightly more complex process. The Sobel Operator is extensively
used in Spatial Filtering. Though the function looks like it may involve complex architecture, it
really works out simple using our technique. Moving on to a more involved application in our
section, we have shown how something like extrapolation fits into the realm ofour design.
24
vvvvvvvv
vvvv
- - - -
v v
SUM
- : Registers
Figure 3.1: Running Sum: A General Implementation
3.1 Running Sum of 16 Numbers:
Consider calculating the running sum of 2n data points. The first step in the design process
would be to translate the data into a polynomial say P(x). Referring to section 2.4 of chapter 2,
algorithms Factor (1 + xk ) and Complete Factorization are applied on this polynomial. Once all
the factors have been determined, from observation, the number of adders as well as the delay
units required are determined. Now, for 2n data values, the polynomial can be reduced to n
factors and will look like this.
P(x) 1+x+x2 +"'+X2n- 1
Hence since there are n factors, the number of adders in the architecture will be n and the total
25
number of delay registers used will be 2n - 1.
As an example, consider calculating the running sum of 16 data points. A general implementation
would use the Divide and Conquer algorithm as shown in Fig 3.1. All the 16 values are first
latched into registers. They are then summed in pairs in the next stage when we have 8 values.
These inturn are summed in pairs and so on until the final sum is arrived at. This particular
architecture requires 15 adders and 30 registers for it to function. And the hardware complexity
is O(n).
Now, if our architecture is applied to solving the same problem, there is a drastic reduction in
hardware complexity. The number of adders required are a mere 4 and registers, 14 as compared
to 15 adders and 30 registers in the previous implementation. The hardware complexity is now
O(logn).
Each module shown is an adder. The number of registers are indicated by the number of delays
used. Each module sums the value available to it and its delayed version. At the end of the first
module, sum of two number are available. So the second module sums four data points. The
third stage outputs the sum of eight values. Hence at the end of the fourth stage, the sum of
sixteen values is obtained.
3.2 Running Maximum
Running Maximum is an interesting application and it too fits in very well with the generator
polynomial. It is an associative operation which makes it feasible to carry the results computed
through the cascaded modules. The Maximum would be Max '" {Max {Max {Xl, X2}, Max
{X3' X4}}} .... Maximum of 16 numbers could be implemented with an architecture similar
to that of the running sum application. Each module would contain a subtractor and logic for
26
SUM
Figure 3.2: Running Sum: OUf Implementation
27
~9
Max Module
Max Module
Max Module
Max Module
MAX
Figure 3.3: Running Maximum
28
y(i)
d(n+l) f-----,
y(i-d(n+l)
y(i) y(i-d(n+l)
max
Figure 3.4: Running Maximum Module
29
determining the maximum of two values as shown in Fig 3.4. The architecture is shown in Fig
3.3. And as there are 16 numbers which is 24, the architecture comprises of 4 modules and the
total delay units required would be n - 1 which in this case is 15.
3.3 Digital Filter~
Another area where applications fit into our architecture nicely is Image Processing. Many of the
digital filters [6] used for image enhancement, detection, etc., can be realised using this model.
For example, the Sobel operator. In Spatial filtering, many gradient operators such as Robert·
Cross, Isotropic and Sobel and Prewitt operators are used. These gradient edge detectors em-
phasise regions of high spatial frequency that correspond to edges. Typically it is used to find
the approximate absolute gradient magnitude at each point in an input grayscale image. One of
the convolution kernels of the sobel operator [7] has been implemented in 3.5.
It may be observed here that image processing mainly uses filters of a 3 x3 dimension through
which the pixels are processed. With this in view, our architecture is ideally suited to implement-
ing these filters. The 3 x 3 pixel points from the image are input serially through the realised
filter and the image is thus processed.
The 3 x 3 kernel is written as a function Y(x) = -Xl -2X2 -X3 +X7 +2Xs+Xg• Inorder to fit it
into the generator equation, Y(i) is written as -1- 2X - X 3 +X 6 +2X7 +Xs. Using the Factor
(1 + xk ) and Complete Factorization algorithms developed in Chapter 2, the pertinent equation
factorizes into (1 +X) (1 +X) (-1 + X 6). From the factors it is obvious that the architecture has
3 modules - two adders with one delay unit each, and a subtractor with six delay units.
30
Problem:
Y(i) = -X 1- 2X 2- X 3+ X7+ 2Xs+ X 9
Pertinent Equation:
Y(i) =-1 - 2X - X 2+ X 6+ 2X7 + X S
=(l +K ) (-1 + X6 )
Figure 3.5: Sobel Operator
31
y(x)
c
o-1-2-3
---+----+--+--....----...... x
Figure 3.6: Extrapolation
32
3.4 Extrapolation
To illustrate the varied application potential, we show how this architecture may be used for
extrapolation. Extrapolation is a complex and involved operation to implement. But in the
following section we show that using our technique reduces it to an extremely simple solutio~.
We propose to derive the error of extrapolation for a n degree polynomial y(x) and show that
it fits into our generator equation. And then design the architecture for say, a second degree fit
equation.
Let us assume that the n degree curve has to be extrapolated for Yo. We write an equation that
fits the points Y-n, Y-n+!, ... , Y-l·
n-l
y(x) = LCiXi
i=O
Using the Van-der-Monde matrix V,
Y=VC
y(-n)
y(-n + 1)
=
y(-1)
n-l .Now, error = y(O) - 2:i=O c;Ot
(-n+1)0 (-n+1)1
33
(_n)n-l
(-n + l)n-l
( _l)n-l Cn-l
V is invertible because -n, -n + 1, .... , -1 are distinct. Therefore,
IV-nl = (-n+1)(-n+2) ... (-1)
Now, _ IV-n+il ( + .)Co - IVI y -n ~
=
where, IV-n + il = (-n)( -n + 1)(-n + 2) ....(-1) except (-n + i)
Therefore,
(-n)( -n +1)(-n +2) ....(-1) except (-n + i)
Co = (n - i -l)(n - i - 2) (1)(-1)(-2) ....(i)
n(n -l)(n - 2) (1)(_l)n-l
(n - i)(n - i - l)(n - i - 2) .... (1)(-1)( -2).... (i)
n! n-l
= ( _ .),.,(-1)n ~ .~.
= Ci(-l)n-l
n-l
Therefore, Co = - L: Cr(_l)ny(-n + i)
i=O
Error = yeO) - Co
= C~(-l)n-ny(-n+n) - Co
n
= L:Cr(-l)i-ny(i -n)
i=O
34
Thus the Error between the actual and predicted values is seen to fit the generator polynomial
exactly making it implementable using our technique.
To illustrate this point, let us consider a second degree polynomial y(x) = ax2 + bx + c. In
Fig 3.6, c gives the extrapolated point and y(O) is the actual data point in some second degree
curve. The error which is y(O) - c can be calculated by solving the polynomial equation for c,
and subtracting it from y(O). It amounts to y(O) - 3y(1) +3y(2) +y(3).
As shown in the previous illustration, this function is now tested for a fit in the generator
equation as depicted in Fig 3.7. Once it is written in polynomial representation, the Complete
Factorization algorithm is used to determine the fit. From the factors, we can observe that three
subtractor modules with one delay element each are required in the architecture to implement
this function.
35
Problem: P(i) = Yo - 3Y; + 3~ - Y3
Pertinent Equation: P(i) =1 - 3X + 3X2_ X 3
= (l - X)(1 - X) (l - X)
Figure 3.7: Implementation
36
Chapter 4
Generalizations and Extensions
The previous chapter has shown the immense potential that our technique has. And chapter 2
discussed how a system may be fitted into the Generator Polynomial and factorised completely
using Algorithms Factor (1 +xk ) and Complete Factorization. Thus the number of modules, and
module designs are determined which gives rise to the overall architecture. We have shown that
using this methodology, the systems can be implemented with minimal hardware and optimal
time.
The third chapter discusses applications which fit the Generator Equation. These systems can
modelled such that they are completely factorized into the form of the generator equation. The
Complete Factorization Algorithm is applied successively until the starting degree n polynomial
is exactly of the form (1 + xiI )(1 + xi2 ) ••. (1 + xii) where, "L,~=1 i m =n.
In this chapter we focus on systems that don't fit the Generator Polynomial directly. When the
Complete Factorization algorithm is applied to such a system, the end result is a term that is
not of the form (1 + xk ) and also cannot be reduced further.
Also, there are problems that may be two dimensional in nature. Design of such systems brings
37
in new issues that make the architecture design very complicated. Keeping this in mind, we
present some preliminary ideas on how to approach the design of such systems.
4.1 Design of Non-Conforming Systems
In a situation when a generator polynomial cannot be decomposed into ideal factors, one can still
try to partition the polynomial into two or more components each of which satisfy the properties
specified in Chapter 2. This problem, in general, is very hard and we can only provide heuristic
algorithms to attack it. However, as the following proposition shows, one should isolate all the
possible factors of the type (1 +x k ) before these algorithms are attempted.
Proposition. It is most optimal to factorize the polynomial until the last possible stage.
Whatever be the method of design, it is most ideal to apply the Complete Factorization Algorithm
from section 2.4 before any attempt is made to fit the system.
We have developed some techniques of getting around the limitation of some systems that do
not fit very well into our design methodology. These work well and provide optimal architecture
designs for these systems in most cases. However, there are instances where these algorithms do
not perform optimally. Having said this, we now present these algorithms. We have also included
counter examples which illustrate the fact that they do not provide satisfactory design solutions
in all cases.
4.1.1 Heuristic Algorithms
The following algorithms partition a generator polynomial into multiple polynomials each having
the properties described in section 2.4 of Chapter 2. The process is different in each case and
most of the time, the split terms can be incorporated into an optimal architecture design.
38
Algorithm 3 (Partition - I)
1. (initialization) Apply the Complete Factorization algorithm to the generator polynomial.
Denote the last quotient by Q(x) = eo + CIX + ... + CjXj • Note that Q(x) does not have
anymore factors of the form (1 + x k ).
2. (expansion) Expand Q(x) in the following fashion:
1 + ... + 1 (eotimes) + x +... + x (cItimes) +
3. (grouping) Group each 1 term with x k terms beginning with the lowest k and moving onto
successi:ve higher powers until all the 1 terms are exhausted.
4. (factorization) Factor out the lowest power x from the remaining terms
5. (iteration) Repeat the previous two steps grouping and factorization until remaining terms
are of the form (1 + xk ) or (1 + x k + xm ), in which case it is treated as two partitions
(1 + xk ) and xm
Example 1. P(x) = 1 + 3x + 4x2 + x3 + x4 •
Initialization: Algorithm Complete Factorization yields no factors for P(x). Therefore, Q(x) =
P(x).
Expansion: Q(x) is expanded as follows:
Q(x) 1+ 3x + 4x2 + x3 + x4
1+ x· + x + x + x2 + x2 + x2 + x2 + x3 + x4
39
Grouping:
Factorization:
Iteration:
Q(x) (1 +X)2 + x(l +x + x +x + x2+x3)
(1 +X)2 + x((l +x)2 + X+ x3)
(1 +x)2 + x((l +X)2 +x(l + x2))
(1+x)3+ x2(1+x?
Now, these two partitions can be implemented separately and then combined using an adder to
give the overall result.
Example 2. P(x) = 2 +5x +5x2+3x3 +x4 .
Initialization: Algorithm Complete Factorization yields factor (1 + x)2 for P(x). Therefore,
Q(x) = 2 + x + x2.
Expansion: Q(x) is expanded as follows:
Q(x) 2 +x + x2
= 1 +1 +x + x2
40
Grouping:
Q(x) = (1 +x) +(1 + X 2 )
Since this is the minimal form that can be reached, the end result is:
Now, these two partitions can be implemented separately and then combined using an adder to
give the overall result.
Algorithm 4 (Partition - II)
1. (initialization) Apply the Complete Factorization algorithm to the generator polynomial.
Denote the last quotient from the previous step by Q(x) = Co +Cl X +... +Cj xj • Note that
Q(x) does not have anymore factors of the form (1 + xk ).
2. (evaluation) Compute the polynomial Q(x) at xk = ±1 where, 1 ~ k ~ n
3. (check) If the result is a power of 2, then for the corresponding k, (1 T x k ) corresponds to
a possible module
Example 1. P(x) =2 +2x +3x2 + x3 +x4 .
Initialization: Algorithm Complete Factorization yields factor (1 + x2 ) for P(x). Therefore,
Q(x) =2 + x + x2 •
Evaluation: Computing the polynomial Q(x) at xk =±1 where, 1 ~ k ~ n, we get:
41
x Q(x) Possible Partition
x=l 6 (=I 2P) None
x =-1 4 (= 22) (1 + x)
x2 = 1 4 + 2x (4 + 2 =I 2P) None
x2 =-1 2x (2 = 21) (1 + x2)
Check: For this problem, P(x) decomposes into
P(x) = (1 + x2)((1 + x) + (1 + x2)
(1 + x)2(1 + x)+ (1 + x2)2
The architecture will have three modules for the first partition, with 1 delay element (register)
each. The second partition will have two modules with two registers each.
4.2 Design and Implementation of 2-D Applications
The fields of two-dimensional Digital Signal Processing and Digital Image Processing have main-
tained tremendous vitality over the past two decades and there is every indication that this trend
will continue. In this section, we pick from some specific two-dimensional filters [8] and illustrate
how our model can be extended.
4.2.1 The Laplacian Filter
Laplacian Filter is an important tool in the area of Digital Image Processing and is used for
Edge Detection. The filter can be defined as a 3 x 3 kernel and depending on the image and the
requirements, the kernel values are chosen accordingly. The pixels from the image are processed
42
through this filter to obtain the desired result. One such Laplacian filter is [7]:
o 1 0
1 -4 1
o 1 0
Since the Laplacian is a two dimensional filter, handling this architecture using the techniques
discussed so far is not in keeping with the general concepts proposed. Hence, it cannot be treated
as a one-dimensional system. Unlike in the previous problems, here there are n2 output points
and hardware complexity would also be of the order of n 2 •
However, since it is a two-dimensional system, it is naturally favourable to partitioning into two
components. So we propose to use the generator polynomial separately for rows and columns.
We would be operating now with row modules and column modules.
o
For a 3 x 3 part of an image on which laplacian filter is to be used, Fig 4.1 shows how the design
methodology works. Part (a) is the Laplacian filter kernel. Part (b) denotes how the space in
the image is defined. The i terms correspond to row pixels and the j terms correspond to the
column pixels in the image. The image pixels corresponding to the non-zero values of the filter
kernel are operated upon. So the row module handles
• • •
1 -2 1·
• • •
of
43
010
1 -4 1
010
0 1 0 •
X· 1 . •1- ,J
1 -4 1 x· . 1 X· . Xi,j+11, J- 1, J
0 1 0 • Xi+1, j •
(a) (b)
X· 1 . - 2x ..+ Xi+1,j1- ,J 1, J
solution
x· . 1 - 2x ..+ Xi, j+11,)- I,)
(c)
Figure 4.1: Rowand Column Modules for Laplacian Filter
44
Row Module Column Module
Laplacian Filter Output
Figure 4.2: Implementation of Laplacian Filter
And the column module handles
.1.
• -2 •
.1.
of
010
1 -4 1
010
Part (c) of Fig 4.1 shows how the two modules interact to produce the filter result. The output
45
from the two modules are processed through an adder and this would result in a two-dimensional
Laplacian Filter.
The overall architecture for to process one set of 3 x 3 image pixels is shown in Fig 4.2. Each of
the modules consist of two adders. They are then tied together with a single adder to produce
the desired result.
To operate on the complete image, one would require an array of n x n such architectures. This
would mean a hardware complexity Q(n2) and time complexity Q(n). One other possibility is
to reduce the number of the modules and thus reduce hardware. However, this would result in a
compromise in time complexity. The number of modules can be restricted to n which translates
to a hardware complexity Q(n) and time complexity Q(n2 ).
46
Chapter 5
Conclusion
In this thesis, we have presented a new architecture with optimal hardware and time complexities.
The design approach proposed is simple and the application potential spans many varied fields.
Our technique may be used to design and implement architectures of linear as well as nonlinear
operation such as addition, maximum, minimum, and more. Operations that fall within the
purview of our· methodology are associative and commutative. The third chapter has been
dedicated to illustrating the wide potential of our design methodology.
In order to study these systems, we have introduced the concept of the Generator Polynomial
n
gn(z) = L Cizi
i=O
. An analysis of the generator polynomial yielded some interesting properties. These have been
discussed in section 2.4 and are important since they are extensively referred to during our
designs. It is imperative that these be satisfied in order for a system to be designed using this
technique.
47
It might be worthwhile to highlight the how and where the generator polynomial influences .
the design. The overall structure of the architecture is dictated by the generator polynomial.
The architecture has identical modules arranged serially and their functionality depends on the
specificity of the operation being performed. Once the generator polynomial has been determined
for that particular system, the algorithms Factor (l+xk ) and Complete Factorization are applied.
The result will decide if the generator polynomial fulfills the properties discussed in section 2.4.
We have proposed some heuristic algorithms - Partition-I and Partition-II in Chapter 4 toward
determining design solutions when the generator polynomial does not satisfy the required proper-
ties. As illustrated in that chapter, these perform well in most cases. However, in some instances,
they do not provide optimal architecture designs. Consider the following example:
Example 1. P(x) = 1 + 2x + x3 + x4 + x5 .
Let us apply Algorithm 4 (Partition-II) to this polynomial.
Initialization: Algorithm Complete Factorization yields no factors for P(x). Therefore, Q(x) =
P(x).
Evaluation: Computing the polynomial Q(x) at xk =±1 where, 1 ~ k ~ n, we get:
48
x Q(x) Possible Partition
x=l 5 ('I 2P) None
x2 = 1 6 (#2P) None
x 3 = 1 6 ('I 2P) None
x4 = 1 6 (#2P) None
x5 = 1 6 (#2P) None
x =-1 -2 (= 21) (1 + x)
x2 =-1 4 (= 22) (1 + x2 )
x3 =-1 0 (1 + x 3 )
x4 =-1 2 (= 21) (1 +x4 )
x5 =-1 4 (2 = 22) (1 + x 5 )
Check: For this problem, there are multiple possibilities. It has to be determined as to which
combination of factors give rise to the optimal hardware design.
The various possible partitions are
Of these, the last structure has the lowest hardware and time complexities and hence is the most
optimal one.
49
From the above example, it is clear that the algorithm has to be improved so as to encompass
the last logic decision also. More work is required so that this design process may be perfected.
Another area open to extensions is that of 2- D applications. This was touched upon in Chapter
4 through the Laplacian filter. A generalized technique needs to be developed for design of 2 - D
systems.
50
Bibliography
[1] A. Aggoun, "Systolic arrays for digital filters," in International Society for Optical Engineer-
ing, vol. 3217, pp. 162-168,1997.
[2] AT&T, AT&T Application Specific Integrated Circuits: 1.25U CMOS Library Standard Cells
and Function Blocks.
[3] P. Erdos, "Some remarks on polynomials," American Mathematical Society, vol. 53, pp. 1169-
1176,1947.
[4] N. G. DeBruijn, "On the zeros of a polynomial and it's derivatives," in Nederlands Akad.
Wetensch Proceedings, vol. 49, pp. 1037-1044,1946.
[5] G. Polya and G. Szego, Problems and Theorems in Analysis VI. 1970.
[6] R. W. Hamming, Digital Filters. 1977.
[7] J. S. Lim, Two-Dimensional Signal and Image Processing. 1990.
[8] S. K. Mitra, Digital Signal Processing - A Computer-based Approach.
51
Vita
Anita Rao was born in Chennai, India, to C. S. Kalavathy and C. B. Sivasubramaniam. Her
initial schooling was at Rosary Matriculation Higher Secondary School, Chennai and National
College, Bangalore. She earned her Bachelor of Engineering Degree, majoring in Electrical and
Electronics Engineering, from Regional Engineering College, Tiruchirapalli, India. A wish to
pursue specialization in the Computer Engineering field brought her to Lehigh University, PA,
USA. She currently lives in Chicago with her husband and will be working at ADI, IL.
52
END OF
TITLE
