IMPLEMENTATION OF FIR FILTERS IN

HARDWARE DESCRIPTION LANGUAGE (HDL) by TONG KIN,  WAH
IMPLEMENTATION OF FIR FILTERS IN




Submitted to the Electrical & Electronics Engineering Programme
in Partial Fulfillment of the Requirements
for the Degree
Bachelor of Engineering (Hons)











IMPLEMENTATION OF FIR FILTERS IN
HARDWARE DESCRIPTION LANGUAGE (HDL)
by
Tong Kin Wah
A project dissertation submitted to the
Electrical & Electronics Engineering Programme
Universiti TeknologiPETRONAS
in partial fulfillment of the requirement for the
Bachelor of Engineering (Hons)
(Electrical & Electronics Engineering)







This is to certify that I am responsible for the work submitted in this project, that the
original work is my own except as specified in the references and acknowledgements,
and that the original work contained herein have not been undertaken or done by




Digital filters are used in digital signal processing (DSP) to improve the quality of a
signal, to extract information from signals or to separate two or more signals previously
combined. The advancements in VLSI technology have seen the growing popularity of
digital filters rather than analog filters. Due to a surge in high performance portable
systems, there is a continuous drive for methodologies and approaches of low power and
high throughput FIR filter cores. The components of an FIR filter include adders,
multipliers, memory unit and control unit. This project intends to compare the
performances of different structures of adders and multipliers and integrate these
structures to yield a filter which displays the best performance in terms of area, speed
and power consumption. The hardware implementation of FIR filters is done using
Verilog Hardware Description Language (HDL). All the filter components are modeled
using HDL, in which they are then synthesized, implemented and simulated. The
simulated design that has been verified is downloaded into Field Programmable Gate
Array (FPGA), where Xilinx Virtex-II chip is used. Hardware verification is performed
by testing the filter output using a logic analyzer. Important considerations in this project
are the selection of appropriate number of bits for input samples and filter coefficients,
and also the number representation scheme. The choices made will affect the
performance of the filter. This project brings out the importance of exploring varies
structures of adders and multipliers that will improve filter performance. This area of
study is lacking although there exists innumerable research on advance techniques to
implement low power and high throughput filter. The designed FIR filter in this project
can be further improved by comparing more structures of adders and multipliers, and
incorporating some advance techniques.
ACKNOWLEDGEMENTS
This design project has equipped me with abundance knowledge and it would not
be a success without the help of a legion of people. First and foremost, I would like to
express my heartfelt gratitude to my supervisor, Azrina, who has not failed to attend to
my needs. She is indeed very helpful in attempting to provide solutions to my problems
and lead me to the resources that are of great help. I would also want to thank her for
directing one ofmy problems to her friend, Weng Fook Lee, who has actually provided
me with suggestions thatguide me through the design process.
I am also indebted to three lecturers, Mr. Lo, Mr. Patrick and Dr. Yap, who have
helped and guided me much in this project. I want to thank them for spending hours with
me in debugging and for their precious piece ofadvice. Besides, they are patient with all
my inquiries and are always willing to lend a helping hand. Not forgetting also to give
my thanks to the lab technician, Kak Azira, for the eagerness to help in every way
regarding the lab equipment. It would be a tough time without her help in installing the
software and obtaining the lab equipment and manuals.
There is another person whom I owe my thanks to - Kuang Sun, who is one of
Mr. Lo's FYP students. He is oftremendous help in my project since apart ofhis project
is rather similar to mine. With his help and advice in using the software and lab
equipment, a lot of time is saved and more focus can be put into the design. Lastly, I
want to take this opportunity to thank everyone who has directly or indirectly involved in
this project, be it offering technical information or giving other useful advice. Once




LIST OF TABLES ix
LIST OF FIGURES x
LIST OF ABBREVIATIONS xiii
CHAPTER 1 INTRODUCTION 1
1.1 BACKGROUND OF STUDY 1
1.2 PROBLEM STATEMENT 2
1.3 OBJECTIVES 3
1.4 SCOPE OF STUDY 3
CHAPTER 2 LITERATURE REVIEW/THEORY 5
2.1 DIGITAL FIR FILTERS 5
2.2 TWO's COMPLEMENT 8
2.3 ADDERS 9
2.3.1 Carry-Look-Ahead Adder (CLA) 9
2.3.2 Carry-Save Adder (CSA) 11
2.4 MULTIPLIERS 13
2.4.1 Radix-4 Booth's Multiplier (Booth's Algorithm) 14
2.4.2 Baugh-Wooley Array Multiplier 17
CHAPTER 3 METHODOLOGY/PROJECT WORK 20
3.1 PROJECT FLOW 20
3.2 BASIC DESIGN METHODOLOGY 23
3.3 BIT REPRESENTATION SCHEME. 24
3.4 IDENTIFICATION OF TOOLS 24
3.5 TASKS ACCOMPLISHED 25
3.6 PROBLEMS ENCOUNTERED 25
3.7 TESTING & TROUBLESHOOTING 26
CHAPTER 4 RESULTS & DISCUSSION 27
4.1 FIR FILTER SPECIFICATIONS 27
4.1.1 Analysis of Designed FIR Filter 27
4.2 VERILOG CODES 30
vn
4.2.1 Baugh-WooleyArray Multiplier 30
4.2.2 Carry-Look-Ahead Adder (CLA) 32
4.2.3 Shift Register (Delay Units) 33
4.2.4 Filter Implementation 34
4.3 SOFTWARE SIMULATIONS 37
4.3.1 Performance Comparisons 38
4.3.2 Complete filter 39
4.4 HARDWARE SYNTHESIS 42
4.5 DISCUSSION 43







Table 1Advantages and disadvantages of digital filters 6
Table 2 Comparison between FIR and IIR filters 7
Table 3 Radix-4 Booth's recoding 14
Table 4 Selection of multiplier based on fewer transitions inO's or l's 15
Table 5 Filter specifications 27
Table 6 Performance comparison between multipliers 38
Table 7 Performance comparison between adders with one input port 38
Table 8Performance comparison between adders with eight input ports 38
Table 9 Complete filter performance 40
IX
LIST OF FIGURES
Figure 1 A simplified block diagram of a real-time digital filter with analog input and
output signals 5
Figure 2 A conceptual representation ofa digital filter 6
Figure 3 Gate-level circuits and equations for (a) halfadder and (b) full adder 9
Figure 4 A 4-bit CLA showing carry-out circuitry 10
Figure 5 General block diagram layout for a CSA using full adders 12
Figure 6 Sequential multiplication of2's-complement numbers with right shifts 13
Figure 7 Radix-4 multiplication with modified Booth's recoding 15
Figure 8Hardware realization ofradix-4 multiplier based on Booth's recoding 16
Figure 9 Recoding logic and multiplexer to generate partial products 17
Figure 10A 5-bit Baugh-Wooley multiplier 19
Figure 11 (a)DF FIR filter architecture (b)TDF FIRfilter architecture 20
Figure 12 Entire project flow 22
Figure 13 Steps in designing small modules of a filter 23
Figure 14 Codes to test the filter performance 28
Figure 15 Original signal and generated random noise 29
Figure 16Noisy signal andfiltered signal 29
Figure 17Partial codes of Baugh-Wooley multiplier 30
Figure 18 Test-bench for Baugh-Wooley array multiplier 31
Figure 19 Full adder 31
Figure 20 Half adder 31
Figure 21 16-bit CLA 32
Figure 22 17-bit CLA 32
Figure 23 Shift register acts as delay units by flip-flop instantiations 33
Figure 24 Verilog codes of a D flip-flop 33
Figure 25 Verilog description for the complete filter 35
Figure 26 Test-bench for the complete filter 36
Figure 27 Partial results for the functional simulation of the filter test-bench 39
Figure 28 Partial results for the timingsimulation of the filter test-bench 40
x
Figure 29 Partial waveforms for the functional simulation of filter test-bench 41
Figure 30 Partial waveforms for the timingsimulation of filter test-bench 41
Figure 31 Signal generator module providing inputs to filter 42
Figure 32 Top-level module 42
Figure 33 Verilog codes of signal generator module 42
Figure 34 Baugh-Wooley multiplier with instantiations of full adders 51
Figure 35 Radix-4 Booth's multiplier with 8-bit inputs 52
Figure 36 Recoding logic and multiplexer to generate partial products 52
Figure 37 CSA for Booth's multiplier to sum all partial products 54
Figure 38 Test-bench for radix-4 Booth's multiplier 54
Figure 39 16-bit CSA adding four operands 56
Figure 40 16-bit CSA adding five operands 58
Figure 41 19-bit CSA adding four operands 60
Figure 42 4-bit CLA without sign extension 61
Figure 43 4-bit CLA with sign extension 62
Figure 44 18-bit CLA 63
Figure 45 19-bit CLA 63
Figure 46 20-bit CLA 64
Figure 47 Results offunctional simulation for the test-bench ofBooth's multiplier 65
Figure 48 Results of timing simulation for the test-bench ofBooth's multiplier 65
Figure 49 Results offunctional simulation for the test-bench ofBaugh-Wooley multiplier
65
Figure 50 Results of timing simulation for the test-bench ofBaugh-Wooley multiplier.66
Figure 51 Overall adder formed by CLA instantiations with only one input port 66
Figure 52 Test-bench for the overall adder with CLA instantiations and one input port.67
Figure 53 Overall adder formed by CLA instantiations with eight input ports 67
Figure 54 Test-bench for the overall adder with CLA instantiations and eight input ports
68
Figure 55 Results of functional simulation for CLA with one input port 68
Figure 56Results of timing simulation for CLA with one input port 68
Figure 57 Results of functional simulation for CLA with eight input ports 68
Figure 58 Results of timing simulation for CLA with eight input ports 69
XI
Figure 59 Overall adder formed by CSA instantiations with only one input port 69
Figure 60 Test-bench for the overall adder with CSA instantiations and one input port.70
Figure 61 Overall adder formed by CSA instantiations with eight input ports 70
Figure 62 Test-bench for the overall adder with CSA instantiations and eight input ports
71
Figure 63 Results offunctional simulation for CSA with one input port 71
Figure 64 Results of timing simulation for CSA with one input port 71
Figure 65 Results offunctional simulation for CSA with eight input ports 71



























Very high speed HDL




1.1 BACKGROUND OF STUDY
Digital filtering is one of the most important operations in digital signal
processing (DSP). Digital filters are widely used in any area where information is
handled in digital form or controlled by a digital processor. The continuous growing
trend towards digital solutions can be seen in all areas - from electronic instrumentation,
control, data manipulation, signals processing, telecommunication to consumer
electronics. Due to the advancements in VLSI technology, digital filters are fabricated
with greater reliability, smaller size, lower cost, lower power consumption and higher
operation speed.
The objectives of using digital filters in DSP are to improve the quality of a
signal (for example, to remove or reduce noise), to extract information from signals or to
separate two or more signals previously combined. The use ofdigital filters is especially
important to minimize the distortion of the in-band signal components. For instance,
digital filter is used in speech synthesis - the Speak and Spell is an example in which it is
an electronic learning aid for children and uses the LPC (linear predictive coding)
techniques, where the actual human speech to be reproduced later is modeled as the
response ofa time-varying digital filter to aperiodic or random excitation signal.
There is a continuous demand for low power and high throughput FIR filtering
cores in DSP architectures. Researches in the literature have developed a number of
techniques to implement digital filters in achieving the above purposes. These include
the following: use of differential coefficients, wordlength optimization, multirate
architectures and dynamic adjustment of filter order [1,2]. Other techniques introduced
by researches include coefficient segmentation, block processing and combined
segmentation and block processing algorithms, as demonstrated in [3,4,5]. The choice of
number representation scheme, investigated in [6,7], can affect the filter performance.
1
Digital filters are normally modeled using software simulation and then
synthesized into corresponding hardware circuit using field programmable gate arrays
(FPGAs) or application-specific integrated circuits (ASICs). A Hardware Description
Language (HDL) provides the framework for the complete logical design. Verilog and
VHDL are the two most commonly used HDLs today. Verilog as an HDL was
introduced by Cadence Design Systems; they placed it into the public domain in 1990. It
was established as a formal IEEE Standard in 1995. The revised version has been
brought out in 2001.
Software simulators offer flexible schemes to code the algorithm from a choice of
many languages but cannot always offer the speed that a hardware simulator can.
Unfortunately, building hardware prototypes to model different systems can be costly
and time consuming when constant changes have to be made. Therefore, a middle
ground might be found using custom computing platforms or programmable logic. Such
systems can offer similar flexibility as software and still retain some or all of the
hardware acceleration at the cost ofa shorter implementation cycle.
FPGAs are becoming increasingly popular for rapid prototyping of designs with
the aid ofsoftware simulation and synthesis. Software synthesis tools translate high-level
language descriptions of the implementation into formats that may be loaded directly
into the FPGAs. An increasing number of design changes through software synthesis
become more cost effective than similar changes done for hardware prototypes. In
addition, the implementation may be constructed on existing hardware to help further
reduce the cost.
1.2 PROBLEM STATEMENT
The requirement of this project title is to implement FIR filters using suitable
Hardware Description Language (HDL). The design can then be synthesized into
hardware circuit using FPGA. In fact, there are innumerable methodologies and
techniques used to implement low power and high throughput FIR filtering cores, as
discussed in [1,2]. The components of a filter include adders, multipliers, memory unit
and control unit. Different structures of adders and multipliers will give different
performance. Hence, this project aims at modeling the components in HDL and
investigating the performance of different structures of adders and multipliers using a
simulation tool. The performance is to be viewed in terms of structure size, speed and
power consumption.
The adder and multiplier structures, that give the best performance, are to be used
in the filter design and the overall filter performance is analyzed. Once software
simulation is completed and successful, the final filter design is downloaded into FPGA
and verified to ensure that the filter is functioning properly. Performance comparison
analyses among various structures of adder and multiplier are lacking since many
researches currently focus on the filter implementation techniques. Hence, this project
brings out the importance ofinvestigating the structures ofadders and multipliers.
1.3 OBJECTIVES
1. To develop software simulations for FIR filters using Verilog HDL.
2. To compare the performance of the different structures of adders and multipliers in
relation to area, speed and power consumption.
3. To select the structures of adder and multiplier with the best performance and
integrate them with memory unit and control unit to build the overall filter.
4. To select a suitable computational arithmetic (unsigned, signed, fixed or floating
point) and the number of bits to represent filter coefficients and input data.
5. To synthesize the filter design into hardware using FPGA and verify its functionality
using appropriate equipment.
1.4 SCOPE OF STUDY
1. The concepts and theories of FIR filters are learnt.
2. The design methodology for FIR filters from specifications, coefficients calculation,
filter structure, finite wordlength effects to filter implementation are learnt.
3
3. A suitable data processing style and computational arithmetic for representing the
input samples and filter coefficients are decided upon.
4. Each component of the filter (adders, multipliers, memory unit and control unit) is
coded into Verilog and their functionalities are verified.
5. Different types of adders and multipliers are explored. The performance of different
structures of each component is compared in terms of area, speed and power
consumption.
6. All components are integrated to form a complete filter. The final design is verified
functionally and a detail analysis is done.




2.1 DIGITAL FIR FILTERS
A filter is essentially a system or network that selectively changes the wave
shape, amplitude-frequency and/or phase-frequency characteristics of a signal in a
desired maimer. A digital filter is a mathematical algorithm implemented in hardware
and/or software that operates on a digital input signal to produce a digital output signal
for the purpose ofachieving filtering objective. Digital filters often operate on digitized
analog signals or just numbers, representing some variable.
A simplified block diagram of a real-time digital filter, with analog input and
output signals, is given in Figure 1. The bandlimited analog signal is sampled
periodically and converted into a series of digital samples, x(n). The digital processor
implements the filtering operation, mapping the input sequence, x(n), into the output
sequence, y(n), in accordance with a computational algorithm for the filter. The DAC
converts the digitally filtered output into analog values which are then analog filtered to













Figure 1 A simplified block diagram of a real-time digital filter with analog input and
output signals [8]
Digital filters play important roles in DSP. Compared to analog filters, they are
preferred in a number of applications; for example, data compression, biomedical signal
processing, speech processing, image processing, data transmission, digital audio and
telephone echo cancellation. The advantages and disadvantages of digital filters
compared to analog filters are summarized in Table 1.
Table 1Advantages and disadvantages ofdigital filters [8]
Advantages Disadvantages
Can have truly linearphase response. Speed limitation. Operating speed of
digital filters depend on speed of digital
processor used and the number of
arithmetic operations performed.
Performance of filters does not vary with
environmental changes - eliminates the need
to calibrate periodically.
Frequency response can be automatically
adjusted if it is implemented using a
programmable processor.
Several input signals or channels can be
filtered by one digital filter without
replicating the hardware.
Finite wordlength effects. Digital filters
are subjected to ADC noise resulting from
quantizing a continuous signal and to
roundoff noise incurred during
computation.Both filtered and unfiltered data can be saved
for further use.
Can be fabricated small in size and consume
low power due to advancements in VLSI
technology.
Long design and development times.
Hardware development for digital filters
can consume a longer time than for analog
filters.
More flexible in terms of precision- only
limited by the wordlength used.
Can be made to work over a wide range of
frequencies even at very low frequencies.
Digital filters can be divided into two categories, namely infinite impulse
response (IIR) and finite impulse response (FIR) filters. Either type of filter, in its basic
form, can be represented by its impulse response sequence, h(k) as in Figure 2. The
choice between FIRand IIR filters depends largely on the relative advantages of the two







Figure 2 A conceptual representation of a digital filter
Table 2 Comparison between FIR and IIR filters [8]
FIR filter IIR filter
Can have exactly linear phase response Nonlinear phase response, especially atband edges
Nonrecursive, always stable Stability problems
Finite wordlength effects are much less
severe
Finite wordlength effects are more severe
Requires more processing time and storage
for a given amplitude response specification
Less coefficients leading to less processing
time and storage
Filters with arbitrary frequency responses
are easier to be synthesized
Analog filters are readily transformed into
equivalent IIR filters meeting similar
specifications
The basic FIR filter is characterized by the following two equations:
N-l
y(ri) =^ h(k)x(n ~k) Equation 1
N-\
H(z) =Yjh(k)z~k Equation 2
*=o
where h(k) are the impulse response coefficients of the filter, H(z) is the transfer function
of the filter and N is the filter length, which is the number of filter coefficients. The sole
objective of most FIR coefficient calculation (or approximation) methods is to obtain
values of h(n) such that the resulting filter meets the design specifications. Several
methods are available to obtain h(n) and the most commonly used are window, optimal
(Parks-McClellan) and frequency sampling methods. All three lead to linear phase FIR
filters.
The number of bits used to represent the input data to the filter and the filter
coefficients and in performing arithmetic operations must be small for efficiency and to
limit the cost of the digital filter. The problems caused by using a finite number of bits
are referred to as finite wordlength effects and can lead to performance degradation of
the filter. Finite wordlength effects include [8]:
7
i) ADC noise. ADC quantization noise which results when the filter input is derived
from analog signals.
ii) Coefficient quantization errors. These result from representing filter coefficients
with a limited number of bits,
iii) Roundoff errors from quantizing results of arithmetic operations. This may be
caused by the wordlength of the processor used.
iv) Arithmetic overflow. This occurs when partial sums or filter output exceeds the
permissible wordlength of the system.
The computation of output sequence, y(n) involves multiplications,
additions/subtractions and delays. Thus, filter implementation needs the following basic
components:
i) memory (RAM) to store the present and past input samples, x(n) and x(n-k)
ii) memory (RAM or ROM) for storing the filter coefficients, the h(k)
iii) multipliers to multiply input samples andfilter coefficients
iv) adders to sum the outputs from multipliers
v) control unit to schedule the operations of all components in a filter
2.2 TWO'S COMPLEMENT
Two's complement number representation is used to represent signed numbers.
This form of representation is also known as radix complement (RC) representation.
Two's complement is selected over other representation schemes because it is able to
perform signed addition and multiplication using the same circuitry as in unsigned
addition and multiplication. To obtain the two's complement of a number, first
complement (negate) all the bits in the number, including the sign bit and all magnitude
bits, then add one to the least significant bit of the number. In order to add or multiply
two 4-bit operands, signedextension needs to be carriedout beforehand, so that the MSB
is the sign bit and all four bits are magnitude bits. For example, integer 5 is represented
by 00T012 while integer -5 is represented by 110112. Hence, addition of two 4-bit
operands requires a 5-bit adder.
2.3 ADDERS
The iterative design process is used to design adder and subtractor circuits at gate
level. Two's complement representation of signed numbers is used so that subtraction
can be done using the same circuitry as in addition. The two basic adders are half adder
(HA) and full adder (FA). A halfadder is capable of adding two 1-bit operands while a
full adder can add two 1-bit operands and an input carry. Both adders result in two
outputs - a sum and an output carry. The gate-level circuits and equations for half adder
and full adderare shownin Figure 3.
Higher bits adders are formed by employing the full adders andhalf adders where
appropriate in an iterative modular design process. Examples of higher bit adders are




















Figure 3 Gate-level circuits and equations for (a)halfadder and (b) full adder
2.3.1 Carry-Look-Ahead Adder (CLA)
The 4-bit CLA showing the carry-out circuitry is indicated in Figure 4. This
figure assumes that there is no input carry at bit position 0. The propagation delay times
shown inparentheses for the carry-out bits and the sum bits for the CLA are substantially
smaller than that of ripple-carry adder as the number of stages increases. The CLA
contains carry-generate terms (Gj = Aj.Bi) and carry-propagate terms (Pj = Aj+Bj). From
full adder, COj+i = Aj.Bj + CIj.(Aj+Bj). The carry bit contains one carry-generate term
and one carry-propagate term. When the expression Aj.Bj is 1, the carry-out bit becomes
1 independent of the carry-in bit, CIj and so the expression Aj.Bj is called the carry-
generate term. It generates the carry-out bit [9].
J L
A if a
















Figure 4 A 4-bit CLA showing carry-out circuitry [9]
When the carry-in bit CIj is 1 and the expression Aj+Bj is also 1, the carry-out bit
becomes 1 and so the expression Aj+Bj is called the carry-propagate term. It propagates
or moves the value CIj to the carry-out bit [9]. The carry-out bit of the non-ripple
expandable CLA can be written as follows for each bit position:
Bit position 0:
COl =G0 + CI0.P0
Bit position 1:
C02-G1+CI1.P1




C03 - G2 + CI2.P2
= G2 + C02.P2
- G2 + G1.P2 + G0.P1.P2 + CI0.P0.P1.P2
10
In general, CLA bit position organization scheme for i = 0,1,2...:
COi+1 = Gi+ Gj-i.Pj + Gi-2.Pi-i.Pi + Gi.3.Pi.2.Pi-i.Pi + ...+ CI0.P0.Pl...PM.Pj
Equation 3
Since each carry-out bit is in SOP (sum of product) form, each function can be
implemented as a 2-level gate circuit that is dependent only on the carry-generate and
carry-propagate terms for the current bit position and all the previous (or less significant)
bit positions. Since each carry-generate carry-propagate term required only a single gate
level of logic, each carry-out function past bit position 0 can be implemented as a 3-level
gate circuit with settling time (propagation delay time) of just 3tp. This reduces the
settling time for the sum bits to only 6tp for any CLA with three or more bits [9].
Three things limit the usefulness of CLA circuitry when it is applied over a large
number of stages:
i) The carry-generation term GO from first bit position must be capable of driving
each of the succeeding stages.
ii) Each succeeding stage requires gates with an increasing number of inputs (gates
with a higher fan-in).
iii) Gate count increases and thus, cost increases with each additional stage.
Due to these limiting factors, CLA is usually implemented over small groups of bits
(such as 4 bits). The carry-look-ahead technique can then be applied again over the
groups as they are cascaded [9].
2.3.2 Carry-Save Adder (CSA)
Carry-save adders are designed to add more than two operands. This technique
involves cascading full adders such that the carry output of each adder is shifted to the
left one bit position and added to anFA in the next row(referred to as carry save) except
for the last row. A single RCA (ripple-carry adder) or CLA may be used in the last row.
The concept is illustrated below for the addition of five 1-bit operands A0, BO, CO, DO
and E0. The following relationship is used to determine the number of rows of adders
required [9].
11
Number of rows of adders = Number of operands to be added - 1
AO Operand 1
BO Operand 2
+ CO Operand 3
S10 Sum, Row 1
COl 1 Carry, Row 2 (carry save)
+ DO Operand 4
S21S20 Sum, Row 2
C021 Carry, Row 3 (carry save)
+ EO Operand 5
S31 S30 Sum, Row 3
C032 C031 Carry, Row 4 (carry save)
C043 C042 C041 Carry, Row 4 (no carry save)
S43 S42 S41 S40 Sum, Row 4 (last row)
Extending the concept for more bits, a general block diagram layout for a CSA using FA
can be drawn. The diagram layout is illustrated in Figure 5. This type of circuit
configuration is also referred to as a Wallace-Tree Summing Network. HA can be used
in places where only two bits must be added and the least significant bit is not required





































Figure 5 General block diagram layout for a CSA using full adders [9]
12
2.4 MULTIPLIERS
Multiplication of signed numbers represented by two's complement is not as
straightforward as multiplication of unsigned numbers. Multiplication of signed numbers
employs an algorithm, either right-shift or left-shift algorithm. In this section, right-shift
algorithm will be discussed as this involves less hardware realization. Multiplication
with right shifts uses top-to-bottom accumulation as governed by the following equation:
p(/+D' =(p(/) +xya2*)2-1 with p^ =0 and
|—add—| pW =p = ax +p^z*j—shift right—|
The example in Figure 6 shows a sequential multiplication of two's complement
numbers with right shifts. The multiplicand is -10 and multiplier is 11, which yields
result -110. For two's complement, arithmetic shift right (ASR) is used to preserve the
MSB in which the contents are shifted right by one bit. For example, 1101 becomes 1110
























































































































Figure 6 Sequential multiplication of 2's-complementnumbers with right shifts [10]
13
2.4.1 Radix-4 Booth's Multiplier (Booth's Algorithm)
Booth's Algorithm is used to replace strings of l's in multiplier by +1 and -1.
This is the most basic form of Booth Algorithm called radix-2 Booth recoding. There are
two ways to speed up the multiplication process:
i) Reducing the number of operands to be added by handling more than one
multiplier bit at a time.
ii) Adding the operands faster via parallel/pipelined multi-operand addition using
tree and array multipliers.
Radix-4 Booth's recoding is a variation of modified Booth's Algorithm. Table 3
shows the recoding techniques associated with radix-4 Booth's Algorithm. Multiplier bit
position is denoted x-, and the recoded version for multiplier is ziP_. An example to recode
the multiplier is provided below the table. From the example, it can be seen that a 16-bit
multiplier is recoded to an 8-bit operand, thus reducing the number of partial products to
be added.
Table 3 Radix-4 Booth's recoding [10]
,+1 Xy xM y,-.M y{ Zji2 Explanation
0 0 0 0 0 0
0 0 1 0 1 1
0 1 0 0 1 1
0 1 1 1 0 2
1 0 0 -1 0 "2
1 0 1 -1 1 -1
1 1 0 0
-1 -|
1 1 1 0 0 0
No string of 1s in sight
End of string of 1s
Isolated 1
End of string of 1s
Beginning of string of 1 s
End a string, begin new one
Beginning of string of 1 s
Continuation of string of 1s
Example: (21 31 22 32)k)Ur
1 0 0J_ 2J_ ®_}_ 1_0 2_0 1_J_ 1_0 Operand x





Shifted 2 bits to
the right and sign
extended
a 0 1 1 0
X 1 0 1 0
z
-1 -2 Recodec
p<°) 0 0 0 0 0 0
+zQa 1 1 0 1 0 0
4^0) 1 1 0 1 0 0
p^T^
"M -1 1 1 0 1 0 0
+z^a 1 1 1 0 1 0
4p<2> 1 1 0 1 1 1 0 0
p(2) 1 1 0 1 1 1 0 0
Figure 7 Radix-4 multiplication with modified Booth's recoding [10]
The example in Figure 7 illustrates radix-4 multiplication with modified Booth's
recoding of the two's complement multiplier. The multiplicand is 6 whereas the
multiplier is -6, which gives -36 as the result. Since the multiplier is 4-bit long, only two
additions of partial products are required with radix-4 multiplication. The redundant sign
bits in front of the final result canbe discarded. Note that right-shift algorithm is used.
An advantage of using modified Booth's recoding technique is that the number of
partial products is reduced which in turnreduces the hardware and delay required to sum
the partial products. This is because when there is a string of 0 or a string of 1 in the
multiplier, only shifting operation is performed, which is faster than addition. Hence, it is
often wise to choose one of the two's complement numbers that has fewer changes in 0's
or l's as the multiplier. For instance, consider the two's complement numbers 101001
and 111001 in Table 4. A disadvantage of Booth's Algorithm is that it adds delay into
the formation of partial products.
Table 4 Selection of multiplier based on fewer transitions in 0's or 1's
101001
111001
4 changes. From 1 to 0, from 0 back to 1, then back to 0, from 0
to 1 for the last bit.
2 changes. The 1 in bit-3 changes to 0, then 0 in bit-1 changes
to 1. Selected as multiplier.
15
The hardware implementation of radix-4 multiplier requires registers for
multiplicand, multiplier and partial product, recoding logic, multiplexer and adder. The
simplified block diagram for a radix-4 multiplier based on Booth's recoding is
represented in Figure 8.





-Mn ,\ r X i ' * ' k
RecoJiiiLr Lcal^l;








control To jiddcr input
Figure 8Hardware realization ofradix-4 multiplier based on Booth's recoding [10]
Figure 9 shows the recoding logic and multiplexer to generate a partial product.
The multiplier group consists of 3 bits of the multiplier (xi+J xt x,w). Output of the booth
decoder will select 0, M or 2M where M is the multiplicand. The XOR gates are used to
generate one's complement by inverting all the bits. If the MSB of the multiplier group is
0, then the partial product will be 0, M or 2M; if the MSB of the multiplier group is 1,
then all the bits of the partial product will be inverted. -M or -2M can be generated by
adding S=l in which two's complement ofpartial product is created. The resulted partial
product is then added to the previous partial product stored in a register that are shifted







Figure 9 Recoding logic and multiplexer to generate partial products [10]
2.4.2 Baugh-Wooley Array Multiplier
Baugh-Wooley array multiplier is used to multiply positive and negative numbers
in two's complement. Theprinciple of this multiplier is that the subtraction can be added
by complementing the subtrahend and adding 1. This multiplier has a regular structure
and is governed by a final equation derived as follows [11]:
Let us consider two numbers A and B:
A=(a».l-ao) = -V-2'-1 + 2;»1.21
0
B = <bM...b0)=.bM.2»-i + 21>i.2i
The product of A and B is given by the following equation:
n-2 n-2 iu2 n-2
0 0 0 0
In order to use only adder cells, the negative terms are rewritten as:
-an.i2bi-2i+1Ul =an-i-(-22lu2 +2M +SbT 2i+lul)
17
Hence, the product of A and B becomes:
A.B = «v1.b^1.22*-2+2 5>it>j2i+j
o o













-22a-1 + ^T+bn.1 +am4bn.i).22*-2
n-2 n.2





- (bn.! + aB-i).22n-2 = -2211"1 + fc +bn-iU2"-2
Equation 4
A and B are n-bit operands, so their product is a 2n-bit number. Consequently,
the most significant weight is 2n-l, and the first term -22""1 is taken into account by
adding a 1 in the most significant cell of the multiplier. Figure 10 shows the structure of
a 5-bit Baugh-Wooley multiplier and can be verified using the final equation by
substituting n=5. The array comprises of (n-l)*(n-l)+l full adders, multiplication units
(AND gates) and carry propagation adders.
W = AND(X ,Y) .
ij i J
W = AND(X' ,Y) ,














The flow of the entire project is outlined as seen in Figure 12. The design
methodology for an FIR filter starts from filter specifications, coefficients calculations,
filter structure, study of finite wordlength effects and finally filter implementation.
Specifications of the filter are determined based on the type of filter designed. There are
four types of filters, namely low-pass, high-pass, bandpass and bandstop filter. Several
methods are available to obtain filter coefficients and the most commonly used are
window, optimal and frequency sampling methods. Two most basic FIR filter
architectures are direct form (DF) and transpose direct form (TDF), given in Figure 11.
In this project, a low-pass FIR filter with DF architecture is designed using Kaiser
Window method.
x{n) yz XI [1






Figure 11 (a) DF FIR filter architecture (b) TDF FIR filter architecture [12]
20
The different structures of adders and multipliers are explored and some of the
structureshave already been discussed in the previous chapter. Design description, which
is to describe the circuit in terms of its behaviour, can be done in a few levels of
abstractions. The lowest level is circuit level with switches as the basic element,
followed by gate level, data flow level and lastly, the highest level, which is behavioural
level. In common practice, both gate level and data flow level modeling (RTL level) are
used because many of the behavioural level constructs are not directly synthesizable.
Even if synthesizable, they are likely to yield relatively redundant or wrong hardware.
The number of bits used to represent input data and filter coefficients, and also the
number representation scheme are important considerations that can affect the filter
performance.
A basic FIR filter consists of multipliers, adders and delay units, as can be
deduced from Equation 1 (page 7). Depending on the architecture and performance
objectives, a filter can also have memory and control unit. Each of the filter components
is coded into Verilog and its functionality is verified. Performance comparison is done
for different structures of adders and multipliers in view of their propagation delay, area
and power consumption. Two different structures of adders and multipliers are compared
in this project. The better component structure based on performance is chosen to be
integrated into the complete filter design. Functionality of the complete filter is verified
through simulation and its performance is tabulated. Once successful, the design is
downloaded into FPGA and functionality verification is carried out by analyzing the
filter output using a logic analyzer.
21
Extensive research on
- FIR filters concepts &
design methods
- Adders and multipliers
Familiarize with Verilog HDL
and design software
Decision on number of bits used to
represent data and computational











to form complete filter &
verify functionality




Figure 12 Entire project flow
22
Debugging
3.2 BASIC DESIGN METHODOLOGY
Figure 13 indicates the crucial steps in designing small modules of a filter. Each
















Figure 13 Steps in designing small modules of a filter [13]
1. Determine specification. The specification details the behavior and interface of each
module in the design. At the module level, the specification includes the following:
i) A description of the top-level behavior of the module
ii) A description of all inputs and outputs, their timingand constraints
iii) Performance requirements and constraints
2. Structure design to register transfer level (RTL). This is a logic design phase where a
block diagram for the design is determined, which includes registers and functions of
combinational logic.
3. Capture design as Verilog. Design description can be done based on a few levels of
abstraction - the highest is behavioral level, followed by data flow level, gate level
and the lowest switch (circuit) level. Many of the behavioral level constructs are not
directly synthesizable; even if synthesized they are likely to yield relatively
redundant or wrong hardware. The solution is to redo the behavioral modules at
lower levels.
23
4. Verify design. This is a pre-synthesis verification process to determine that the design
is 100% functionally correct. This process is known as functional simulation.
5. Synthesize design. Synthesis tools are used to transform the Verilog design into a
gate level design.
6. Verify results of synthesis. Gate-level simulation, timing analysis and other
techniques are used to verify that the design produced by the synthesis tool is correct
and consistent with the Verilog RTL design.
7. Place and route. This stage is referred to as physical design where the actual layout
of the chip is determined. The gates in the chip are assigned (placement) to positions
on the chip and then connected together with wires (routing). Post-place-and-route
simulation can then be performed to obtain area and timing information.
8. Final verification. A number of final checks are done to ensure that the chip is wired
up correctly and is manufacturable. Thenature of these checks is beyond the scope of
this project.
3.3 BIT REPRESENTATION SCHEME
In this project, the number of bits used to represent input data and filter
coefficients is eight bits. Signed numbers will be used with two's complement as the
representation scheme. Fixed-point numbers will be employed instead of floating-point
which needs a more complex number representation scheme. Area, speed and power
consumption analyses are performed by using ModelSim and Xilinx ISE simulation
tools.
3.4 IDENTIFICATION OF TOOLS
1. ModelSim and Xilinx ISE simulation tools
2. MATLAB
3. Virtex-II xc2vl000 reference board - an FPGA which enables the filter design to be
programmed into.
4. Agilent Technologies 1673G logic analyzer and probes
5. Xilinx JTAG cable
24
3.5 TASKS ACCOMPLISHED
1. Two structures of adders and multipliers are designed, simulated and their
performances are compared. The adders are CLA and CSA while multipliers are
radix-4 Booth's multiplier and Baugh-Wooley array multiplier. CLA and Baugh-
Wooley multiplier are found to have better performance compared to their
counterparts.
2. A DF low-pass, 18l1 order FIR filter is designed by using adders, multipliers and
delay units. The filter is implemented using parallel approach, which eliminates the
need of memory and control unit.
. The performance of the complete filter is analyzed. Its functionality is verified
through software simulation, as well as hardware verification.




Throughout this project, some problems and challenges are encountered as
discussed briefly below:
1. Inexperience in employing the different levels of abstractions of Verilog coding. As
mentioned, some codes written in behavioural level may be non-synthesizable.
Considerable amount of time is used to debug the faulty codes when simulation fails
or gives incorrect output.
2. Limitation of Virtex-II device. This device has 172 bonded IOBs. However, both
adders accept outputs from 19multipliers simultaneously, which gives a total of 304
bits for all outputs of multipliers. Limitation of the target device causes simulations
to fail for both CLA andCSA. The solution is discussed in the next chapter.
3. Incapability of I673G logic analyzer toprovide inputs. The available logic analyzer
in the lab is not able to provide inputs to the filter that is downloaded into the Virtex-
II chip. Hence, inputs to the filter are provided manually by extending the codes to
account for a signal generator module.
4. Difficulty inpredicting the output from the filter. It can be seen from the codes that
filter operation is controlled by the triggering of clock. During hardware testing of
25
the filter functionality, the onboard clock is utilized and is always running once the
board is powered-up. Therefore, it is very hard to compare the output from
simulations and output obtained from logic analyzer. A manual push button
(availableon the board) is used to serve the function of a clock trigger.
3.7 TESTING & TROUBLESHOOTING
A lot of debugging is done on the codes when simulation fails or gives incorrect
output. This is often so when behavioural level modeling is used to model the filter
components. Behavioural level modeling is inevitable when conditional expressions are
employed in the process of designing. Examples of these type of constructs are 'if, 'if-
else', 'while' and 'for'. In this case, experience is vital to recognize the way of writing
that results in codes that are synthesizable.
All the filter components are simulated and verified to ensure that their intended
functionalities are correct before proceeding to the next step in designing. The complete
filter does not require much troubleshooting since all lower level modules are
functioning correctly. The simulated design is verified through hardware synthesis using




4.1 FIR FILTER SPECIFICATIONS
A low-pass FIR filter is designed using Kaiser Window with MATLAB 'sptool'.
A set of filter specifications is defined in Table 5.
Table 5 Filter specifications
Specifications Values
Passband frequency, Fp 1000 Hz
Stopband frequency, Fs 2000 Hz
Passband ripple, Rp 0.4455 dB (5%)
Stopband ripple, Rs 40dB(l%)
Sampling frequency, Fsamp 8000 Hz
This set ofspecifications yields an 18th order filter with 19 coefficients altogether.
The specifications are chosen such that the number of coefficients is not too big in order
to reduce the filter size. The multiplication and addition process canied out by the filter
is intended to be parallel so that the throughput and sample rate of the filter can be
maximized. Due to the parallelism, the number of coefficients has to be small in order to
reduce hardware. FIR filters can also be implemented in sequential in which this
approach aims to minimize area requirements through the reuse of as much hardware as
possible. However, its bottleneck is low throughput. Direct form (DF) FIR filter is
realized in this project.
4.1.1 Analysis of Designed FIR Filter
The defined filter specifications are analyzed to determine the level of filter
performance in removing or reducing high-frequency noise. It can be seen in Figure 14
that the generated signal has frequency of 500Hz and random noise has frequencies
ranging from 500Hz to 8000Hz. The two signals are combined to create a noisy signal, z,
which is then allowed to pass through to the designed filter that ultimately gives filtered
27
output y. The second plot in Figure 16resembles the original signal in which the filtered
signal is relatively smooth without jagged edges caused by high-frequency noise. Since
the cutoff frequency of designed filter is 1500 Hz, any frequencies above this will be
significantly suppressed. These suppressed frequencies have negligible amplitudes owing
to the 40 dB stopband ripple. However, the filtered output displays a phase lag or termed
group delay of nine. The group delay of a filter is a measure of the average delay of the
filter as a function offrequency. It is the negative first derivative of the phase response of
the filter.




5 %to create noise with 16 differ =nt frequencies
6 - for k=l:16
7
- nn(k,:)=0.08*randn(l)*sin(2 *pi*k*5CiO*t) ;
S ~ end
9 - sum=0;
10 - for k = l:JL6
11 - sum=sum.+nn(k,: ) ;
12 - end
13 - s=x+sum;
14 %filtl consists of designed filter specs




19 - subplot(2,1,1); plot(x(m));
20 - xlabel('Time index n'); ylabel{ Amplitude');
21 - title('Signal, :•: = sin (500\pit
');
22 - subplot(2,1,2); plot(sum(m));
23 - xlabel('Time index n1); ylabel( Amplitude') ;
24 - title('Random noise, gum');
25 - figure(2);
26 - subplot(2,l,l); plot(s(ia));
27 ~ xlabel('Time index n'); ylabel( Amplitude ' ) ;
28 - titlef'Moisy signal, x + sum');
29 - subplot(2,l,2); plot(y(m));




Figure 14 Codes to test the filter performance
28






i r i r
J 1 I L
-0.4









Figure 15 Original signal and generated random noise
























Figure 16 Noisy signal and filtered signal
29
4.2 VERILOG CODES
This section indicates the associated codes that are used in the filter design. These
include codes for Baugh-Wooley array multiplier, CLA, shift register and the complete
filter. Note that other Verilog codes associated with radix-4 Booth's multiplier and carry-
save adder are included in Appendix A.
4.2.1 Baugh-Wooley Array Multiplier
Variable B (codes in Figure 17) represents the coefficient of the filter and is
declared as parameter so that its value can be changed in the complete filter design
during instantiation of this module. The following codes illustrate an example which
declares B as having the hexadecimal value 02. The test-bench for Baugh-Wooley array
multiplier instantiates the module 'Wooley' that declares B as an input port rather than
parameter in order to be used for simulation purpose. The complete codes for this



















I s: n | if;, m;
*uailL,Bansl2,iiUs»13,st^jaI4,B,!mlS,»uiiliS,sUiti.
«un2.1 „ amuZ2,sva&S , sun£f|, suai2£,sun^fc, sut£
SfJtn31,syjks3 2,sviSi33,£i-iii34,s,uita3oJ,SlUx3i5,!gUji3
ccut-0, qout-1,couk2, cqvx.3, coyi; 4,eoucS, c&xc
cout-11, cout12, cout 13 , cout, 14 , cowt i 5 , c out 1
*sqw,z X, qquzZZ , cout 23, co«'t Z4, a wx. Z%,<so\saS
*:cut31,cout32,cout 33,coute 34,cent35,co*3
COW4 1, e«4r,*l2 , qoy.t 43 „.RWt 4=1,GOUt 4 S , CfJWC 4
couliSl,cout-52, cout53r cowbS4,.couti 55, cows,5
(isiitin UEO] -* AEOJ EB{D];
OEsicp U[l] = A£1J t B|0];
assign ITE2} b A[£] s. B|01;
assic^, H[3) = A£3] £ B JO] ;
assioa "143 = &E47 a B?01;
assign U[£] - A£S] & BSD];












9 , cquc.20 .;
3,CGUT-30,
9„caut.40;






Hooiev r/oo [A^J3TP) ;
initial.
begin
A •> 8'hDOj B ^ 6 liDO;
#1DQ k - p1 ftCi£; E - a1hint
#30 A = S'hll; 3 = B'ftlCLJ
#S0 A - 8'h2l; D - Q'h2ta;
#50 A = B'H31; 3 = B'b32;
ji'SG k •=• £"liS2; B
- 8'hiOj
£5D k - B^hif; 3 - B'h7a;
Jr-SD A = B"iic5; 3 = 8'hfob;
#50 k - Cliff; 9
- Q'htfi
erid
initial Suraziitor [Srealt iji'G r " A-4h, BHh, produce thw, A,B,P];
e nctoio cftil e
Figure 18Test-bench for Baugh-Wooley array multiplier
module txxl i adder (cits,hi a, sum, coot) ;
iaput ci n b, a;




halt adder haL(a,b,5Ql can.;
half adder ha2 (sin, £01 aiun,CD2) ;
b.33 ign coat " C01 C02;
endrnodule
Figure 19 Full adder
module half adder(A, B, sura, cout);
input A,B;
output sum, 2 out;
assign CO lit = A & B;
assign sunn = A A B;
enclmadule
Figure 20 Half adder
31
4.2.2 Carry-Look-Ahead Adder (CLA)
Figures 21 and 22 represent 16-bit CLA and 17-bit CLA respectively. As the
names imply, a 16-bit CLA is capable of adding two operands that have 16 bits. Note
that the Verilog codes for CLA_nsx (4-bit CLA without sign extension), CLA (4-bit







isrire CIO = 0;
uiCB CD1,C0Z,C03;
CIjA n$v. clanl (A[3 <U,B[3:0] ,CT0 ,3[3 01 , C01> ;
CLA nsx clsn2(&[7 4],B[7:«] ,C€1 ,5[7 4] C02) ;
CLAjtisy. clajn3(A[11^03 ,S[1I: B],C02,S 11 8| ,CM) ;
CLA clal(Afl5:l2] B[1S:121, CQ3, S[1S. 12} fS[lSJ);
enduedule








wire CIO = 0;
Ql.k_nB-A clsmJ(if3:a],B[a;03,Pia,Si3;0| ,CU1) ;
CLAjisx clan2<A[?:4],B[7:4I,C0Jt,Sr7:41 ,CK) ;
CLA_nsj( clan3<A[ll:SJ ,Bfil:8] ,C02,S[li:8J,CQ3) ;
CLA_nsK Qian.'HA[2,S: 12] ,B[.1,5; li?f , CD3, ZllBiXZ 1, GCKi) ;
assign A1'J=AU6] ,Aie=AU6) fA19=A[16] ;
assifln B17=Bri6] ,B1B-B[i6] FB19=BUS] -'
CLA elal({Al-»,AlB,A17,AU6.}»- {Sl£|f BIS, B17„ BUS) ,J,C04, {519,318, S[17 :lfij >,S20J ;
endiao dul e
Figure 22 17-bit CLA
32
4.2.3 Shift Register (Delay Units)
Figure 23 shows the codes for a shift register which consists of instantiations of
eighteen flip-flops. The flip-flops serve as delay units for the filter.
'"tiHieseale lns/lps
iradule delayjcik, ceset.fx( yl,y2jy3,, y4,yS,, ?6ry7,y8/y9J

















































f £a=- ^crife, iresec,,












Figure 23 Shift register acts as delay units by flip-flop instantiations
' tiitaescale ins/1pa













Figure 24 Verilog codes of a D flip-flop
33
4,2.4 Filter Implementation
The Verilog description for the complete filter and its associated test-bench can
be seen in Figures 25 and 26respectively. During the instantiations ofmultipliers, the














//regis car 'nsn' aces aa butcee con data storage ton one clock cycle













delay sSiCt /sg/Ccloclit reset,dacaj>uE,yl,y2,v3/Y^y^/YS^v^xyS^yg,y10,y1JL,
y!2,yl3,714,715,Yl6,yl7Jyl8) j






r/maiey #|sjhis; mux6 (y5.,FS);
Wooley pliS'hfcj nuitV (y6,i'7) I
continue.
34
u^oley (ftS-tsOd) a«ltQ(y7,PWj ;








Iks oley uultl7[ylSfP17J ;
Itooley iflS'hOO) it«lcl9(yl7,PlB);
Ifcoley ?[BJhDG) asiUcig^ylS, ?19) •
//ittsemLiacioBB o£ aS3e&ii tit si: add tfpenteda irith viryin^ nti£ib&£ a£ bins
















CUM 0 clalSb tPcCxRad^BgD);
CLA.^19 clBl9a[Mx.,Rgg,Hfcui};
assign rlB = Ree[i7]/ -19 - Rse[l7];
CLA_20 clo3Cta(Rhlsricl9,rl8,Pec),outy;
Figure 25 Verilog description for the complete filter
35
'tixteacaie lns/lpa





pstiiaete;: offset = 1G0;
pdionctei: cycle - 20;
filter filet, clock ij clock) f . reset(reset),,. da.co_in(cata__in), .outlcutl ) ;
initial
begin
doc;* = 0; resec = 0; ciata_m = 8'hDO;
£offset;
fotev£c ^cytie slack = "docs;
initial
begin
jf(olffsct-l-cycic) react =• 1;
s cycle;
reset - 0;
data_iti = 9 'hOl;
far(.i*Q; i<20; i»i+l)
iSfcyci£"*2);
&ata_m - datQ_in + 3'dS;
end
initial faonitei [SLias," clock =^ i, resets%b, input=^h, pucput=^h", clock,reset,&ata_ir-,out]
Figure 26 Test-bench for the complete filter
36
4.3 SOFTWARE SIMULATIONS
Functional and timing simulation results for radix-4 Booth's multiplier and
Baugh-Wooleymultiplier are included in Appendix B.
Simulations for CLA for performance comparison are done based on the overall
adder formed by multiple CLA instantiations. However, the large amount of I/Os of
overall adder has exceeded the amount of I/Os that the selected device is capable of
handling, which causes simulation to fail. Thus, some of the input ports are declared as
'wire' and assigned values internally. To ensure the accuracy of the simulation results in
terms of performance criteria, two sets of the number of input ports are chosen, which
are one and eight input ports. It can be seen in Tables 7 and 8 that the percentage
difference follows a consistent trend for the three performance criteria. All three criteria
- path delay, area and power consumption decrease by half when input port increases
from one to eight. The respective Verilog codes are attached to Appendix B, shown in
Figures 51 and 53, together with the simulation results for both test-benches.
Similar to CLA, the simulations for CSA for performance comparison are done
based on the overall adder formed by multiple CSA instantiations. The CSA also
encounters the same problem as in the case of CLA. Similar method as in CLA is used to
perform simulations on CSA. The Verilog codes for overall adder with one input and
eight input ports are included in Appendix B, shown in Figures 59 and 61, together with
the simulation results for both test-benches.
37
4.3.1 Performance Comparisons
The following results are obtained through functional and timing simulations
using Xilinx ISE synthesis tool.






delay after place &
route (ns)
24.542 25.078 2.14%
Area (no. of slices
out of 5120) 78 64 -21.88%
Power consumption
(mW) 510.34 481.65 -5.96%








delay after place &
route (ns)
27.200 26.090 4.08%
Area (no. of slices
out of 5120) 31 51 -64.52%
Power consumption
(mW) 570.49 510.34 10.54%
Table 8 Performance comparison between adders with eight input ports
Eight inputs
Maximum path
delay after place &
route (ns)





















Both the functional and timing simulation results for the complete filter are





















































res&t=ci, 1Plf>U ^—3,5 3
re§£i:=g, inputs,
r'es&i:=Q, inpiJt=ia]







































































-oooogo //at this time,input data iu stores in register
•OOODOO //input; 01 iu available at data-out r y[l]
•000000
































Figure 27 Partial results for the functional simulation of the filter test-bench
39
0 clock=o re5et=0, input=oo, output=xxxxxx
27 clock=o reset=o, input=00, output=oooooo
12 0 clock=i reset=l, i nput=QQ, output=oooooo
160 clock=i reset=o, i nput=oi, output=oooooo
200 clock=i reset=0, i nput=06, output=oooooo
240 clock=i reset=0, i nput=ob, output=oooooo
280 clock=i reset=o, input=10, output=oooooo
293 clock=l reset=0, input=io, OUtpUt=000002
3 00 clock=o reset=o, input=l5, output=000002
320 clock=l reset=o, inpur=i5, 0Utput=0Q0002
334 clock=l reset=o, 1l1pUt=15 , output=oooooe
360 clock=i reset=o, input=ia, output=oooooe
336 clock=o reset=o, i nput^if, output=O00020
400 clock=i reset=o, i nput=lf, OUtput=000020
42 2 clock=0 reset=o, i nput=2 4, OUtput=000022
440 clock=l reset=0, i nput=24, OUtput=OO0022
466 clock=o reset=o, i nput=2 9, output=oooooo
430 clock=i reset=o, i nput=2 9, OUtput=000000
504 clock=o reset=o, input=2e, output=lfffdb
520 clock=i reset=o, input=2e, output=ifffdb
54S clock=o reset=0, input=3 3, output=ooooof
EGO clock=l reset=0, input=3 3, output=ooooof
583 clock=o reEet=o, i nput=3S, OUtput=000107
600 clock=i reset=0, i nput=3 3, 0UtpLlt=000107
62 3 c1ock=Q reset=o, i nput=3d, OUtpUt=0002e4
640 clock=l reset=0, input=3d, OLItpUt=0002e4
664 clock=o reset=0, input=42, OUtput=0005£2
630 clock=i reEet=o, input=42, OUtput=0005£2
705 clock=o reset=o, input=47, OUtput=000310
720 clock=i re5et=o, input=47, OUCput=000310
743 clock=o reset=o, i nput=4c, output=oooaa6
760 clock=l reset=0, i nput=4c, output=oooaas
737 c1ock=Q reset=0, input=5i, output=ooodla
800 clock=l reset=0, input=5i, output=ooodia
82S clock=o reset=0, input=56, output=ooofss
840 clock=i reset=0, input=56, OUtput=O00f83
864 clock=0 reset=0, input=5b, OUtput=001200
330 clock=i reset=o, input=5b, OUtpUt=001200
904 clock=o reset=o, input=eo, OUtput=001430
92 0 clock=l reset=0, i nput=eo, OUtput=001430
944 clock=o reset^O, i nput=65, OUtput=001700
Figure 28 Partial results for the timing simulation of the filter test-bench
Table 9 Complete filter performance
Complete filter using Baugh-Wooley array
multipliers and carry-look-ahead adders
Maximum path delay after
place & route (ns) 32.133
Area (no. of slices out of
5120) 414
























































































































































































































































































































The design is programmed into Virtex-II chip and it is tested using a logic
analyzer. It is supposed that the logic analyzer provides input to the filter and at the same
time, the filter output is observed. Unfortunately, the logic analyzer available is unable to
provide input. Thus, the codes are extended to account for the input generator module







Figure 31 Signal generator module providing inputs to filter




aioduie iai^er^in (clock,. eeaEX t out]i
injmt Tine!;'j Ssafct.;
q -nz p u t lT2DiQ] qui:;
vire [ 1:0Jdaca in;
input ijen gen^clock,cts t%, cist is in);
tLLter t.i it \cIocHl, reaei: , dac a In, a at) ;
enckiioclule
Figure 32 Top-level module
//This pre orcan ccucratco input davQ intcrnaily to the flltCET.
' Liine'scais ins/i pa
niDcmlG input gen (clocl^resec, data mi;
•i^pUt eldck, £%S4t%>;
output, p:0]data iej
rca fJ:0]data_in - 6'hOOj
aluays Q^posedgs cloeK a£ poseflge reseci
b eai m
if (Keaet] iiata as <- a'hQD;
else
data iK <" data in -t 3'd5;
«nrl
cndmodule
Figure 33 Verilog codes of signal generator module
42
4.5 DISCUSSION
The module that describes the radix-4 Booth's multiplier with 8-bit inputs (see
Figure 35 in Appendix A) instantiates four 'Boothpar' modules which in turn yield four
partial products. All four partial products are summed using a 16-bit CSA. 'Boothpar'
module realizes the hardware implementation of recoding logic and multiplexer. In
'CSA_16_booth' module, the 9-bit partial products are required to be shifted accordingly
based on the weights of bits in each partial product. Functional and timing simulations
for Booth's multiplier are verified and found to be identical.
Baugh-Wooley array multiplier basically consists of AND gates and full adders
as reflected by the structure in Figure 10. Functional and timing simulations for Baugh-
Wooley multiplier are also verified and found to be identical. From the performance
comparison in Table 6, both multipliers have almost similar path delay with Booth's
multiplier delay recorded at a slightly lower value. However, the area occupied by
Booth's multiplier is 78 slices as compared to 64 slices for Baugh-Wooley multiplier.
Power consumption for Baugh-Wooley multiplier is about 30mW less than Booth's
multiplier. By looking at the percentage difference, Baugh-Wooley multiplier displays a
better performance and hence, it is selected for the filter design.
Basically, for CLA modules, there are multiple instantiations of 'CLAnsx'
modules followed by an instantiation of 'CLA' module. 'CLA_nsx' module performs
addition between two 4-bit operands that are not signed extended. On the contrary,
'CLA' module adds two 4-bit operands that are sign extended, where these four bits are
the upper four bits of an operand. Sign extension is necessary for the upper four bits in
order to obtain the correct result.
Figures 42 and 43 (in Appendix A) show the HDL descriptions for modules
'CLAnsx' and 'CLA' respectively. It can be seen that the codes are divided into four
stages since it is a 4-bit adder in the case of 'CLA_nsx\ The basis to this block of codes
is according to the formula given in Equation 3. In the case of 'CLA', there is an extra
stage owing to sign extension of operands. Output S4 is the sign bit, which corresponds
43
to S[16] of top-level module 'CLA_16'. The carry-out bit, C05 can be discarded since
the output range requires only five bits for a 4-bit adder. Higher-order adders can be
designed by cascading several 'CLA_nsx' modules with one 'CLA' module for the
upper four bits.
The overall adder formed by several CLA instantiations accepts outputs from 19
multipliers simultaneously since the multiplication and addition process is carried out in
parallel. Each multiplier output consists of 16 bits, thus there are 304 bits for all outputs
of the 19 multipliers. However, the target device has only 172 bonded IOBs. Therefore, a
method is used, which is mentioned in 'Software Simulations' section, in order to
perform simulations on the adder. Similar problem is encountered by overall adder with
several CSA instantiations and the same method is used to resolve it.
The overall adder formed by multiple CSA instantiations (module 'adder_csa' in
Figure 59 or 61 in Appendix B) instantiates three 16-bit adders capable of adding five
operands, one 16-bit and one 19-bit adder, in which both are capable of adding four
operands. This is the best combination of different sizes of adders due to two reasons:
1. If CSA was to add three operands, it will function like a ripple-carry adder, thus
the advantage of using CSA cannot be displayed.
2. The more operands that CSA adds, the more number of bits of sign extension is
required since adding two operands requires one sign extension. More sign
extensions increase hardware.
Functional and timing simulation results for CLA and CSA are done for overall
adders that have one and eight input ports. By looking at the performance comparison in
Tables 7 and 8, CLA has a significantly smaller area compared to CSA, which are
64.52%) and 33.88%o less for overall adder with one input port and eight input ports
respectively. The trade-offs for the decrease in area are the increase in path delay and
power consumption. CLA indicates an increase of 4.08% path delay and 10.54% power
consumption for adder with one input port while for adder with eight input ports, an
increase of 2.45% path delay and 5.09% power consumption can be observed. It can be
44
safely said that CLA portrays a better performance compared to CSA judging at the
much higher decrease in area. Hence, it is selected for the filter design.
Since the design is an 18t!l order filter, there are eighteen delay units for the input
samples to pass through. The delay units are implemented using D flip-flops where in
this design, the input data appears at the output at the positive edge ofclock that triggers
the flip-flop. In the 'delay' module in Figure 23, it instantiates eighteen flip-flops which
are actually cascaded to form a shift register. The HDL description for the complete filter
inFigure 25 is rather straightforward. The 'always' construct defines a register that holds
an input sample temporarily for one clock cycle before going out to the shift register.
The functional and timing simulations for the filter are verified.
In this filter design, memory unit and control unit are omitted because the
arithmetic operations are performed in parallel. RAM which is used to store the input
samples is replaced by a single register. ROM which is initially suggested to be used to
store filter coefficients is not necessary because the coefficients are directly defined as
parameter in the multiplier module. Control unit is also not required as the processing of
data and output sample, y[n] are all carried out in one clock cycle. The omission of
memory unit and control unit introduces simplicity in this design and also the use of less
hardware, hence reducing cost.
The functionality of the filter is verified by implementing it into FPGA. During
hardware verification, there is a difficulty to predict the filter output because the onboard
24 MHz oscillator is used as clock, which starts running once the board is supplied with
power. This problem is highlighted in the preceding chapter. Hence, a manual push
button is used in order to test the output of the filter. When the button is pushed, it
signifies the triggering of clock and thus, starts the operation of the filter for one clock




This project requires the implementation of FIR filter through HDL in which the
filter components can be divided into adders, multipliers, memory unit and control unit.
Two's complement number representation and eight bits are used to represent input data
and filter coefficients. Fixed point numbers are used. In this project, carry-look-ahead
adder and carry-save adder are designed and compared. In the case ofmultiplier, radix-4
Booth's multiplier and Baugh-Wooley array multiplier are designed and compared. Both
carry-look-ahead adder and Baugh-Wooley array multiplier display better performance
compared to their counterparts. Hence, they are selected to be used in the filter design.
The design is an eighteenth order filter and has nineteen filter coefficients. Therefore, the
shift register has eighteen D flip-flops cascaded. Memory unit and control unit are
omitted because arithmetic operations of the filter are carried out in parallel. The filter
employs DF architecture and its performance obtained via simulations is summarized.
The complete filter are synthesized, implemented using FPGA and overall functionality
is validated through hardware.
Improvements can be made to the current design, which include the following:
i) More structures of adders and multipliers can be compared for their performance.
ii) Other factors that affect the filter performance can be incorporated into the design.
These factors include the use of different number representation schemes like sign
magnitude and advance techniques like differential coefficient method (DCM).
iii) A combination of sequential and parallel filter implementation approach can be
explored to determine the trade-off between consumed areaand throughput.
iv) The versatility of this design enables the filter to be modified to other types besides
low-pass based on specific applications. However, one limitation is that the
verification of the design is rather cumbersome due to the lack of suitable equipment.
46
REFERENCES
[I] A.T. Erdoganand T. Arslan, "High Throughput FIR FilterDesign for Low Power
SOC Applications", University of Edinburgh, 2000, pp. 374-378.
[2] A.T. Erdogan and T. Arslan, "Low Power FIR Filter Implementations Based on
Coefficient Ordering Algorithm", Proceedings of the IEEE Computer Society
Annual Symposium on VLSI Emerging Trends in VLSI Systems Design, 2004.
[3] A.T. Erdogan, M. Hasanand T. Arslan, "Algorithmic LowPower FIR cores", IEE
Proc.-Circuits Devices Syst, Vol. 150, No. 3, June 2003, pp. 155-160.
[4] C.H. Wang, A.T. Erdogan, T. Arslan, "High Throughput and Low Power FIR
Filtering IP Cores", University of Edinburgh, 2004, pp. 127-130.
[5] A.T. Erdogan and T. Arslan, "LowPower Block Based FIRFiltering Cores",
University of Edinburgh, 2003, pp. 341-344.
[6] T. Arslan and A.T. Erdogan, "LowPower Implementation of High Throughput FIR
' Filters", University of Edinburgh, 2002, pp. 373-376.
[7] A.T. Erdogan, E. Zwyssig and T. Arslan, "Architectural Trade-offs in the Design of
Low PowerFIR Filtering Cores", IEE Proc.-Circuits Devices Syst., Vol. 151, No.
1, Feb. 2004, pp.10-17.
[8] Emmanuel C. Ifeachor, Barrie W. Jervis, Digital Signal Processing, A Practical
Approach, 2nd Ed., Prentice Hall, 2002.
[9] Richard S. Sandige, Digital Design Essentials, Prentice Hall, 2002.
[10] Prof. Vojin G. Oklobdzija, University of California, "Lecture 9: Multipliers", 11
May 2004, http://lapwww.epflxWcourses/comparitWLectiires/VLSI-Arithmetic-
Lect-9-Multiplier.pdf
[II] D. Mlynek, "Chapter 6 Arithmetic for Digital Systems", 11 October 1998,
http://www.vlsi.wpi.edu/webcourse/ch06/ch06.html
[12] A.T. Erdogan, T. Arslan and D.H. Horrocks, "Low PowerMultiplication Schemes
for Single Miltiplier CMOS Based FIRDigital FilterImplementations", University
of Wales Cardiff, 1997, pp. 1940-1943.
[13] David R. Smith, Paul D. Franzon, Verilog Stylesfor Synthesis ofDigitalSystems,
Prentice Hall, 2001.
47
[14] T. Arslan, Chapter 4: VLSI Design, Institute for System Level Integration/
University of Edinburgh, 2001/2002.
[15] T.R. Padmanabhan, B. Bala Tripura Sundari, Design Through Verilog HDL, Wiley
Inter-Science, 2004.
[16] Weng Fook Lee, Verilog Codingfor Logic Synthesis, Wiley Inter-Science, 2003.
[17] Stephen Brown, Zvonko Vranesic, Fundamentals ofDigital Logic with Verilog
Design, McGraw Hill, 2003.
[18] Virtex-II XC2V40/XC2V1000 Reference Board User's Guide.




1. Baugh-Wooley Array Multiplier
nodule 'UoolaytA,f);
input ['.J: 01 A;
output U5:0|P;













*uuO, sijal „syn4 j, »n3 , ?uafl „sur* fi, sun &, sua"?, KUTifci , ?uu«, sural Q;
sLuiil,^uMl2,iuia3,&Lail4,3l^l5,sij3il6,S>^17,Eual8,aiml3,siut£D;
,ffuu£i, cuio£ 2, =un23 , sua.24„suialSfKua2£, cua.2V, suai2a, -sun29, subi3 fl;
s.Uei31 , si*fo3£ ,suii33 , suu.34, Suio35,S'un3S, SU&37,su^38 ,SU&33, Suto401
sumil, siTO-s 2 ,su3i43;
cautO, caiae 1, coucS, coutS, c out 4, cows r cout S, c omc*? rco w§, csut: 9, c
ecut-11, cout12:,cout,13, cout 14, cout I5 , ecut i £, cout 17, coy119 , caufcl &
^CUT,2l,cQurJ?_2,cGUt?.3,cour,S4,GOuc.?;5^oouc26,crjycS7,cciiJcZ8,C-oy^?9
coT.it-31, ccut32 , cout 33, cout 34, cout 35,cout-36,coufe37,eout30, cotit-l9
Ccire4J , cout 42,, court's, GOi4t4-q,c;<5iJiTn4£,c;osjt'tS,coyc:47/CQ'uc4gj.cou!:49
ccut-52 , csut-52 , cout S3, cout 54, c out 5S, c cut 5C, c out-5 7 ;
B&sifcn "10] -<• A£Q] & B|D] ;
acsigw U(l] = AJ1] fi. B !D] ;
assian H(£] -Am 6.BJD1;
assign U[3) - A£3] tBiO|;
&5Si«3S» HE4J = AS41 4 BIDl;
assign U(£] - A[SI & B(D];
flsnirp-i in*] = Atft] d h;dj ;
iVsskjm l.f[7) = MO] >i HjlJ;
ascites tf[S] - A[l] iBll];
assign TI£§) = kit) &. BiUi
assign "[10! - Am s B[l);
ossitp-j 11(115 = AH) s BUI;
assiijifj K[iEj - Af5) 5 BUS;
assicm U(13; - A[6) fi B[l);
assign. HE1-^ -A[05 £212];
asss.gs y-U.'i! = ALU S SU1;
assign U[lfij - A[2J s DEE];
= a[3] fi am;
- AM) 5 B[2U
=* A[£J i B[23;
s-ssig-rt Htm
BLjfsi^is W[ia j






fissian ¥£21] = Aroi £ B[3J;
£=SStgT; 0J[22] = All] £ e[3);
assign ¥[£3] r, A [21 i B[31;
as-iigti WI24J = AI3I £ B E3 J ;
£SS i gis KT(£$J = A[4] i B£3|;
ii*?^igis ¥[2S] ^ A [5] £ B[33;
sttr-sigra ¥££?] •= A [S]
4 B [ 3 ] ;
assign WtZS] = A tO I fi B[4);
assigr* ¥[23] = A [11 £ B [ 4 : ,-
ass i qyi erooj •= A[2] £ B[4);
assi gij. art 91] = M3] £ B [ 4) ;
ess i. yi-. ¥[32.]
" A [41 4 B[4];
stsr-sicro &TC3S] = A £51 A B [ 4 ] ;



































































assise U£[5] " A.[?H*-BL5p ;
assign ir£r£l = AE7J4<*BI6J3 ;
W&I'e i$ftd=0^
mrs hig!v=l;
assign V3-h [7 \ S.E [7 ) ;
assign p(0.l=1f !0f;
fiiil_iddcr faKgt-jd^El] ,tT(7] ,3?Ii] ^coutOJ ;
fill l_aci.de t l-az <.tiwi,r\\\Z}, iff si ,sxiM).eout-i:i ;
hai_addce t_3<!gttd,,U[3],U[?],s«nL,eout£!;
*uU__>id«st- fa4{-atid,i;r£4] ,V£LO] ^xia2,cout'3S r
Wtl_sid(l«ti: j;ai;(e|nd,ll££] ,11(11] ,,sUH.3,i5axit4) ;
full tadder f »e (gndrTJ[S] ,¥{1£] , si,«i4,cout-Er ;
*-uli_acidetr i a7 (gncj, V£ Lo) ,W£133 ,swfl.5-,coyr,&3 ;
ixili____ddcE £ii8(iH'[l4|,coat0,sua.0,PI2J , cout? , i
luli^aadet* fa3ijSlfi5f,couti,£i.iuii,suni6,cciu.c8!i;
fxill__dd<iK- fiiO(Mlifii ,cdut£,smi2irsiiii7I,cai^t-9j ;
iall_isddec £aII(S[l7] ,eou^3,:sufl3,s^e,c6„t£0) ,
iull-_?*4dflE isllK(W[a8! ,eaui;4,sxm4,raT.iS,*:ai.itU) ;
fxil ladder fai3(¥[29i f CQiit-5, jmnS, _l_i,1 D, ea-ufc. 12 J ;-
iuii^acideir 3fal4 (WisOl ycsuts^wg Ui ,suuU,eovixi3);
£xill__ddp;r faiS<W[£iJ , caufc? , s-xra.6,P [3 5,=outi't) ;
lul i__ti.de tr laiecwiZEj,couts,Sim?,suu.1 g,esuc 15) ;
fxil____.d_r f_I?(W|23J ,e_xi_3,_„_8I,_x___31,c©xifcls:j ,-
Jull__ddei- £_i8 (5J[Z4 [ „c.utlO,s-xlaiS,_cOii4,_gut L7) ;
fuli_addiss- falS(Kt[£5l J.coi,ml.l./:SfialO/.ST,i».L£1, cotjcJ&J ;
fuil^adder *__0([iJ[2S| ,_»xifcl_,_ttiill,_xuil_,_-ut-lS'J ;
IuU._ati,det- iaEUWIS?!, cout 13^if2 IZ] ,suxil7,ccru";_ej;
*xill__ddc;r fa22(B[28| ,ccuti'l,suiil2,r M] ,cout_l> ,-
lu_l__tidet- £a_3(W[£9j ,eoutl5,suiil:3,s-._ils,cou,:.2-!;
fuil_ssdd,<se £a24(W l„0i f cw-tl£v£xtal4,_xisil£,cc<xr=.23.) ;
£uil__dde- ta£Si;W[SI| ,cou«.171,3AttiiS,3iua.20,co\i,iZ4J ,
rwi Ijs&cis * t&:S^(ifl|3§|,,ccjuxIS,si,uiLS,,.si,i*^'] „coij-;!H ;
iuii^addec £_£7 f7J £33] ,-G4.it_9rsx_il7,_xi_.2„i,co\x,;2_J ;
XuiJ._sti.de r taE8(!rt| 34 j ,ccmt.ZO,i.i2 £31 ^sxmZ-s, couz-Zl J ;
fxiil__dd_tr f_29<„[3.£] ,cc„t_i,„x'i_.18,:P | S],coxrc.„S} ;
_ull„&cLdet £_3D<W|'36| ,euut„£;.^lS,s\ua._4 ^cou'^.SJ ;
_xiU_addft<r fa3i(Wl3*?! ,coxit2'3, svia20, sxi_.2_,,._<ni-iG'; ;
f'.ai__ddet: f._32<W[38i ,^o'.tt24v.«i.m£l, ^l_2£, .ou^SlJ ;
luLL^drf^t t a,33 fifl i 39 | , couT-/i:6, s.un?:;, sius ;r7,,aoivr.:}?,;. ;
£xill _ddce f _3 4 (FJ [4ft j ,_a.utZ_,sx_i:23,_ti_.23j[ _©xi__3:;i ;
Jul l_sti.de t: la35(WKiI ,ccvut._7, W2 E4 | ,si.ui25,cout:-34) ;
fXLll____ddcir £a3S(W|.42i ,caxit2S,:_,xi__24,-I< ] 6J „cou-c'3f.) ;
rull_add^f Ija.37" (DJI 43 I ,cou«-5',,3i._L2S,s\_i3Q,couiM3iJ3 ;
fxill__„ddflit- f^3S fUt | 44j ^co\n;3ii:il,'s:\_i24, s^ul31,cou^:.'^ •?; ;
fiiLl_addcr £_39{Tjr!1.5],cout31,2i_jE7 , Eu_32ircaiit3fl I ;
lulJ_ad'ier i._4fj(if,i 46] r cout32, suiiZS r su_s33,eouc3Sj ;
s«ll_adj(*or f_itl('lifi|47] , cant 35;sii_29 , su«34,eouc40 I ;
lull^adder £a4E(i'HS3.caut34,Tf2E^] ,sij_.3S,eeu«41! j
*uil_ad(*©r £_42(KltCi] ^ce.xst3Si.s-Ha.30, s«_aS,cow:.4£ I ;
*n.ll_ddfitir f_44(¥l(l] ,e*uit36,»i,_x31.,!su_3 3,'.'-BUt431;
;uU_is«!c«iss ta4S£lfi,[2] ,fl?.ut37,.svi!a3;: j,fimtas, co«t44l | ;
full^-ddar f_4S JTf-[3] ,cciut3S^si(ffl33,sui_33,c:ini.fc4£} i
Jxjl.l_aflder (a47(ifl E4l ,nour-39isijKl3'5isu(i;4a,cci-ij-46J ;
fxill^addor £a4S(IjriE£] , cout4Q,sxi_-35 , c«e4i,caut.47 j ;
full_adcSer Ia49(KI£61 ,count4li,trie js] ,suit4S,eou^4gj;
fxiil^ad^or fa£D(tf3,-Aj7] ,'B[?i) ; svai.4 3, caxis-4 91 ;
iull^adder laSKAl?! ,B[7] ,ss4Ki3S,P|7! ,couc5-0I ;
:xill_addQr £•„££ (cout,£Q„=our,42 , sti_a'? vP [SI ,coxir£li ;
Ixil.l__ddt:r £^S3 Cccut-51 ,ceu.t43, sue&30 ,P ( 3 1 ,cgutS2! ^
rxiJ i_nddor faSM (coticMjCcutMl, svite39^ P|J,Cl] , cflycW) ;
Sxill_Qdder £a5 S (cPUtiO ,cout4S r =udi40, P [11 ] , co-ieS4) ^
full^oddcr £a,57(caxit.ES,co'ut4l7^suiii-5£,P |13[ fco*it£G) ;
:ull_a«asej.- xsSBiicouc^e^cu^^s, sun»43,p U4l, co«tsv);
£xxll___addoy faS.9 (coxxcST ,cout<1!3,high, ? [IS | ,CQ( :•
Figure 34 Baugh-Wooley multiplier with instantiations of full adders
51
2. Radix-4 Booth's Multiplier
//Psdi,x--4 Booth timHijUlet arith 8-teit input operands, ge?tts
//16-toic result,





uire [S: 0] Bl;
assign 31 = B « I;
Boothpar pari U, 61(2 iO] ,, HI] ;
Boothpar pars (A,Bi^;2],P2);
Eoothpa.r par3 (A, Bl [ 6 :"s] ,P3) ;
Boothpar par4 (A, Blf 8 : 6] , PI) ;
CSA_16_toooth csa(PL,P2,P3,t>4,ft! ;
endtroduic
Figure 35 Radix-4 Booth's multiplier with 8-bit inputs
/*Thi» p r ogr am imp ,l«na gift; s toha jrecqdiri*r l«?gic and
-jui'ciplexer toe radix-4 seoeh 's &lgoric;J^_ to generate









wir<? | 3: Q]oxxt;.
/* srispt <S xt unsio x siraodsd £bi: this cusn -.rhran MSB a£ multiplicand is 1
and decoded ve.r sioii oi! 3-bis multiplier group is l(H~li and also
WSB of
a-ssign
siulfciplipr g-rsiup is 0.
Aa=A[7]: // sicp-j QKt«nsficm
assign n = Bioj r-B [ 1 ];
assign MS = ~(H [ tB[il-*Bm ) ; // £*_«ltiplicax-id
assign outtO] •= (H 6 A101 > " Bf2);
assign out[I] = ((H2 4 A[0]J ,H & A[1]J1 A B[2J;
assign out[2) = (<HZ £ A(l]} 1H 4 A[2]3! A B[2);
assign out[3] s (<H2 c M21) £M £ A[3] ) J "• E['2] ;
assign OXJt [41 = (CM2 i A(3]J IK & AMI)} ~ BIZ);
assigit _ufc[5] k ((HZ £ Mill iH £ &££])) * B[2];
assign OUttS) = ((M£ & MS]) {E £ A[S])J - B[Z);
assign out[7] « ((He & A£S]J {Ei.in])l A B['2J,-
as sign out[6) = ((HZ 4 A(7]3 m i. ASM '• £tZ] ;
si'SHig—, P =• OXlfc * BTZ];
assign P9 = EM £ AS) ~ BIZ] ;
eadmodule
Figure 36 Recoding logic andmultiplexer to generate partial products
52
/*7his program adds Eoui lS-bit- opur—ids, creating a IS"bit CSA.
The 9-bit input operands aza internally signed extended to 1G bits.
InpxLt opo rands are shifted la-ffc accordingly he i ova addition to























a __11 , sual 2ysual 3 , ex_e14 , s*._alSvsxml 6 , su_17 , ev.io.18 „ siml 9 , sxua201
^uiB21,svi^££,sxm23',sim24,s^T_'2Si, !:xm^6,=xi_27,sum28,sxm^S,sx^ffi30J
coxstO,c^5y^l,CDUt2,cc•ut3^cou--4,cout£,coxlte,couc7JcDuta,cautS,£?olJclO;



















adder £_MCfrtd,A[0] ^gnd^U-O, coutQ) ;
adder £aZ{gnd,A£1] ,gn.d, su_l , coutl) ;
adder £a3 jgnd,A[2],B[Q],sus2,cout2 J;
adder £a4 Jtpid,Ji [3] ,B[1] , sxx_3 , caxi.t-3) ;
_addoK eaiSiCfG],Af*l],B[2] , smul, cout 4 ) ;
adAar fiftS j C11 ] „Jl fS] ,B [3 ] , sunS ,coutS) ;
.flticipr ca7(CI2] ,,M£1 ,8[4! , suiiS^outft) ;
.adder taG {C I3 ] , k\ ?'| ,B[S|, sua"?,cout?) ;
adder Ca9 ! C j •* ) ,A ISj ,B [61 , suxlH , coutS) ;
adder CalU IC [5J ,A 1.8] ,B[7 | , svn.9 , ccnscS) ;
adder Call*ciej ,A [8],3[S|,sunlo,couclO)
adder £-iESC[7],A[S],S[3].,sU!all,l:CJUCll)
add«r E_I3JC[8] ,A[S] ,B[6|, Sxm,I£ ,ce.uiilZJ
-dder £_14 j'C[8] ,A[8] ,B[S] , ssua.13 , coutlSJ
adder £._1£(C IB) ,A[9] ,3[B] , cu_.I-i, coxitis J




































































£a_l Ccout 3,, sun4., gad, sl_i19 ,cout20) ;
fa22'£c-out4.r£!juiS,g!id,£iju20,_ciut2i) ;
fa?3tcoijt5,s''.mSrDtO] , s\ua21, ecn,;t22l ;
Sa£4 tcoxafcG,si.T_.7,D[l] fsxmZ2, cout23} ;




f_.£3 iaoi&llfG\utiL2,'D[&] , sxxm27, ccxibSS)







[oouc 33, sum! S, cout18
[c out 34 , Sinai 9, cout19
t ctuv 3-S, sum^O , col*'?, 0
i[co,U5:35,sxi_i£l/ caUt.21
{cout37, sxmZ Z, coxib 2 2
•tcouc 38 , sum£ 3, cout 2 3
tc^ut33,.suKi£4,cout£4
fq^Wt'SO, swnSf, coisfcgS
{c out -\1,sxtaS 6, couft £ 6
•Jcou£-4£^siE_.27ii,coue27
{c<mfc43rsum£S,covfc28
ieout44, sum£9, cout £9






























Figure 37 CSA for Booth's multiplier to sum all partial products







A = 0' 1?P0: B = Q' noo;
SiQQ A - B'hOl; B - B'blO;
^50 a. = e'his; e - S'liia;
i^SO A - 8'hal; B = S'iiSfa;
^50 A = R'fjSt; B = a'h32,'
$5G A •--• 8'b83; B ~ S'hSC;
#50 A = 8'ba^; B = 8'his;
#50 A - B'hdc; B - 8'h9b;
#50 A = 81 fitto; B = e'hCZ,"
end
anir-lai
^monitor (?rcsltimc .," A-^ta^ B'-^b, product^*:hJf, A,B,E] ;
eiichviGduie
Figure 38 Test-bench for radix-4 Booth's multiplier
54
3. Carry-Save Adder (CSA)
/•This program adds lour 16-bit. operands, creating a 16-blt CSA.
33_ifh optiE-rid is Siejri-eiitiiiiiSed In gSnfij?_fc j* Zhs i6fch _rid ITth bit.





















117:015; // S18_5j.S not needed as cucput,hence declarsd as Hire
A1S, A17,B16,B 17, CIS, Cl?, IUG, 017,218,319,;
su~0,:ixiil1, EU_2,sxnt3,=xm'l, cxiiaE, i;xiihS.,su_.7 , sxauS, =u_9 „=umlQ;
sutsI lyHiiiiJ?,si,ial.3,suHii4f sumlS^ swRiS^syAiT^ssuialS^iiTnia^ sux«2 0;
s\u^21,s^^£2,su^£3,sua£4,sx^2£,sxu426,su_.£T,sua28rsT.i[o2y,s\m3D;







assign AA6=A[LS1 ,Ba6=3U5] ,CI,5 =CI1&]/&J,S=D [J.S1 ;
assign Al?=A[15j ,Bl?=BUS) ,C17=C \1S] ,Di'J=»tiSI ;
Julljad&er Ial(C[0.! ,A[0J ,S| 0 ] ,su&'Q,ecut0 3 ;
fxlll__dd« f_2 (C[1J ,A[1] ,Bfl] ,_ual,ctmfcl) ;
ful.l_a.ddBr fa3(C[2J ,A[£] ,33^2] ,5Li_.'2,ceut.£S ;
;uJ.l_addei- ia4 (C131 ,.k [31 ,33 S3] ,sxm'3,cpx3t31 ;
full__dder faS(C(4I,A|4],3(4),sua4,couc4);
fulljadder £_6(C[^j ,AI5J .BIS],suaiS,cout-S) ;
fUll^addeK f_7(C[.6I ,A[6] ,33 IS] , sc_a.6 re GxiliS) ;
f_l.l_ad.dBi: fa@(C[7j,A[?],J3n] ,su_.7y cout?) ;
mH_.nddcir faStCES^AfS] ,33J9 ] , suae , ewfoSJ ;
fxUl_adder falQ (CO) ,A[9|,B !*) ,su_.9,cpuc9j ;
£ulimadder t_ll(C(iD),A[10!,B[10|, suialO, e out10)
£ull__ader 1-12 (S[ II),A Ell! ,E[11] ,su_ll,t:i3utll}
full__dder f_13{CEI2],A[12) ,B [12] , au_l£,coxLtl2>
f_ll_addor ial< (C [13 ) , A[ L35,33' [13] , si_El2,ecmfcl3j
iuI3_ai3der ialB (C(i41, A[141 ,B IH I ,sial4,coutl4)
fuli__adder :&1S (C£iS) ,AUS ! ,B IIS | , stj_1 S, =out 1 ii)
full^adder i_17(C16,Aie,Bi6,$u_lfi,coutl6) ;
full__dder t_1S (Cl? ,A17,D1? , £u_17 , cawtl?) ;
55
continue.
£_11 add* i! lai.5
full ^adder iaZO


















cuii_ ad.de r fa38
full"_adder £&39
£uli_ adder fa'-i 0
rull^ adder la41
full...adder £a4 2














gm.d,£u_a,D[0] ,S[OJ ,eaUfclB) ,
coutO,switl,J>£i] ,su_18,csu£iS>) ;
ccutl,suifc2,B[2] , sual9,cout,20) ;
eout.£,B.uit3,D[33 ,SU_S0,c>3UtZl) ;
eout3,su_4,I>[4] ,S'un21, coxier) -f
cout4,suii5rI>[5] ,sxi3i22,cout£33 ;
cout.51,»u_£,D[ G] , suu£3,coufc2'i) ;
coutoHsu_?,]>[?) ,su_24,C€>UC2&) ;
cout7,s«js8,P[8) ,sij_.J5,co\*t2ej ;
cout8,cu_3,D[9] , su_26,coi;t27) ;
cout.9,su_l€l,D!10) ,sx_iZ7,couc-£B>;
coutlO,su_.llirD [11] ,sij_.28,cout29)
e out Li, sural;?,!) [IS] ,svm23„coutgo)
ee>uulZ,*fl_Li3,H [13] , 3U_30 ,e&_e.31)
coufcl3,su_14,B [14] ,su_31,.couc3£)
coutl4^su3tll£J,U [IS] >s-u_3^,cnxit33)
eeu&l S, su_l£,3>16,ssx_3 3,coat34) ;
coutl6,s\i_il?,D17,st._i34,20ut3£|i ;
grid , siusl B, c out IS , S [




cout4 0, sxml 3, e o\it23
caut *l1, sxiia2 *t, c out £ 4
COUt42., SU_25 , cout,£5
cout43, sx_26, cout2S
c out 4 4 „ 5UB1.E 7, c out SJ 7
trout-4 5 , 5u_-E B, Cout-28
cout46,su_2S!,cput29
cout47r^UBi30J coutSD
cout4 D,s!i_i31, cout 31
c out4 9, siaa.3 2, c out3 £
cout S0,sum3 3,ciutS3









,3 s8J ,CDxit43) ;
,S|9],couc44);
,S!10],cout45 J ,








Figure 39 16-bit CSA adding four operands
56
/ntils propraa adds five 16-Dic operands, creatine* a IS-Uit CSA..
Kacb operand is si(m.-e steaded co generate the lSth„i"?th and
(One bat siccr. est-ens ion cor addition of two operands'!
*/
Modul* C£A_lS_£[A,E„C,S#SJ,a) ;















siugQ, sural, suu.2, su»3, su_4,sum5,=.uiB6., si.ua.7 , suu8 rsu_9, sum10;
«uj»il, su_12, sunt.3. su»14,su_l$, sural*, suw.iv, suulS, svx.L%,s\wZ0;
?UBi21,siiTt22,sxm^3F5\]3i24r-?umJS,^wn2'iSvSUtta?,5u-:;;e,suyi;^,^ua-i3Ci;
snjmv2.1,, fixiK.3Z , sxi-33, sxra-i34 , suaiSS, srumSfe, gTiitS? , sxip.3S , susi39 , ?xim40 ;
culls']!, sxie.42, sxmil3„kxx_44, s!Lia»4E,srtJini'dfe, stj._.-j7 , smuiS, sxi_.49,t'.i_£0;
sxs_.Sl ,su_£.2, su_JJ2, sxi_,S4,suai£E,s-uiii.E6;
coutO,cautl, coxi-bS:, cou&3,coufc4,cout£1,.-o.wt.<; ,cout7 , cout.9, =*u&9,coufcl0j





























f 33 11 j , sval, ^outl J ;
, B 121 , su_2 , €Our:-2);
,B;3]' rsuai3,co',ic.3) ;
,Bt4l .1,sua4J,'S«3ur:-4);
, S | £ I „ stiaS , •coufcfi) ;
,2|71,s^m7^^oyt7);
,E [ 3 ] , cvtstS, =outS) ;
3 ,B£0| ,si_5,ooufc3);
iOi ,B[10J ,sumlO,coutlO) i
US ,B[li] ,i.^_.ll,,~ouL.ll) i
!ZP,BElzi,s'j_.iz,couicia);
13?,B[13.| ,^_il.3,COUC-i3);
14 i ,Bt 14 | ,5*.™il4,coucl4) ;
ii[i£],Eie=BiiSi,
111IB) ,E17«D|1£I ,




































































































































































Sx*Ja'l,D [4] , 3 USb;2 3, <S ©U*3 2 3 )
Sua£,D [SI , sum24,coutZ4)
£%_i'S,D [§] ,.su_££,cox!.t2S)
su_7,D [7] , SX-.2S, cout 26 J











fa39 (grid, su_13 , S[0) ,S£0] ,c
Ee4u tcout13,au_EQ,2[IE,su_3S,




£&4.5 [sPufc£4.,suin£S,S [S] , sum-ib,
f s4S {cout25,su_2'6r2 [7] , sum**4,
£a47 tcout26,su_2?,3 [8i ,suj_4S,
£_4S{COUC.2?,SLU_2S,3[9j ,SU_*SS,
fu49 |.cDXLt£S, buiiiZS,3 [10] , 3x_.*t7





















c out 4 2)
CQUTl^S)
c out •! >3 J
cout4£3












fa-53 ( gx'id, = _3S , cout 3 0 , S \ I
f sS-Q | coav S8, -5ujs39 »« out35,
laol icous- £S,,sun40,cout4a,
£a*a2 'tt3US60,»uu.-11,CDUt41 ,
f <a5'3 i cout S Ar *jy.»*l 2:, c out4 Z.,
laS4tcout62,s5,m43,cox!:t43,
f aS-S { couE, g 3 , sun44 , c out 4 •! ,
J:s5 S t cou*; 6 *3, sun.'i E, c ovit 4 5 ,
ls67 I couk 6S, sum4.6 , c ox;t4 6 r
f_GO | cci'Jt-Ge,st_.>i7,aDU-t'17,
iaS,9{cc«uc67,-5ua-3Si,.43 0Ut4S,




f_74 !coufc72 , suii£3,cout'"D ,
fa7S lcou-.73',suii54, coxitS4,
la76 'coiaE.74,suu5S,<:ciuc.55,




E [4] y coutSJ.) f
SfSJ ,-iptjcSZ) ;
S L6|,couc63);






E[10| , cout, 70? „
&ri4i,cout71)j
S 1151 ,COux;72j ,
S[IS|,couf73)j
S [17! , coy*7<3j •
S [iSj. , COUX--7SJ ;
SI9, cout7i") ;
0.S21J;
Figure 40 16-bit CSA adding five operands
58
/T^his prG_rrs_. adds- ioviE 19-bit- opeE-SKids , cre_titi_t a 13-bit CSA.
laeji ^jseiBSid is si^it-e^-etided to q,e„._„_ii.& tshe 15th and ECth bit.















su_C,._u_J , 3u_2 ,sxx_3,au_4. r?x_..5, sum*;,sx_7, sxt_3,sxuiS ,_u_l
£iJjill,SUJ112,£i_13,5ij^liir£Uttl£,5^^16,Si^l7,£^_19,st._13,
!ruis.21,--ruK.2£,^ijiifi22y«ru„24,^uri:2£ ,suK2^,f-uni27 , f-uiii28 , sa_2 9 ,
•5u_3i,iL_3£,.si^3 3_.s^34,.su_3£,si^'J(", bu_:?7 , s *a_.30 , _u_S3 ,
c&utO,iri;'Utl,coiii:2,eoi.x3 rcone4,couc5 reouce, coutT-, couc3 r
oc-utll,coxml2,coxiT;13,ccixi-l<l, coyc.lt, cout 16, cout 17 , aoxst 1
_cut_l,_ax„2_,coxtt.2_,cous-_4,_oufc_5 , cout_C, cout £7, cout £
coiit3i,cciUi--3S,coux-3-3,.cou,!:S'3,i-:ciuc-3:.5,couc-36, coui;3 7,coui:-3












assign &i9=*U 13S,3l$=Ei181 , Zt'3=QlIS I ,DI3=P [181 ;
assign A20=A[1&S ,320^33 US] ,C20^C[18i ,D20^B[IS] ;
fuil__ddei' f_l ;C[0j ,A[QJ ,3350] ,su_0,cout03 ;
lu.Ll_ad;des- is,&<<;i) ] rX[ll ,'BU I ,sxjmJ.,-?Q«J,t£3 ;
£u£l___ddc- f_3<C[2J ,A[2] ,E *2J , au_.£ , cout £) ;
iulljaddes' la4(C[S] ,A[3] ,B|3] , sun3, com; 3 3 ;
tull^-Mes faS(C(4j11A£4] ,Bi4] ,sxoi4,coi,it43 ;
fuil__ddcr f-6(CE£],A[S],B|-S],sxuaE,.=out£] ;
iu£l__dder ra7(C[6] ,A[6] ,Ei6I ,suu6,couuS3 ;
fuH_Adas;r t«¥l(Cl'7]1,At7],ie|7t,cu)i-?<,c;putV] ;
fuil___ddQr £a9<C[S],A[3] ,J3Jg] ,sunS,eoufcS3 ;
iulijadkier I_10(C[SJ ,A(5] ,EC"9] ,sx!*a'r?,couii-?) ^
iuil_addPi: tail 0-f 10 i ^A[101 „3 IIC^euibIQ. cout 10'J
fxi.ll_._ddc:: f_I_(C[llE,A[lI| ,2 [11 j ,.su_li , coutii)
lxUl__dder lal3(C[l_i,Al.l2r,3[l_i,siu_lZ,coL!itl£3
iuil^ddar f~14<C[13'_A[13S,S[J.3^,sui_13„._oufcl3]
full,, adder £_iS (C [14 f ,A[14} .,33 [l-l 3,,j*w_14 , coutl'U
Juii__dder i&lSCC 115! ,A 115} ,3 115 3. suiU$,«out 151
iuil__ddor f_i7(CL16JJ,Atl_*.,3[16.^-3riji_16,_i.utlS3
iult^oddBr f-ia(C[17],A[17|,r[17j,3i^_17,=c.utl7]
*u_l_a_der tal&fCUBj ,A118I rB US J .sunilS r cout 183
iuil__ddoi: fa3Q (C1.9, .U9,E13,*ui_19 ,cout 19} ;









full_. adder S_££irjiid,sxiaO,rM01 , S i 0 ] Hc0xlt21i ;
£ul i_aa.de £ £a234coutQ ,sxml,I< 113 , s\_Zl,eou-..£Z J;
.Cu.ll_-fl.ddsf £aZ4icouti,su_Z,D \Z} ,sxi_ZZ,eoutZ3 J;
fuil_addsc t&ZSIcout Z, su&3,t> I3 \, su&Z3, cout24 f;
.cuLl_arM*ar i!a£fc*(cout3, sui4*!,!:'! 4 i, sun24,coi4t35 J ;
1;uLl_a?ld?r fag?i<;out4,sxmSrri! S3, suugA, coxtttSSj ;
full n-ddor ifiZB taoxi-tB rp\-m$ tl> !6 J , sxiiL?-fi, t3i-.xi.t27 J ;
full adder ia£9 teoiite,sx_i7,t J7J , sxin2? , cout 28 J ;
full adder f_3Cl }caxxt7 , sxt_8 , DJ 3 ] , cxi_28 , coxitZS ) ;
full odder £_3i lco\Lt0,sx_.9,H IS J ,sxi_£9,coxLt*30|! ;
full^addsr £a32(c-)ut9,sxtolO,D | J.0 3, sxw3 G, c exit 31 >;
fu.ll_adfl.ei: £a33feextl'20(,su_lI,B|i3 J ,3Ula.31,c<JUt3 2> ;
£y.ll_addet: £a34 5c-mtII,SX_12,I' [IE] ,su_3Z,coxit33J ;
fiii i_ad.de.t- Sa35Ccoi«lZrsual3,P!i3)rsu_3S,cout34J ;
Cuil_addeL" la3S{CQUtl3,suji34,l:MA4 5,sxm34, coxites J ;
cull _adrfst: :&37 ' cQUtiU , svxiXS ,l> ! AS3, sxmZB , c exit 3 6 5 ;
fu 1l_^dd*i r i*%33lcouti£if-XLalfi,3>ll£3,sx^36,ci-.ut3'?J ;
full adder ffl.33 tcoutiS,sxmi'7,t>S17 3,.axx_37 , cout 38 J ;
fia 11 addc r £a'iQi'cmifcI7,=xii»l8,3>ii£3 , =%m,3 8 , ccmt39 5;
full adder *a41^coxitl3,-:xmI3,I)19,3Xi_33,coxit4Q> ;
full adds s f _4 2 I c out13 , sxl-2.0 ,1' 2 0 , sx_4 0 , c exit 41) j
ful l__adde tr £a4 3ignd,suu£I,eoui;£l,S |13, cout423;
D.ui„_-id3r £a4 4 1cout4Z,sx_ZZ „coutZZ,S 121 ,cout43) ;
i!-Jll_add3r ia4 51cout43,suu2 3,cout23,:>!3!„ cout 44 31;
cull_addeE' ia4Sxcout44, su3*24,coutK4,£ i43 ,cout45 J;
i?uj.lmadd*s i*.47iQQvxA$fs\m?,5fQQ%it2&f$ \S) ^cout46; ;
full adder r &4S 'caxit4e>,£xm2.6, cout26,S it ] , co\ifi7 J ;
full_addei- £&ti'HcGut47],£X_.2.7,coxit£7 ,3 !7 j , coxites J ;
full adder t a&0 i c out 48 , sxi_2 8 , c oxit £S , S | 8 1 , c oxit -i9 >;
full_„ddeL- £-5 i ) c oxit 4 9 , 5xi_£ 3 , c exit Z0 , S 13 ] , c oxit 50 J ;
full _dd=r *&5Z jc exit 5 0 , sxuiSO ,c exit 3 0,8 i 10 J , cout 51J ;
full__dd-;L' fas 3 icoui-5i,Si_3I,couC-'31,S 111} , cout 52 i ;
.£ttll_add= f ^a54f cout5Z,su_3£,coxtt.3Z,S UZ.J ,cout53} ;
Cnll_ad.de s; Ea5Sicout5"3,su_,3S,coxit33,S (13 3 , cout.543 ;
feill_add5c: i a5 6 {c out 5 4, su_3 4, c out:3 4, S 114 3, c out 5S 3 ;
tull_add=ir £s£7 icoxxt5£,sunS5,coxi.t.'3.K,S 1J S3, c oxit 5-J) ;
£uL.L addtsr £a£S!eout5£, su-36 , cQiit|3iS, K'! 2£ ], c.oxxt£7 J ;
full adder iaS9 (cout 57, sim'37 , coxit3 7, SJ17} ,coxit£8J ;
full adder *ab0icoxit£8,,5T;LB|3S,coxiti:3S,S J18J ,cox;t£9 J ;
£ M-l 1 _ilde r £&Gli[ceixit£9(,isx_.'39rc,exitii3ii,SP3] , coufefiO ) ;
full .-ddttsr f _GZlcaxitL>0,ssxin.4b,eoxib40,S 1Z0] , coxifciSl J ;
full__ddet" £ae3 4ctiuti5I,coulL.41,cexLt.Z0,SZl,3Z2>;
sndaioduie
Figure 41 19-bitCSA adding four operands
60
4. Carry-Look-Ahead Adder (CLA)
/rttM-s pracjrai adds too •i-brfc e-pe:e a&is-, creatiii? a fl-bit addos.
H'y si g
-. «KteEis-tcir. i-.<i lite *3jii*sri8^cisi, haiv:^ ftftly s^il'-abi^ fmr
oddrtaen o.C ussaspri-d „u_bes'Ks.
*/
-cd^i-s CLA_ns h <A,B, C10, £, CO -1! ;
inpvifc i3:0)A; >>'/ lrjpij.i:~i»-!xu- bate
i ri.itst i 3 : Ci ] 3-;
IXijaiit. C1.D:
<•AAkpi.it :3:tJj5;




viarc cl, c2,,. c3, cd, c& ,.cfc, c 7r c&, c"S, clS;
"«3, fs StsJ. .iSS^^sS, »>•%;•
ass i gn ssl - A101 •" Bltfl •
assign S[Dj- = CID * sail;
assign CO - A(0) 4 SCO]^
asstepft Pa = Ata] 1 3EQ];
assitni cl - £0 & CIS;
i-i-r.fi j. ij r3 CCU - GO 1 «i •
e^'sirjn. s-.£ - All! " Bill;
assign S|l| = CQ1 - sz2;
assi _rn Cl - A(1J 4 Bflli
asstesY, PI = All) i 3[l];
assign c2 - GO 5 PI;
ftS«^ gt*i cM - PD * PI js CID;
assign. CU2 - Ui | c2 l cS;
£53 i _fn 553 = AIZJ - Bl2\i
assign S\li - C02 ''• ss3;
assign C^ = A[Z] i 3£2]y"
assign P2 - A(21 i D(21;
fiSSiiJtX c<i - Gl 4 P2;
assign c-S - 50 £ J'1! 4 (?£;
rt^'S ;i yri fifi - PD A. PI. * "Pff 4. CJOi
assign CQ3 = E2 | «9 i el | c5;
assign -5-1 - A[3] * B.3J;
assi-jti. SS3? - COt '* ss4;
afesi.gsn G3 " Ai'Si: 4 E [33 j-
assiffi-'i P3 - M3» I El 33;
jSB.«iepi c-7 - £3 A P3;
a-5'fii gp. c8 = Cl ' ?2 i F3;
as-fii._pi c3 - CD i. I'l 4 ]!£ i *S;
assign ciCl - j'0 £ I'i £ J>2 _ $3 4 CID j
assi.gr. C04 - C3 | c? | c9 1 c9 i elO;
sindziiodulft
Figure 42 4-bit CLA without sign extension
61
/"•Thli ptroaisuR :a.ids cn-o 4-bii:. ij^eta/tds, exza^infj a 4-jjm atidec.
The two opcr_ad£ are sicix-caittr.dcd to create £-bit opcr_nli,
giiesTfltiin^ cos, vhaeh j.s ^-sagrrU&SG w t-he t-esxUc ,
HDiiUie CLA[A,B,€IO,S,S43 ;




outptxfc £4; // Carry-out bit
vke A't,E4;j
xritQ C0.1,,C1D£,C03.(CCi'i .COS; /,' COS is tor oo-Q-fiow dxxs- to sigi-s fts!t aasion




as-siOiTi i.4=A'31, 8*3=3131; if sicft'i i-'it- - sign ejiii-nsic'ci
a*KS3.fjw ss], = AID.! " BITii ;
assign S[0] - CIO * *»1;
asr-rags CO - A.[0| i. B.0| ,-
assign ?0 - A ro1 i B to I,
assign cl -* PO & CIOj
fl-.i*-jgK r:rn = f:n | ci.;
a-s-sigs ss2: = A |1] '• S|lj;
BSS-tm 3111 = coi "• *ss;
assign Cl -< Ml l s 2 [i| ,-
a^icjn 1U = ALU 1 SLil ;
asiigri c£ - :GQ 4 Vti
arsrsign c3 = P& 4 J>i -S CIO;
assaiyti COZ = Gl | cJ 1 o3;
fl^s^gs ss'S = A 1*1 " I3|,*J ;
&2Zr±yn 3[t] - COS **• af.33-:
assign C2 = A [2] 5 £[2.1,-
ass.£gia P2 = A.|£| 1 BrSU
assigsj ct - SI i 3>£;
S*S5,i3ri RS = .pp i'i ¥,£, i, Ji-J';
a».s-i grj c5 - PO i 11 <s ?2 4 CIO;
assign COS = £2 | c4 | c-S | e£;
ab.s.igci -.-^ - JJJ31 -A BL31 ;
assign $[3 j = CO3 *" ss-i;
assign C3 = Ais; a E[3;i;
assign. P3 = A|3;. I B|3?;
assign <:"? ~ £_ 4 j'3;
assign c§ - Gi 4 ?2 i K;
assign d - GO d ?1 ' ?2 s P3;
as-sign -trlO - PO fi H S P2 i ?3 4 CIO;
a-ssiOfc CO-I - £"5 1 c? 1 -.-9 1 e9 ! elO;
assign 54 = can - ss4;
acs.igsi G<L - A4 £ JS'ij
assign [--•L - M i Bl;
assign cil - CS i. JM"
assion ciS • C2 ' P3 i P-i;
*ts.^igEi c-I3 - Gl & P£ a P3 5 ?*,
4&siort CI4 = ©5 t. PI. fi P2 & i>3 4 P4;
ftSSigtx >3-i£ = PO i PL S PS 4 !P3 A P-3 4 CIO;
assigr. CDS = Gsi | ell 1 c!2 1 -Qls | c14 1 ell;
crjcktodulc








wire CIO = 0;
CW_jvsk clan! (A[3: 0] ,B[3rO] ,CJQ, S [3 ; 0] ,CG1! '
CLA^nsa cianZ(A[7:4),B[7:4],C0I,S[7:4],C0Z);
CLAjsisx cI__3(A(ll:e] ,B[li:8j ,C0£,S [11:8] ,C03) ;
ClAjriSx cI_n4(A[lS:lg],B [15:121„C€3,5[15:12 3,03'
assign A19=A[17],A19=A[17i;
assign B1B=BU?] ,B19=B U7| ;
endmodiile
(„B|i7; 1S| ),C04r(S19,S[lB:A6] ) ,£2Q>
Figure 44 18-bit CLA
// 19-bit CLA







































CLA clal([AlS.A[ia ilS] )„ {319,3[IS;1G]J CD*1,S[19:16],&ZQl;
i*ndaiadults







T.rire CIO = 0;
CLA nsj clanl(A_3:0]fB [3:0] ^CIO ,S[3 0} C01J -
CLA nsj clan2(A[7:4],E[7:4] FC01 ,S[7 4] C02) ;
CLA nsj cl_n3(A[ll:8],E[11: 8] ,C0Z,S[11 8] ,C03) ;
CLA risji clan4(A[15:12] ,B[1S = 12] ,C03 S[1S:12I,C04>;
CLA clalUM19: 16] ,E[19 16], C04, 3 [19 16 ,S [20] '} ;
endmodule
Figure 46 20-bit CLA
64
APPENDIX B
1. Radix-4 Booth's Multiplier
Finished circuit initialization process.
d A^oooooaoo, S*DOQOOOQ0, product-DD0D
10D A=0QQ0O00l, B=0OQ1OOOO. product=00IG
iso A=ooai&ooi, s=0'0oiioio. product-Olba
200 A*QQ1Q0QQ1, B=DQ1Q1Q11. prod«Qt=QS8b
250 A=0011tt00l, B=00ll00lQ, product=0992
300 A-iooaaoii, B<«GQli0DQ0.. product^-S90
350 A=10100001, 8=00011010, product3f&5s
400 A=41fllll00, B=10011011, praduct=0e34
450 A*11U1Q11, B=il0C0010. product-Dl36












































































































Figure 48 Results of timing simulation for the test-bench of Booth's multiplier
2. Baugh-Wooley Array Multiplier
Finished circuit initialisation process'
0 A=QQ, B-QO, productsQQC0
100 A-01, 0=10, praduDt=0B10
ISO! A«ll, B^la, prdduct*
200 A=21, Bs2b, product3
250 A=31, S=32, product=
300 A=B2. B=10. products 820
3S0 A=si. B>7a. product=d966
400 4=c5. Q=bb. product=dfe7
450 k=tl, B=f£J pradLict =0001
Figure 49 Results of functional simulation for the test-bench of Baugh-Wooley multiplier
65
0 A=00, B=00, product=S*KKK
16 .910 A=00, B=00, product=000D
100 A=01, B-10, product=0000
164.274 A=ll, B=la, product=01ba
200 A=2i; B=2b, product=01ba
214.141 A=21, B=2b, product=058b
250 A=31, B=32, product=058b
265.071 A=31, B=32, product=0992
300 A=82, B=10, product=0992
319.016 A=B2, B=10, product=f82 0
350 A=af , B=7a. product=f820
366.985 A=af , B=7a, product=d966
400 A=c5, B=bb, productBd966
419.848 A=>c5, B=bb. product=0fe7
450 A=ff , B=ff, product=0fe7
472.321 A«f f. B=ff, product=0001
Figure 50 Results of timing simulation for the test-bench of Baugh-Wooley multiplier
3, Carry-Look-Ahead Adder (CLA)
isugra_ adds i;he result* Erom utm 13 -ulcipiica'tions ijetssjeen
ava and iii tear coefilcien-cs using CLA.
'cinaseals lns/lps























assign ml6 = HISrIS I;
CLA_L7 clal7a'Ra,J5b„31a.ciJ ;
CLA_1? cl_l?b'RcFReLBbbJ;





C1.A 1*3 cl &15! a EH £ £ , R&
assign rlS « 3te@ [17 | ,, el@ =
CLA_2D cl__0_ (tell, •( slS,KlS,:
e„d_ad.ul-6
I [17];
Figure 51 Overall adder formed by CLA instantiations with only one input port
66
' t; iKiesaa ,Ls» 1 »s/ 1 pg
randale ackjerrc 1 a tst;
rcg [15sO]K;
vire [20:0] c^fum;
acAdei:_cla adder [K„ csurai ;
j. :i i. c i a I.
3^ e cji n
M = 16'hOcld;
#50 H ^ 16'hllli;
#50 H - lS'h02aa;
#5Q K =* 16'hCiOSI;




H=%kf totaJ £f-', K.tgiutiS
Figure 52 Test-bench for the overall adder with CLA instantiations and one input port
uciisssc&le Ins/Ips





ain |iEl:Cliaf f ,Egg;
Tdre U9:ti|Hhli*
•aire- _iS,!:IS,rl9;
assa cm IS = 1*
assiqn KS » 1<
'jOIOO;
•4.0700;
assign His = 16'&Cl23Ci;








CLA 16 clalSi'Hl?, i,M);
-asign. HTi * 16'hlOQG;
assicjn. HS = IS'heOOO;
essEcgs H10 - lS!i*iODBO.
pss-ssgw 311.4 = LS'bQsJO.
assign H17 a 16'i-aaDO,
assign nie - HIS US);
CLA__17 cla"t?a(P.a„JlbfRaa> ;
CLA_17 claJ.?b(jlc,M(.IU3b};
CLA_17 cl._17c(E.cv^f ,Rcc) ;
CLA^l? clal7d^,Sh,Md);
CLA_17 elaI7e^l, f_lSyK13| ,Re*iJ ;
CLA_1S cl.aJ.8a(P;aa,P.bb,»f'(:i;
CLA_18 claiSb (Kcc, Sdd, Bggl >r
ei,A_15 c.la^,9-a(Ji.ff,Sgvj,fch> ;
assign rll3 - an «s[ i ? ] , j:IS - a<ns[17];
CLA^SO cla20affchi]., jrl3,rie,fee},£su&3
encbsodule
Figure 53 Overall adder formed by CLA instantiations with eight input ports
67
" •CUfiE seal * lsWipa
K-odul £ -dd^rErli-i tst;
rsg us Ql%t,n2,m,w ,MllfHl2,J*1,6,lU9;
oira fSO Q'l&aum;
adder rltt todtizi: !,¥A> H2,J!:'S,ll7fMlI,Vi_.].i_]nie, „un.l ;
i n i c i al
begin
Ul- It1 hO<juO;JTC-J. ' hOOOO;(16-16 • h&UOCUM?-* A' hOOOft;Nil''K) JiOOriO,
HiS: -16' hOOO0,-[[l6- 15,hGQQO;K13^i£l hODGD;




?Z =16' hli tO; Cl6~i^ ' 111530^7-16 'KOald Ml
_=lfi h. oc-gj
end
//Ul* a:g jjtkiiL a »it ttin ^irranltar system task afeauld all be in csua i. iie
ml c ^ ii Jmcirjlcor ("^^Oi.1 cii=%^, m=%h, m-%h, .P^ii, kh^s. 112 = ••El, sie-%;t, His-^ji, Mt-aiauEi^h".,
$ t IE"* Kl^.Mf^M^Ml.Hia^ii^Eig, li^xm) ;
CTldifQfrJu i,K
Figure 54 Test-bench for the overall adder with CLA instantiations and eight input ports
Sijii'uiatQ'r' is d-inef eii'cuit initi&li__tioj-i






Figure 55 Results of functional simulation for CLA with one input port
o M=0dd, totalsura=xxxxxx
16 M=0cld, total sum=0Ge627
^0 M=llll, totalsum=00e627
65 M-llll, total suri-014443
100 M=02aa, total suri=Q14443
118 M-02aa, totalsum=00329e
150 N-dO^a, taralsum=QG329e
166 M-d051, total sum-*lc76Q3
2GQ M-0023, totalsum«lc7603
214 M-QQ23, total sum>GGQ299
Figure 56 Results of timing simulation for CLA with one input port
#0, M1-0G0Q, M2=0000, M6=0000, M7-0000, Mll=0000, Ml2=0000, Ml6=0000, Ml9=0000, totalsum=0061b4
#50, Ml-1000, M2=0200, M6=0330, M7=0efd, Mll=1579, Ml2=00f0, M16=67Q9, Ml9=afff, tOtalsum=OOb352
#100, Ml=120G, M2=02f0, M6-1530, M7=0afd, Mll-1009, Ml2=cS10, Ml6=6009, Ml9=afl2, tOtalsiim=Q07bO5
Figure 57 Results of functional simulation for CLA with eight input ports
68
#0, M1=0000, M2=0000, M6=0000, M7=0000, Mll=0000, M12-0000, W.6-0000, H19=0QG0, totalsum-xxxxxx
#26, M1=0Q0G, M2=0000, M6*00G0, M7=00QG, M11=0000, M12=0000, Ml6=00Q0, Ml9=0000, tOtalsum=0Q61b4
#50, Ml=1000, M2=020O, M6=033Q, M7=0efd, Mll=1579, M12=QQfO, M16=67Q9, Ml9=afff, totalsum=0061b4
#70, Ml=1000, M2=020O, H6-033Q, M7-0efd, Mll=1579, M12=00f0, M16=6709, Ml9=afff, totalsum=OOb352
#100, Ml=120G, M2=Q2fO, M6-1530, M7=Gafdf M1U10Q9, Ml2=c510, Ml6=6009, Ml9=afl2, total SUI?l=G0b3 52
#122, M1=120Q, M2=02f0, M6=1530( M7=0afd, Mll=1009, Ml2=c510, Ml6=6009, Ml9=afl2, totalsuiti=007b05
Figure 58 Results of timing simulation for CLA with eight input ports
4. Carry-Save Adder (CSA)
/H'hts progeajfl adds all 19 results Erosn w«atlpllfiar,3.00-3 setae
inputs and filter coefficicnts using CSA,
'/
module adde;f_csa (H^tsuw) ;
inpui Ll^tOJH;
ulJXj-jm, [20:0] i:-i?Ulti;
wire [ IS : 0] HL, H2 , H3 , H4, MS, KG, H7, KG, HE",KLO;
uire [lSiQ_Hll,m,Hl3,Hl4,MlS,Mlfi,HL7,llia,Ml9;
"ire [ia;0] Ra,Rfe, Re
wire [lTiOJRd;
nice PidlS;
aaaagn H1-H_K2**H, M3-M,)M-H,H5-H, H6-H,H7-H,He-K, H9-tt,Hia»M;
assign Mll»Mf H12=H, M13=H/Hi4»H/MlS=H,Hlfi-H,Mn =H,}Iia«H, M19 =M;
C3J_1(_5 esal6_5a(Hl,H2,H3,M,K5,Ra) ;
CSA_L6_S C3alS_5b(HS,H7,He,K9,HL0J.Rh) ;













H - 16' siDclct;
#50 M = 16'hllll;
#50 " = 16!ftQ2sa;
#50 H - IS'hdOSl;




" n^hh, ilSiSm^h" f 11, C3UJE1)
Figure 60 Test-bench for the overall adder with CSA instantiations and one input port
'cmicscalc Ins/lpa
titQdlllB atid^c^caaiKi^KS^ESyin^ni^nis^His^ ki9#lsu3ti| ;




y i£es [ L~:0]Rd,-
wire Ml 8;
assign K3 = 16'hOlOD;
assign K-3 - lS'hlOOO;
assign hs = is1hoaog;
assraisrEt KB - 16'h2000;
assigrj K9 = 16'^0700;
aosiga K10 - 16'MOeO;
assign H13 = la' 11.0012;
anai-jn K14 n I6')_0rf0j
ass i _fsi 'HIS s 16' 11Q230;
assies Hi? = ifi'hinOOj
assifa. H1B - 16' MOOD;;
CSA 1$ 5 Ea.tUS^&fKi, KB,!P,TC*3,Tf5rR«] ;
C5A__1£__S caalS Sti [&6,K7JKB,.ffi^fliO,.RhJ ;
CSA 1€ "5 csaLG Se[Hll,Hl2f"l^H14,N;i5,Re);
CSA™ IS" osalSafKlfi^Hn^Hia^HlS^itJ J
ttKSigffi MW"Pa[P];
CSA_19 csaiea|P.a,PIB,Re{ {RAIS,,]^} , caum);
erndeiatlu'le
Figure 61 Overall adder formed by CSA instantiations with eight input ports
70
'tiKieaeaie ins/ipa
Module aGtSeresa^i: =3 c:
reg [1S:Q)Hl(PK2J.KS,)|7(,Hil,Kl2,]i3.0,K19;
wire [20;0}tsumj





lf5C Hl=15,taODD;JJ'il =lo,SD2 0D,'l[6=15']iO33ajK7=16,l3aei!»i;}51i=i6'h.l£79;
3T22-15lhDO-na,-HlS-16'he7.05;M19=S6l 1>.&Z£Z;
#50 Ml-ifi'h1.2P0;]T2-j;6( Js02£Cl;Hfi- 1£' hJ.S3Q; K7-1 fi ' hQaliri; KU-liS ' hi DOS;
»fl2-16,h<3SlO;Kl&-16ih6Q0E5;M19-itilhr.fl2;
//the arguirettts uiti^in StaDtiiCDr aysLBJii L&_k should ail Iqb in one line
initial jJjnDMiLOf {"j^Dt], Hl^h, H2^Ui, 36-%h, vn-'-zh, Hll^h, Hi2^h, HlS-^h,,
endisjadu 1 c
Figure 62 Test-bench for the overall adder with CSA instantiations and eight input ports
Finished circuit ini tializat ion process.
U M==0cld. totalsuui"=Q0e627
50 M==1111, totalsuijL==014443
100 M=-02aa, totalsuiii ==Q0329e
150 M= d051. totalsujn ==lc7603
200 M= 0023, totaisuiii==000299










Figure 64 Results of timingsimulation for CSAwith one input port
#0, Ml-0000, M2=0000, M6=0000, M7=Q000, Mll=0000, Ml2=00Q0, Ml6=0000, Ml9=0000J totalsum=0061b4
#50, ^1=1000, M2=0200, M6=0330, M7=0efd, Mll=1579, Ml2=00f0, M16=67Q9, Ml9=afff, tOtalsuiTl=GGb352
#100, Ml=1200, M2=02f0, M6=1530, M7=0afd, Mll=1009, Ml2=c510, M16=6009, Ml9=afl2, totalsum=OO7b05
Figure 65 Results of functional simulation for CSA with eight input ports
71
#0, M1=0000, M2=0000, M6=0000J M7=0000, Mll-OOQO, M12-0000, Ml6=0000, M19=000Q totalsum=xxxxxx
#23, Ml=0OQG, M2=0000, M6=0000, M7=000Q, [€11=0000, H12=0000, M16=QQ0Q, Ml9=0000, totalsum=0061b4
#50, Ml=1000, M2-02QG, M6=0330, M7=0efd, Mll=1579, M12=00f0, M16=6709, M19=afff, tOtalsum=Q063 b4
#70, Ml-1000, M2=O2O0, M6=0330, M7-0efd, Mll=1579, Ml2=00f0, Ml6=6709, Ml9=afff, tOtalsuill=00b352
#100, Ml=1200, M2=02fO, M6=1530, M7=0afd, KL1=1QQ9, Ml2=c51G, 1416=6009, Ml9=afl2, tOtalsum=Q0b352
#122, Ml-1200, M2=02fO, M6-1530, M7=0afd, Mll-1009, M12=c51Q, M16=6009, M19=afl2, totalsum=GQ7bQ5
Figure 66 Results oftiming simulation for CSA with eight input ports
72
