A Synthesizable single-cycle multiply-accumulator by Childs, John
Rochester Institute of Technology 
RIT Scholar Works 
Theses 
8-1-2003 
A Synthesizable single-cycle multiply-accumulator 
John Childs 
Follow this and additional works at: https://scholarworks.rit.edu/theses 
Recommended Citation 
Childs, John, "A Synthesizable single-cycle multiply-accumulator" (2003). Thesis. Rochester Institute of 
Technology. Accessed from 
This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in 
Theses by an authorized administrator of RIT Scholar Works. For more information, please contact 
ritscholarworks@rit.edu. 
A SYNTHESIZABLE SINGLE-CYCLE MULTIPLY-ACCUMULATOR
By
JOHN S. CHILDS
Thesis submitted to the Faculty of Rochester Institute of Technology in partial














DEPARTMENT OF ELECTRICAL ENGINEERING, COLLEGE OF ENGINEERING
ROCHESTER INSTITUTE OF TECHNOLOGY, ROCHESTER, NEW YORK
AUGUST 2003
Department of Electrical Engineering
11
A SYNTHESIZABLE SINGLE-CYCLE MULTIPLY-ACCUMULATOR
I, John S. Childs hereby grant the permission to the Wallace Library of the Rochester
Institute of Technology to reproduce my thesis in whole or in part. Any reproduction will





Department of Electrical Engineering
III
Acknowledgments
The work contained in this thesis would not have been possible without the assistance of
several people both within and outside of RIT. First, I would like to give my sincere
thanks to Dr. Ferat Sahin, my graduate advisor. Dr. Sahin has provided me with the
guidance and direction needed to complete this work. He also introduced me to the
Multi-Agent Bio-Robotics Lab (MABL) whose meetings challenged me to look at a wide
range of problems in new ways.
I would also like to thank my thesis committee, Dr. Dorin Patru and Dr. Daniel
Phillips, for carefully reviewing my work. The suggestions and feedback that they have
provided has been a tremendous help to me.
The faculty and staff of the Electrical Engineering Department at RIT haye given
me a great deal of support over the last few years. Many thanks to the Electrical
Engineering Department Head, Dr. Robert Bowman, for the advice and perspective he
has provided me. I would also like to thank the Electrical Engineering Department staff,
Ms. Aorence Layton, Ms. Jill Lewis, and Ms. Patti Vicari for keeping me on track and
focused on my goals at RIT.
Last, I would like to thank Improv Systems, Inc. for providing the motivation for
conducting this research, as well as the tools and resources necessary for its completion.
I would especially like to thank Mark Indovina, Richard Wanzenried, John Gostomski,
and Robert Childs for the assistance and suggestions that they have each offered.





Master of Science in Electrical Engineering
Abstract
The multiplication and multiply-accumulate operations are expensive to implement in
hardware for Digital Signal Processing, video, and graphics applications. A standard
multiply-accumulator has three inputs and a single output that is equal to the product of
two of its inputs added to the third input. For some applications it is desirable for a
multiply-accumulator to have two outputs; one output that is the product of the first two
inputs, and a second output that is the multiply-accumulate result.
The goal of this thesis is to investigate algorithms and architectures used to design
multipliers and multiply-accumulators, and to create a multiply-accumulator that
computes both outputs in a single clock cycle. Often times in high speed designs the
most time-consuming operations are pipelined to meet the system timing requirements.
If the multiply-accumulate computation can be reduced to a single-cycle operation the
overall processor performance can be improved for many applications.
A multiply-accumulator with two outputs can be created using a combination of
standard multiply, add, or multiply-accumulate components. Using these components, a
multiplier and a multiply-accumulator can be used to produce the outputs in the most
time-efficient manner. A multiplier and an adder will result in a smaller design with a
9/3/03 Department of Electrical Engineering
larger worst-case delay. Therefore, the goal is to create a multiply-accumulator that is
comparable in speed, but requires less area than a design using an industry standard
multiplier and multiply-accumulator.





Table of Contents vi
List ofFigures viii
List of Tables xi
1. Introduction 1
1.1 Motivation 1
1.2 Thesis Outline 3
2. Problem Definition 4
3. Multiplication Algorithms 6
3.1 Introduction to BinaryMultiplication 6
3.2 Unsigned Multiplier Designs 7
3.2.1 Carry Save Adder 7
3.2.2 Simple ArrayMultiplier 10
3.2.3 Multipass ArrayMultiplier 12
3.2.4 Even/Odd Array Multiplier 13
3.3 SignedMultiplier Partial Product Generation 14
3.3.1 The Baugh-Wooley Algorithm 15
3.3.2 Booth's Algorithm 19
3.3.3 Radix-4 Booth Encoding 22
3.3.3.1 Gate-Level Implementation 24
3.3.3.2 Sign-Extension Correction 29
3.3.4 Higher Radix Booth Encoding 33
3.4 Partial Product Reduction 34
3.4.1 Wallace TreeMultipliers 36
3.4.2 Dadda Multipliers 41
3.4.3 The Three Dimensional Method 45
3.4.4 4:2 Compressors 51
3.5 Addition Algorithms 56
3.5.1 Ripple Carry Adder 56
3.5.2 Carry-Lookahead Adder 56
3.5.3 Carry-Select Adder 61
3.5.4 Carry-Skip Adder 63
3.5.5 Conditional Sum Adder 65
9/3/03 Department ofElectrical Engineering
Vll
3.5.6 Brent-Kung Adder 67
3.5.7 Input Delay Considerations 71
4. Design Alternatives 74
4.1 Introduction 74
4.2 Design Stages 75
4.2.1 Partial Product Generation 75
4.2.2 Partial Product Reduction 80
4.2.2.1 Partial Product Reduction - Method 1 81
4.2.2.2 Partial Product Reduction - Method 2 82
4.2.3 Final Addition 84
4.2.3.1 Hybrid Adder 1 85
4.2.3.2 Hybrid Adder 2 87
4.2.3.3 Hybrid Adder 3 89
4.2.3.4 Hybrid Adder 4 89
5. MAC Designs and Results 91
5.1 Introduction 91
5.2 StandardMultiply-Accumulators 92
5.3 MAC Design Solutions 93
5.3.1 Partial Product Generation: Baugh-Wooley vs. Booth Encoding 94
5.3.2 Final Adder Comparison 98
5.3.3 Booth Coding Style Comparison 100
5.3.4 Partial Product Reduction Comparison - Method 1 vs. Method 2 103
5.3.5 Partial Product Reduction Comparison - Dadda vs. TDM 105
5.3.6 TDM PPRT Analysis 108
5.3.7 Method 1 Partial Product Reduction Analysis 1 1 1
5.3.8 Method 2 Partial Product Reduction Analysis 113
5.3.9 DaddaMAC Designs 116
5.3.10 MAC Design Summary 118
6. Future Work 123
7. Conclusion 125
8. References 127
9. Appendix A 130
9/3/03 Department ofElectrical Engineering
Vlll
List of Figures
Figure 1-1 : Comparison of (a) aMAC with Two Outputs and (b) a Standard MAC 2
Figure 2-1: Multiply-Accumulator Block Diagram 5
Figure 3-1: Unsigned Multiplication Example 7
Figure 3-2: Four-Input Adder using RCAs 9
Figure 3-3: Four-Input Adder using CSAs 10
Figure 3-4: Eight-Bit Simple Array Multiplier 1 1
Figure 3-5: Eight-Bit Multipass ArrayMultiplier [3] 13
Figure 3-6: Even/Odd ArrayMultiplier [3] 14
Figure 3-7: Incorrect Signed Multiplier Example 15
Figure 3-8: Baugh-Wooley Partial Product Formation 16
Figure 3-9: Booth Encoding Example 20
Figure 3-10: Radix-4 Booth Encoding Example 24
Figure 3-11: Booth Decoder (a) and Encoder (b) [9] 27
Figure 3-12: Partial Products using anMBE Scheme 29
Figure 3-13: Sum of Sign Bits for Negative Partial Products 30
Figure 3-14: Sign-Extension Correction Examples for a SignedMultiplier 32
Figure 3-15: MBE Scheme with Sign-Extension Correction [9] 33
Figure 3-16: Full Adder Schematic 35
Figure 3-17: HalfAdder Schematic 36
Figure 3-18: An 8-BitWallace Multiplier 39
Figure 3-19: 8-Bit Wallace Multiplier Block Diagram 41
Figure 3-20: 8-Bit Dadda Multiplier 43
Figure 3-21: PPRT Example 46
Figure 3-22: Delay Comparison for Two Signal Arrangements [13] 49
Figure 3-23: Non-TDM PPRT 50
Figure 3-24: TDM PPRT 51
Figure 3-25: A 4:2 Adder Built using Full Adders 52
9/3/03 Department of Electrical Engineering
IX
Figure 3-26: 4:2 Compressors to Sum Four Numbers 53
Figure 3-27: Redesigned 4:2 Compressor Logic [8] 54
Figure 3-28: Alternate 4:2 Compressor using Full Adders 55
Figure 3-29: 16-Bit Carry-Lookahead Adder 60
Figure 3-30: 8-Bit Carry-Select Adder 62
Figure 3-31: 13-Bit Carry-Select Adder 63
Figure 3-32: 16-Bit Carry-Skip Adder 64
Figure 3-33: 16-Bit Brent-Kung Adder 71
Figure 4-1: Partial Products for a Signed/Unsigned MAC 77
Figure 4-2: Sign-Extension Correction Constant for a Signed/Unsigned MAC 78
Figure 4-3: Partial Products for a Signed/Unsigned MAC 80
Figure 4-4: MAC Design using Method 1 82
Figure 4-5: MAC Design using Method 2 83
Figure 4-6: Hybrid Adder 1 Block Diagram 86
Figure 4-7: Hybrid Adder 2 Block Diagram 88
Figure 4-8: Hybrid Adder 3 Block Diagram 89
Figure 4-9: Hybrid Adder 4 Block Diagram 90
Figure 5-1: Final Adder Simulation Results - 0.18 micron Technology 99
Figure 5-2: Final Adder Simulation Results - 0. 1 3 micron Technology 100
Figure 5-3: Booth Coding Style Comparison
- 0.18 micron Technology 102
Figure 5-4: Booth Coding Style Comparison -0.1 3 micron Technology 102
Figure 5-5: Partial Product Reduction Comparison - 0.18 micron Technology 104
Figure 5-6: Partial Product Reduction Comparison - 0.13 micron Technology 105
Figure 5-7: TDM vs. Dadda Comparison - 0.18 micron Technology 107
Figure 5-8: TDM vs. Dadda Comparison -0.13 micron Technology 107
Figure 5-9: TDM Analysis - 0.18 micron Technology 109
Figure 5- 10: TDM Analysis -0.1 3 micron Technology 110
Figure 5-11: Method 1 PPRT Comparison
- 0. 1 8 micron Technology 112
Figure 5-12: Method 1 PPRT Comparison -0.13 micron Technology 113
9/3/03 Department ofElectrical Engineering
Figure 5-13: Method 2 PPRT Comparison - 0.18 micron Technology 1 15
Figure 5-14: Method 2 PPRT Comparison - 0.13 micron Technology 1 15
Figure 5-15: DaddaMAC Comparison - 0. 1 8 micron Technology 117
Figure 5-16: Dadda MAC Comparison -0.13 micron Technology 118
Figure 5-17: MAC Comparison - 0.18 micron Technology 1 19
Figure 5-18: MAC Comparison - 0. 1 3 micron Technology 120
9/3/03 Department ofElectrical Engineering
XI
List ofTables
Table 3-1: Booth's Algorithm 20
Table 3-2: Alternate Booth Encoding 21
Table 3-3: Radix-4 Booth Encoding 23
Table 3-4: pp_LSB Formation 25
Table 3-5: Inner Partial Product Bits 25
Table 3-6: Logic Table for Inner Partial Product Bits 26
Table 3-7: neg_cin Logic Values 28
Table 3-8: Full Adder Truth Table 35
Table 3-9: HalfAdder Truth Table 36
Table 3-10: Dadda andWallace Multiplier Comparison [14] 44
Table 3-11: 8-Bit Conditional Sum Adder [7] 66
Table 3-12: Step 1 of a Brent-Kung Adder 68
Table 3-13: Step 2 of a Brent-Kung Adder 69
Table 3-14: Step 3 of a Brent-Kung Adder 69
Table 5-1: Simulation Results for Two DesignWare MACs 93
Table 5-2: Booth vs. Baugh-Wooley Partial Product Generation 95
Table 5-3: Booth vs. Baugh-Wooley MAC Simulation Results 97
Table 5-4: Delay Estimate Summary for TDM Analysis 109
Table 5-5: TDM Analysis Simulation Summary 110
Table 5-6: MAC Design Summary 120
Table 5-7: MAC Design Descriptions 121
9/3/03 Department ofElectrical Engineering
1. Introduction
1.1 Motivation
The Improv Systems Jazz DSP [1] has a multiply-accumulator with multiply and
multiply-accumulate outputs. However, to meet the timing requirements of the system,
the current architecture uses two clock cycles to complete. Therefore, methods have been
investigated to increase the speed of the multiply-accumulator such that the operation can
be completed in a single clock cycle.
A multiply-accumulator with two outputs can process data more efficiently than a
standard multiply-accumulator for certain applications. For example, assume the
following two equations need to be processed:
p (wx+ y)wx (1-1)
q
= (wx+ y )wx+ z (1-2)
These equations were selected to demonstrate the increased computational power that a
multiply-accumulator with two outputs has over a standard multiply-accumulator.
Equations (1-1) and (1-2) are typical examples of calculations that are part of more
complex computations performed by a DSP for a video, graphics, compression, or
filtering applications.
Assume that to solve this set of equations either a multiply-accumulator (MAC)
with two outputs or a standard multiply-accumulator can be used. If both multiply-
9/3/03 Department ofElectrical Engineering
accumulators can complete one computation per clock cycle, a MAC with two outputs
can solve these equations in two clock cycles (Figure 1-1 (a)), while a standard MAC



































Figure 1-1: Comparison of (a) aMAC with Two Outputs and (b) a Standard MAC
9/3/03 Department ofElectrical Engineering
This example provides the motivation for creating a MAC that can compute the
product of two numbers as well as the multiply-accumulate result of the inputs.
1.2 Thesis Outline
The following chapters summarize algorithms that have been used to design standard
multipliers and multiply-accumulators. Possible design solutions to meet the design
requirements are also presented.
Chapter 2 defines the design specifications for a multiply-accumulator with two
outputs in greater detail.
Chapter 3 summarizes the algorithms that are used to create a binary multiplier.
Both signed and unsigned multipliers are considered. The three stages of a combinational
multiplier are described, with several alternatives to implement each stage.
Chapter 4 discusses design methods for standard multiply-accumulators with a
single output. The modifications to the standard design methods required to meet the
specifications of a multiply-accumulator with two outputs are presented.
Chapter 5 proposes several possible design solutions. Each design is simulated,
and compared to the performance of industry standard components.
Chapter 6 summarizes the results achieved, and suggests two specific designs
that fulfill the problem requirements. The constraints will dictate which solution should
be used.
9/3/03 Department ofElectrical Engineering
2. Problem Definition
A MAC has three operands: a, b, and c. The specifications require that inputs a and b are
each 32-bit values, while c is a 64-bit value. The operands can be either signed (two's
complement) or unsigned values. The tc, or "two's
complement"
input is a single bit that
indicates whether the inputs are signed or unsigned. If tc is set (has a logic value of 1),
the inputs are all signed values. Otherwise, the inputs are unsigned values.
There are two 64-bit outputs from the MAC. The first output is the product of the
a and b inputs, where a is the multiplicand, and b the multiplier. The second output is the
multiply-accumulate (MAC) output, which is the product of the a and b inputs plus the c
input. If overflow occurs from the MAC operation, the MAC output is truncated to
64-
bits. That is, only the lower 64-bits of the MAC output are calculated. A block diagram
summarizing the operation of aMAC is given in Figure 2-1 .
Two additional constraints placed on the design are that it must be synthesizable
and technology independent. Each of the proposed designs have been coded using the
Verilog Hardware Description Language. The code must be written using a style that can
be compiled using the Synopsys Design Compiler tool. In addition, no
technology-
specific cells can be instantiated within the design, allowing the same code to be used to
target any technology library.

















3.1 Introduction to Binary Multiplication
The multiplication and multiply-accumulate operations can be implemented using similar
methods. A multiply-accumulator can be considered a multiplier with one extra partial
product. This chapter will first describe several unsigned multiplier designs, and then
present the algorithms used to create a signed (where the inputs are in two's complement
format) multiplier.
The specifications for the multiply-accumulator to be designed require that the
outputs be produced in a single clock cycle. Therefore, a combinatorial path must exist
between the inputs and the outputs of the unit. The most common class of multipliers
that produce a result in a single clock cycle consists of three stages: partial product
generation, partial product reduction, and final addition. The first stage generates a set of
binary numbers based on the multiplicand and multiplier whose sum equals the product
of the inputs. Next, the partial products are reduced to two binary numbers whose sum
equals the desired product. In the final stage the two binary numbers generated from the
second stage are summed using a fast final adder.
The next section describes some representative multiplier designs. Sections 3.3,
3.4 and 3.5 consider a multiplier with signed inputs, and look at each stage in greater
detail.
9/3/03 Department ofElectrical Engineering
3.2 Unsigned Multiplier Designs
The main difference between a signed and an unsigned multiplier is in the partial product
generation stage. ANDing each multiplier bit with the multiplicand can generate the
partial products for an unsigned multiplier. This is similar to multiplication using the
traditional longhand method, as shown below. The values shown in parenthesis are the
decimal equivalent to the respective unsigned binary number.
0 1 0 1 (5)
X 1 0 1 1 (11)
0 1 0 1
0 1 0 1
0 0 0 0
+ 010 1
0 110 111 (55)
Figure 3-1: Unsigned Multiplication Example
3.2.1 Carry Save Adder
A carry save adder (CSA) [2, 3] is simply a row of
full adders. A full adder has three
inputs (a, b, c) and two outputs (sum, carry). Each of the inputs to a full adder has the
same weight. The sum output has the same weight as the input bits, while the carry
output has twice the weight of the sum. For example, assume the inputs to a full adder
are (1, 0, 1), each with weight 2. The sum output
will be 0 with weight 2, and the carry
output is 1 with weight 21. The logical equations for the sum and carry outputs of a full
9/3/03 Department ofElectrical Engineering
adder are given in Equations (3-1) and (3-2), where represents the exclusive OR
operation, and + represents the logical OR operation.
sum = ab@c (3-1)
carry
= ab + bc + ac (3-2)
Assume four 4-bit binary numbers (a, b, c, d) are to be summed. One possible
adder design consists of three levels of full-adders, where the carry-out bit of each full
adder is connected to the c input of the adjacent full adder, forming a ripple carry adder
(RCA). Such a design is given below, where a simplified block diagram is given to the
right. The
'HA'
blocks represent half adders. A half adder is equivalent to a full adder
with one of the inputs set to zero.
9/3/03 Department of Electrical Engineering


















Figure 3-2: Four-Input Adder using RCAs
The same four-input adder can be implemented using two carry save adders and
one ripple carry adder. A design using CSAs allows the full adders in each row to
perform the addition of its inputs simultaneously, without waiting for the carry-out of the
adder to the right to be calculated. Therefore, a design using CSAs will be faster than
one using RCAs, while each design uses approximately the same number of full and half
adders.
9/3/03 Department ofElectrical Engineering
10
a b3c3 a2b2c2 a, b, c, a0 b0 c0
LLHLLL-LLh-LH







Figure 3-3: Four-Input Adder using CSAs
3.2.2 Simple ArrayMultiplier
A simple array multiplier [3] adds the partial products in a linear arrangement. The first
three partial products are added using a CSA, which produces a carry and a sum. Next,
the fourth partial product is added to the carry and sum of the first CSA. The bits must be
properly aligned so that the carry is shifted to the left by one bit. This procedure is
continued until all the partial products have been added in a CSA. The last CSA
produces a carry and a sum that must be added in a carry-propagate adder (CPA) to
obtain the final result. A carry-propagate adder sums two binary numbers to produce a
9/3/03 Department ofElectrical Engineering
11
single result. Ripple carry, carry-lookahead, carry-select, and Brent-Kung adders are
examples of carry-propagate adders that will be discussed in Section 3.5.
An 8-bit Simple Array Multiplier is shown below. The inputs to the multiplier are
unsigned, so the partial products can be formed using AND gates. In this example,
A
represents all five bits of the multiplicand, while b\ indicates bit i of the multiplier.






















Figure 3-4: Eight-Bit Simple Array Multiplier
9/3/03 Department of Electrical Engineering
12
3.2.3 Multipass ArrayMultiplier
If the speed of the design is not a critical factor, an array multiplier can be arranged to
minimize cost [3]. A multipass array multiplier [3] can be used to minimize the space
used on a chip, by reusing hardware. The structure of a multipass array multiplier is
similar to that of a simple array multiplier. However, signals pass through the hardware
twice before a result is found. An 8-bit multipass array multiplier is shown in Figure 3-5.
On the first pass through the array, the first five partial products are added. The results of
the first pass are fed back into the first CSA, and the sixth, seventh, and eighth partial
products are reduced. The sum and carry outputs from the last CSA after the second pass
are added in the CPA to give the product.
The timing of a multipass array multiplier will determine whether the circuit
functions properly. The sixth, seventh, and eighth partial products must be applied after
the result of the respective CSA on the first pass has been found. The sum and carry
from the last CSA on the second pass must be applied to the CPA at the correct time to
prevent errors in the product. The delay of the multiplier will vary based on the
technology used to implement the design. Therefore, such a design cannot be
implemented using a synthesis tool, and will not be of use due to the specifications of this
project.














Figure 3-5: Eight-Bit Multipass ArrayMultiplier [3]
3.2.4 Even/Odd Array Multiplier
The arrangement of the inputs to each of the CSAs of an even/odd array multiplier allows
multiplication to occur faster than that of a simple array multiplier [3]. Any given signal
passes through only half of the adders, making the partial product reduction almost twice
as fast as the method used in a simple array multiplier. For example, an 8-bit even/odd
array consists of six CSAs and one CPA. A signal will not pass through more than four
CSAs before it reaches the CPA. An 8-bit simple array multiplier has six CSAs and one
CPA. However, a signal could pass through all six of the CSAs before it reaches the
9/3/03 Department of Electrical Engineering
14





















Figure 3-6: Even/OddArrayMultiplier [3]
3.3 Signed Multiplier Partial Product Generation
The inputs to a signed multiplier are in two's complement format. A signed multiplier
differs from an unsigned multiplier only in the partial product generation stage. Using
9/3/03 Department of Electrical Engineering
15
AND gates does not create the proper partial products for a signed multiplier, as shown in
the example in Figure 3-7.
0 1 0 1 (5)
X 1 0 1 1 (-5)
0 1 0 1
0 1 0 1
0 0 0 0
+ 010 1
0 110 111 (55)
Figure 3-7: Incorrect SignedMultiplier Example
This section presents three methods of generating the partial products of a signed
multiplier: the Baugh-Wooley Alogorithm [4, 5], Booth's alogorithm [2, 3, 6, 7, 8], and
Radix-4 Booth Encoding [3, 7, 8, 10].
3.3.1 The Baugh-Wooley Algorithm
The Baugh-Wooley method of partial product generation uses a method similar to that of
longhand multiplication. The partial products are generated using AND gates, where
each bit of the multiplier is ANDed with the multiplicand. The Baugh-Wooley method
[4, 5] can be modified to correctly generate the partial products for a multiplier that
accepts either signed or unsigned values as inputs. An example illustrating the operation
of this sort ofmultiplier for 5-bit operands is given below.
9/3/03 Department ofElectrical Engineering
16
tc
34 *3 a2 a. ao
X b4 b3 b2 b. bo
\S b0a2 b0a, boao
b,a3 bia2 bia! b,a0
b2a3 b2az b2a, b2a0
b3a3 b^ b3a, b3a0
b334 b2a4 b,a4 boa4 (do not negate if tc
= 0)
b4a3 b4a2 b4a, b4ao (do not negate iftc
= 0)
D4a4 0 0 tc
Figure 3-8: Baugh-Wooley Partial Product Formation
A derivation the above modified Baugh-Wooley method of partial product
generation similar to that given in [4, 5] follows. In general, an n-bit number, N, in two's





where an.j is the sign bit of N. Similarly, -N can be found by taking the two's
complement of Equation (3-3). That is, Equation (3-4) is formed by negating each bit of






9/3/03 Department of Electrical Engineering
17
Let A be an m-bit multiplicand and B be an n-bit multiplier, where each bit ofA and B are
written as:
A = i"m-i am-2 a-3, a0)
B = {bn.ibn_2bn_3...blb0)
Assume that A and B are signed numbers in two's complement form, where





<=o A yo )
(3-5)
Expanding Equation (3-5), gives Equation (3-6):
n-2 n-2 n-2 n-2
P = fl^2M+ES^2^ -2>_A2"-^ (3-6)
i=o ;=o i=o j=o
The first term in Equation (3-6) is the result of the two sign bits being ANDed
together, with the weight placing the result in the second most significant bit of the
product. The second term in Equation (3-6) represents the standard method of finding the
partial products, without including either of the sign bits. These partial products are
found using AND gates. The last two terms in Equation (3-6) can be rewritten using
9/3/03 Department of Electrical Engineering
18
Equation (3-4). Instead of subtracting the negative summands, these terms can be























Similarly, the last term in Equation (3-6) can be written as:
(3-8)




















9/3/03 Department ofElectrical Engineering
19
However, the binary representation of has each bit in bit position 2 and
greater set, and will be zero in the 2n - 1 bit position and below. The sign bit of the
product, P, has a weight of 22""', so the term will not affect the product.
Therefore, this term can be ignored when forming the partial products.
To summarize, the modified Baugh-Wooley method presented can be used to
form the partial products for a multiplier that can have either signed or unsigned inputs,
depending on the tc input bit. If the inputs are unsigned values, tc will equal zero.
Therefore, the partial products can be found using the longhand method of forming the
partial products by ANDing each bit of the multiplier with the multiplicand. For signed
operands, the tc input will be set, and the described modifications need to be made to give
a valid set of partial products.
3.3.2 Booth's Algorithm
Booth's algorithm offers an alternative to partial product generation methods using AND
gates. Nearly 90% of all multipliers use some form of Booth's algorithm in the partial
product generation stage [6].
Booth's algorithm [2, 3, 6, 7, 8] assumes that the operands are signed binary
values. Let a be the multiplicand, b be the multiplier, where bi represents the
i'h
bit of b.
The functionality of Booth's algorithm is shown in Table 3-1
.
9/3/03 Department ofElectrical Engineering
20





1 0 Subtract a
1 1 AddO
Using Booth's algorithm, bit bo and b.j are examined first, where b.j is always
zero. The correct operation is selected based on the values of adjacent bits of the
multiplier, as summarized in Table 3-1. Next, a is shifted left by one bit (equivalent to
multiplying by 2), and bits bj and bo are used to determine the correct operation. The
procedure ends when each bit of b has been used. An example using Booth's algorithm
for 4-bit operands is given in Figure 3-9.
0 10 1
X 1 0 1 1
111110 11
0 0 0 0 0 0 0








(-25)1110 0 1 11
Figure 3-9: Booth Encoding Example
The first and fourth partial products need to be sign-extended to obtain the correct result
for this example.
A proof of Booth's algorithm starts by representing the operation shown in Table
3-1 is by evaluating (bj.j
- bi) [2]. The result determines the operation, as given in Table
3-2.
9/3/03 Department ofElectrical Engineering
21





Next, because a shift to the left by one bit is equivalent to multiplying by two,




+ (b2 -b2)xax2 (3-11)






= (2bt - bt
)2'
= b;xT (3-12)
But b.} is always zero, so if a is factored out of each term of Equation (3-1 1):
ix(-b3x2i+b2x22 +fc,x2'
+b0x2)=axb (3-13)
The value in parenthesis in Equation (3-13) is the two's complement
representation of b, leading to the equality shown. Therefore, Booth's algorithm performs
two's complement multiplication of a and b.
9/3/03 Department ofElectrical Engineering
22
3.3.3 Radix-4 Booth Encoding
The radix-4 Booth encoding method, also known as Booth-MacSorley recoding [3, 7, 8,
10], is an extension of Booth's algorithm that can be used to reduce the number of partial
products, thereby increasing the speed of a multiplier. Instead of examining two bits at a
time, and shifting the multiplicand to the left by one bit after each step, three bits can be
evaluated, and the multiplicand will be shifted left by two bits at a time. Using a radix-4
modified Booth encoding scheme reduces the number of partial products by a factor of
two.
The radix-4 Booth encoding method can be derived using a similar procedure to
that of the radix-2 Booth's algorithm proof, except now three bits of the multiplier are















The formation of the partial products using this method is shown in Table 3-3.
9/3/03 Department of Electrical Engineering
23
Table 3-3: Radix-4 Booth Encoding
Current Bits Previous Bit Operation
biM bn bn-i
0 0 0 AddO
0 0 1 Add a
0 1 0 Adda
0 1 1 Add 2a
1 0 0 Subtract 2a
1 0 1 Subtract a
1 1 0 Subtract a
1 1 1 AddO
Although the radix-4 version of Booth's algorithm halves the number of partial
products (if the operands are evenly divisible by 2) the drawback is that 2a, -a, and -2a
need to be calculated before the multiplication can begin.
In addition to shifting a to the left by two bits after each step, the values of b
being examined are also shifted to the left by two bits. For example, on the first step bits
1, 0, and -1 are used to determine the correct operation. On the second step bits 3, 2, and
1 are used, and so on. The same example used to illustrate radix-2 encoding is used to
show how the radix-4 version can be used. There are only two partial products when
radix-4 Booth encoding is used for a 4-bit
multiplier.
9/3/03 Department ofElectrical Engineering
24
0 10 1 (5)
X 1 0 1 1 (-5)
111110 11 (110: Subtract a)
+ 1 110 11 (101: Subtract a)
1110 0 111 (-25)
Figure 3-10: Radix-4 Booth Encoding Example
3.3.3.1 Gate-Level Implementation
To use radix-4 Booth encoding, it was assumed that -a and -2a have been calculated, and
are available to be used to generate the partial products. However, actually finding the
two's complement of a binary number typically requires negating each bit of the number
(finding the one's complement) and adding one to the result. In order to add one, a
carry-
propagate adder must be used, which would require a significant amount of hardware and
time to produce the result. Therefore, this section presents a hardware implementation of
the radix-4 Booth encoding of partial products that does not require addition.
In [9, 23] two similar modified Booth encoding methods have been presented, and
will be described in this section. First, it is known that the least significant bit of a partial
product will be set if the least significant bit of the multiplicand (a) is set and the
operation according to Booth's algorithm is
either (add a) or (subtract a). A partial
product will have a zero in the least significant bit if the operation according to Booth's
algorithm is (add 2a), (subtract 2a), or (add 0). Table 3-4 summarizes the values taken
by the least significant bit of a partial
product. The LSB of a partial product, pp_LSB, is
found using Equation (3-15).
9/3/03 Department of Electrical Engineering
Table 3-4: ppJLSB Formation
b2i bn-i a0 pp_LSB
0 0 1 | U
0 1 1 1
1 0 1 1
1 1 1 \ 0
25
pp_LSB = a0-{b2i_]b2i) (3-15)
Bits one through the MSB of the partial product are found using Table 3-5. Each
bit of the multiplicand is negated for the (-a) and (-2a) operations of Booth's algorithm.
The multiplicand is shifted to the left by one bit for the (2a) and (-2a) operations.
Table 3-5: Inner Partial Product Bits
b2i+i b2i b2t-i pptj
0 0 0 0
0 0 1 aJ
0 1 0 aJ







! 1 1 1 0
9/3/03 Department of Electrical Engineering
26
One possible way of selecting the correct partial product bits would be to use an
8-to-l MUX. However, [9, 11] present an alternative design. First, a simple encoder
creates the Neg, d (doubled), Z, and nd (not doubled) signals. These signals are used with
bits from the multiplicand to generate the respective partial product bit. A schematic and
truth table for this design is given in Figure 3-11 and Table 3-6, respectively. The
internal signals a, r, s, t are shown to help clarify the logic of the circuit.
Table 3-6: Logic Table for Inner Partial Product Bits
bn+i b2i biui Neg z d nd o r S / pptj
0 0 0 0 1 1 0
ai <*M
1 1 0
0 0 1 0 1 0 1
ai aM
1 1 ai
0 l 0 0 0 0 1
aj ;-i GJ
1 aJ
0 l 1 0 0 1 0
ai ;-> J aH aM
1 0 0 1 0 1 0 aj aJ-> 1 <*H aH
1 0 1 1 0 0 1 ai aM aj 1 QJ
1 1 0 1 1 0 1 aj aM J 1 Qj
! l 1 1 1 1 1 0 ai <*H 1 1 0















Figure 3-11: Booth Decoder (a) and Encoder (b) [9]
Using the pp_LSB logic and Figure 3-1 1 for the upper partial product bits will not
produce the correct partial products, however. Therefore, one more modification needs
to be made. A neg_cin bit is appended to the right of the LSB of each partial product
except the first partial product. The neg_cin bit for the first partial product is appended to
the second partial product; the neg_cin bit for the second partial product is appended to
the third partial product, and so on. The neg_cin bit for the last partial product will be
placed below the last partial product.
The neg_cin bit is always set for the (-2a) operation. The LSB of the partial
product will be set to zero, and each bit of a will be negated, giving the one's complement
of a shifted to the left by one bit. To find the two's complement of a, one must be added,
which causes neg_cin to be set for this operation.
9/3/03 Department ofElectrical Engineering
28
The neg_cin bit will also be set for the (-a) operation if bit zero of the
multiplicand a is zero. In this case, the neg_cin bit represents a carry-out of the LSB of
the partial product. That is, a carry-out of the LSB will only occur for the (-a) operation
if ao equals zero. For this case, negating a causes its LSB to be one, and adding one
causes a carry-out. For all other cases neg_cin will be zero. Table 3-12 summarizes the
values taken by neg_cin. Equations (3-16) and (3-17) give two separate, but equivalent,
representations of neg_cin.
Table 3-7: neg_cin Logic Values
b2i+i hi b2i-i a0 neg_cin
0 0 0 0 0
0 0 0 1 0
0 0 1 0 0
0 0 1 1 0
0 1 0 0 0
0 1 0 1 0
0 1 1 0 0
0 1 1 1 0
0 0 0 1
0 0 1 1
0 1 0 1
0 1 1 0
1 0 0 1
1 0 1 0
1 1 0 0
1 1 1 0
neg _ cin
= b2M b2i_1 a0+b2M b2i b2i_t + b2M b2i aQ
neg _ cin
= b2M [(Vi + ao ) (b2l + b2i_t ) {b2i + a0 )]
(3-16)
(3-17)
9/3/03 Department ofElectrical Engineering
29
The partial products generated using the modified Booth encoding (MBE) scheme
for a multiplier with 8-bit operands is shown in Figure 3-12. The sign bits are found
using the same logic used for the partial product bits. Only one of the sign bits for each
partial product needs to be calculated; the rest of the sign bits are found by sign-
extending.
Bit
15 13 11 9 7 5 3 2 10
i AAAAAAAAnnnnnnno
Partial
2 AAAAAAnnnDDDDOO A sign bit
Product
3 AAAADDDDDDDOO ? Partial product
^AAnnnnnnnoo o ^*
O O neg_cin
Figure 3-12: Partial Products using an MBE Scheme
3.3.3.2 Sign-Extension Correction
This section presents a method of reducing the number of sign bits in Figure 3-12.
Sign-
extension of the partial products is required to ensure that negative partial products have
the proper two's complement representation. However, if the number of sign bits can be
reduced, the speed of the multiplier will increase, and less hardware will be required.
A method for sign-extension correction similar to those given in [6, 10] follows.
Assume that each of the partial products shown in Figure 3-12 is negative. Therefore,
each of the sign bits will be set. The sum of the sign bits will have a constant value, as
shown in Figure 3-13.









1 1 1 1 1 1 1
1 1 1 1 1
1 1 1 1
+ 1
1 0 1 0 1 0 1 1
Figure 3-13: Sum of Sign Bits for Negative Partial Products
In general, the sum of the sign bits (with the assumption that all sign bits are set)
will have the constant value: 10101.
..1011,
where the number of leading
'10'
values
depends on the operand length. The constant value can be used to replace all the sign-
extension bits given in Figure 3-13. It will be correct as long as all the partial products
are actually negative. If a partial product turns out to be positive, one is added to the bit
position corresponding to the least significant sign-bit for the given partial product.
If the operands of a signed multiplier are n-bit values there will be (nil) partial
products, not including the extra neg_cin bit for the last partial product. The first partial
product will need three sign-extension bits in bit positions n + 2 to n. The values of these
sign-extension bits are assigned according to Equation (3-18). The subscript on se




[l 00 if partial product > 0
[Oil if partial product< 0
(3-18)
9/3/03 Department of Electrical Engineering
31
For partial products 2 through nil, there will be two sign-extension bits appended
to the most significant end of the respective partial products. The bit position will depend
on the associated partial product, but will follow the general pattern given in Figure 3-12.
The values that these bits take on are given in Equation (3-19). The most significant
sign-extension correction bit will always be set, which corresponds to the set bits in the
constant value given in Figure 3-13. If the partial product is negative as assumed, the
least significant sign-bit will be zero. For a positive partial product, the assumption was
wrong, so the least significant sign-bit must be set.
f1 1 if partial product> 0
partial products , , : se = { (3-19)
[^p) [10 if partial product<0
Three examples using this sign-extension correction method are given in Figure
3-14. The examples assume that the operands of the multiplier are signed 8-bit values.
Only the sign-bits are considered, and for each example the calculation on the left sums
the sign bits, the calculation on the right uses the sign-extension correction method. For
each example the resulting sum is the same, indicating that the sign-extension correction
method gives an equivalent representation of the sum of the sign bits.
9/3/03 Department ofElectrical Engineering
32
Example 1 : The First Partial Product is Positive
Bit: 15 14 13 12 11 10 9 8 Bit: 15 14 13 12 11 10 9 8
OOOOOOOO 1 0 0
111111 1 0
1111 1 0
+ 1 1 + 1 0
10101100 10101100
Example 2: All Partial Products are Positive
Bit: 15 14 13 12 11 10 9 8 Bit: 15 14 13 12 11 10 9 8
OOOOOOOO 100
0 0 0 0 0 0 11
0 0 0 0 11
+ 00 +11
OOOOOOOO OOOOOOOO
Example 3: The Second Partial Product is Positive
Bit: 15 14 13 12 11 10 9 8 Bit: 15 14 13 12 11 10 9 8
1111111 1 0 1 1
0 0 0 0 0 0 1 1
1111 1 0
+ 1 1 + 1 0
10101111 10101111
Figure 3-14: Sign-Extension Correction Examples for a SignedMultiplier
The partial products generated for an 8-bit multiplier with signed inputs using the
MBE scheme with sign-extension correction is given in Figure 3-15. A triangle with a T
inside indicates that the bit will always be set.
9/3/03 Department of Electrical Engineering
33
Bit
15 13 11 9 7 5 3 2 10
1 AAAnnnnnnno
partial
* AAnnnnnnnoo a ^1
Product
3 AAnnnnnnnoo n partial product
4 AAnnnnnnnoo o ^*
O O negjcin
Figure 3-15: MBE Scheme with Sign-Extension Correction [9]
3.3.4 Higher Radix Booth Encoding
Radix-4 Booth encoding produces half as many partial products as the standard Booth
encoding method, at the expense of a larger design. Methods of using even higher
radices, such as 8 or 16, have been proposed such that even fewer partial products are
produced [24]. As the radix increases, the logic becomes more complex, causing the area
to increase while the speed of the partial product generation stage will decrease.
The main difficulty with using a radix greater than four is that higher multiples of
the multiplicand are required. For example, a radix-8 scheme will require that 3 times
the multiplicand be calculated. One approach to finding three times the multiplicand
would be to use an adder: (3a = 2a + a), where 2a is found by shifting to the left by one
bit. However, this addition would consume valuable time and area. Methods of
increasing the radix of the Booth encoding method past four generally become too
complicated [8]. The extra cost applied to the partial product generation stage typically
does not justify the savings achieved in the partial product reduction stage.
9/3/03 Department of Electrical Engineering
34
3.4 Partial Product Reduction
Once the partial products have been generated, they need to be summed to give the
desired product. The addition of the partial products generally occurs in two stages. In
the partial product reduction stage either adders or compressors are used to sum the
partial products so that there are no more than two bits in each bit position, or column.
The third and final stage, the final addition stage, will sum the two binary numbers from
the partial product reduction stage to produce the product. It would be inefficient to sum
the partial products using only full and half adders due to the propagation of the carry
bits. Therefore, two stages are used, where the final addition stage uses additional logic
to increase the speed of carry propagation.
The partial product reduction stage is implemented using full and half adders. A
full adder has three inputs and two outputs whose value represents the binary sum of the
inputs. The truth table for a full adder is given in Table 3-8, and a gate-level schematic is
given in Figure 3-16 [25]. Similarly, a half adder has two inputs and two outputs that are
the sum of the two inputs. The truth table for a half adder is given in Table 3-9, and
Figure 3-17 shows the corresponding schematic.
The full and half adders are organized in a tree-like structure, or a partial product
reduction tree (PPRT) [17]. The full adder or (3,2) counter has 3 inputs and 2 outputs.
Other counters also exist such as the (5,3) [26] or (7,3) counter. Although the ratio of
inputs to outputs for these counters are greater than the (3,2) counter, the size and speed
of a (3,2) adder allow it to reduce the partial products at rates greater than or equal to any
9/3/03 Department of Electrical Engineering
35
other counter. The schematic given in Figure 3-16 is one possible schematic for a full
adder; other circuits with equivalent logic can also be drawn.
Table 3-8: Full Adder Truth Table
Inputs Outputs
a b c sum carry
0 0 0 0 0
0 0 1 1 0
0 1 0 1 0
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 1
carry sum




Table 3-9: HalfAdder Truth Table
Inputs Outputs
a b sum carry
0 0 0 0
0 1 1 0
1 0 1 0
1 1 0 1
carry sum
Figure 3-17: HalfAdder Schematic
In this section several different methods of implementing the partial product
reduction stage of a multiplier orMAC will be presented. The benefits of each algorithm
will be discussed. The Wallace Tree, Dadda's Algorithm, the Three Dimensional
Method, and a series of 4:2 Compressors can each be used to reduce the partial products.
3.4.1 Wallace Tree Multipliers
In 1964,Wallace [12] proposed a multiplication architecture that offered a speed increase
over the traditionally used array multiplier. An array multiplier uses carry-propagate
9/3/03 Department of Electrical Engineering
37
adders to sum the partial products, so that the delay of the multiplier grows linearly as the
operand size increases. A Wallace tree multiplier [3, 7, 10, 11-14] uses parallel carry
save adders to sum the partial products such that the delay of the multiplier grows
proportionally to the logarithm of the operand size.
The original Wallace implementation only considered unsigned operands.
Therefore, the partial products are generated using AND gates. The partial products are
summed using (3,2) and (2,2) counters using a Wallace tree to form two numbers. The
two outputs from the Wallace tree are input to a fast carry-propagate adder to form the
product. Although Wallace assumed that the operands are unsigned, the partial product
reduction method can be applied to a signed multiplier with some modification. This
section will present the Wallace method of summing the partial products for an unsigned
multiplier.
Consider an unsigned multiplier with 8-bit inputs. There will be eight 8-bit partial
products formed by ANDing each bit of the multiplier with the multiplicand. The LSB of
each partial product will have the same weight as the bit of the multiplier used in its
formation.
A Wallace tree arranges the partial products into groups of three in each stage. In
each group, if there are two bits in a column a half adder is used to sum the partial
products, while a full adder is used to sum three bits in a
column. The sum and carry bits
from these additions are organized in the next stage, where the procedure continues until
there are at most two bits in each column.
9/3/03 Department ofElectrical Engineering
38
The bits to be summed in each stage of a Wallace tree can be considered as
elements in a matrix. For a multiplier with -bit inputs, the height of the matrix, hjt at a




+ hj mod 3 (3-20)
The structure of an 8-bit Wallace tree multiplier is given in Figure 3-18. Each of
the dots represents a partial product, sum, or carry bit. In stage 1 there are eight 8-bit
partial products. The first three partial products are grouped together, and summed using
(3,2) and (2,2) counters, as represented by the circled bits. Partial products four through
six are similarly grouped and summed. The last two partial products are moved to stage
2 without modification. After stage 1, the eight partial products have been reduced to six
numbers. These six numbers are summed in two groups, resulting in four numbers in
stage 3. The process is continued as shown.






Figure 3-18: An 8-Bit WallaceMultiplier
The number of (3,2) adders and the length of the final adder for a Wallace
multiplier can be calculated based on the operand length, n, and the number of stages in
the Wallace tree, S [14]. For 3 < n < 5:




#(3,2) = n2-4 + 3 + 5 (3-21)
Final Adder Length = 2n-2-S (3-22)
#(3,2) = rc2-4rc + 2 + S (3-23)
#(3,2) =n2-4 + l + S (3-24)
Final AdderLength = 2n-\-S (3-25)
depending on the number of bits after the final stage of the tree. The number of (2,2)
counters is more difficult to estimate, so the next section lists the number of (2,2)
counters for several different values of n. The number of (2,2) counters will always be at
least n.
The structure of aWallace tree can also be represented using carry save adders in
a block diagram. Such a diagram emphasizes the fact that the addition in each stage of
the Wallace tree occurs in parallel. The greater the value of n, more additions occur in
parallel, so the delay of the multiplier does not increase linearly, as in an array multiplier.
A block diagram of a Wallace multiplier with 8-bit inputs is given in Figure 3-19. The
lines coming in from the top represent the eight partial products. Therefore, each line is
an 8-bit bus. The right output from each CSA is the sum output, while the left output is
the carry. This diagram represents the same structure given in Figure 3-18.











Figure 3-19: 8-BitWallaceMultiplier Block Diagram
3.4.2 Dadda Multipliers
In 1965, Dadda [15] proposed a modification to the Wallace multiplier to optimize the
number of (3,2) and (2,2) counters used in the PPRT. A Dadda multiplier [13, 14, 15]
finds the matrix height of each stage by working backwards from the last stage, for which
the height is two. The height of each matrix is the largest integer that is no more than 1.5
times the height of the previous stage. For example, a 16-bit multiplier will have six
reduction stages, with heights: 13, 9, 6, 4, 3, and 2. A Dadda multiplier uses (2,2) and
(3,2) counters to form the reduced matrix with no more than hj bits in any column, where
hj is the matrix height of the
j'h
stage from the end.
9/3/03 Department of Electrical Engineering
42
\ = 2 (3-26)
h^=[\.5-dj\ (3-27)
A Dadda multiplier uses the minimum possible number of (3,2) counters. For an
unsigned multiplier with n-bit inputs, there are n bits in the original partial product
matrix. The final two numbers that are to be summed are (2n - l)-bit and (2n - 2)-bit
values, resulting in An
- 3 total bits. Each (3,2) counter reduces the number of bits in the
matrix by one. Therefore, the number of (3,2) counters and the length of the final adder
for a Dadda multiplier can be expressed as:
#{3,2) =
n2
-4n + 3 (3-28)
Final Adder Length = 2n 2 (3-29)
At least one (2,2) counter is used in each stage of a Dadda reduction tree. The
first reduction stage has (2,2) counters in columns c to n, where column c is the smallest
column number that needs to be reduced by one bit to get to the next level. In the i
stage from the end, columns with heights ht to hM-\ will have a (2,2) counter.
Therefore, columns 2 to n will have one (2,2) counter, giving a
total of n - 1 (2,2)
counters.
Again consider an unsigned multiplier with 8-bit inputs, whose partial products
are formed using AND gates. The
design of such a multiplier using Dadda's method is
9/3/03 Department of Electrical Engineering
43
summarized in Figure 3-20, using the same notation as that used in Figure 3-18 for the
correspondingWallace multiplier.
Bit Position




Figure 3-20: 8-Bit Dadda Multiplier
9/3/03 Department ofElectrical Engineering
44
In general, a Dadda multiplier will have fewer (3,2) and (2,2) adders, but will
have a longer final adder than a Wallace multiplier. Each multiplier will have the same
number of reduction stages, and will have approximately the same delay. On average,
the delay of a Dadda multiplier will be the same or slightly less than that of a Wallace
multiplier, depending on the operand length [14]. Table 3-10 compares the number of
(3,2) and (2,2) adders as well as the length of the final adder needed for various operand
lengths using theWallace and Dadda methods [14].
Table 3-10: Dadda and Wallace Multiplier Comparison [14]
Multiplier #(3,2) #(2,2) CPA 1
8 by 8 Dadda 35 7 14
8 by 8Wallace 38 15 n !!
16 by 16 Dadda 195 15 30 1
16 by 16Wallace 200 54 25
32 by 32 Dadda 899 31 62
32 by 32 Wallace 906 164 55
64 by 64 Dadda 3843 63 126
64 by 64 Wallace 3850 459 117
The Wallace and Dadda methods use a similar structure to reduce the partial
products. Other methods of reducing the partial products have been proposed to
minimize the delay of a partial product reduction tree. One such method is the Three
Dimensional Method, which will be discussed in the next section.
9/3/03 Department of Electrical Engineering
45
3.4.3 The Three Dimensional Method
Rather than reducing the partial products of a multiplier or a MAC by considering each
partial product as a row, or level of the reduction tree, each column can be reduced
individually using the Three Dimensional Method (TDM). In general, a PPRT can be
formed using full and half adders to sum the partial products starting from bit position
zero. For each full and half adder, the sum output remains in the same bit position as its
inputs, while the carry output is placed in the next most significant bit position. If the
total number of partial product and carry-in bits to a column is odd, a half adder must first
be used, with the rest of the addition using full adders. If there are an even number of
partial product and carry-in bits, only full adders are used for the column. This restriction
ensures that there will be two bits output from the PPRT for each column (except for bit
0, in which there is only one partial product generated).
The lower 8-bits of an 8-bit multiplier whose partial products are generated using
radix-4 Booth encoding can be summed as given in Figure 3-21. The partial products are
named ppxy in Figure 3-21, where x is the partial product number and y is the bit number.
For example, pp23 represents bit 3 of partial product 2.
















































Figure 3-21: PPRT Example
The PPRT given in Figure 3-21 connects the full and half adders without using a
structured method. As long as the carry-out bits are placed in the next most significant
bit position and the sum bits have the same weight as the input bits, the order in which
the partial product and carry-in bits are summed will not affect the result. However, the
manner in which the full and half adders are connected will affect the timing of the
PPRT.
The Three Dimensional Reduction Method [6, 9, 13, 16, 17, 18] is a procedure for
connecting full and half adders to optimize the speed
of a PPRT. The TDM attempts to
equalize the delay of each path of the PPRT.
9/3/03 Department of Electrical Engineering
47
The TDM considers the
'fast'
input to a full adder when determining the
connections in the PPRT. The development of the TDM in is based on an LSI 100K lp:
CMOS-ASIC cell for a full adder [13]. The full adder using this technology can be
represented by the schematic given in Figure 3-16. For such an implementation, the
propagation delay from the a and b inputs to the sum output is two XOR gate delays.
However, the delay from the c input to the sum output is equal to one XOR gate delay.
Therefore, the c input is referred to as the 'fast input'. The propagation delay for the carry
output is the same for all inputs. However, in [6, 9, 13, 16, 17, 18] a level of AND-OR
logic (or equivalently, two levels of NAND logic) is considered to have a propagation
delay equivalent to one XOR gate delay. This value will depend on the technology of
implementation and the circuit techniques used, and can be changed to any value
(including non-integer values). A one XOR equivalent delay for the carry output will be
used here to demonstrate the TDM. The equations for the time assigned to the sum and
carry outputs of a full adder are given in Equations (3-30) and (3-31).
Delay{s)FA =MAX{Delay{a)+Da_s,Delay{b)+Db_s,Delay{ciK)+DCm_,\ (3-30)
Delay{c)FA =MAX{Delay{a)+Da_c,Delay{b) +Db_c,Delay{cw)+ DCm_c\ (3-31)
The value of Delay(x) is the arrival time of the x input to the full adder. Dp.q
represents the delay assigned from the p input to the a output of the full adder. In
Equations (3-30) and (3-31) the s represents the sum output, cin is the carry-in or c input
9/3/03 Department of Electrical Engineering
48
to the full adder, and c is the carry output from the full adder. Using equivalent XOR
gate delays for the Dp.q values, Equations (3-30) and (3-31) can be rewritten as:
Delay{s)FA =MAX{Delay{a)+2,Delay{b)+2,Delay{cm)+\} (3-32)
Delay{c)FA =MAX{Delay{a)+ ],Delay{b)+ \,Delay(cm )+ 1} (3-33)
Similarly, equations for the delay of the sum and carry outputs from a half adder are
given in Equations (3-34) and (3-35).
Delay(s)HA =MAX{Delay{a)+\,Delay{b)+\} (3-34)
Delay{c)HA =MAX{Delay{a)+0.5,Delay{b) + 0.5} (3-35)
Using these equations, each signal in the PPRT is assigned a time corresponding
to the number of equivalent XOR propagation delays required to produce the signal.
Each partial product bit is assigned a time of zero, assuming that all the partial product
bits arrive to the PPRT at the same time. This assumption can be modified if it is found
to be inaccurate for a specific partial product generation scheme.
The TDM modifies the procedure described previously for creating a PPRT by
establishing a more formal method of connecting the full and half adders of the tree.
Before each full or half adder is placed in the PPRT, the remaining signals to be summed
9/3/03 Department ofElectrical Engineering
49
in the column are sorted in ascending order according to their time. If a full adder is
needed, the three fastest signals (those assigned the smallest time) are assigned to the a,
b, and c inputs, respectively. The slowest of the three inputs must be assigned to the c
input to equalize the delay through the full adder. For a half adder, the fastest two signals
are assigned to the a and b inputs. In contrast to the full adder, order is not as critical for
a half adder because the delay is the same for both inputs. However, if a half adder is
needed, it should be used before any full adders. Each time a full or half adder is placed
in the PPRT, the remaining signals to be summed must be updated and resorted.
An example comparing the TDM delays to a non-TDM arrangement is given in
Figure 3-22. The numbers shown are the time values (in equivalent XOR delays)
assigned to each input and output signal. This example shows that the TDM results in


















Figure 3-22: Delay Comparison for Two Signal Arrangements [13]
9/3/03 Department ofElectrical Engineering
50
Figures 3-23 and 3-24 compare a non-TDM design to a PPRT using the TDM.
These two figures show the PPRT for the lower 8-bits of an 8-bit multiplier, and assume
that the partial products have been generated using the modified radix-4 Booth encoding.
Figure 3-23 uses the same setup as Figure 3-21, except now the times assigned to each
signal are shown. These examples show that the TDM PPRT equalizes the delay for all
paths, making the critical path shorter. The same number of full and half adders are used
for both examples; the only difference is how the components are connected.
Bit Position
Figure 3-23: Non-TDM PPRT



























































Figure 3-24: TDM PPRT
The Wallace, Dadda, and TDM methods of partial product reduction each use full
and half adders to sum the partial products. An alternative method of summing the
partial products using 4:2 compressors will be described next.
3.4.4 4:2 Compressors
The final method of partial product reduction to be discussed uses a series of
compressors. Possible types are the 4:2, 5:2, or 9:2 compressors. Of the three, the 4:2
compressor is the most commonly used because it offers the best trade-off in terms of
speed and area [8, 13]. A 4:2 compressor has four inputs and two outputs. Therefore,
9/3/03 Department ofElectrical Engineering
52
using one level of 4:2 compressors will reduce the number partial products by a factor of
2. The four input bits have the same weight; that is they are partial product bits from the
same column. There is a sum and a carry output from the 4:2 compressor, where the sum
remains in the same column as the input bits, and the carry is shifted into the next most
significant column. There are also Cin and Cout bits used internally by the 4:2
compressor. These values are not considered inputs or outputs because the carry-out
(Com) of the compressor is connected to the carry-in (C,-) of the next most significant














Figure 3-25: A 4:2 Adder Built using Full Adders
Figure 3-25 shows the structure of a 4:2 compressor for one bit position. A 4:2
compressor block can be viewed as having five inputs (four input bits and a carry-in from
the previous compressor) and three
outputs (sum, carry, and carry-out). By properly
9/3/03
Department of Electrical Engineering
53
connecting 4:2 compressors, four 2-bit numbers can be summed as shown in Figure 3-26.
The same structure can be applied to sum larger numbers as well.
i i i i










Figure 3-26: 4:2 Compressors to Sum Four Numbers
The advantage of using a 4:2 compressor is that the structure is more regular than
that of aWallace Tree or the Three-Dimensional Method. However, the regular structure
can reduce the speed of the PPRT due to interconnects that are not optimized. Using the
same XOR-equivalent model used for the TDM [6, 9, 13, 16, 17, 18] to analyze the speed
of a circuit, the critical path of the compressor will be 4 XOR gate delays from the full
adder model shown in Figure 3-25. To speed up the circuit, the logic can be redesigned
as shown in Figure 3-27 [8].
9/3/03 Department ofElectrical Engineering
54
I, h h i4 cin
Figure 3-27: Redesigned 4:2 Compressor Logic [8]
By redesigning the logic of the 4:2 compressor, the critical path reduces to 3 XOR
gate delays [8]. However, a similar speed improvement can be achieved by considering
the
'fast'
input to a full adder when constructing the 4:2 compressors using full adders.
An alternate method of creating a 4:2 compressor from full adders is given in Figure
3-
28, where the sum output of one full adder is connected to the c input of the second full
adder and the Cin input is connected to the a input of the second full adder. This design
9/3/03 Department ofElectrical Engineering
55
also gives a critical path of 3 XOR gate delays, using the same delay estimates used for
the TDM.















Figure 3-28: Alternate 4:2 Compressor using Full Adders
Using the 4:2 compressor method restricts the full adder interconnections by
requiring that the C, bit be connected to the a input of the second full adder. This
restriction gives the 4:2 compressor a regular structure, but doing so does not fully
optimize the speed of the adder tree. The TDM looks at each path individually,
optimizing the speed of the PPRT by equalizing each path of the tree. Therefore, using
the TDM will give the fastest design using this delay model.
9/3/03 Department of Electrical Engineering
56
3.5 Addition Algorithms
Once the partial products of a multiplier or a MAC have been reduced to two bits per
column, they need to be summed using a final adder. The advantages and disadvantages
of several methods of performing binary addition will be presented in this section.
3.5.1 Ripple Carry Adder
The simplest adder is constructed by wiring together full adders in series. For example, a
1 6-bit adder can be made from 1 6 full adders by connecting the carry-out bit of one full
adder to the carry-in of the next most significant bit. The carry-in to the least significant
bit is tied to zero. Such an adder is known as a Ripple Carry Adder (RCA) [7, 19, 20].
The Ripple Carry Adder is the simplest and most area efficient type of adder.
However, as the operands grow in length, the time needed to compute the result increases
linearly. For applications in which speed is not important, a ripple carry adder is the best
choice.
3.5.2 Carry-Lookahead Adder
A carry-lookahead adder (CLA) is faster than a ripple carry adder, but requires more
hardware to implement. The inputs to an adder contain all of the information needed to
generate the carry-in to each bit position. Therefore, instead of waiting for the carry to
propagate from least significant bit to the most significant bit, as in a ripple carry adder,
some logic can be developed to generate the carry-in to each bit in parallel.
9/3/03 Department ofElectrical Engineering
57
Carry-lookahead adders [3, 7, 19, 21, 22] use propagate and generate signals to




where the subscript i refers to the
i'h
bit of the respective number. These equations
indicate that a carry-out of bit position / will occur if both of the operands are set (a carry
is generated). Similarly, if the carry-in to bit position i is one, that carry-in bit will
propagate through to the next bit position if one or both of the operands are set. Using
the generate and propagate bits, the carry-in bits can be expressed as follows:
ci
=
o + Poco (3-38)
C2
=
8i+Pi8o + PiPoco (3-39)
c3
=
g2 + P28, + P2P^8Q + P2PiPoco (3-40)
c4
=
83 + P382 + PiPi8\ + P3P2P180
+ P3P2PiPoco (3-41)
Equations (3-38) through (3-41) indicate that a carry-in bit is set if an earlier bit generates
a carry and each of the next most
significant bits allows the carry to propagate.
9/3/03 Department ofElectrical Engineering
58
Using generate and propagate bits allows the carry-in to each bit to be found using
two levels of AND-OR logic. Once the carry-in bit is known, a full adder can be used to
produce the individual sum bits. Ideally, any adder can be implemented using five levels
of logic using a carry-lookahead adder. However, from Equations (3-38) through (3-41)
each successive carry-in bit has one more term than the previous carry-in. To obtain the
carry-in to bit 1, a two-input AND gate and a two-input OR gate are required. To
produce the carry-in to bit 4, a 5-input, 4-input, 3-input, and a 2-input AND gate are
needed, as well as a 5-input OR gate. The maximum fan-in to a gate limits the number of
inputs a gate can have. The maximum fan-in to a gate as well as the affect of fan-in on
the propagation delay of the gate is dependent on the technology used to implement the
circuit.
A 4-bit carry-lookahead adder could be designed using Equations (3-38) through
(3-41). For this case, the carry-in to the LSB (co) is set to zero, so the last terms of each
of these equations do not need to be calculated. If the same circuit were to be used as an
adder as well as a subtractor, these terms would be needed to set the carry-in to the LSB
of the subtractor.
As the operands grow in length, finding the carry-in bits using only propagate and
generate bits will not be adequate. To reduce the number of terms in the equations for the
carry-in bits, a second level of abstraction can be implemented, using the carry-lookahead
approach. A 16-bit adder can be constructed using four 4-bit carry-lookahead adders. In
order to do this,
"super"
generate, propagate, and carry bits must be formed. The
equations for the
"super"
propagate bits are as follows:
9/3/03 Department of Electrical Engineering
59




where the P's are the propagate bits for the second level of abstraction, and the p's are
from Equation (3-37). These equations show that a carry can propagate through a group
of four bits only if each of the individual propagate bits are set.
The equations for the
"super"
generate bits are given below:
G0=
83 + P382 + P3P281+ P3P2P180 (3-46)
Gi = 81 + Pt86 + PiPeSs + P1P6P584 (3-47)
G2 = 8n + P11810 + P11P10S9 + PnPioP98s (3"48)
G3 = #15 + ^15^14 + PisPu8i3 + PisPuPaSn (3"49)
These equations indicate that a carry will be generated if one of the earlier bits in the
group generates a carry, and each of the next propagate bits are set.
Using the
"super"
generate and propagate bits, equations for the carry-in to each
four-bit group can be found using Equations (3-50) through (3-53).
9/3/03 Department of Electrical Engineering
60









Equations (3-50) through (3-53) were formed using the same reasoning used to
develop Equations (3-38) through (3-41). A block diagram showing how a 16-bit carry-











sum[15:12] sum[U:8] sum[7:4] sum[3:0]
Figure 3-29: 16-Bit Carry-Lookahead Adder
In Figure 3-29, each block represents a 4-bit carry-lookahead adder. The sum
values in brackets indicate the sum bits produced by each 4-bit adder. Using the same
method, another level of abstraction can be added to form a 64-bit adder.
A carry-lookahead adder generates the carry-out of the most significant bit several
times faster than a ripple carry adder. For example, a 16-bit ripple carry adder will take
32 gate delays (16 bits x 2 gate delays per bit = 32) to find the carry-out of the sign bit.
9/3/03 Department ofElectrical Engineering
61
A carry-lookahead adder using the second level of abstraction will take 5 gate delays to
calculate the carry-out of bit 15(1 level of logic for gj and pf, 2 levels for G and P, 2
levels for Q), and 9 gate delays to calculate the sum (5 levels for Q, 2 levels for c;, and 2
levels for the sum). Therefore, a 16-bit carry-lookahead adder is almost four times faster
than a ripple carry adder [3]. In the previous calculations it was assumed that and AND
gate and an OR gate have the same unit delay, regardless of the number of inputs.
Making this assumption allows the number of levels of logic to be proportional to the
time needed for a signal to propagate through the circuit.
3.5.3 Carry-SelectAdder
A carry-select adder allows addition to be performed even faster than a carry-lookahead
adder, at the expense of increased hardware. A carry-select adder [7, 10] performs two
additions in parallel; one assuming the carry-in is zero the other assuming the carry-in is
one. Once the carry-in is actually known, the correct sum is selected.
For example, an 8-bit adder can be constructed from 3 4-bit carry-lookahead
adders as shown in Figure 3-30.
9/3/03 Department of Electrical Engineering


















Figure 3-30: 8-Bit Carry-Select Adder
In Figure 3-30, a is added to b, where the subscript indicates an individual bit of
the respective operand. The carry-out of the rightmost carry-lookahead adder is used as
the select bit on all four of the MUX's shown. The eight-bit sum is indicated by 57
through so.
Any type of adder can be used to implement a carry-select adder. For example,
rather than using a carry-lookahead adder, a ripple carry
adder could have been used.
However, it will take eight gate delays to obtain the carry-out of bit 3 for a ripple carry
adder, compared to 3 gate delays for a carry-lookahead adder. An 8-bit carry-select adder
using three 4-bit ripple carry adders
will be nearly twice as fast as an 8-bit ripple carry
adder, but requires approximately 50% more hardware.
9/3/03 Department of Electrical Engineering
63
The operands do not have to be split into two groups to use a carry select-adder.
A 13-bit carry-select adder broken into three groups is shown below. The AND-OR logic
is used to determine the carry-in to bit 8. If this adder is made up of ripple carry adders,
an optimal design will have the leftmost adder be one bit wider than the first two adders.
This is because the extra two-gate delay to generate the carry-in to bit 8 allows an extra
bit to be added in the ripple carry adder (2 gate delays are needed for a carry to ripple in a
ripple carry adder).
">:bi: ii bi i diobio ^bi ^bg
U H H U H
5-Bit Adder
11 Ii ii
5- iit Adr er
ii
a,b7 d6b6 d5b5 d4b4
U H U U
i^t^^^vJ*-'
s.. sm s,. s.3I2 ^11 J10
a,b, i,b2 a,b, a,^




Figure 3-31: 13-Bit Carry-Select Adder
3.5.4 Carry-SkipAdder
A carry-skip adder combines a ripple carry
adder with part of a carry-lookahead adder
[27-29]. The size and speed of this type of adder are between that of a ripple carry adder
and a carry-lookahead adder. A carry-skip adder is most practically used when rippling
can be done quickly. The logic needed to form the propagate (P,-) signal is much simpler
9/3/03 Department ofElectrical Engineering
64
than that to form the generate (G,) signal for the second level of abstraction, as given in
Equations (3-42) through (3-49). Therefore, a carry-skip adder only computes the
propagate bits. A block diagram showing an example of a 16-bit carry-skip adder is
shown in Figure 3-32.











Figure 3-32: 16-Bit Carry-Skip Adder
The carry-skip adder above is made from four 4-bit ripple carry adders. Although
not explicitly shown, each adder produces its respective four-bit sum. The P^n bits are
the
"super"
propagate signals for the bits in the range indicated by the subscript,
inclusive.
For a carry-skip adder to work properly, the carry-out of each adder must be reset
at the beginning of each addition, which is a drawback to this type of adder. The
carry-
out acts as the generate signal for each adder. The carry-out is reset before each addition
to ensure that a false generate signal is not produced.
The operation of a carry-skip adder works as
follows: first each carry-out bit is
reset. Next, each of the four-bit ripple carry adders sums its operands, producing a
carry-
out with the assumption that the carry-in is zero. The only case that will change the
9/3/03 Department ofElectrical Engineering
65
carry-out bit is if the carry-in bit is actually set, and the carry is allowed to propagate (the
propagate bit, P, is set). The carry-out bit that is produced when the carry-in is zero can
be viewed as the generate bit; it will only be set only if a carry is generated in the
previous four bits. Once the true carry-out of bit 3 (c4) is known, the carry-out of bit 7
may not be correct. Rather than recalculating the sum with the true carry, the AND-OR
is used to find the carry-out without waiting for it to actually be calculated. Therefore,
the second addition block is "skipped". However, c4 will affect the sum calculated by the
second most significant 4-bit adder, so the sum will need to be recalculated. The carry-in
to the third adder (eg) is known two gate delays after the carry-out from the first 4-bit
adder has been found. Similarly, en is known two gate delays after eg is found, rather
than waiting the eight gate delays required for a four-bit ripple carry adder.
The 16-bit adder of Figure 3-32 requires 20 gate delays to complete (8 for the first
4-bit adder, 4 to skip the middle two adders, and another 8 gate delays for the last adder).
Previously it was determined that a 16-bit ripple carry adder required 32 gate delays to
complete, while a 16-bit carry-lookahead adder completes in 9 gate delays. Depending
on the bit width of the operands, a carry-skip adder often can be made faster by using
variable length ripple carry adders [27, 29]. If the interior blocks are made larger, a
speed increase is possible, but it still will not be as fast as a carry-lookahead adder.
3.5.5 Conditional Sum Adder
A conditional sum adder [7] computes two sets of outputs for a given group of operand
bits, one assuming the carry-in to the group is one and the second assuming the carry-in
9/3/03 Department ofElectrical Engineering
66
is zero. Once the carry-in is actually known, the correct sum and carry bits are selected
using aMUX.
If the operands are an integer power of 2, one possible grouping of operands is in
groups of 1, 2, 4, 8 etc. Using this grouping allows an 8-bit adder to be completed in
three steps as shown in Table 3-11.
Table 3-11 : 8-Bit Conditional Sum Adder [7]
Carry-
i 1 6 5 4 3 2 1 0
Xj 1 0 1 1 0 1 1 0
In yt 0 0 1 0 1 1 0 1
0 Si 1 0 0 1 1 0 1 1
Step
1
0 Ci+l 0 0 1 0 0 1 0 0
1 Si 0 1 1 0 0 1 0
1 Ci+l 1 0 1 1 1 1 1
0 Si 1 0 0 1 0 0 1 1
Step
2
0 Ci+l 0 1 1 0
1 Si 1 1 1 1 0 1
1 Ci+l 0 0 1
0 Si 1 1 0 1 0 0 1 1
Step
3
0 Cj+1 0 1
1 Si 1 1 1 0
1 Ci+l 0
Result 1 110 0 0 1 1
In step 1, for i between 1 and 7 jc, is added to y, twice, once assuming the carry-in
is zero, the other assuming the carry-in is one. The sum in bit position / is represented by
si, while Ci+i is the carry-out of bit position i. For bit 0, it is known that the carry-in is
zero, so an addition assuming a carry-in of one is not necessary. Step 2 computes sums
two bits at a time. The carry-out of bit 0 is zero in step 1 when the carry-in equals 0, so
9/3/03 Department ofElectrical Engineering
67
the sum and carry-out of bit 1 assuming the carry-in is zero is selected (using a MUX)
forming the values for bits 1 and 0 for step 2. Similarly, when the carry-in to bit 2 in step
1 is zero, the carry-out is one. Therefore, the values assuming the carry-in is one from
step 1 are used in step 2. The same process is used to complete the entries for step 2, as
shown. Step 3 uses the same procedure, except now the sum is computed four bits at a
time. After step 3, one more selection is needed to find the result.
An 8-bit conditional sum adder can be constructed in hardware using a level of
full adders and three levels ofMUX's. Assuming that two levels of logic are needed to
implement a full adder, and another two levels are used for a MUX, an 8-bit conditional
sum adder completes in 8 gate delays. An /7-bit conditional sum adder completes in
log2 n steps, if the groups are divided as in the previous example. A conditional sum
adder does not need to be divided into equal-sized subgroups, so it can be used even if the
operands are not a power of 2. The speed of a conditional sum adder is similar to that of
a carry-lookahead adder, but are less modular, so carry-lookahead adders are more
frequently used [7].
3.5.6 Brent-KungAdder
A Brent-Kung parallel prefix adder uses some of the ideas of a carry-lookahead adder,
but takes a slightly different approach. A Brent-Kung adder [7, 30] performs addition in
three steps. In the first step, the inputs to the adder, A and B, are transformed to two new
variables, U and V. The U and V variables are used to determine whether a carry will
propagate, be generated, or rejected.
9/3/03 Department of Electrical Engineering
68
Table 3-12: Step 1 of a Brent-Kung Adder
u V Description
X 1 Carry-out from this bit equals one (a carry is generated)
1 0 Carry-out from this bit equals one if the carry-in is one (a carry will propagate)
0 0 Carry-out from this bit equals zero (carry reject)
U = A + B (3-54)
V=AB (3-55)
The equations for U and V for a Brent-Kung adder are the same as the generate
and the propagate signals for a carry-lookahead adder, but are interpreted in a different
manner.
A Brent-Kung adder begins to differ from a carry-lookahead adder in the second
stage. Adjacent U and V bits are used to determine whether a pair of bits will generate,
propagate, or reject a carry. The same reasoning is used to determine the carry for groups
of 2, 4, 8, 16, 32 . . . bits at a time, forming a tree structure. In the table below, Uj, V; and
Uo, Vo are adjacent values formed using Equations (3-54) and (3-55). Based on these
values, the U and V values shown are formed using Equations (3-56) and (3-57) for the
two bits examined.
9/3/03 Department ofElectrical Engineering
69
Table 3-13: Step 2 of a Brent-Kung Adder
Vi Vi U0 V0 Description U V
0 0 X X Bit 1 Rejects the Carry 0 0
0 1 X X Bit 1 Generates a Carry X 1
1 0 0 0 Bit 1 Propagates, Bit 0 Rejects 0 0
1 0 0 1 Bit 1 Propagates, Bit 0 Generates X 1
1 0 1 0 Bit 1 Propagates, Bit 0 Propagates 1 0
1 0 1 1 Bit 1 Propagates, Bit 0 Generates X 1





In the third step of a Brent-Kung adder, the carry-in to each bit position is found.
If the carry-in to a group of bits and the U, V pair for the group is known, the carry-out
from the group of bits can be calculated. The
carry-out of a group can be found as shown
in Table 3-14 and Equation (3-58).
Table 3-14: Step 3 of a Brent-KungAdder
U V ^in Description *-out
0 0 X Carry Reject 0







1 1 X Carry Generate 1
coul=v+uch (3-58)
9/3/03 Department of Electrical Engineering
70
The carry-out of each bit position is used as the carry-in to the next most
significant bit position. Once the carry-in to each bit is known, the sum is found using
the exclusive-or of the two input bits with the carry-in to the column. An example
illustrating the structure of a 16-bit Brent-Kung adder is shown in Figure 3-33. The
inputs come in at the top of the diagram, and are transformed to the U, V pair using step
1
,
as represented by the squares. The numbers in each square indicate the bit position of
the inputs. Next, step 2 is used to find the U, V pair for groups of 2, 4, 8, and 16 bits, as
represented by the circles. The U, V pairs are then combined with the carry-in to the
group to find the carry-out of the group, as pictured using triangles. Last, the sum is
found using the exclusive-or gates at the bottom of the figure.




] = Step 1
O = Step 2
/\ = Step 3
Figure 3-33: 16-Bit Brent-KungAdder
3.5.7 Input Delay Considerations
The adders described to this point have been designed based on the assumption that the
inputs are available at the same time. For a multiplier, the outputs from the PPRT will
not arrive at the same time. The arrival time of the inputs will be the shortest for the least
significant bits. Moving from the LSB towards the MSB, the time of arrival will increase
9/3/03 Department ofElectrical Engineering
72
linearly for the first several bits, and will level off for the middle bits. The profile will
then slope back down for the upper bits. This means that the middle bits will take the
longest to arrive, and the upper and lower bits will be available at earlier times. The
exact form of the arrival profile will depend on the methods used to generate and reduce
the partial products and the technology of implementation. To efficiently design the final
adder of a multiplier or a MAC the time at which the inputs to each column become
stable need to be considered.
Several methods of designing the final adder to take advantage of the input delay
profile to the final adder have been proposed [9, 13, 31]. Two of the most common
methods of optimizing the speed of the final adder are to adjust the size of the adder
blocks and to use a slower, more area-efficient adder for the LSBs.
For the lower bits, no matter how fast the adder is, the speed at which the two
values can be added will be limited by the time at which the inputs arrive. For example,
let the two values from the PPRT at bit position 4 arrive at time t = A, and the two bits in
bit position 5 arrive at time t = 3A. If a ripple carry adder can sum consecutive bits with a
delay of t = A due to carry propagation, then the addition will be delayed due to the
arrival of the inputs, not the time to propagate the carry. Therefore, for this case a ripple
carry adder would be the best solution because it
requires the least amount of logic, and
will produce a result in the same time as a more complex adder using more hardware. In
general, a ripple carry adder should be used for as many
of the least significant bits as
possible, without sacrificing speed. The length of the ripple carry adder will depend on
the arrival times of the bits for each bit position.
9/3/03 Department ofElectrical Engineering
73
Once the slope of the arrival profile begins to level off such that the arrival time
between consecutive bits becomes less than the time needed for a carry to propagate
through a ripple carry adder, a faster adder is needed. For the upper bits a
carry-






To this point, the discussion has been focused on the design of fast multipliers. Much of
the logic for a multiplier is the same as that for a multiply-accumulator. The design of a
multiplier-accumulator is often considered as a multiplier with one extra partial product
[6, 17, 32]. A single-cycle multiply-accumulator can therefore be designed using the
same three stages as that of a multiplier: partial product generation, partial product
reduction, and final addition.
The multiply-accumulators discussed in the literature, and those provided by
Synopsys as DesignWare components only produce the multiply-accumulate output. For
this project, the specifications require that two outputs be produced; the multiplied result
(a x b) as well as the multiply-accumulate result (a x b) + c, where a, b, and c are the
inputs to the unit.
A multiply-accumulator that has both a multiply output and a
multiply-
accumulate output can be designed with minimal additional hardware to a standard
multiply-accumulate unit. As discussed in Chapter 1 , a MAC with two outputs decreases
the number of clock cycles needed to perform certain computations. This section will
present several different proposed solutions for designing a fast multiply-accumulator
that meets the specifications for this problem.
9/3/03 Department of Electrical Engineering
75
4.2 Design Stages
To meet the specifications for the multiply-accumulator to be designed, modifications
need to be made to the algorithms that have been previously discussed. The MAC for
this project will be designed using the same three stages that are used for a multiplier.
Possible methods of satisfying each of these stages are presented in the next three
sections.
4.2.1 Partial Product Generation
The two main methods of generating partial products for a signed multiplier are the
Baugh-Wooley method and Booth's algorithm. The modified Baugh-Wooley method can
produce the correct partial products for both signed and unsigned operands. However,
the radix-4 Booth encoding method only works for signed operands. Therefore, a few
modifications need to be made for this method to be used for a multiplier-accumulator
with either signed or unsigned inputs.
As discussed before, Booth's algorithm performs two's complement
multiplication. Therefore, to be able to use Booth's algorithm for a multiplier that can
have either signed or unsigned operands, two modifications need to be made. First, the
multiplicand a must be sign-extended by one bit. This bit can be found by ANDing the tc
input with the most significant bit of the multiplicand. For a multiplier with n-bit
operands (bit n - 1 is the MSB, bit 0 is the LSB), one additional bit is appended to the
most significant bit of the multiplicand in bit position n using:
9/3/03 Department ofElectrical Engineering
76
ap=tc-anA (4-1)
where an represents bit position n of a, and the
''
is the logical AND operator. The
multiplicand needs to be sign-extended because the operands can be either signed or
unsigned, depending on the tc input. By sign-extending using Equation (4-1), the
multiplicand becomes a signed value, allowing Booth's algorithm to be used to generate
the partial products.
Second, one more partial product must be created that is zero if tc is one (no
modification is necessary if signed multiplication is being performed), and equals bn_j
ANDed with each bit of a, multiplied by
2
if tc = 0, where the operands are -bit values.









To verify Equation (4-2), assume first that the
inputs are signed. For this case tc
is one and the last partial product is zero because Booth's algorithm already
performs





For unsigned operands, tc will be set and the last
partial product will be used as a
correction term. Again using a four-bit example,




the desired partial products for an unsigned multiplier. The second line shows the actual








A diagram of the partial products formed using this method is given in Figure 4-1
for n = 8 rather than the specified value of n = 32
,
for simplicity. The method has been
presented in general terms so that it can be easily applied to other even integer values of
n. In Figure 4-1 the partial products are each one bit longer than those in Figure 3-13 and
the c input is shown as a partial product to transform the multiplier into a MAC. Partial
product 5 is the additional correction term needed for unsigned operands. The LSB of this
partial product is the neg_cin bit from partial product 4.
Bit
15 13 11 9 7 5 3 2 10
i AAAAAAAnnnnnnnno
2 AAAAADDDDDDDDOO A signbit
3 AAADDDDDDDDOO ? pa^i product
4Annnnnnnnoo o ppjsb
5 DDDDDDDDO O ***-&
6 OOOOOOOOOOOOOOOO O c input
Partial
Product
Figure 4-1: Partial Products for a Signed/Unsigned MAC
9/3/03 Department ofElectrical Engineering
78
Once again, sign-extension correction can be used to make the partial products
shorter. The constant sum of the sign bits assuming that all partial products are negative
changes now that the partial products are one bit longer.
Bit
1
15 14 13 12 11 10 9
1 1 1 1 1 1 1
Partial







0 10 10 11
Figure 4-2: Sign-Extension Correction Constant for a Signed/Unsigned MAC
In genera], the sum of the sign bits (with the assumption that all sign bits are set)
will have the constant value: 010101 .. . 101 1 . This constant value can be used to replace
all the sign-extension bits given in Figure 4-1. It will be correct as long as all the partial
products are actually negative. If a partial product turns out to be positive, one is added
to the bit position corresponding to the least significant sign-bit for the given partial
product.
Let the multiplicand and multiplier inputs to a MAC be n-bit values. Therefore,
there will be [(n/2)+ l] partial products and the c input can be considered the [(n/2)+
2]'h
partial product. The sign-extension correction scheme follows the same pattern as that
presented in the section on multiplier designs, but will be given again here with the
necessary modifications for the signed and unsignedMAC case.
9/3/03 Department of Electrical Engineering
79
Partial product 1 will need three sign-extension bits in bit positions n + 3 to n + \.
The values of these sign-extension bits are assigned according to Equation (4-4). The
subscript on se indicates the bit positions where the sign-extension bits should be placed,
inclusive.
f 1 00 if partial product > 0
partial product, : selH+3):ln+])
= (4-4)
1011 if partial product<0
For partial products 2 through [(/2)-l], there will be two sign-extension bits,
appended to the most significant end of the respective partial products. The bit position
will depend on the associated partial product, but will follow the general pattern given in
Figure 4-2.
\\ 1 if partial product> 0
partial products
(
s : se = \ (4-5)
-i h(2) 110 if partial product<0
The
(n/2)"1
partial product will have one sign-correction bit in bit position
(2n - 1) . The sign bit will be set if the partial product is positive, and zero if it is
negative, as given by Equation (4-6).
[l if partial product>0




0 if partial product<0
9/3/03 Department ofElectrical Engineering
80
The partial products generated for an 8-bit MAC with signed or unsigned inputs
using a modified Booth Encoding scheme with sign-extension correction is given in
Figure 4-3. A triangle with a
'1'
inside indicates that the bit will always be set.
Bit
15 13 11 9 7 5 3 2 10
i AAAnnnnnnnno
Partial
2 AADDDDDDDDOO A sign bit
Product
3 AADnnnnnnnOO ? partial product
^Annnnnnnnoo o^
5 DDDDDDDDO o *_*
6 OOOOOOOOOOOOOOOO O c mput
Figure 4-3: Partial Products for a Signed/Unsigned MAC
4.2.2 Partial Product Reduction
The partial product reduction stage of a MAC with two outputs needs to reduce the
partial products formed from the a and b inputs to two binary numbers, and it also needs
to sum the partial products with the c input into two numbers. The same methods of
summing the partial products for a multiplier can be applied to aMAC, except now there
are two pairs of outputs from the partial product reduction tree. Two methods of
achieving this result will be described in Sections
4.2.2.1 and 4.2.2.2. Based on these two
general methods, specific solutions will be described in Section 4.3.
9/3/03 Department ofElectrical Engineering
81
4.2.2.1 Partial Product Reduction - Method 1
Two separate methods for designing a MAC have been created. The main difference
between these two designs is in the partial product reduction stage. The first method
requires four stages. In stage one, the partial products for the product (axb) are
generated. These partial products are summed in a PPRT to produce two bits per column,
as in a multiplier. At this point, the design splits into two paths, one for each output. To
calculate the MAC output, the two values from the PPRT must be added to the c input.
These three values are reduced to two bits per column using one more row of full adders
in the third stage. The fourth stage uses two final adders: one to calculate the product
output, the other to find theMAC output. The method 1 design of aMAC is illustrated in
Figure 4-4.












Figure 4-4: MAC Design using Method 1
4.2.2.2 Partial Product Reduction - Method 2
The partial product generation stage for Method 2 is identical to that of Method 1.
However, the partial product reduction stage considers the c input as a partial product.
Two PPRT's are used: one to sum the partial products only, and a second to sum the
partial products and the c input. These PPRT's are not entirely separate, however.
Because the two PPRT's sum many of the same values, several of the full and half adders
can be shared. There will be a trade-off between the number of full and half adders
9/3/03 Department ofElectrical Engineering
83
shared and the speed of the MAC. In fact, Method 1 can be considered a special case of
Method 2 in which the maximum number of full and half adders are shared.
The
"shared"
PPRT using this method produces two numbers whose sum equals
product output, and two values for theMAC output that are summed in parallel using two





< ' ' ' i '
PPRT: share FA's that both paths have in common
r i r ir
Final Adder Final Adder
1 "
Product MAC
Figure 4-5: MAC Design usingMethod 2
To optimize the speed of the design, the timing of the MAC output can be given
the highest priority, because it is known that theMAC output will be in the critical path of
the unit. Using this method, the PPRT will only share a full or half adder if it does not
affect the timing of the MAC output.
9/3/03 Department of Electrical Engineering
84
Method 1 will create a MAC that will require less area but be slower than the
Method 2 approach. In Method 1
, the PPRT for the product and the MAC outputs are
shared, and the only extra hardware over that of a multiplier is the extra row of full
adders and one extra final adder. Therefore there will be one extra full adder delay and
slightly more hardware than would be needed for a multiplier.
Method 2 optimizes the MAC for speed. More area will be required because
there are several extra full adders in the shared PPRT. The fewer the number of adders
that are shared in the PPRT, the faster the MAC will be at the expense of additional
hardware. The fastest design using this method will result when the critical path of the
MAC output is not compromised due to the product output. This restriction could be
relaxed to sacrifice speed with the benefit of reduced area. Method 1 and Method 2
should theoretically give the range of possible performance for the MAC, which will be
confirmed in Section 5.
To effectively sum the outputs of the partial product reduction tree a fast adder
must be designed. The most optimal final adder design for a MAC will consider the
varying arrival time of its inputs. Four hybrid adder designs
will be presented next.
4.2.3 Final Addition
Four hybrid adder designs that can be used as the final adder of a multiply-accumulator
will be presented in this section. A multiply-accumulator that has the multiply output and
a multiply-accumulate will require two final adders. In this section, 64-bit adders will be
considered, but the same methods can be applied to any size adder.
9/3/03 Department ofElectrical Engineering
85
4.2.3.1 Hybrid Adder 1
A 64-bit adder can be designed using a combination of ripple carry, carry-lookahead, and
carry-select adders as shown in Figure 4-6. At the highest level, this adder is made up of
four 16-bit adders, as shown in Figure 4-6(a). The carry-in to bits 16, 32, and 48 are
found using the equations for a carry-lookahead adder. However, since only these three
carry-in bits are needed, much of the logic for the carry-lookahead adder can be
eliminated.
Each 16-bit adder is made from three 8-bit adders, as shown in Figure 4-6(b).
The lower 8-bits are summed in parallel with the eight upper bits. Once the carry-out of
the lower eight bits is known, the correct sum for the upper eight bits is selected.
Similarly, each 8-bit adder is made from an 8-bit carry-select adder. The 8-bit adders are
formed from three 4-bit ripple carry adders, where the carry-out from the lower four bits
is used to select the correct sum for the upper four bits.
Ripple carry adders are used in this design to reduce the area of the adder.
However, the RCAs are limited to 4-bits to prevent a long carry-propagation delay. To
increase the speed of the adder carry-select adders are used, which allow several
additions to occur in parallel before the carry-in is actually known. Further speed
increases are achieved using the carry-lookahead method to generate the carry-in to the
1 6-bit adder blocks.
9/3/03 Department ofElectrical Engineering
86























8-bitAdder = 4-bit RCA
4- jit RCA








Figure 4-6: Hybrid Adder 1 Block Diagram
9/3/03 Department of Electrical Engineering
87
4.2.3.2 Hybrid Adder 2
A second 64-bit hybrid adder can also be created using carry-lookahead, carry-select, and
ripple carry adders, but in a different arrangement. Hybrid adder 2 uses a carry-select
adder at the top level, such that two 32-bit blocks are used to compute the sum of the
upper and lower 32-bits of the sum. A block diagram illustrating this adder design is
shown in Figure 4-7.
Each 32-bit adder is made from a carry-select adder, where the upper and lower
blocks are 16-bits. Each of the 16-bit adders are formed from four 4-bit ripple carry
adders. The carry-in to bits 4, 8, and 12 are found using a carry-lookahead adder at the
second level of abstraction. To further speed up the adder, the carry-out of the lower 16
bits of the 32-bit adders can be found using the carry-lookahead method as well.
However, if the carry-out of the lower 16-bits of the 32-bit carry-select adder is computed
before the sum of the upper 16-bits, then no speed increase will be seen. Therefore, the
extra hardware investment in this case would not be beneficial.
The 1 6-bit adders used for this adder were designed to be fast and area-efficient.
The four 4-bit ripple carry adders keep the area to a minimum. However, to improve the
speed, the carry-in to each 4-bit block is found using the second level of abstraction of a
carry-lookahead adder. Such a 16-bit adder will require less area than a traditional
carry-
lookahead adder, without sacrificing much speed.
9/3/03 Department ofElectrical Engineering
88











16-bit Adder * 1











Figure 4-7: Hybrid Adder 2 Block Diagram
9/3/03 Department ofElectrical Engineering
89
4.2.3.3 Hybrid Adder 3
Another hybrid adder using a combination of carry-lookahead and carry-select adders is
given in Figure 4-8. This adder uses seven 16-bit carry-lookahead adders. Once the
carry-out of the lower 16-bits is known, the sum calculated from the next set of adders
will be selected. Two more multiplexers are used to select the upper 32 bits of the sum,
once the actual carry-in is known. This adder will be the faster than Hybrid Adders 1 and
2. However, this adder will also require more logic than the other adders because of the






















Figure 4-8: HybridAdder 3 Block Diagram
4.2.3.4 Hybrid Adder 4
Another hybrid adder can be created using the same structure as
that of Hybrid Adder 3
by modifying the 16-bit adder for the least significant
bits. Hybrid Adder 4 replaces the
lower carry-lookahead adder with a ripple carry adder, as shown in Figure
4-9. Although
9/3/03 Department of Electrical Engineering
90
a ripple carry adder is slower than the carry-lookahead adder it replaces, the ripple carry
adder requires less logic.
The inputs to the final adder of a multiply-accumulator do not arrive
simultaneously. The least significant bits arrive earlier in time than the middle bits.
Therefore, a slower and more area efficient adder can be used to sum the lower bits
without increasing the overall speed of the multiply-accumulator. Therefore, Hybrid
Adder 4 will be the best design choice of those presented for a final adder.
16-bit CLA
16-bit CLA




















Figure 4-9: Hybrid Adder 4 Block Diagram
9/3/03 Department ofElectrical Engineering
91
5. MAC Designs and Results
5.1 Introduction
Chapter 4 described each stage of a multiplier-accumulator, and presented several
algorithms that can be used to implement each stage. This chapter will evaluate each of
these methods using twenty-six different multiply-accumulator designs. Each design will
be presented and the simulation results compared to the theoretically expected timing and
area results.
Each of the designs in this chapter has been coded using Verilog HDL. Synopsys
Design Compiler was used to compile, optimize, and synthesize each design. The timing
and area numbers were obtained using the Synopsys PrimeTime static timing analyzer.
To show that these designs are technology independent, results will be presented using
0.13 micron and 0.18 micron CMOS process technologies. For deep-submicron
technologies, the effects of interconnect parasitics becomes increasingly important as the
cell delay decreases. Therefore, a conservative wire load model provided by the
technology vendor was used to estimate the interconnect delays. The wire load model
takes into consideration the drive strength and fan-out of each cell, from which a
capacitive load is assigned to each output. From these parameters, a delay value is then
assigned to each net, giving the delay estimation.
More accurate interconnect delay estimates can be achieved using back-end place
and route tools. However, the MAC designs simulated are part of a larger design that is
undefined. That is, the MAC will be a single computational unit within a larger DSP
design. The other components of the DSP have not been considered. These components
9/3/03 Department ofElectrical Engineering
92
will affect the placement of the cells of the MAC, which in turn will affect its timing.
Therefore, because the MAC was considered as an individual unit that can be part of any
number of larger designs, the wire load model for delay estimation was chosen. The
designs each use the same wire load model so that an equal comparison can be made.
The 0.13 micron technology was simulated under typical operating conditions,
using an operating voltage of 1.0V and a temperature of 25C. The 0.18 micron
technology was also simulated using the typical operating conditions, with an operating
voltage of 1.8V and at a temperature of 25C.
5.2 Standard Multiply-Accumulators
As a basis for comparison, Synopsys DesignWare components are first simulated to
obtain timing and area results. DesignWare components are synthesizable, technology
independent arithmetic units that can be instantiated within a design coded using a
hardware description language. No single component exists to meet the specifications of
this project. Therefore, combinations of DesignWare components will be used to give
the desired functionality.
To create a multiply-accumulator with a product and a multiply-accumulate
output, two possible combinations of DesignWare components can be employed. First, a
32-bit signed/unsigned multiplier can be used to generate the product, which is then
added to the c input using a 64-bit adder. The second solution uses a DesignWare MAC
to give the multiply-accumulate output and a multiplier to produce the product output.
9/3/03 Department of Electrical Engineering
93
A design using a multiplier and an adder should be smaller but slower than the
design using a MAC and a multiplier. The multiply-accumulate output will be relatively
slow because the critical path is through two adders. However, this design will require
less area than the implementation using a MAC and a multiplier because more logic will
be shared. Conversely, the design using a MAC and a multiplier does not share any
hardware, and therefore should be faster but will use more area.
The static timing results obtained using PrimeTime support the expected results,
as given in Table 5-1. The DW MAC Mult design uses a DesignWare multiplier-
accumulator with a DesignWare multiplier, while the DW Mult Add design uses a
DesignWare multiplier and adder.
Table 5-1: Simulation Results for Two DesignWare MACs








DW Mult Add 3.576 163,540 2.098 92,375
DWMAC Mult 2.954 251,602 1.690 167,183
5.3 MAC Design Solutions
This section will present each of the MAC designs that have been created to meet the
specifications of this project. The purpose of each design is to explore the various design
alternatives presented, and to be able to draw conclusions from the simulation results as
to which methods produce the best results. The goal is to create a design that is at least as
fast as the DWMACMult, but requires less area to implement.
9/3/03 Department ofElectrical Engineering
94
5.3.1 Partial Product Generation: Baugh-Wooley vs. Booth Encoding
The first question that needs to be addressed is how to generate the partial products for
the multiply-accumulator. The two main solutions that have been presented for a
multiplier with signed or unsigned inputs are the Baugh-Wooley method and the
modified radix-4 Booth encoding method.
The Baugh-Wooley method uses only AND and NAND gates to generate the
partial products, and therefore will use less hardware and be faster than using Booth's
algorithm. However, the advantage of using the radix-4 Booth encoding method is that
fewer partial products are generated. Therefore, although the partial product generation
stage will take more time and area to implement, the partial product reduction stage will
be faster and require less hardware because fewer numbers need to be summed.
The other factor that affects which method will give better results is the bit width
of the input operands, ti. The Baugh-Wooley method produces (n + l) partial products,
while radix-4 Booth encoding gives [(n/2)+2] partial products, where the c input to the
MAC is considered a partial product. Therefore, as n increases the difference in the
number of partial products generated also increases. For small n, the extra hardware
required to implement Booth encoding will not reduce the number of partial products
sufficiently for the difference to be made up in the partial product reduction stage. The
Baugh-Wooley method will work well for small n, while radix-4 Booth encoding should
be used for larger n.
It is difficult to predict the exact value of n where Booth encoding begins to be
more efficient than the Baugh-Wooley method. A value of n equal to 32 should be large
9/3/03 Department of Electrical Engineering
95
enough for Booth encoding to be used, but the only way to be certain is through
simulation.
Two simulations to compare Booth encoding to the Baugh-Wooley method were
performed using 0.18 micron and 0.13 micron technologies. The first comparison only
involves producing the partial products, while the second compares two
multiply-
accumulators. Because the modified radix-4 Booth encoding produces almost half as
many partial products as the Baugh-Wooley method, 4:2 compressors were used to
reduce the number of partial products for the Baugh-Wooley case to make the two
methods as equivalent as possible.











Booth 0.611 56,103 0.374 30,910
Baugh-Wooley/4:2 0.896 103,435 0.515 78,115
The simulation results indicate that the radix-4 Booth encoding method has better
area and speed performance than the Baugh-Wooley method with 4:2 compressors in
both implementations. However, due to the differences in the formation of the partial
products using these two methods, the total
number of partial product bits is not the same
for both methods. Therefore, this test does not give a completely accurate comparison,
but it is as close as possible when only the partial products are considered. A more
9/3/03 Department ofElectrical Engineering
96
accurate test is to compare a multiply-accumulator using the Booth method to one using
the Baugh-Wooley method.
To test whether Booth encoding or the Baugh-Wooley method has better
performance, two designs were created and simulated. The first MAC, MAC 1, uses the
Baugh-Wooley method to create the partial products, the three-dimensional reduction
method with the delay estimates given in Section 3.4.3 as the PPRT, and two final adders
similar to Hybrid Adder 4 of Section 4.2.3.4. The PPRT uses Method 2 style reduction;
where as many full adders as possible are shared without affecting the timing of the
multiply-accumulate output.
For comparison, a second design, Booth MAC, was created that is the same as
MAC 1 except the radix-4 Booth encoding method is used to generate the partial
products. The PPRT will also be different because there are fewer partial products to
sum, however the same general method is employed.
The simulation results for these two designs are given in Table 5-3 for two
different process technologies. Both designs are faster than DW MAC Mult when
implemented in 0.18 um technology, but are slightly slower in 0.13 um technology.
Compared to each other, MAC 1 is faster than the Booth MAC using the 0.18 micron
process, but is slightly slower in the 0.13 micron process. There are several explanations
for these somewhat conflicting results. First, each technology library has its own set of
cells, with different delay characteristics of these cells. From these cells, certain designs
will be able to be implemented more efficiently than others. Therefore, it is likely that
the differences in the structure of MAC I and the Booth MAC combined with the
9/3/03 Department ofElectrical Engineering
97
differences in the technology library cells contribute to the relative timing differences
seen.
Table 5-3: Booth vs. Baugh-Wooley MAC Simulation Results
Design 0.18 urn Technology





MAC 1 (B-W) 2.801 312,870 1.750 179,750
Booth MAC 2.905 265,776 1.707 151,514
The constraints placed on a design during synthesis affect the timing and area
results of the design. Timing and area constraints are used to guide the synthesis tool in
optimizing the design for the given performance goals. Based on
these constraints, the
tool will use optimization techniques to attempt to meet these goals. If it cannot meet the
constraints, it will place a higher priority on the timing result than the area number. This
means that it will create the fastest design possible and will then reduce the area as long
as the timing is not affected. The optimization
methods used during compilation will also
lead to performance differences between designs.
Finally, as the channel length of the transistors decreases, the interconnect delays
have an increasing impact on the speed of the design due to decreasing
cell delays. For
deep-submicron technologies, the interconnect parasitics must be considered
to give the
most accurate timing results. Therefore, a design
with short interconnects and a regular
layout becomes increasingly desirable as the channel length of the
transistors decreases.
9/3/03 Department ofElectrical Engineering
98
Although the timing numbers are inconsistent, the area numbers present the biggest
difference between MAC 1 and the Booth MAC. For both cases the Booth MAC requires
less area, indicating that the additional hardware used to implement Booth's algorithm is
made up for by using fewer full adders in the PPRT. These results indicate that Booth
encoding should be used to generate the partial products for this case, where n equals 32.
Further improvements are needed, however, to meet the desired timing and area goals.
5.3.2 Final Adder Comparison
In Section 4.2.3 several final adder designs were presented. Of these designs, Hybrid
Adder 4 was shown to be best suited as the final adder of a multiply-accumulator. The
inputs to a final adder of aMAC do not arrive simultaneously, so the fastest adder design
assuming all inputs arrive at the same time will not necessarily be the best suited as a
final adder.
The only way to accurately compare final adder designs is to use the adder in an
actual MAC design. Therefore, three separate multiply-accumulators with identical
partial product generation and partial product reduction stages were created. The designs
only differ in the final addition stage. Each MAC was
designed using a radix-4 version
of Booth's algorithm to generate the partial products. The three-dimensional reduction
method using the delay estimates presented in Section 3.4.3 was used to implement the
PPRT. The PPRT was designed using the Method 2 style, in
which full adders are only
shared when it does not impact the delay estimate of the multiply-accumulate output.
9/3/03 Department ofElectrical Engineering
99
MAC 2 uses two DesignWare final adders, one to give the multiply-accumulate
output, and the other to produce the product output. This design will be used as a
benchmark for the other two designs. MAC 3 uses two Brent-Kung adders, and MAC 4
uses Hybrid Adder 4. MAC 4 was coded differently than the BoothMAC in Table 5-2 by
removing one level of hierarchy in the Verilog design. The simulation results are given
in Figures 5-1 and 5-2. All area numbers presented have units of square microns (pm2).





Figure 5-1: Final Adder Simulation Results - 0.18 micron Technology
9/3/03 Department ofElectrical Engineering
100
Final Adder Comparison: 0.13 micron Technology
<0
>




areax 100,000 1.371 1.495
Figure 5-2: Final Adder Simulation Results - 0.13 micron Technology
The simulation results indicate that the Brent-Kung adder gives approximately the
same results as Hybrid Adder 4 when implemented in 0. 1 8 micron technology. However,
in 0.13 micron technology, Hybrid Adder 4 is faster and larger than the Brent-Kung
implementation. For both cases, the DesignWare adder design is faster but requires more
area than the other two designs. This illustrates the tradeoff that exists between the size
and speed of these designs.
5.3.3 Booth Coding Style Comparison
The coding style used to describe a
hardware design can impact the circuit that is created,
and thus the simulation results. Therefore, three different methods of coding the radix-4
version of Booth's algorithm were created and simulated. The three
multiply-
9/3/03 Department of Electrical Engineering
101
accumulators are identical except in the way that the partial product generation stage was
coded.
The partial product generation stage of MAC 5 was coded to match the logic
described in Section 3.3.3. This coding style was intended to reflect the specific
gate-
level implementation of the design. However, optimization can occur during compilation
to create a design that meets the specified constraints. The functionality of Booth's
algorithm is described at a higher level in MAC 6 by using one Verilog case statement
and an extra logic equation for bit one of the partial product. The LSB of each partial
product (the neg_cin bit) was coded using the same Boolean equation used for MAC 5.
Bit one of each partial product was also coded using the equivalent logic equation. The
upper bits of each partial product are determined using a case statement, where three bits
of the multiplicand are used as the select bits for determining the correct partial product
bits to be used. MAC 7 also uses a case statement to determine the partial products using
the radix-4 Booth encoding. However, in this design bit one of each partial product is
also selected within the case statement, rather than using a separate equation. The code
for each of these designs is given in Appendix A.
Each of the three MACs are logically equivalent, but are coded in a different
manner. Therefore, this will test the impact of coding style on the way in which
Synopsys DesignCompiler interprets the design.
For MACs 5, 6, and 7 the partial product reduction stage was designed using a
combination ofDadda's method and the TDM. Dadda's method was used, but the TDM
delay estimates were used to determine the interconnections of the full and half adders of
9/3/03 Department ofElectrical Engineering
102
the PPRT. Hybrid Adder 4 was used as the final adder, and was coded identically to that
of MAC 4. The simulation results are summarized in Figures 5-3 and 5-4.

































? time (ns) 1.746





Figure 5-4: Booth Coding Style Comparison
- 0.13 micron Technology
9/3/03 Department of Electrical Engineering
103
These results indicate that there is no clear best choice. The MAC 5 design has
the worst timing performance in the 0.18 micron technology, but is the fastest in the 0.13
micron technology. The MAC 7 design is the fastest in 0.18 micron technology, and is
only slightly slower than MAC 5 using the 0.13 micron technology. If speed is the most
critical constraint on the design, MAC 7 produces the best overall timing numbers for
both technologies. The increased speed of the design comes at the expense of area, most
notably when implemented using the 0.18 micron technology. These results indicate that
the coding style has an impact on the simulation results.
5.3.4 Partial Product Reduction Comparison - Method 1 vs. Method 2
In Section 4.2.2 two different methods of reducing the partial products of a
multiply-
accumulator were introduced. Method 1 first sums the partial products excluding the c
input to give two binary numbers. Then these numbers are added to the c input to yield
the multiply-accumulate output. Method 2 considers the c input as a partial product and
begins to sum it with the other partial products immediately. The partial product
reduction tree has two separate paths; one produces the product output, while the other
calculates the multiply-accumulate output. As discussed, a MAC using Method 1 should
be slower, but require less area than aMAC implemented usingMethod 2.
MACs 8 and 9 were created to test whether the predicted results for Methods 1
and 2 match the simulation results. Both designs use radix-4 Booth encoding, coded
using a single case statement in Verilog, as in MAC 7. The final
adder uses the Hybrid
9/3/03 Department of Electrical Engineering
104
Adder 4 design, identical to that described for MAC 4. The only difference between the
two designs is in the partial product reduction stage.
MAC 8 uses the Method 2 approach to reduce the partial products. The timing
estimates for the three-dimensional reduction method were used to estimate the delay of
each path through the PPRT. Adders were only shared if it did not affect the timing
estimate of the multiply-accumulate output. MAC 9 also uses the TDM approach to
reduce the partial products, except in this design the Method 1 approach was used. The
simulation results for these two designs are summarized in Figures 5-5 and 5-6.













I area x 100,000
MAC 8 (Method 2)
2.794
2.709
MAC 9 (Method r
2.940
2.188
Figure 5-5: Partial Product Reduction Comparison - 0.18 micron Technology
9/3/03 Department of Electrical Engineering
105
Partial Product Reduction Comparison: 0.13 micron
Technology
1 800 -
a i rdo -
n




MAC 8 (Method 2 ) MAC 9 (Method 1)
M time (ns) 1.686 1.773
area x 100,000 1.545 1.225
Figure 5-6: Partial Product Reduction Comparison - 0.13 micron Technology
The simulation results confirm the theoretically predicted results. For both cases,
MAC 8 is faster but requires more area than MAC 9. Therefore, the timing requirements
specified for a system will determine the implementation. If the timing performance of a
MAC designed using Method 2 is sufficient, it should be used for the area savings it
allows.
5.3.5 Partial Product Reduction Comparison - Dadda vs. TDM
In [14] the area and speed of Dadda and Wallace multipliers were compared using 0.25
micron and 0.18 micron technologies. These simulations show that Wallace multipliers
have more area than the equivalent Dadda multiplier, and have approximately the same
worst-case delay [14]. From these results, the Dadda method of partial product reduction
9/3/03 Department of Electrical Engineering
106
will be preferred to aWallace tree. However, the simulations in [14] did not consider the
three-dimensional method as an alternative, which will be presented in this section.
To compare the TDM to the Dadda partial product reduction approach, two of the
MAC designs already presented can be used. MAC 7 uses radix-4 Booth encoding to
create the partial products, the Dadda method using the delay estimates of the TDM for
each level of the Dadda tree to reduce the partial products, and Hybrid Adder 4 as the
final adder. MAC 8 is identical to MAC 7 except the TDM is used in the partial product
reduction stage. Because both designs use the TDM delay estimates, the worst-case
delay should be approximately the same. However, the TDM approach considers the
timing of each column individually, where the Dadda approach assigns a time to each
level of the tree. Therefore, the TDM has the potential to be slightly faster than the
Dadda method. As a result of assigning a time to each level, the Dadda approach will
share more full adders in the PPRT than the TDM method, which should result in a
design with smaller area.
The simulation results for the two designs are given in Figures 5-7 and 5-8. The
simulation results generally match the expected results. The Dadda design has a smaller
worst-case delay in the 0.18 micron simulation, however, indicating that the delay is
technology dependent. The simulations confirm that the Dadda design requires less area
than the TDM design.
9/3/03 Department of Electrical Engineering
107











MAC 7 (Dadda) MAC 8 (TDM)
II time (ns) 2.776 2.794
area x 1 00,000 2.637 2.709
Figure 5-7: TDM vs. Dadda Comparison - 0.18 micron Technology















MAC 7 (Dadda) MAC 8 (TDM)
Utime (ns) 1.754 1.686
area x 1 00,000 1.437 1.545
Figure 5-8: TDM vs. Dadda Comparison - 0.13 micron Technology
9/3/03 Department ofElectrical Engineering
108
5.3.6 TDM PPRT Analysis
The delay estimates used to develop the three-dimensional reduction method were based
on a full adder from an LSI 100K lu CMOS-ASIC cell [13]. Other technology libraries
will most likely not follow the same normalized delay model as this cell. That is, each
technology library will have its own full adder implementation, so a universal model
cannot be used to generalize the delay of a full adder for all technologies.
The TDM is a useful method if the delay estimates for the cell of the technology
library is known. However, for a, technology independent design, this information is not
known. To test these theories, MAC designs MAC 10 through MAC 19 were created
using varying delay estimates for a full adder. The simulation results for the MAC 8
design previously discussed will be used again for comparison with the other TDM
designs.
Each MAC uses the radix-4 Booth encoding method to generate the partial
products, and a Hybrid Adder 4 final adder. The TDM is used to reduce the partial
products, where full adders are shared only if it does not affect the delay estimate of the
multiply-accumulate output. The normalized delay estimates from each input (a, b, c) to
the outputs of the full adder (sum, carry) for each design are summarized in Table 5-4.
Using these delay estimates, the TDM was used to create each design. The simulation
results for each design are given in Figures 5-9 and 5-10. Table 5-5 lists the numeric
timing and area numbers for each of the TDM designs.
9/3/03 Department of Electrical Engineering
109
Table 5-4: Delay Estimate Summary for TDM Analysis
Design a -> sum b -> sum c -> sum a -^ carry b -> carry c -> carry
MAC 10 2 2 0.6 0.6 0.6
MAC 11 2 2 0.8 0.8 0.8
MAC 8 2 2 1.0 1.0 1.0
MAC 12 2 2 1.2 1.2 1.2
MAC 13 2 2 1.4 1.4 1.4
MAC 14 2 2 1.6 1.6 1.6
MAC 15 2 2 1.8 1.8 1.8
MAC 16 2 2 2.0 2.0 2.0
MAC 17 2 2 2.2 2.2 2.2
MAC 18 1 1 1.0 1.0 1.0
MAC 19 1 1 2 1.0 1.0 1.0





MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC
10 11 8 12 13 14 15 16 17 18 19
Figure 5-9: TDM Analysis - 0.18 micron Technology




TDM Comparison: 0.13 micron Technology
MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC
10 11 8 12 13 14 15 16 17 18 19
Figure 5-10: TDM Analysis - 0.13micron Technology
Table 5-5: TDM Analysis Simulation Summary
Design 0.18 micron Technology 0.13 micron Technology
time (ns) area (um2) time (ns) area (um2)
MAC 10 2.788 276, 009 1.731 145,026
MAC 11 2.842 268,923 1.658 157,101
MAC 8 2.794 270,889 1.686 154,549
MAC 12 2.839 260,879 1.703 147,716
MAC 13 2.728 275,255 1.764 146,283
MAC 14 2.683 275,841 [ 1.728 146,749
MAC 15 2.764 278,213 1.699 153,105
MAC 16 2.735 263,128 1.728 145,608
MAC 17 2.824 270,351 1.688 157,713
MAC 18 2.859 255,128 1.748 146,292
MAC 19 2.819 268,183 1.742 147,824
The simulation results indicate that there is no clear best choice delay estimate
that gives the best performance in both technologies. MAC 14 is the fastest choice in
9/3/03 Department ofElectrical Engineering
Ill
0.18 micron technology, while MAC 11 has the best worst-case delay for 0.13 micron
technology. These results confirm that predicting a normalized delay model for a full
adder implemented in any technology cannot be accomplished with the methods and tools
presented.
5.3.7 Method 1 Partial Product Reduction Analysis
It was shown in Section 5.3.4 that the Method 1 approach to reducing the partial products
of a multiply-accumulator saves area in the design with the cost of reduced speed. Only
one design using the Method 1 approach was presented. Therefore, alternative design
methodologies will be considered in this section.
Based on the results of Section 5.3.6, the TDM delay prediction cannot be applied
generally to all technologies. In this section the TDM approach using Method 1 to reduce
the partial products will be compared to a Method 1 approach using an even delay
estimate for the full adder timing. That is, the delay from each input to each output will
be estimated to have equal delay. An even delay estimate will allow for more full adders
to be shared without affecting the timing estimate. For example, a full adder that sums
three signals in the path of both the multiply-accumulate and product outputs will
produce a sum and a carry that have an equal time estimate assignment. Therefore, these
two outputs can more easily be assigned to the same full adder in the next level of the tree
because they have the same timing, and do not need to be split up in an attempt to
equalize the delay through the tree.
9/3/03 Department ofElectrical Engineering
112
An even delay model is a generic timing estimate that can be used to create a
technology independent design. Although better estimates exist for specific
implementations, an even delay model will give adequate results for all technologies.
Three designs have been created using the TDM and a Method 1 approach to
partial product reduction. MAC 9, as previously described uses the TDM estimates given
in [13] to reduce the partial products. MAC 20 reduces the partial products using a full
adder delay model in which the delay from each input to output is equal. MAC 21 also
uses an even delay model, but it uses a Brent-Kung final adder. The first two designs use
a Hybrid 4 adder. Each design uses the radix-4 Booth encoding method to create the
partial products.


















time (ns) 2.940 2.936 3.000
area x 100,000 2.188 2.183 2.150
Figure 5-11: Method 1 PPRT Comparison - 0.18 micron Technology
9/3/03 Department of Electrical Engineering
113







1 0 - _
MAC 9 (TDM)
MAC 20 (Even Delay,
Hybrid 4)
MAC 21 (Even Delay,
B-K)
Btime (ns) 1.773 1.757 1.712
areax 100,000 1.225 1.203 1.325
Figure 5-12: Method 1 PPRT Comparison - 0.13 micron Technology
The simulation results displayed in Figures 5-11 and 5-12 indicate that the
performance of MAC 20 serves as a good compromise for both technologies. In both
cases it is slightly faster than MAC 9, which uses the standard TDM estimate. MAC 21
shows that using a Brent-Kung adder gives the fastest design in 0.13 micron technology
at the expense of increased area, but it is the slowest design in 0.18 micron technology.
Based on these results, a TDM estimate using even delays and a Hybrid Adder 4 final
adder can provide good results regardless of the technology of implementation.
5.3.8 Method 2 Partial Product Reduction Analysis
Similar analysis performed using a Method 1 approach
can be applied to designs using
Method 2 to reduce the partial products. Once again three designs were simulated to
9/3/03 Department ofElectrical Engineering
114
determine whether an even delay estimate or the TDM delay estimate produce the best
results.
All designs use the radix-4 Booth encoding method to generate the partial
products. MAC 8 uses the standard TDM delay estimates, and only shares full adders if it
does not affect the timing of the multiply-accumulate output. Hybrid Adder 4 is used for
the final addition stage of MAC 8. The full adder delay estimates are equal from all
inputs to all outputs for MAC 18 and is otherwise identical to MAC 8. The MAC 22
design uses a Brent-Kung final adder, radix-4 Booth encoding, and uses an equal delay
full adder model.
The simulation results shown in Figures 5-13 and 5-14 using the Method 2
approach do not match the results using Method 1. MAC 8 gives the fastest and largest
design for both technologies, and MAC 18 gives the slowest and smallest design for both
cases. These results indicate that for a Method 2 approach, the even delay method will
give a design that is not as fast as the TDM.
9/3/03 Department of Electrical Engineering
115
















Figure 5-13: Method 2 PPRT Comparison - 0.18 micron Technology

















Figure 5-14: Method 2 PPRT Comparison - 0.13 micron Technology
9/3/03 Department of Electrical Engineering
116
5.3.9 Dadda MAC Designs
To this point, most of the MAC designs presented have used the three-dimensional
reduction method with different delay estimates to sum the partial products. In this
section various DaddaMAC designs will be analyzed.
All of the MAC designs presented in this section use the radix-4 Booth encoding
method coded using a single case statement in Verilog for the upper bits of the partial
product. The partial product reduction stage of each MAC uses the Method 2 approach,
in which the c input is summed as a partial product from the beginning of the PPRT.
MAC 7 uses the Dadda method to reduce the partial products using the standard TDM
method, as described previously.
MAC 23 uses the Dadda method without considering the timing of any of the
signals in the circuit when deciding on how the full and half adders should be connected.
The final adder used is Hybrid Adder 4. MAC 26 uses the same partial product
generation and reduction methods as MAC 23, and the final adder is a Brent-Kung adder.
These two designs should be the slowest because no consideration is given to optimizing
the circuit for speed. These designs were created to show that speed improvements can
be achieved when an attempt is made to equalize the delay of each path through the
MAC.
MAC 24 sums the partial products using Dadda's method, where each path
through a full adder is considered to have equal delay. Using this delay estimate allows
each level of the Dadda tree to be considered together, where the sum and carry outputs
from the level are assigned the same time value. The final adder is a Hybrid Adder 4.
9/3/03 Department of Electrical Engineering
117
MAC 25 is identical to MAC 24 except the final adder is a Brent-Kung adder. Using an
even delay model has been predicted to give a good compromise in performance for a
technology independent design. This was true for the Method 1 designs, however was
found not to be true for the Method 2 designs in 5.3.8. Similar results to those seen in
Section 5.3.8 should therefore be expected for these simulations.

































Figure 5-15: Dadda MAC Comparison - 0.18 micron Technology
9/3/03 Department ofElectrical Engineering
118










MAC 7 MAC 23 (No MAC 24





MAC 25 MAC 26 (No





Figure 5-16: Dadda MAC Comparison - 0.13 micron Technology
The simulation results shown in Figures 5-15 and 5-16 for the Dadda MACs
confirm that for both implementations the designs that consider the timing of the signals
in the circuit are faster but larger than a design that does not. The next step is to
determine which method is best for predicting the delays of the full adders in the circuit.
For both technologies, MAC 25 is faster than the MAC using the standard TDM.
However, this speed increase comes at the expense of increased area. When considering
both speed and area,MAC 7 presents the best compromise.
5.3.10 MAC Design Summary
The proposed MAC designs with the best performance with respect to area and speed are
summarized in this section. The goal is to find a design that is approximately the same
9/3/03 Department of Electrical Engineering
119
speed but requires less area than a DesignWare MAC and multiplier. Several of the
proposed designs are compared to the DesignWare components in Figures 5-17 and 5-18.
MAC Comparison: 0.18 micron Technology
DW DW MAC MAC MAC MAC MAC MAC MAC MAC
ADD MAC 2 8 9 15 18 20 23 25
Mult Mult
Dtime (ns) H area x 100,000
Figure 5-17: MAC Comparison - 0.18 micron Technology
9/3/03 Department ofElectrical Engineering
120
75 1-3 H
MAC Comparison: 0.13 micron Technology
DW DW MAC MAC MAC MAC MAC MAC MAC MAC
ADD MAC 2 8 9 15 18 20 23 25
Mult Mult
? time (ns) Harea x 100,000
Figure 5-18: MAC Comparison - 0.13 micron Technology
Table 5-6:MAC Design Summary
Design 0.18 micron Technology 0.13 micron Technology
time (ns) area (pm ) time (ns) area (urn2)
DW Add Mult 3.576 163,540 2.098 92,374
DWMACMult 2.954 251,602 1.690 167,183
MAC 2 2.666 282,310 1.659 158,946
MAC 8 2.794 270,889 1.686 154,549
MAC 9 2.940 218,789 1.773 122,461
MAC 15 2.764 278,213 1.699 153,105
MAC 18 2.859 255,128 1.748 146,292
MAC 20 2.936 218,323 1.757 120,298
MAC 23 3.076 215,748 1.912 120,856
MAC 25 2.738 283,128 1.700 155,387
9/3/03 Department of Electrical Engineering
121
Table 5-7: MAC Design Descriptions
Design Partial Product Gen. Partial Prod. Reduction Final Addition
DW AddMult DesignWare DesignWare DesignWare
DW MAC Mult DesignWare DesignWare DesignWare
MAC 2 Booth TDM - Method 2 DesignWare
MAC 8 Booth TDM - Method 2 Hybrid Adder 4
MAC 9 Booth TDM - Method 1 Hybrid Adder 4
MAC 15 Booth Modified TDM - Method 2 Hybrid Adder 4
MAC 18 Booth Even Delay
- Method 2 Hybrid Adder 4
MAC 20 Booth Even Delay
- Method 1 Hybrid Adder 4
MAC 23 Booth Dadda, No Opt. - Method 1 Hybrid Adder 4
MAC 25 Booth Dadda, Even - Method 2 Brent-Kung
Table 5-6 gives a brief summary of each MAC design used in Figures 5-17 and
5-18. Each design has been described in more detail in the previous sections. The first
two designs use only DesignWare components to create the MAC. These two designs are
used for comparison to the other designs that have been created. The designs with 'Even
Delay'
in the Partial Product Reduction column indicate that equal delay estimates were
assigned to each path of a full adder. The partial product reduction tree for MAC 23 does
not consider any delay estimates in its formation; hence there is no delay
optimization
(No Opt.).
From the simulation results, it is difficult to pick a single design that outperforms
the others for both technologies. If speed were the only consideration, MAC 8
would be a
good choice because it is the only design that does not
contain any DesignWare
components that is faster than the DWMACMult for both technologies. MAC
15 offers
similar performance to that ofMAC 8, except it is even larger than MAC 8 in the 0.18
micron technology. Similarly, MAC 25 offers acceptable timing numbers,
but the area in
0.18 micron technology is even greater than that ofMAC 15.
9/3/03 Department ofElectrical Engineering
122
In general, the designs using Method 1 are slower but require less area than those
using Method 2 to reduce the partial products. MAC 9, MAC 20, and MAC 23 each use
Method 1. Of these designs, MAC 20 is the fastest and also offers respectable area
numbers for both technologies. MAC 20 is faster than the DW MAC Mult in the 0.18
micron technology, but slightly slower in the 0.13 micron technology. However, it offers
an area improvement in both implementations. Therefore, if the timing numbers are
adequate,MAC 20 would be a good design choice.
MAC 18 offers similar worst-case delay numbers to that ofMAC 20, butMAC 20
offers an area savings for both cases. Therefore, of the designs presented MAC 8 should
be used if the speed of the circuit is the most critical factor. However, if timing ofMAC
20 is sufficient, it should be used to take advantage of the area savings it offers.
9/3/03 Department of Electrical Engineering
123
6. Future Work
Several design solutions have been presented and simulated that meet the required timing
and area specifications. For more accurate timing and area estimates for the multiply-
accumulator, place and route needs to be performed and static timing rerun. To get the
most accurate results, the power grid and other design constraints must be taken into
account. Once place and route has been performed, the interconnect delays can be more
accurately modelled, and the area number will take into account the area of interconnects
between the cells. The simulation results presented may change after place and route
since each design method may be affected differently and the performance difference
between several of the designs is small.
In addition to performing more simulations, other design methods can be explored
for possible performance improvements. First, higher radix Booth recoding methods can
be used to further decrease the number of partial products formed in the partial product
generation stage. As the radix increases, more hardware resources are required as the
logic becomes more complex. For this reason, radices higher than four usually do not
result in speed improvements. However, a higher radix Booth encoding could be
beneficial if it could be implemented using a method that does not require a significant
timing and area increase over a radix 4 method.
The final adder used in the multiply-accumulator design has a large impact on the
speed of the design. Once specific partial product generation and reduction methods have
been decided upon, the final adder can be customized to the input delay profile. Based on
the input delay profile, Hybrid Adder 4 can be modified to give the best performance.
9/3/03 Department ofElectrical Engineering
124
The length of the ripple carry adder used for the least significant bits and the bit width of
each of the adders in the most significant bits can be adjusted to the input delay profile to
give a more efficient design.
It has been shown that the delay estimates assigned to a full adder using the Three
Dimensional Method will be technology independent. If a specific technology library is
specified, and the delay of the full adder cell to be used in the partial product reduction
stage is known, a more accurate timing estimate can be developed.
The initial simulation results indicate that the design solutions presented will give
the desired timing and area performance. More can be done to verify these results and to
possibly improve the design.
9/3/03 Department of Electrical Engineering
125
7. Conclusion
The goal of this work was to investigate algorithms and architectures used to design fast
multipliers and multiply-accumulators, and to apply these methods to create a
single-
cycle multiplier-accumulator with product and multiply-accumulate outputs. Several
different architectures have been described and the simulation results of each alternative
compared. The simulation results of the proposed designs have been compared to those
of the equivalent industry standard components.
The area and worst-case delay of each design were considered as measures of
performance. To show that the proposed designs are general enough to be considered
technology independent, each design was simulated using two different technology
libraries. For comparison, a multiplier and an adder were simulated to give the desired
outputs. The results show that the area numbers are the smallest for this type of design,
but the worst-case delay is the longest. The other option using standard components uses
a multiplier-accumulator and a multiplier to give the equivalent outputs. This design
gives an increased speed at the expense of a greater amount of hardware.
The proposed designs were created to give a compromise between the two
standard designs. If the speed of the design is the most critical constraint, the proposed
MAC 8 offers a speed improvement over the fastest standard design at the expense of
increased area when using the 0.18 micron
technology. If the speed only needs to be
approximately that of the standard
multiply-accumulator and multiplier, MAC 20 should
be used because of the cost savings it offers with respect to the area of the
design. MAC
20 is a good compromise between the two standard designs because it offers a
speed
9/3/03 Department of Electrical Engineering
126
increase over the multiplier and adder implementation, while giving an area increase over
that of the multiplier-accumulator and multiplier design.
The improvements to the standard design alternatives for a multiplier-accumulator
with two outputs were achieved by noticing the similarities between the multiply and
multiply-accumulate operations. These similarities allow a single unit to produce both
outputs more efficiently than a design that produces the outputs using standard
operations.
The multiplier-accumulators proposed used two 32-bit and one 64-bit input to
produce two 64-bit outputs. The methods proposed, however, are general enough to be
applied to inputs of any size. The inputs were allowed to be either signed or unsigned
values. If the inputs were restricted to either only signed or only unsigned values,
simplifications to the hardware could be achieved, resulting in a more efficient design.




[2] D. A. Patterson and J. L. Hennessy, Computer Organization & Design: The
Hardware/Software Interface.
2nd
ed. San Francisco, CA: Morgan Kaufmann,
1998.
[3] D. A. Patterson and J. L. Hennessy, Computer Architecture: A Quantitative
Approach. San Francisco, CA: Morgan Kaufmann, 1996.
[4] K. Hwang, Computer Arithmetic: Principles, Architecture and Design. John Wiley
and Sons, Inc., 1979.
[5] C. R. Baugh and B. A. Wooley, "A Two's Complement Parallel Array
Multiplication
Algorithm,"
IEEE Transactions on Computers, vol. C-22, no. 12, pp.
1045-1047, Dec. 1973.
[6] A. A. Farooqui and V. G. Oklobdzija, "General Data-Path Organization of a MAC
unit for VLSI Implementation of DSP
Processors,"
Proc. 1998 Int'l Symp. on
Circuits and Systems, vol. 2, pp. 260-263, 1998.
[7] I. Koren, ComputerArithmeticAlgorithms.
2nd
ed. Natick, MA: A K Peters, 2002.
[8] D. Villeger and V. G. Oklobdzija, "Analysis of Booth Encoding Efficiency in
Parallel Multipliers Using Compressors for Reduction of Partial
Products," 27h
Asilomar Conference, vol. 1, pp. 781-784, 1993.
[9] W. C. Yeh and C. W. Jen, "High-Speed Booth Encoded Parallel Multiplier
Design,"
IEEE Transactions on Computers, vol. 49, no. 7, pp. 692-701, July 2000.
[10] J. Fadavi-Ardekani, "M x N Booth Encoded Multiplier Generator Using Optimized
Wallace
Trees,"
IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 1, no. 2, pp. 120-125, June 1993.
[1 1] G. Goto et al., "A 54 x 54-b Regularly Structured Tree
Multiplier,"
IEEE lournal of
Solid-State Circuits, vol. 27, no. 9, pp. 1229-1235, Sept. 1992.




EC- 13, pp. 14-17, Feb. 1964.
[13] V. G. Oklobdzija, D. Villeger, and S. S. Liu, "A Method
for Speed Optimized
Partial Product Reduction and Generation of Fast Parallel Multipliers Using an
Algorithmic
Approach,"
IEEE Transactions on Computers, vol. 45, no. 3, pp.
294-
306,March 1996.
9/3/03 Department ofElectrical Engineering
128






ComputerArithmetic, pp. 33-39, June 2001.
[15] L. Dadda, "Some Schemes for Parallel
Multipliers,"
Aha Frequenza vol 34 pp
349-356, 1965.
[16] P. F. Stelling, C. U. Martel, V. G. Oklobdzija, and R. Ravi, "Optimal Circuits for
Parallel
Multipliers,"
IEEE Transactions on Computers, vol. 47, no. 3, pp. 273-285
March 1998.






ComputerArithmetic, pp. 99-106, 1997.






Arithmetic, pp. 42-49, 1995.
[19] J. M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits, a
Design Perspective.
2nd
ed., Upper Saddle River, NJ: Pearson Education, Inc.,
2003.
[20] J. P. Uyemura, Introduction to VLSI Circuits and Systems. New York: John Wiley
& Sons, Inc., 2002.
[21] B. D. Lee and V. G. Oklobdzija, "Optimization and Speed Improvement Analysis
of Carry-Lookahead Adder
Structure," 24rh
Asilomar Conference on Signals,
Systems, and Computers, vol. 2, pp. 918-922, 1990.
[22] B. D. Lee and V. G. Oklobdzija, "Improved CLA scheme with Optimized
Delay,"
Journal ofVLSI Signal Processing, vol. 3, no. 4, pp. 265-274, Nov. 1 99 1 .
[23] G. Goto et al., "A 4.1 ns Compact 54 x 54-b Multiplier Utilizing Sign-Select Booth
Encoders,"
IEEE Journal of Solid-State Circuits, vol. 32, no. 11, pp. 1676-1681,
Nov. 1997.
[24] P. E. Madrid, B. Millar, and E. E. Swartzlander, Jr., "Modified Booth Algorithm for
High Radix Fixed-Point
Multiplication,"
IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 1, no. 2, pp. 164-167, June 1993.
[25] A. S. Tanenbaum, Structured Computer Organization.
4th
ed. Upper Saddle River,
NJ: Prentice Hall, 1999.
9/3/03 Department of Electrical Engineering
129
[26] L. Ciminiera and P. Montuschi, "Carry-Save Multiplication Schemes Without Final
Addition,"
IEEE Transactions on Computers, vol. 45, no. 9 pp 1050-1055 SeDt
1996.
' V'
[27] S^ Turrini, "Optimal Group Distribution in Carry-Skip
Adders,"
Proceedings of the9th
Symposium on ComputerArithmetic, pp. 96-103, Sept. 1989.
[28] V. G. Oklobdzija and E. R. Barnes, "Some Optimal Schemes for ALU
Implementation in VLSI
Technology," 7*
Symposium on Computer Arithmetic,
June 1985.
[29] P. K. Chan and M. D. Schlag, "A Note on Designing Two-Level Carry-Skip
Adders,"
Journal ofVLSI Signal Processing, vol. 3, pp. 275-281, 1991.
[30] R. Brent and H. T. Kung, "A Regular Layout for Parallel
Adders,"
IEEE
Transactions on Computers, vol. C-31, no. 3, pp. 260-264, March 1982.
[31] V. G. Oklobdzija, "Design and Analysis of Fast Carry-Propagate Adder Under
Non-Equal Input Signal Arrival
Profile," 28"'
Asilomar Conference, vol. 2, pp.
1398-1401,1994.
[32] P F. Stelling and V. G. Oklobdzija, "Optimal Designs for Multipliers and Multiply-
Accumulators," 15'h
IMACS World Congress on Scientific Computation, Modeling,
andAppliedMathematics, pp. 139-1AA, 1997.
[33] V. G. Oklobdzija and E. R. Barnes, "On Implementing Addition in VLSI
Technology,"
Journal ofParallel and Distributed Computing, no. 5, pp. 716-728,
1988.
[34] S. Nakamura, "Algorithms for Iterative Array
Multiplication,"
IEEE Transactions
on Computers, vol. 35, no. 8, pp. 713-719, August 1986.
[35] H. M. Deitel and P. J. Deitel, C How to Program.
2nd
ed. Englewood Cliffs, NJ:
Prentice-Hall, Inc., 1994.
[36] S. Brown and Z. Vranesic, Fundamentals of Digital Logic with VHDL Design.
Toronto: McGraw-Hill, 2000.
[37] G. Choe and E. E. Swartzlander Jr., "Interconnection Effects in Fast
Multipliers,"
Conference Record of the Thirty-Third Asilomar Conference on Signals, Systems,
and Computers, vol. 2, Oct. 1999.
[38] M. D. Ercegovac and T. Lang, "Fast Multiplication without Carry-Propagate
Addition,"
IEEE Transactions on Computers, vol. 39, no. 11, pp. 1385-1390, Nov.
1990.
9/3/03 Department ofElectrical Engineering
130
9. Appendix A
The CD ROM contains all of the Verilog HDL source code used to create each of
theMAC designs described.
9/3/03
Department of Electrical Engineering
