IEEE Compliant Double-Precision FPU and 64-bit ALU with Variable Latency Integer Divider by Williams, Ryan D
Rochester Institute of Technology 
RIT Scholar Works 
Theses 
2007 
IEEE Compliant Double-Precision FPU and 64-bit ALU with 
Variable Latency Integer Divider 
Ryan D. Williams 
Follow this and additional works at: https://scholarworks.rit.edu/theses 
Recommended Citation 
Williams, Ryan D., "IEEE Compliant Double-Precision FPU and 64-bit ALU with Variable Latency Integer 
Divider" (2007). Thesis. Rochester Institute of Technology. Accessed from 
This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in 
Theses by an authorized administrator of RIT Scholar Works. For more information, please contact 
ritscholarworks@rit.edu. 
IEEE Compliant Double-Precision FPU and 64-bit ALU 
with Variable Latency Integer Divider 
by 
Ryan D. Williams 
A Thesis Submitted in Partial Fulfillfuent of the Requirements for the Degree of 
Master of Science in Computer Engineering 
Approved By: 
Supervised by 
Dr. Kenneth W. Hsu 
Department of Computer Engineering 
Kate Gleason College of Engineering 
Rochester Institute of Technology 
Rochester, NY 
June 2007 
Dr. Kenneth W. Hsu 
Primary Advisor - R.I. T Dept. of Computer Engineering 
Dr. Roy Melton 
Secondary Advisor - R.I. T Dept. of Computer Engineering 
Dr. Dhireesha Kudithipudi 
Secondary Advisor - R.I. T Dept. of Computer Engineering 
Thesis Release Permission Form 
Rochester Institute of Technology 
Kate Gleason College of Engineering 
Title: IEEE Compliant Double-Precision FPU and 64-Bit ALU with Variable 
Latency Integer Divider 
I, Ryan D. Williams, hereby grant permission to the Wallace Memorial Library to 
reproduce my thesis in whole or part. 
Ryan D. Williams 
Date 
11 
Dedication
To Emileigh, without whose patience and support I never would have been able to
complete this work, and to my parents for their continued support.
in
Acknowledgements
I would like to thank Dr. Kenneth W. Hsu for lending me his support, knowledge, and
enthusiasm, and the members of my committee, Dr. Dhireesha Kudithipudi and Dr. Roy
Melton, for their help and suggestions.
IV
Abstract
Together the arithmetic logic unit (ALU) and floating-point unit (FPU) perform
all of the mathematical and logic operations of computer processors. Because they are
used so prominently, they fall in the critical path of the central processing unit - often
becoming the bottleneck, or limiting factor for performance. As such, the design of a
high-speed ALU and FPU is vital to creating a processor capable of performing up to the
demanding standards of today's computer users.
In this paper, both a 64-bit ALU and a 64-bit FPU are designed based on the
reduced instruction set computer architecture. The ALU performs the four basic
mathematical operations - addition, subtraction, multiplication and division - in both
unsigned and two's complement format, basic logic operations and shifting. The division
algorithm is a novel approach, using a comparison multiples based SRT divider to create
a variable latency integer divider. The floating-point unit performs the double-precision
floating-point operations add, subtract, multiply and divide, in accordance with the IEEE
754 standard for number representation and rounding.
The ALU and FPU were implemented in VHDL, simulated in ModelSim, and
constrained and synthesized using Synopsys Design Compiler (2006.06). They were
synthesized using TSMC 0.13nm CMOS technology. The timing, power and area
synthesis results were recorded, and, where applicable, compared to those of the
corresponding DesignWare components. The ALU synthesis reported an area of 122,215
gates, a power of 384 mW, and a delay of 2.89 ns - a frequency of 346 MHz. The FPU
synthesis reported an area 84,440 gates, a delay of 2.82 ns and an operating frequency of
355 MHz. It has a maximum dynamic power of 153.9 mW.
Table of Contents
Abstract v
List of Figures x
List of Tables xiii
Glossary xv
1. Introduction 1
2. ALU Overview 4
2.1 Select MIPS Instructions 4
3. IntegerAdder/Subtractor 7
3.1 Parallel Prefix Adders 8
3.1.1 Traditional Equations 8
3.1.2 Ling Equations 9
3.2 PPA Tree Structures 10
3.2.1 Kogge-Stone 11
3.2.2 Brent-Kung 12
3.2.3 Han-Carlson 13
3.2.4 Comparison of Radix-2 PPAs 15
3.2.5 Radix-4 Kogge-Stone 16
3.3 Hybrid Adder 17
vi
3.4 Subtraction Using Addition 18
3.5 Signed and Unsigned Addition/Subtraction 19
3.6 Addition/Subtraction Exceptions 20
3.7 Chapter Summary 20
4. IntegerMultiplier 21
4.1 Multiplier Types 21
4.1.1 Shift and Add Multipliers 2 1
4.1.2 Tree Multipliers 23
4.2 Booth Encoding 25
4.3 Booth-WallaceMultiplier 27
4.4 Chapter Summary 29
5. Integer Division 30
5.1 Types ofDividers 30
5.1.1 Functional Iteration 30
5.1.2 Digit Recurrence 32
5.2 SRT Division by Comparison Multiples 36
5.2. 1 QDS by ComparisonMultiples 37
5.3 Implementation of SRT Division by ComparisonMultiples 41
5.3.1 QDS by Comparison Multiples 41
5.3.2 Residual Computation 49
5.3.3 Quotient Conversion 52
5.3.4 Adaptation for Universal Integer Divider 53
vii
5.4 Chapter Summary 55
6. ALU Synthesis Results 56
6.1 Adder Synthesis 56
6.2 Multiplier Synthesis 59
6.3 Divider Synthesis 60
6.4 ALU Synthesis 61
7. Floating-Point Unit Overview 64
7.1 The IEEE 754 Standard 64
7.2 FPU Instructions 67
8. Floating-PointAdder/Subtractor 68
8.1 Common FP Adder/SubtractorOptimizations 70
8.1.1 Use of Compound Adders 70
8.1.2 Parallel Paths 70
8.1.3 One's Complement Significand Negation 70
8.1.4 Leading Zero Approximation 7 1
8.1.5 Reduction of Rounding Modes 7 1
8.2 The SE FPAddition/Subtraction 72
8.2.1 Near Path 73
8.2.2 Far Path - First Cycle 80
8.2.3 Far Path - Second Cycle 83
8.3 Chapter Summary 92
vin
9. Floating-PointMultiplication 93
9.1 FP Multiplier Using Booth-Wallace 94
9.2 Injection by Rounding 96
9.3 Exponent Calculation and Overflow Exception 99
9.4 Chapter Summary 100
10. Floating-Point Division 101
10.1 FP Divider using SRT by Comparison Multiples 102
10.2 Exponent Calculation and Overflow Exception 112
10.3 Chapter Summary 114
11. FPU Synthesis Results 115
11.1 FP Adder/Subtractor Synthesis 115
11.2 FP Multiplier Synthesis 116
11.3 FP Divider Synthesis 117
11.4 FPU Synthesis 118
12. Conclusion and Future Work 120
IX
List of Figures
3. 1 64-Bit Kogge-Stone Adder 1 1
3.2 64-Bit Kogge-Stone Adder with Carry-in 12
3.3 64-Bit Brent-Kung Adder 13
3.4 64-Bit Han-Carlson Adder 14
3.5 64-Bit Han-Carlson Adder with Carry-in 15
3.6 64-Bit Radix-4 Han-Carlson Adder 16
3.7 64-Bit Hybrid Han-Carlson Carry-Select Adder 18
3.8 Adder/Subtractor 20
4. 1 4-Bit Unsigned and SignedMultiplication Examples 22
4.2 Shift and Add Multiplier Circuit 23
4.3 3:2 and 4:3 Wallace Trees with 7 Inputs 24
4.4 4:2 Compressor Stage 25
4.5 Booth-Wallace Multiplier 28
5.1 PDPlot 38
5.2 Comparison Multiples Generator 46
5.3 Comparator 46
5.4 BSDA - Binary Signed Digit Carry-Free Adder 47
5.5 Sign Detector 48
5.6 QDS Output Encoding 48
5.7 QDS Logic 49
5.8 Adjust Unit for Integer Divider 5 1
5.9 SRT Division Block Diagram 53
5.10 Operand Conditioning 54
6. 1 HC/CS Adder Schematic from Synthesis 58
6.2 Universal Multiplier from Synthesis 59
6.3 Universal Divider from Synthesis 60
6.4 ALU from Synthesis 61
6.5 Barrel Shift Left from Synthesis 62
7.1 Double-Precision IEEE 754 Format 65
8. 1 Block Diagram of Floating-Point Adder/Subtractor 68
8.2 High-Level SE FP Adder/Subtractor Block Diagram 73
8.3 SE Adder Near Path 74
8.4 PN Recoding Circuit 77
8.5 Binary Priority Encoder 78
8.6 Area Optimized Unary Priority Encoder 79
8.7 BPENC Implemented Using Recursive Structure 80
8.8 SE Adder Far Path - First Cycle 8 1
xi
8.9 SE Adder Far Path - Second Cycle 84
8. 10 CRS Generation in Naive FP Addition/Subtraction Rounding 85
8. 1 1 CRS Creation for Effective Addition 87
8.12 CRS Creation for Effective Subtraction 88
8.13 CRS Circuit 88
8.14 Injection Based Rounding 91
9. 1 Generic Floating-Point Multiplier Block Diagram 93
9.2 53-Bit Booth-Wallace Tree for FPMultiplication 95
9.3 CRS Generation in Naive FPMultiplication Rounding 97
9.4 FP Multiplier with Rounding by Injection - Mantissa Only 99
10. 1 Generic Floating-Point Division Block Diagram 101
10.2 Adjust Unit for Floating-Point Divider 104
10.3 Floating-Point PR Sign Detector 105
10.4 Floating-Point Conversion and Rounding 108
10.5 FP Divider Rounding and Normalization Logic 111
10.6 FP SRT Division Block Diagram 112
11.1 FP Adder/Subtractor from Synthesis 1 1 6
11.2 FP Booth-Wallace Multiplier from Synthesis 1 1 7
11.3 FP Divider from Synthesis 1 1 8
1 1 .4 FPU from Synthesis 1 19
xn
List ofTables
2.1 ALU Operations 5
3 . 1 Comparison of Radix-2 PPAs for Theoretical Area and Delay 15
4.1 Modified Booth Encoding 26
5. 1 BSD Conversions 35
5.2 Redundant Digit Sets 36
5.3 QDS Function 38
5.4 Upper and Lower Bounds for r=4, a=2 BSD Set 41
5.5 BSDA Outputs 47
5.6 QDS Output Encoding 48
5.7 Compression Process for 1 e {0,1,2} 51
6. 1 Comparison ofAdder Synthesis Results 57
7. 1 Double Precision IEEE 754 FP Values 66
7.2 Possible Implementation of IEEE Rounding Modes 66
7.3 FPU Operations 67
8. 1 Compound Adder Inputs Based on Exponents 75
8.2 Reduce RoundingModes 83
8.3 RNRI to INJ Conversion 86
xiii
8.4 Truth Tables for CRS Generation 89
10. 1 FP Compression Process for 1 e { 0, 1 ,2 } 104
10.2 Analysis of Initial q Values with Respect to x and d 107
10.3 CRN Rounding Rules 109
10.4 CRN Rounding 110
xiv
Glossary
ALU
BPENC
BSD
CFA
CSA
CMOS
CRN
FPU
IEEE
ISA
LSB
MIPS
MSB
MUX
QDS
PENC
PG
PP
PPA
RI
Arithmetic Logic Unit
Binary priority encoder
Binary Signed-Digit
Carry-Free Adder
Carry-Save Adder
Complementary metal oxide semiconductor
Convert, Round, and Normalize
Floating Point Unit
Institute for Electrical and Electronics Engineers
Instruction Set Architecture
Least significant bit
Millions of instructions per second
most significant bit
Multiplexer
Quotient Digit Selection
Priority encoder
Propagate and Generate
Partial product
Parallel prefix adder - an adder with a prefix stage to convert the A and B
inputs into propagate and generate signals, and uses a tree structure to
compute the carries in parallel
Round to infinity
xv
RISC Reduced instruction set computer
RM Round to minus infinity
RN Round to nearest
RNE Round to nearest (Even)
RP Round to positive infinity
RZ Round to zero
TSMC Taiwan SemiconductorManufacturing Company
ULP Unit of least precision - used interchangeably with LSB
UPENC Unary priority encoder
VHDL VHSIC Hardware Descriptive Language
xvi
Chapter 1 Introduction
Computing pioneer John von Neumann proposed the idea of a separate block for
arithmetic and logic operations in 1945 [43]. The idea was that dividing the architecture
of a central processing unit into separate functional blocks would allow designers to
focus on the specific problems associated with the creation of a single group of
specialized circuitry, without worrying about the remainder of the architecture. This is a
vital concept in processor design, and especially in the work undertaken in the research
and implementation of work presented in this paper. This paper focuses on the design of
the arithmetic logic unit (ALU) and floating-point unit (FPU), which perform integer
mathematical and logic operations, and decimal mathematical operations, respectively.
The research of this paper was undertaken with the notion that it will be combined with
theses on an instruction control unit and a cache hierarchy to create a functional processor
based on the MIPS64 architecture.
The end goal of this thesis is the design, VHDL implementation and synthesis of
the arithmetic logic unit and floating-point unit of a 64-bit reduced instruction set
computer (RISC) processor. The RISC architecture, and specifically MIPS64, was
chosen because it requires the implementation of relatively few instructions, while
allowing for the same functionality of more complicated instruction sets. The MIPS64
instruction set was chosen because it is a well defined, fairly prevalent architecture in the
vein of the RISC ideal.
The ALU and FPU are two of the most time critical components in a processor, so
effort was made to reduce the delay as much as possible. There is always a tradeoff
between speed, area, and power, with an increase in speed usually coming at the expense
1
of an increase in area and power usage. All three factors are limited by the technology
used, and the type of logic implemented. The TSMC process used in this paper is highly
scalable but is not known for its speed. Most modern processors utilize silicon on
inductor (SOI) technology that allows for faster switching of transistors, and thus circuits.
Dynamic logic typically performs faster than static logic, but uses more power. The
TSMC 0.13u,m library that was utilized for synthesis with Synopsys uses only static
logic. As a result, it cannot be expected that the implemented design will be of
comparable speed to commercial processors. Nor will they approach the same speeds as
the implementations from the IEEE papers that they are based upon, if more modern
technologies were utilized in those papers [26, 44, 49].
This remainder of paper is divided into two parts - one each for the ALU and
FPU. The first part describes the implementation of the functions of the arithmetic logic
unit. Integer adders, multipliers, and dividers are discussed in detail, with the basic logic
operations - AND, OR, XOR, etc. - not focused upon, as they are simple to implement.
The second part of the paper describes the floating-point unit. Implementations of
double-precision adder/subtractors, multipliers, and dividers that conform to the IEEE
754 standard are presented.
Parti
Arithmetic Logic Unit
This section presents the implementation of a 64-Bit arithmetic logic unit based on
the MIPS 64 instruction set. The arithmetic logic unit supports signed and unsigned
addition, subtraction, multiplication and division, as well as arithmetic and logical
shifting, and basic logic operations.
Chapter 2 ALU Overview
The arithmetic logic unit (ALU) is a fundamental component of the central
processing unit (CPU) of a computer. As the name implies, the ALU handles all of the
integer arithmetic operations - addition, subtraction, multiplication, and division - as
well as the logic operations - OR, AND, XOR, and inversion - and operand shifting
operations performed by the CPU. The ability to perform all of these operations results
in the ALU being one of the most complex circuits in the CPU.
The concept of a separate unit designed specifically for arithmetic and logic was
proposed by John von Neumann in 1945. Von Neumann stated that since a computer
must always perform the basic mathematical and logic operations, it is "reasonable that
[the computer] should contain specialized organs for these operations
[43]."
Implementing a logic block designed specifically for mathematical and logic operations
has allowed designers to create many design improvements over the past 60 years, greatly
reducing the delay caused by the mathematical operations, and thus allowing for much
higher frequencies.
While one ALU is necessary for a computer, more is better. The use of multiplier
ALU units allows multiple instructions to be executed in parallel, greatly increasing
throughput for the processor. This comes at the cost of area, but with ever-decreasing
transistor sizes, the area is typically sacrificed for the additional performance that having
multiple arithmetic logic units offers.
2.1 Selected MIPS Instructions
The designed ALU takes a clock signal, two 64-bit operands, and a 4-bit opcode
as inputs. The outputs of the ALU are a 128-bit signal spread across two 64-bit registers,
and a status signal that denotes errors such as overflow, underflow, and divide-by-zero.
The instructions implemented for the ALU are a subsection of the MIPS64 instruction set
architecture (ISA) that deals with 64-bit arithmetic and logic operations [28]. A complete
list of the implemented instructions can be seen in Table 1.
Table 2.1: ALU Operations
Type OP Code Operation
Arithmetic
Operations
0 0 0 0 A + B
Unsigned
0 0 0 1 A-B
0 0 10 A*B
0 0 11 A/B
0 10 0 A + B
Signed
0 10 1 A-B
0 110 A*B
0 111 A/B
Logic
Operations
10 0 0 AND
Logic
10 0 1 OR
10 10 XOR
10 11 NOR
110 0 INV
110 1 Arithmetic Shift Right
Shifts1110 Logical Shift Left
1111 Logical Shift Right
As shown in Table 1, all four of the arithmetic operations are implemented in both
unsigned and signed (two's complement) format. Similarly, the right shift is
implemented both arithmetically - where the operation is sign extended as it is shifted -
and logically - where 0's are inserted into the bit positions that are shifted out. The shift
operations are performed on operand A, with the least significant six bits of B indicating
the number of bits to shift A. All of the shift operations, the addition, subtraction and
division operations output to the lowest 64-bits of the output, with the most significant
64-bits filled with O's. The multiplication operation produces a 128-bit output that
utilizes both registers.
The logical operations implement a series of standard logic operations on the
operands at the bit level. The AND operation produces a 1 at output bit i only if A, and
B; are both equal to 1. The OR operation produces a 1 at output bit i if A, or 5, is equal
to 1. The XOR operation produces a 1 at output bit i if either A. or Bt is equal to 1, but
not both. The NOR operation is the opposite of the OR operation. A 0 is inserted at the
bit position if the operation conditions are not met. All four of the operations use only
the lowest 64 output bits, filling the higher bits with Os. The ENV operation inverts each
bit of both A and B. The inversion of B is output to the most significant 64-bits, while
the inversion of A is output to the least significant 64-bits. Using a combination of these
operations, any logic operation can be implemented.
The design of the arithmetic portion of the ALU is further broken down into the
universal adder/subtractor, universal multiplier, and universal divider to simplify
implementation. The selection of the proper functionality by the ALU is then selected
using multiplexers controlled by the opcode. The three arithmetic blocks that make up
the majority of the complexity of the ALU are described in Chapters 3, 4, and 5,
respectively.
Chapter 3 IntegerAddition/Subtraction
Fast addition is extremely important in many digital systems. Adders are utilized
in the critical path of address generation units, floating-point units and arithmetic logic
units. As such, the characteristics of the adder are particularly important factors in the
size, speed, and power efficiency of microprocessors.
The adder speed is critical to the throughput of a microprocessor, as it is involved
in some way or another with every mathematical operation, in addition to its function in
address generation. However, designing purely for speed can result in a design that uses
a large amount of power. While this may be acceptable in devices that use power outlets,
it can greatly reduce the battery life of portable devices. Similarly, fast designs are
typically large designs, and the size of the adder influences the overall design size.
To implement high-speed addition, parallel prefix adders are commonly used.
These adders utilize tree structures to compute the carries of each bit in parallel, reducing
the time required for sum generation to a logarithmic function rather than a linear
function. These carry trees are the critical features of parallel prefix adders both in terms
of speed and area. As such, the trade-offs of the tree structures must be determined to
select an appropriate adder for a design.
In this paper, two common types of tree structures
- Kogge-Stone and Han-
Carlson - are compared in terms of theoretical delay and area. A third design is then
introduced which offers a cross between a parallel prefix adder and a carry-select adder.
The three designs are implemented in VHDL, and synthesized in standard 0.13u.m TSMC
CMOS using Synopsys Design Compiler (2006.06). The designs are compared for power,
timing and area.
3.1 Parallel Prefix Adders
3.1.1. Traditional Equations
All parallel prefix adders can be divided into three stages. The prefix generation
stage takes in the addition inputs 'A' and 'B', where A-{an_lan_2...ala0} and
^ = {frn-A-2"A^o}> and converts them into propagate and generate signals using the
following computations:
gi(A,B) = arbi
pi(A,B) = aibi
The carry generation stage then uses the propagate and generate signals to form
the carries. As their names imply, the propagate signals are used to transmit a carry-in
into a carry-out, while the generate signals are able to create a carry-out regardless of the
value of the carry-in. The propagates and generates can be grouped together to form a
tree. In these groups, the values represent a larger number of bits per signal, but maintain
the same properties. The method in which the grouping occurs, specifically the radix -
the number of p,g pairs per operation - and the sparseness - the number of bits skipped
for every bit used in the tree structure - are the cause of the variation in speed and area of
the adders. The radix-2 groupings are as follows:
Puo = PrPo
Gl:0=-?1+/V*?0
The recursive merge operation is expressed in one of the two following ways, where the
second (carry) equation is essentially a subsection of the first (merge) equation, with the
propagate logic removed to reduce the area.
(G,P)o(G',P') = (G +PG',PP)
C,.+1=g,+(p,.-C,.)
The final stage is sum generation. In this step, the sum of the A+B is computed as
s,. = p,ec(.,
where C, =G-_1:0.
3.1.2 Lmg Equations
In 1981, Ling published a paper detailing how to create a faster parallel prefix
adder, regardless of the tree structure used in for the design [17]. The speed increase is
accomplished by replacing the propagate signal with
ti(A,B) = ai+bi.
Since g, and f. both cover the case when ai =b( =1, the Ling equations produce
what is called a pseudo-carry, which represents a carry-in or a carry-out of a bit position,
rather than the carry-out signal that the traditional equations produce.
ff,=G,+GM
tf.=g.+r,_.-tf,,.
In addition to using OR for the propagation operation, which is less costly than an
XOR operation, the pseudo-carry further reduces the costs of the tree because t{ -g.=g.,
resulting in one less logic gate per operation. For example:
^3:0 =83+P3-82+P3-P2-8l+P3-Pz-Pl-8o
in the tradition equations becomes
#3:0 =S3+-?2+'2-l+Vl " #0
in the Ling equations. The propagate equations for the merge operations are essentially
the same as the traditional equations, with
P\:0 = Pl-P0
in the traditional equations becoming
A : 0 = ^0 ' t-l
in the Ling equations.
This efficiency in the prefix and carry generation stages outweighs the slower sum
generation stage implemented with the Ling equations. The sum stage is more complex
due to the fact that the carry bit needs to be derived from the pseudo carry. The carry can
be derived as
C=t,H;(i i i 0
resulting in a sum of
S,.-=p,.e(r, //) [49].
In spite of the increased sum delay, analysis shows that the Ling equations always
outperform the traditional equations [22].
3.2 PPA Tree Structures
The design of the carry tree is the critical component in creating a trade-off
between speed and area. There has been a large amount of research done in this area,
resulting in many different structures. The structures presented in this section are the
most commonly used, regular structures.
10
3.2.1 Kogge-Stone
In 1973, Kogge and Stone introduced a densely packed, regular tree structure that
uses the minimum possible number of stages, allowing for the highest possible theoretical
speed of any parallel prefix adder [1]. This tree - shown in Figure 3.1 without a carry-in
bit - has often been utilized due to its high speed, regular layout and limited fan-out -
each output drives at most two merge operations after buffering.
L) Propa^te and Generate %(q.P)(&,p')-(G+p a\p P) OQ-.i-3+^Q) I Sum Compulation
Figure 3.1: 64-Bit Kogge-Stone Adder
While the Kogge-Stone adder offers high speed, it requires a large amount of area
to implement due to the large number of carry and merge operations. The merge
operations are depicted by the shaded dots in the figures, while the white dots represent
the carry operations. The white and shaded squares represent the propagate and generate
(pg) logic, and sum computations respectively.
11
Another problem with the Kogge-Stone adder is that it is ill adapted to handle a
carry-in. In order to add the carry-in to the tree, a specialized pg generation or carry
operation must be created to integrate the carry-in into the tree. Figure 3.2 shows the
Kogge-Stone tree adapted for a carry-in. Note the additional carry operation, and that the
second carry dot in the first stage includes 3 inputs, both of which are required to
propagate the carry-in. The 3 input carry increases the delay of the tree by increasing the
number of gates in the logic for the dot to (G1:0 o cin ) = g +/* g0 + P\ p0 Cin .
aPmpagtle and Generate Ha,Pf>(<?.r>-<.a+p-a:p-p) o cm-g+(p c.) I SumComputation
Figure 3.2: 64-Bit Kogge-Stone Adder with Carry-in
3.2.2 Brent-Kung
The Brent-Kung adder [1] was introduced in 1982 and takes the opposite
approach of the Kogge-Stone adder. Rather than attempting to create the fastest parallel
prefix adder, Brent and Kung tried to create the smallest. The Brent-Kung adder has a
sparseness of 2 - alternating columns are skipped until the last stage
- trading extra tree
stages for fewer merge operations.
12
0HC
While the additional stages of the Brent-Kung adder preclude it from being
included in a discussion of high-speed adders, and thus not seriously considered for use
in the ALU designed in this paper, it is an important precursor to the Han-Carlson adder
discussed in the following section.
D Propajpte and Generate + (G,P)e(G,.r)-(Q+P G'.P P*) O c^-g+cp c,-) I Sum Computation
Figure 3.3: 64-Bit Brent-Kung Adder
3.2.3 Han-Carlson
The Han-Carlson adder [19] was introduced in 1987. It combines the
philosophies behind the Kogge-Stone and Brent-Kung adders - in fact, Parhami [39]
refers to it as a hybrid Brent-Kung/Kogge-Stone adder, rather than a Han-Carlson. The
Han-Carlson adder utilizes the merge pattern of the Kogge-Stone adder, but only
performs the operations on the odd number bits, giving it a sparseness of 2 like the Brent-
Kung Tree. The result is a structure that requires one additional stage to generate the
carries than does the Kogge-Stone, but is significantly smaller in area. This tradeoff of
area and speed make the Han-Carlson adder a popular choice for recent designs.
13
O Propagate and Generate *(a,P)ia',p)-(a+PG',p p> O Q.i-0+iPCi) I Sum Computation
Figure 3.4: 64-Bit Han-Carlson Adder
Like the Kogge-Stone adder, the Han-Carlson adder can be converted to handle a
carry-in by altering the carry operation in the first stage of the tree to handle three inputs.
However, the sparseness of the Han-Carlson adder allows for a better design to handle
the carry-in than does the denser Kogge-Stone adder. To accommodate the carry-in, A
and B can be shifted left along the tree by one bit, inserting the carry-in where the least
significant bits of A and B were positioned. In this way, the sum can be computed in the
same number of stages, with the addition of a single carry dot to compute the carry-out
bit. This presents an optimal solution to the implementation of the carry-in because it
only alters the final stage of the tree and does not add any delay due to the fact that the
carry operation added to the tree is the same as the other operations implemented in the
final stage. The Han-Carlson adder modified for a carry-in can be seen in Figure 3.5.
14
pHC
fill ^1i1']t''^1I/'/'I1'
nngnnannDDngnnnnnnnnnnnnnnnnnnngnnnn
ll8 HI iilifflffl
5 1 33i 1* 3HB -1 3 -i -Bel -t -fifl -PffiWi iflfi
"ill
ill
II
111
rrrrrVrTir V I V V r irifrrrr? ?!? ?ir tIt
{a
1 J
ill
nPropagite and Generate m(G,P)(.<?,P)-(G+PG'.PP) o <-m-3+(--- SumComputation
Figure 3.5: Han-Carlson Adder with Carry-in
3.2.4 Comparison ofRadix-2 PPAs
As can be seen in Table 3.1, the Kogge-Stone adder is theoretically the fastest and
largest for any size inputs, the Brent-Kung is the smallest and slowest, and the Han-
Carlson presents a tradeoff between size and speed. This comparison only takes the
number of logic operations into consideration, ignoring the effects that interconnect
delays may have upon the adders. In the table, k represents the operand width in terms of
bits. As will be demonstrated in subsequent discussions, wiring delays, fan-outs and
capacitances can greatly affect the synthesis and actual results achieved by the adders.
Table 3.1: Comparison of Radix-2 PPAs for Theoretical Area and Delay [18]
Factor Brent Kung Kogge Stone Han Carlson
Delay 2*log2k-2 log2k log2k+1
Area 2k-2-loa2k kloa2k-(k-1) k/2*loa2k
15
3.2.5 Radix-4 Han-Carlson Adder
As mentioned previously, increasing the radix of the tree can reduce the number
of stages in an adder tree. The radix is equal to the number of maximum number of
inputs that the merge operations have. As such, an increase in the radix causes an
increase in the amount of logic required to process the merge operations, so radices are
typically limited to 2, 4, or 8 so that the amount of logic in each operation does not
become overly large.
As can be seen in Figure 3.6, the increase in radix also increases the fan-out of
each merge operation. This increased load can result in a slower operation time
depending on the type of logic and process used to create the adder. It is reported in [49]
that using an SOI process, radix-4 adders perform better than radix-2 adders; however
Synopsys synthesis results conducted in TSMC 0.13|im static CMOS report that the
radix-2 implementations perform at higher speeds, in spite of having three more stages of
carry and merge operations.
D Preparole and Generate
= 47
. GM) = Sj 'Pi 82 'Pi Pi Si
'Pi'Pi P\ So
Piu'PlPi PlPo
O GM=3*ft Si'Pi Pi Si'Pi Pi Pi So I Sum Computation
Figure 3.6: 64-Bit Radix-4 Han-Carlson Adder
16
3.3 Hybrid Adder
Recently, adder designers have focused on combining parallel prefix adders with
carry select adders to produce faster designs [22, 49]. Carry select adders generate the
possible sums for each bit in parallel with the computation of the carries. The carries are
then used to select the correct sums. When this approach is combined with the fast carry
generation of the parallel prefix adders, it can result in a design that is faster than a
parallel prefix adder.
The speed increase is dependant upon the parallel prefix structure used. If a
Kogge-Stone design is utilized, the sum select multiplexers replace the sum generation
stage of the adder, resulting in a similar speed, with a larger area. If a Han-Carlson
structure is utilized, the sum-select multiplexers can replace both the final stage ofmerge
operations, and the sum generation, using each of the n/2 carries generated to select two
sum bits. This results in an adder that has the same number of logic levels as the Kogge-
Stone carry-select adder, but is significantly smaller.
To implement the Han-Carlson carry-select adder, the Ling Han-Carlson tree is
used, with the log2k+l stage removed. The sum pre-computation [49] for the odd bits
and 0 is
S?=a, b:
s,1^,. *>,. (flM+fcM)
and the even sum pre-computes are computed as
5,
= at bt 0 (._, &,_)
5,1
= a,. 0 b, [aM fcM + (,_i +Kx )(fl,--2 Ki )]
17
where the superscript signifies the even carry-out values used for sum selection. Each
carry-out is used to select both the subsequent odd sum and the following even sum.
After comparing the synthesis results for the size, speed, and power of the various
adders discussed in this chapter, this hybrid Han-Carlson carry-select design was selected
to implement the 64-bit adder/subtractor, as it has the highest speed, while still having a
smaller area than the Kogge-Stone adder. The synthesis results can be viewed in Chapter
6, where the ALU synthesis results are discussed.
Co.63 62,61 60.59 58.5756,55 54,53 52,51 50.49 48,47 46,45 44,43 42,41 40,39 3637 3635 3433 3231 30,29 2627 2825 2423 22,21 20.19 16,17 16,15 14,13 12,11 10,9 6.7 6,5 4,3 2,1 0
Figure 3.7: 64-Bit Hybrid Han-Carlson Carry-Select Adder
3.4 Subtraction Using Addition
As any elementary school child knows, addition and subtraction are essentially
the same operation. Subtraction merely inverts the sign of the second operand. Using
this knowledge, it becomes simple to implement integer subtraction using any type
integer adder.
18
To alter the adder to be able to handle both integer addition and subtraction, logic
must be established to perform two's complement conversion on the B operand if
subtraction is selected, and to leave B untouched if addition is selected. This can be done
in one of two ways: using an inverter and a multiplexer on each bit, or an XOR gate with
one bit tied to a control, and the other tied to the corresponding bit of B. As a multiplexer
and an XOR gate have similar delays using the static CMOS in the TSMC process, the
XOR gate method is used for the inversion to remove the additional logic necessary for
the alternative method.
The inversion of the B operand results in the one's complement of B. However,
modern computers typically operate using two's complement format for signed numbers.
To convert the one's complement value into a two's complement value, 1 must be added
to the least significant bit. This can be accomplished during the addition/subtraction by
adjusting the carry-in bit accordingly. Since the MIPS instructions do not include a
carry-in, the carry-in can be tied directly to the control bit that selects between addition
and subtraction. In this method the control bit will be T if subtraction is selected,
causing the inversion of the B operand, and the addition of a 1 to the least significant bit
via the carry-in - completing the conversion for subtraction.
3.5 Signed and Unsigned Addition/Subtraction
Neither addition nor subtraction requires any modifications to switch between
signed and unsigned operation when two's complement format is used for the signed
representation. The difference lies in the way that the results are interpreted, with the
MSB representing the sign in signed format, and the MSB in unsigned format.
19
3.6 Addition/Subtraction Exceptions
Integer addition and subtraction also lend themselves nicely to error detection.
According to the MIPS ISA, the only exceptions that need to be detected for either are
overflows for unsigned operations. These can be detected directly from the carry-out of
the adder, with no additional logic required.
B
Cin
64-Bit Adder
Sum
64
Figure 3.8: Adder/Subtractor
3.7 Summary
This chapter provides an overview for some of the basic types of parallel prefix
adders, presenting strengths and weaknesses for each. An adder combining the strengths
of the Han-Carlson parallel prefix structure, with those of the carry-select adder is then
introduced as a higher speed addition architecture. This Han-Carlson/Carry-select hybrid
is then leveraged into performing both unsigned and signed addition and subtraction
through a simple modification of the B operand, and the carry-in. The carry-out of the
adder is used to detect the overflow condition, as specified in theMIPS ISA.
20
Chapter 4 Integer Multiplication
Multiplication, like addition, is a heavily used operation that figures prominently
in many types of operations. Among many other uses, multiplication is used in signal
processing and scientific applications. It is also a common basis for division. With such
wide-ranging applications, multiplication has been a heavily researched area of digital
design, with many different types of multipliers and optimizations proposed.
Integer multipliers can be implemented in a variety of ways. Typical
implementations range from the slow, but small, shift and add sequential multipliers to
the larger and faster tree multipliers. Regardless of the implementation used, integer
multiplication is the only mathematical operation explored in this paper that creates a 2n-
bit output from n-bit inputs. In this chapter shift and add, and tree multipliers are
described, and a universal (signed/unsigned) integer multiplier is designed for use in the
ALU.
4.1 Multiplier Types
4.1.1 Shift andAdd
Shift and add multipliers are the simplest and easiest to implement types of
multipliers. The basic unsigned radix-2 shift and add multiplier is the binary equivalent
of the multiplication method taught in grade school. The A input is multiplied by each
bit of B in turn to create each of the n partial products, where n is the number of bits of B.
Each partial product is then shifted left by the bit position of B from which it was created,
with the shifted bits and the unused significant bits filled with O's. The partial products
21
are added together to form the product. In this implementation, one partial product is
generated and added to the running product per clock cycle. To expand this
implementation to handle signed notation, each partial product must be sign extended to
2n bits, rather than grounding the unused bits. 4-bit examples of both signed and
unsigned are presented in Figure 4. 1 .
Operand A
Operand B
10 11
x 0 1 1 0
0 0 0 0 PP1 (added first elk)
1 0 1 1 x PP2 (added second elk)
1 0 1 1 x x PP3 (added third elk)
+ 0 0 0 Ox x x PP4 (added fourth elk)
0 0000000
1 1 1 1 0 1 1x
1 1 1 0 1 1x x
+ 0 000 Ox xx
Product
Unsigned Signed
Figure 4.1: 4-Bit Unsigned and SignedMultiplication Examples
This implementation is fairly straight forward, requiring only an n-bit adder, a 2n-
bit register and a multiplexer. The multiplier is initially loaded into the least significant
n-bits of the register. The least significant bit of the register controls the multiplexer,
selecting the multiplicand if 1 and ground if 0. The most significant n-bits of the partial
product are then added to the output of the multiplexer, and the resulting sum and carry-
in are loaded into the most significant n+1 bits of the partial product register. Bits n
down to 1 of the partial product are loaded into the lowest n-1 bits of the register,
resulting in a shift of the multiplier. This method takes
n clock cycles, but allows for a
high frequency and small area.
22
Multiplier
n bits
1=
Partial Product
Figure 4.2: Shift and Add Multiplier Circuit [4]
Although the radix-2 implementation of the shift and add multiplier is the
simplest type of multiplier to implement, it is unsuitable to high speed circuits due to the
large latency that the large number of required clock cycles creates. This problem can be
partially rectified by increasing the effective radix of the multiplier by recoding the
operands to reduce the number of partial products. The most common method of doing
this is through Booth encoding, which is discussed in detail later in this chapter.
4.1.2 TreeMultipliers
In a tree multiplier, the partial products are computed and added in parallel. The
tree multipliers reduce the number of partial products by a set factor each stage, and can
execute multiple stages per clock cycle. Typical trees utilize 3:2 or 4:2 compressors,
which reduce the number of parallel products to 2/3 and Vz, respectively, at each stage of
the tree other than the first. While these circuits result in multipliers with greatly reduced
23
latency, the circuitry required to compute the partial products in parallel, and to perform
the tree compressions results in a much larger circuit.
The traditional trees - Wallace [46] and Dadda [6] - were introduced in the mid
1960's, and used a series of full adders and half adders to sum the partial products. Since
these circuits consist of three inputs and two outputs, the tree structures were irregular.
To make the structures more regular, compressor structures with a larger number of
inputs were introduced. In this paper, a simple 4:2 compressor is utilized. As can be
seen in Figure 4.3, this structure allows for a more regular tree structure by cutting the
number of partial products in half at each stage of the tree. Since the outputs of two
partial product summations become the inputs of the following summation, the 4:2
becomes simpler to map and layout than the 3:2 tree.
Wallace Tree with 3:2 Compressors
PP7 PP6 PP5 PP4 PP3 PP2 PP1
k-bitCSA
k-bitCPA
Wallace Treewith 4:2 Compressors
PP7 PP6PP5PP4 PP3PP2PP1 0
L J 11 111 J
k-bit4:2CSk-bit4:2CSA ji A j
k-bit4:2 CSA
k-bitCPA
Figure 4.3: 3:2 and 4:2 Wallace Trees with 7 inputs
24
Each of the carry-save adders (CSA) in the 3:2 Wallace tree is a series of full
adders, producing the carry and sum outputs. The 4:2 CSAs in the 4:2 Wallace tree also
produce the carry and save outputs for each bit using a 4:2 compressor. As the
compressor unit makes up the majority of the Wallace tree multiplier, it has been a
widely researched topic [2, 20, 24, 29]. The logic for the Nagamatsu [29] 4:2 compressor
is shown in Figure 4.4.
Sum
2N+1 Bits
Figure 4.4: 4:2 Compressor Stage [3, 29]
4.2 Booth Encoding
Independent of the type of multiplier used, the simplest way to speed up the
multiplier is to reduce the number of partial products that must be added together. This
can be done through recoding methods, the most famous of which was developed by
Booth in 1951 [3].
Booth deduced that by analyzing one of the two operands for strings of l's and
0's, it could be recoded so that it might produce fewer partial products. Booth did this by
analyzing groups of two consecutive bits. If the bits were the same value, the partial
25
product was a string of O's. If the most significant of the two bits was 0 and the other bit
was 1, it was noted that a string of ones was ending and the other operand (shifted
accordingly) was used for the partial product. If the most significant of the two bits was
1 and the other bit was 0, a string of l's was starting and the two's complement of the
other operand was used as the partial product. To ensure that the least significant bit is
not seen as a continuation of a string of l's, a 0 is appended to the end of the analyzed
operand. This method has the potential to remove a large number of partial products if
the O's are merely shifted over, and not added. However, the reduction of partial
products is entirely dependent upon the operand being analyzed, and if the operand
alternates O's and l's, this method provides no benefit.
To create a recoding method that is always beneficial, Booth encoding can be
extended to analyze three consecutive bits, with the index of the middle bit incremented
by two so that the middle of the group is only analyzed once. This method forces the
multiplier into radix-4 format, and reduces the number of partial products by half.
Table 4.1: Modified-Booth Encoding [4]
b+1 b b-i PP Value Status
0 0 0 0 Midstring of 0s
0 0 1 A End string of 1s
0 1 0 A Single 1
0 1 1 2*A End string of 1s
1 0 0 A Begin string of 1s
1 0 1 -A Single 0
1 1 0 -A Begin string of 1s
1 1 1 0 Midstring of 1s
26
It should be noted that Booth encoding is inherently signed. To produce an
unsigned product, the analyzed operand must be buffered with a 0 at the most significant
bit, potentially creating the need for an additional partial product [15]. This bit must be a
sign extension of the MSB if the same circuit is to be used for signed multiplication as
well. Thus, in order to achieve a universal multiplier, the mode and MSB are ANDed
together for each operand and concatenated with the operand to form an n+1 bit extended
operand.
4.3 Booth-WallaceMultiplier
In order to produce a high-speed multiplier, it is common to combine modified
Booth encoding with a Wallace tree structure [3, 25, 27]. The modified Booth encoder
reduces the number of partial products from 64 to 32 (65 to 33 when the universal
multiplier is implemented) and the Wallace tree is used to add the partial products
quickly.
It is this Booth-Wallace approach that is used for the universal multiplier
implemented for the ALU in this paper. To prepare the operands for the Booth encoding,
B is sign extended two bits and a 0 is appended. A is sign extended one bit. The sign
extensions are necessary to switch between signed and unsigned multiplication, while the
appended 0 on B prevents the initial group from being marked as a continuation of a
string of Is. The modified A and B values are thereby denoted as,
A_ext = SE & A
B ext = SE&SE&B&0
27
where the
'&'
symbol represents a concatenation operation. The sign extensions here are
not true sign extensions, but are instead defined as SE =MSB S , where S is 1 if a signed
operation is being performed, and 0 otherwise. This forces the MSB to 0 for unsigned
operations, and extends the sign for signed operations.
The 67-bit extended B operand is then divided into groups of 3 bits, with the end
bits of the group overlapping, to create 33 partial products. The additional partial product
creates a non-symmetrical tree, which results in an additional stage. The outputs of the
tree are added together to form the 128-bit product. No exceptions must be handled for
integer multiplication. Figure 4.5 shows the structure of the Booth-Wallace Multiplier
that was designed in this paper.
:w
ssssssssssssssssssseLSbkbkb^kbs^
-h| o| ot to JJT inT -a- <*>| (nT ri o ct>T joT^ cl! ft ft 2 21 ft ft ft
3. ft ft a, a. a. a- ft ft ftf ft ft ft| o,|
figyfc
1 \i&\ fi2gt_| |*12gf \<j j^1zgf t12y|H |~ *129t
4i2sH_
i I I r
"tiiwt
: i i r
_Jl29-f
z5>r
' '129* ' |l''
,. 128:1
128-Bit HCCS
Product p.28
Figure 4.5: Booth-WallaceMultiplier
28
4.4 Summary
In this chapter, a universal 64-bit by 64-bit integer multiplier was designed. The
multiplier uses a Wallace tree to provide a regular structure to add the partial products in
parallel. A modified Booth encoder is used to reduce the number of partial products by
half, increasing the radix from two to four. To allow for both signed and unsigned
operation, A and B must extended by a bit, introducing an additional partial product,
bringing the final tally to 33. Integer multiplication does not produce any observable
exceptions.
29
Chapter 5 Integer Division
Division is the least used of the four basic arithmetic operations. As such, it has
been least researched of the four operations, and remains the most difficult operation to
implement efficiently. Although many different high-speed, high-precision mathematical
algorithms have been developed in the past 50 years, few are suitable for implementation
in VLSI, and those that are require a large clock cycle per iteration or many clock cycles
to implement. As such, division remains the limiting factor in ALU performance.
5.1 Types ofDividers
Oberman and Flynn [35] group division algorithms into five classes: digit
recurrence, functional iteration, very high radix, table lookup and variable latency. Digit
recurrence and functional iteration are the predominate types of multipliers, although the
implementations of each typically incorporate aspects of the other three groups.
5.1.1 Functional Iteration
Functional iteration is the division equivalent of a tree multiplier. Division by
functional iteration uses a multiplier for the fundamental operator. As a result, the
quotient is converged upon quadratically - the number of correct bits is doubled each
iteration, much as it is halved each clock cycle of a tree multiplier. While the functional
iteration method requires fewer clock cycles than a linear approach, the use of a
multiplier as the fundamental operation greatly increases the required clock period, as the
multiplier has a significantly longer delay than the adders that serve as the fundamental
operators of the digit recurrence algorithm.
30
The basis for the functional iteration method can be seen by expressing the
quotient (Q) as the product of the dividend (x) and reciprocal of the divisor (d):
e = = JC*J_
d d
As the dividend is provided, the difficulty in this method is computing the reciprocal of
the divisor. It is the method of obtaining the reciprocal that varies between functional
iteration algorithms. The most common types of functional dividers are the Newton-
Raphson and Goldschmidt algorithms.
The Newton-Raphson divider gets its name from the fact that it is based on the
Newton-Raphson convergence algorithm
which is combined with the priming function
/(X) = l-d = 0
Ji.
to form the approximation equation
XM=Xi*(2-d*Xi).
The corresponding error is given by the equation
eM=ef(d).
From the equations, it can be seen that each iteration of the Newton-Raphson divider
requires two multiplications and a subtraction.
The Goldschmidt algorithm uses a Maclaurin series expansion to approximate the
reciprocal of the divisor to similar effect. The mathematics behind this approach can be
seen in [1].
31
5.1.2 DigitRecurrence
Digit recurrence is the simplest and most widely implemented class of division
algorithms. It utilizes a shift and subtract approach that is similar to the shift and add
multiplication algorithm introduced in Chapter 4. A fixed number of quotient bits are
retired per iteration, with the number of bits retired dependent upon the radix of the
implementation. The basic recurrence algorithms are also similar to shift and add
multiplication in that they require a small amount of area to implement, but the large
number of required clock cycles results in a large latency. The radix of the algorithms
can be increased to reduce the overall latency, at the expense of requiring a larger area
and longer clock cycle. In fact, the possible variations in digit recurrence division are
numerous enough to warrant entire textbooks on the subject. Ercegovac and Lang [10]
have created such a textbook, which presents a comprehensive analysis of digit
recurrence algorithms, and is recommended for those who require further information on
the topic.
Digit recurrence division is defined by the equations:
x - q d + rem
\rem\ <\d\- ulp
sign(rem) = sign(x)
where the dividend x, and the divisor d are the operands, q is the quotient, and rem is the
remainder. In integer division, the unit of least precision (ULP) is 1, as there are no
fractional bits. Digit recurrence assumes that the operands x and d are normalized at the
time of implementation.
32
Using the number of bits to be retired per iteration (b), and the number of bits in
the operands (n = 64 for integer division), the radix (r), and number of iterations (k) are
defined as
b
The following equation is the critical recurrence used in each iteration of the
algorithm:
Pj+^r-Pj-d-q^
where P. is the partial remainder (residual) at iteration j, and qj is are the quotient bits
as determined by the quotient digit selection algorithm. As digit recurrence is a
subtractive algorithm, it is necessary to use the initial condition of
r-P0=x.
The quotient digit selection (QDS) algorithm can be implemented in a number of
ways, so for simplicity's sake, it is written as
qJ+,=SEUr-Pj,d).
Using the QDS values, it is possible to determine the quotient after the jth iteration using
the following equation. It should be noted that the final quotient occurs when j = k.
1=1
The remainder is defined as
K* ifPk>0
rem = \[Pk+d- ulp ifPk<0
33
Digit recurrence algorithms can be divided into two groups: restoring and non-
restoring. Restoring division supports only non-negative values of <?y , and must revert to
the last state before continuing with the next iteration if a negative q} is formed. This
reversion takes time and logic. In restoring division, the multiples of d act as the
separation point for the values of q} . In radix-2 format, gy. is defined as
JO, if rPk<d
^'+1 [l, if
To eliminate the reversion step, non-restoring division utilizes both positive and
negative values of qj . By using both positive and negative values for q. , one of the
separation points can be 0, simplifying the QDS selection.
JT, ifrPk<0
^ [1, ifrPk>0
The quotient digit selection algorithms can be further reduced through the use of
redundant digits sets. Redundant digit sets have more digits than the digit set that they
represent. So a redundant digit set for a radix-2 circuit would have 3 or more numbers.
Redundancy allows for more separation points, allowing for simpler QDS functions. For
simplicity, binary signed digit (BSD) representation is typically used for the redundant
digit set. BSD is a symmetrical digit set that is defined by the following rules:
ja,<Tl...,I,0,l,...,a-l,4 where n = -n, and ^-<a<r-\
Numbers expressed in BSD format are represented by two arrays of bits. One
array represents the positive bits, while the other represents the negative bits. BSD
format can be converted into binary by subtracting the negative bits from the positive
34
bits. Using this methodology, each bit of the BSD number can be decoded as seen in
Table 4. The sign of the BSD number is decided by the most significant bit where the
positive and negative values differ, with a larger negative value resulting in an underflow,
and creating an automatic sign extension. BSD format also has the advantage of being
invertible in zero time - without logic - simply by switching the positive and negative
arrays.
Table 5.1: BSD Conversion
Each signed digit set is measured by a redundancy factor p, where
1 a
,-1< p = <1.
2 r-\
A greater redundancy results in more overlap in the regions covered by the digits,
allowing for greater leniency in selecting a separation point. It also results in more
complex logic to select the appropriate value.
Some sample SD sets for radix-2, radix-4 and radix-8 implementations can be
seen in the following table. In this table, maximally redundant means that a = r-1, and
represents the largest SD set, while minimally redundant means that a = 0.5*r, and
represents the minimal SD set.
35
Table 5.2: Redundant Digit Sets [31]
r a SDset P Type
2 1 {1,0,1} 1 Maximally andMinimally redundant
4 2 {2,1,0,1,2} 23 Minimally redundant
4 3 {3,2,1,0,1,2,3} 1 Maximally redundant
4 4 {4,3,2,1,0,1,2,3,4} 43 Over redundant
8 3 {3,2,1,0,1,2,3} 37 Non-redundant
8 4 {4,3,2,1,0,1,2,3,4} 47 Minimally redundant
8 7 {7,6,5,4,3,2,1,0,1,2,3,4,5,6,7} 1 Maximally redundant
5.2 SRT Division by Comparison Multiples
The most common implementation of digit recurrence division in modern
processors is SRT division, named for Sweeny [5], Robertson [40], and Tocher [45], who
independently developed a method of performing non-restoring division around the same
time. They proposed the inclusion of 0 into the signed digit set. This allows some
iterations to be reduced to shifting, reducing latency in asynchronous implementations.
Qj+i =
1, ifrPk<-d
0 if-d<rPk<d
1, ifrPk>d
SRT division is easily expandable to higher-radix division by implementing more
complex QDS functions. There are three basic methods of implementing the QDS
function. The selection intervals method uses Robertson's diagram to graphically
determine the precision required for rPj and d to choose qj+l from a lookup table. The
selection constants method is the most commonly implemented QDS algorithm. It also
uses a lookup table, but forms separation points based on the value calculations using r
and p, rather than graphically. Information about these methods can be found in [10].
36
The third method, comparison multiples, is used in the divider implemented in this paper,
and will be described in the following section.
5.2.1 QDS by Comparison Multiples
Although most processors implement SRT division using the selection constants
method, a comparison multiples approach is used for the dividers in this paper. With this
approach, the residuals are compared to multiples of d to determine separation points for
the values of qj+1 . The residuals are compared to each of the separation points in parallel
to determine which region they fall into and thus which value of qj+l to output.
Ercegovac and Lang [9] report that this method is complicated due to the need for
multiples and comparisons; however it is shown in [31] that this method is a reasonable
alternative to selection constants.
Each value of q +1 is bounded by an upper and lower limit as shown in the
equation
d(l -p)< rPj < d(l + p)
where le ja,a-l,...,I,0,l,...,a-l,a}. When redundant digit sets like BSD are used, the
upper bound of qj+l= I overlaps the lower bound of qj+1= I +1. It is in this overlap
region that a multiple of d is selected to act as a separation point between / and / +1.
37
d('.-i*1|
d(l*p) oven.
J
Figure 5.1: PD Plot for qj+1 = 1 and q*+i = 1+1
Using the bounds for rPi , qj+l can be defined as:
Table 5.3: QDS Function
**. Condition
It rPj *--/>)
-rf(-l+p)-S fPj S-d(ii-l-p)
T
0
1
Hi + P)s rPj <-J(i-p)
-rfps fPj Srff?
m-P)s rPj <d(i+p)
8-1
li
flV-l-rt-S rPj SflV-i+p)
<*-rts rPj
The overlap between values allows for more error in defining the separation
points, allowing for truncated multiples to be used for the comparisons. The truncated
multiples, reduced to c bits, allow for smaller and faster circuits for the determination of
the comparison multiples. Defining Mk as Mk = Akd , where Ak is a rational number,
and using the symmetry M_k+l=-Mk, where ks {l,...,a-l,a}, the number of
comparison multiples can be reduced to a, and qj+l can be rewritten as
38
*;+i =
'a, if{rPj}<0 and {rP.\ <-{Ma\
~l, if{rPj}<0 and {MM\ < {rP^ < -{M ,}c
I, if{rPj}<0 and -{M2}c <{rP.}c <-{M.}c
0, if{rPj}<0 and -{m\ ^{rP^
0, if{rPj}>0 and {rP^ <{M.}c
1, z/{rP.}>0 anJ {M,}e <{rPj}c <{M2\
I, if{rPj}>0 and {M;}c <{rPj\ <{MM}c
a, if{rPj}>0 and {Affl}c <{r/>.}
The number of bits required for the truncated comparison multiples was
determined in [7] as follows. It should be noted that this method was derived for a
floating-point multiplier normalized so that < d < 1 . However, these equations can be
easily adapted to integer division by assuming that d has been shifted so that the most
significant non-zero bit is the first bit of a fraction.
Truncating a BSD number X to t bits results in the inequality
{x},-2-
<X<{x}t+2-',
which when combined with the inequalities used in the QDS function changes to
Ml-2-c<{M,}c<{rPj}c<rPj+2-c.
Removing the middle statements in the inequality results in
M,-2-c+l
<rPj.
As q+l=l, adding the recurrence subtraction -Id results in an inequality involving the
subsequent residual.
39
M,-2-c+l-ld<Pj+]
In order for the division to converge,
-pd< Pj+X < pd ,
so
-Id
must be met. Reworking this equation, the lower bounds on Af * is derived as
d(l-p) + 2~c+' <Mr
Similarly, the upper bounds on M, is derived as
MM<d(l + p)-2-c.
Combining the two bounds on M, gives the inequality:
d(l-p) + 2~c+1 <Mt <d(l + p-\)-2-c,
which must be maintained for the truncation to be a viable substitution for the complete
number. Eliminating the middle condition and reworking the equation gives
32C
> .
d(2p-\)
Since d > in the shifted form, the final equation for minimum number of bits in the
2
truncated comparison multiples is based on the redundancy factor as
6
2C
> .
2/?-l
The QDS function must also allow for the maximum possible overflow of the additions.
This number of bits is defined as e, and is based on the radix and number of integer bits
(i) coming into the QDS function as e = log2 r + i.
40
5.3 Implementation of SRT Division by ComparisonMultiples
5.3.1 QDS by Comparison Multiples
Before beginning the design of the hardware for the divider, the radix and digit set
that will be used must be determined. As the Booth-Wallace multiplier in Chapter 4 was
implemented in radix-4 format, the radix for the divider has also been chosen to be 4. To
minimize the amount of hardware, the BSD set was chosen such that a = 2, resulting in a
2
redundancy factor of p = . Using the redundancy factor, the truncation precisions are
determined to be c = 5 and e = 2 , as the number of integer bits into the QDS function is
said to be 0 after the theoretical shifting.
Table 5.4: Upper and Lower Bounds for r=4, a=2 BSD Set [35]
L LI = d(Z-p) Ul = d(/+p)
-2 -(8/3)*d -(4/3)*d
-1 -(5/3)*d -(l/3)*d
0 -(2/3)*d (2/3)*d
1 (l/3)*d (5/3)*d
2 (4/3)*d (8/3)*d
With the chosen digit set, the upper (Ul) and lower (LI) bounds of I are given
according to the above table. Analyzing the lower bound of / with the upper bound of
M, it can be seen that due to symmetry, only two separation points - and thus two
comparison multiples - need to be derived. The multiples are the overlap between / = 0
and 1=1, and between Z=l and 1=2.
41
In order for the comparison multiples method to be feasible, the comparison
multiples must be constructed strictly of powers of two of d. This is due to the fact that
powers of two can be implemented via simple shifting of the d value. If the comparison
multiples cannot be constructed of powers of two, the derivation of the multiples
becomes too costly for hardware implementation. For this reason the comparison
multiples are chosen as Af. =0.5* J and M2=1.5J. M2 can be defined as
M2 = d +0.5J or M2 = 2d -0.5d . These cases must be analyzed to ensure that they
conform to the inequalities described previously.
Firstly, the binary (for d) and BSD (for the residuals) truncation conditions must
be described.
0.5d - 2-5 < {0.5J}5 < 0.5d
d-2-5
<{d}5<d
2d - 2~5 < {2d}5 < 2d
4Pj -
2~5
< {4Pj \ < 4P. + 2~5
M!=0.5d
It can be shown that M. = 0.5d conforms to the inequalities by analyzing the two cases
that M, borders: qj+l = 0 and qj+i - 1 .
For q.+l = 0 :
min({rP.}5) = 4Py. < {4Fy}5 < {MA <-d = max({M.}5)
Since qj+l = 0 , Pj+l = rPj , and the previous equation can be rewritten as
42
When Pj+1 is bounded for convergence as - pd < P.+1 < pd d can be derived as
16
Since d > , this condition is always met, so M, = 0.5J can be used for qj+l = 0 .
For qj+l=l:
min({M1}5) = ^-2-5<{M1}5<{4P.}<4P;+25=max({rP.}5)
Since qj+l = 1 , P-+1 = rPj - d , and the previous equation can be rewritten as
-d-2"4<4P. -d = PM.
When P.+1 is bounded for convergence, d can be derived as
8
Both ^.+1 =0 and qj+1 =1 are always correct, so Af,=d is a valid comparison
multiple.
M2=d+0.5d
M2 is the border between: qj+] - 1 and qj+x - 2.
For qj+1 = 1 :
min({rPy}5) = 4P;. -
2"5
< {4P^ }5 < [Af, } < d + -d = max({M1 }5)
43
Since qj+1 = 1 , Pj+l = rPj - d , and the previous equation can be rewritten as
P,.+1 =4P. -d<-d +
2'5
J T
When P,+1 is bounded for convergence, d can be derived as
4**1.
16
For qj+i=2:
min({A/. }5) = d - 2-5 +-d - 2~5 < {M, }5 < {AP, } < 4P +
25
= max({rP,.}5)
2
Since q ,+1 = 2 , P.+] = rPy - 2J , and the previous equation can be rewritten as
- - d - 2"4 - 2'5 < 4P,. - 2J = P,+1 .
2 ; J
When P. , is bounded for convergence, d can be derived as
16
, a .^=2 is not correct for all cases and M2-d +
2
;+1 2
As d> qj+i d is not a valid
comparison multiple.
M2=2d-0.5d
M2 is the border between: qj+l = 1 and qj+1 -2.
For qj+1 = 1 :
min({rP,. }5) = 4P,. -
2'5
< {APj }5 < {Afl } < 2d -- d = max({M, }5)
44
Since qj+l = 1 , Pj+i = rPj - d , and the previous equation can be rewritten as
P,+1=4P.-rf<ij + 2-5.
When P.+1 is bounded for convergence, d can be derived as
4>A.
16
For qj+l = 2 :
min({Af.}5) = 2d-2-5-(^d-2~5)< {Af,}5 < {4Py}<4P,
+25
= max({/P.}5)
Since qj+l = 2 , Pj+X = rP} - 2d , and the previous equation can be rewritten as
-\d-2~5<APj-2d = Pj+l.
When Pj+l is bounded for convergence, d can be derived as
16
Since d> , both qj+l = 1 and qj+] - 2 are always correct, so M 2 = 2d d is a valid
^ 2.
comparison multiple.
The implementation of the comparison multiples generator is rather straight
forward. The multiples 0.5d and 2d are derived in zero time by shifting d using wires. A
subtractor constructed from inverters and a Han-Carlson adder is used to subtract 0.5d
from 2d. Since the values for d will not change from iteration to iteration, the generation
of the comparison multiples is a one time per division delay.
45
[-d}5=Q.01xxx {2d}5=l.xxxxx {-d}5=O.OLrxr
2J5
Figure 5.2: Comparison Multiples Generator
To compare the binary comparison multiples with the BSD residual P , a three
input comparator using a carry-free BSD adder was constructed,
{4^[/]}5+{4w[/]}s-{A4}s S*m
<7> <fi;0> <6iQ>
1 \ f
<7>
I
<7:0> <73Ql>
""I
\-
P*+ Pi
Figure 5.3: Comparator
where the BSDA is constructed is constructed using full-adders, with the A input being
the positive residual vector, the B input being the negative residual vector, and the carry-
ins being the XNOR of the comparison multiple and the sign of qj+1 , where the BSDA
produces the equation
a. a. +b, =2sT+1-sT
46
Table 5.5: BSDA Outputs [31]
Oi -<-*
tv i 0 1
0
1
i
3-)=
(0,1)
(0,0)
= (0,0)
= a,i)
= (i,i)
(i,o)
In this manner, the BSD residuals can be compared with the binary comparison
multiples, without the necessity of propagating a carry-bit. As a result the comparison
can be completed in less time than a typical binary subtractor,
a*
a b
caut -i-
<a-\ > <b-1 > <n-\ >
J 1_
::=T=f
a
"FA.-:
i . I _J L_ _l JL_ I
b I a b a b \
j d I .... co FAj ci J -- co FA0 d J
<ii-l> <n-l>
H
"^
m
1> <1> < > <0> <0> <0i
<1> <1>
H -
f \
5
^=1 cin
<0> <0>
4
s^
s
Figure 5.4: BSDA - Binary Signed Digit Carry-Free Adder
so the comparator produces
rPj
- rPJ -M,= rPj -M,=
P;+
-
P~
.
The signs of the comparator outputs must then be determined so that they may be
used to find the proper value of qj+i . To determine the sign of the comparator outputs,
the inverse of the carry-out of
P;+
-
P;"
was calculated. To perform the calculation, a
Han-Carlson adder with the sum stage removed was used.
47
X Y
SigniX-f)
Figure 5.5: Sign Detector
The final stage of the QDS circuit was the encoding of the 2-bit unsigned qj+i
output using the signs of the two comparators and the sign of q ,+] , which were computed
the previous stage. The QDS output was determined as seen in Table 5.6.
Table 5.6: QDS Output Encoding [31]
Sm, Sm1 Sign{qj+\) Mflgfyy+i) = <?1(7 IM
1 1 1 11 2
1 0 1 10 T
0 0 1 00 0
1 1 0 oo 0
0 1 0 10 1
0 0 0 11 2
Analyzing the table, it can be seen that the qJ+1 can be generated using simple
XNOR gates.
SMt SM2
ql qO
%(?>+ 1)
Figure 5.6: QDS Output Encoding
48
M {4w\j\h S*W
Wxh ""ST W2T]
2.5, '. Z
jCony. SignDel| |Conip. SignDet|
| Coder \-
gl| |g0
A/agfa*.-)
Figure 5.7: QDS Logic
Combining these components, the complete QDS circuit can be seen above. It consists of
the comparison multiple generator to create the two comparison multiples. Two
comparators compare the residual, shifted by the radix, with the comparison multiples.
Two sign detectors then determine if the comparison multiple was larger or smaller than
the shifted residual. The final outputs are determined from the sign of the QDS output -
calculated the previous cycle, as described in the next section - and the output of the sign
detectors.
5.3.2 Residual Computation
With the quotient digit selection function implemented, it is possible to implement
the recurrence function P.+1 =r-Pj-d- qj+] that is at the heart of the SRT division. The
implementation of rP = APj is a simple matter of appending two grounds to the end of
the residuals. The two bit shift acts as a multiplication by four, and takes zero time.
49
Similarly, since the values of qj+] e {-2,-1,0,1,2}, which are all multiples of two, the
values for d qj+1 can all be calculated by shifting d. To obtain a value of 2d, one ground
is appended, and to obtain d, no action is taken. If d is zero, Pj+] = r P. - 0 qj+1 =r-Pj,
so the d component can be ignored entirely.
Examining the recurrence equation reveals the familiar formula
BSD - BIN = BSD+ -
BSD~
- BIN . This is the exact same equation used for the QDS
comparators and, in fact, the same comparator hardware - extended to support the bit
lengths of shifted residual - can be used to calculate the possible residuals for the next
cycle.
With the possible residuals for the following cycle computed for all qj+1 values,
the comparator outputs must be reduced from 69 bits to 66, and the signs of each must be
calculated. The sign detector is an extended version of the sign detector used for the
QDS function. It simply uses a Han-Carlson tree to determine the carry-out of
BSD+-BSD~
and inverts it.
The bit reduction utilizes the convergence condition - pd < Pj+] < pd to perform
2 1 2 2
the compression. Given that p =, < d < 1 , and < Pj+1 <, this means that the
integer portion of the comparator outputs can be removed without affecting the value of
the residuals. Furthermore, [31] has demonstrated that the most significant two bits of
the compressed residual can be defined as seen in Table 10.
50
Table 5.7: Compression Process for 1 e {0,1,2}
P;<68:6.tt>=ABCXY Pi<S5:S4>= Jty
000XY XY
001 .01 .11
0O1.1X IX
ooloi li
ooTjx .IX
01I0T n
oiTjx IX
011.01 "n
mux JX
iITdT n
mix IX
111.01 ."IT
Tii.ix jx
The encoding of the
65th
and
66th
residual bits can be seen in the following figure.
The lowest 64 bits of the residual remain the same.
P-i<68:0>
P1<65:0>
Figure 5.8: Adjust Unit for Integer Divider
With possible residuals calculated, they can be stored in registers for the following cycle.
The q. value can then be used to select the residual P. and sign of qj+i using
multiplexers.
51
5.3.3 Quotient Conversion
The final component in the division algorithm is the conversion of the quotient
from a BSD number to a binary number. The simple method of converting the numbers
is to wait until all iterations of the division have been completed, and then to subtract the
negative quotient array from the positive array. However, rounding in floating-point
division requires the converted quotient, so using this method could necessitate an
additional clock cycle. For this reason, an on-the-fly rounding was described by
Ercegovac and Lang in [1 1] and expanded upon in [9] and [10]. Since the integer divider
is derived from the floating-point divider, it also utilizes this on-the-fly rounding method.
This method utilizes two n+2 bit shift registers Q and Qm, where Qm is always
equal to Q-l. Each clock the quotient bits are appended to shift registers as:
Qm[j + l] =
QU + l] =
(QUUqj+i-m ifqj+i>0
(Qm[j],(r-\qj+1\-l)) ifqj+l<0
\QUUqj+0) ifqj+lzo
(QmUUr-lqJ)) ifqj+l<0
Where (a,b) is a concatenation, Q[0] = 0 and Qm[0] is undefined. Due to the assumption
that < d < 1 and < x < 1 , qx is always positive so it is not necessary to initialize Qm.
To automate the divider to detect when the division has finished, Q[0] can be initialized
to all l's except for the LSB, which should be 0. Using this method, the proper number
of iterations can be detected when the 0 bit is shifted out of the Q register. At this point,
Q register can be used to determine the quotient. If the MSB of Q is 1 , the output is
Q(n+1:2), otherwise it is Q(n:l). Once again, this normalization shows the algorithms
roots in floating-point format.
52
|P;;' Red |P,-RJ x |Pp- Rej [P; Reg| | P, Rj 1^ R.J \S> R<g| \$ Rej [$ fej
|S*K-*,--H rAdTl lagnDd fAdi"! IStenDetl f^H
_c, ^Im , , I , , ci, ^r] , , I , , d- Jul
IqlRejj 1-jORagl |50Ret| IPp-Rej |P0 Reg| fi Refi| | P.- Rej | P, Reg| ^Rag) |P2'R^ [WR^ k^3
Figure 5.9: SRT Division Block Diagram
5.3.4 Adaptationfor Universal IntegerDivision
Unlike the mantissas used in the floating-point division algorithm, the number of
bits following the leading one of the integer dividend and divider is not constant. For this
reason, the operands must be conditioned before they are used in the SRT divider.
Firstly, as the divider handles both signed and unsigned division, a method for
handling both must be implemented. The simplest method, and the method implemented
here, is to take the two's complement of the operand if the MSB is 1 and signed division
is chosen. This results in all values being positive, so all values can be handled in the
same manner.
To shift the operands so that they are 64-bits with the MSB = 1 , the operands - or
their two's complements - are fed into leading-zeros detectors. This determines the
number leading zeros of x (LZX) and d (LZD), and thus the number of bits that the
operand must be shifted left. X and LZX, and d and LZD are then fed into barrel shifters,
53
and x and d are shifted so that there are no leading zeros. The shifted values are then put
into registers, and used as the x and d inputs to the SRT divider on the following clock.
This shift of x and d has the effect of changing the number of iterations required
to determine q. While q would be perfectly accurate if simply allowed to run for 33 clock
cycles, this wastes valuable ALU cycles and reduces the throughput. To alter the
algorithm so that the minimum number of cycles is used, Q[0] is first shifted left by LZX.
This value is entered into a register, as it takes the same amount of time to determine as
the x and d values. At the start of the next clock cycle, it is additionally shifted by
n-2 LZD to account for shift of d that occurred. Taking advantage of the fact that the
first QDS output is not appended to shift registers until the beginning of the following
cycle, this value is then inserted as the initial Q register value: Q[0].
x operand Signed
x-Reg
d operand
d-Reg
29
0
644-
64-Bit HC
Adder
I
, J
0
64.J-
64-Bit HC
Adder
BarrelL
x Shift-Reg
6-Bit HC
Adder
Q_Shift-Reg
,'6
Q_S2-Reg
1
d_Shift-Reg
Q(0]
Figure 5.10: Operand Conditioning
54
The leading zero detectors can also be utilized to detect when an input operand is
zero. If either operand is 0, the division is completed. In the case of x being 0, 0 is
outputted as the quotient. If d is 0, a divide-by-zero exception is triggered.
When the quotient has been produced by the SRT division circuit, provided that
both operands are non-zero, the quotient can be modified accordingly. If either x or d
was negative, and signed division is being used, then the two's complement of the
quotient is formed. Otherwise, the quotient remains unchanged. This value is the final
quotient for the divider.
5.4 Summary
The universal integer divider described in this chapter is based upon the SRT
digit-recurrence floating-point divider described in [31], and containing elements of
variable latency and high-radix division. The FP algorithm requires that the inputs be
normalized. Maintaining this condition requires an additional cycle of operand
conditioning to take the two's complement of negative numbers if signed division is
selected, and to shift the operands so that the most significant bit is 1. This allows the
divider to require a variable number of cycles to execute.
The divider uses a radix-4 quotient digit selection function by comparison
multiples to reduce the maximum number of iterations by half. This reduces the
maximum number of iterations to 34, including the conditioning cycle. The minimum
number of cycles, is 0 if either operand is 0 or if d is 1 .
55
Chapter 6 ALU Synthesis Results
Each of the arithmetic blocks described previously were implemented in the
VHDL hardware descriptive language. The designs were constructed primarily of basic
logic gates, multiplexers and flip-flops using a structural architecture format. Test
benches were then designed and used to verify the functionality of the blocks using
ModelSim. Upon successful verification of the VHDL implementation, each arithmetic
block was constrained and synthesized using Synopsys Design Compile (2006.06). The
designs were synthesized using a TSMC 0.13pim static CMOS process. The designs were
synthesized with a focus on speed, with area and power trade-offs considered, but
secondary. The delay, area, and power of each were recorded.
6.1 Adder Synthesis
As the integer adder is such a vital part of all of the arithmetic components
discussed in this paper, special emphasis was placed on determining the trade-offs
between the different types of adders. First, radix-2 and radix-4 implementations of the
Kogge-Stone parallel prefix adder were synthesized to determine which radix produces
faster circuits with the TSMC process. As can be seen in Table 10, the radix-2
implementation demonstrated faster speeds, presumably due to the fan-out requirements
of the radix-4 adder.
With the radix of the adder determined, the Han-Carlson and Han-Carlson/Carry-
select adders were synthesized. These were compared to each other, as well as the two
Kogge-Stone implementations. It was observed that the Han-Carlson performed slightly
better than the corresponding Kogge-Stone, with the hybrid adder outperforming the rest.
56
Finally, to obtain a baseline against which to compare the performances of the
adders being tested, an adder implementation from the DesignWare library was
synthesized under identical conditions. The DesignWare libraries consist of highly
optimized VHDL and Verilog implementations of circuits created by the Synopsys team.
While the components implemented in the library are not identical to those implemented
in this paper, they can, at least, be synthesized using the same technology. As
comparable circuits in IEEE papers have been implemented in different sizes, and with
different technologies, the DW libraries provide the best opportunity to make a fair
comparison of the implemented circuits against other implementations. The DW adder
that was synthesized was a variation on the Brent-Kung, modified to improve
performance, without a large increase in area. This is the fasted implementation offered
by DesignWare - another carry-look ahead variant may be chosen, but it is both slower,
and larger.
Table 6.1: Comparison ofAdder Synthesis Results
Conditions Area (gates) Power (mW) Delay (ns) Speed (MHz)
KS-R2
Worst Case
3093
16.58 1.2 833
Best Case 34.11 0.44 2,273
KS-R4
Worst Case
3007
17.96 1.39 719
Best Case 39.26 0.52 1,923
HC
Worst Case
2406
16.23 1.17 855
Best Case 66.07 0.44 2,273
HC/CS
Worst Case
2564
18.38 1.07 926
Best Case 35.35 0.4 2,500
DWBK
Worst Case
2092
11.62 1.25 800
Best Case 24.54 0.52 1,923
57
As can be seen in the preceding table, the Han-Carlson/Carry-select hybrid is the
fastest of five adders that were analyzed, though, as expected, the increased performance
comes at the cost of an increase in size over the strictly PPA Han-Carlson
implementation. The radix-4 Kogge-Stone performed the worst due to increased fan-out,
but did offer a decrease in area over the radix-2 Kogge-Stone implementation. The
DesignWare adder was the smallest and used the least power, but was also slower than
the radix-2 adders that were implemented - as expected with a Brent-Kung solution.
As this design is focused primarily on speed, the Han-Carlson/Carry-select adder
is the best solution. The schematic that was produced by Synopsys Design Vision can be
seen in Figure 6. 1 .
Figure 6.1: HC/CS Schematic from Synthesis
58
6.2Multiplier Synthesis
As with the adders, the DesignWare libraries provide the best opportunity to
gauge the effectiveness of the multiplier design. The DesignWare libraries offer several
Booth-Wallace multipliers, each requiring a different number of clock cycles - the
number of rows of flip-flops can be selected to determine the number of required cycles.
In this manner, the DW multiplier is extremely similar to the multiplier described
previously in its flexibility.
As the integer divider has a fixed iteration, it is the limiting factor in the ALU.
The number of stages in the multiplier can then be adjusted so that the fewest possible
stages are used, while keeping the cycle time less than that of the divider. The
synthesized universal Booth-Wallace multiplier can be seen in Figure 6.2.
Figure 6.2: Universal Booth-WallaceMultiplier from Synthesis
59
Using four stages - with registers after the Booth encoder, and each two stages of
the Wallace tree - the custom Booth-Wallace multiplier synthesis produced an area of
95649 gates, at a power dissipation of 345 mW and a delay of 2.04 ns - a speed of 490
MHz - under worst case conditions. Using the same conditions and constraints, the DW
implementation required an area 52,167 gates, a maximum power of 254.72 mW, and a
delay of 2.03 ns. While the optimized DW multiplier utilized less power and area, the
custom design operates at nearly the same speed.
6.3 Divider Synthesis
Unfortunately, the DesignWare libraries do not offer a clocked divider. As such,
there is no good benchmark with which to compare the universal divider. The reported
divider area was 20925 gates and the power was 49.64 mW. The delay was 2.72 ns, for a
frequency of 368 MHz.
Figure 6.3: Universal Divider from Synthesis
60
6.4 ALU Synthesis
The synthesis of the completed ALU yielded an area of 122,215 gates, a power of
384 mW, and a delay of 2.89 ns - a frequency of 346 MHz. As expected, the critical path
of the ALU is in the divider - from the output of the partial residual registers, through the
computation of the next partial residuals. The multiplier dominated the area.
As with most large designs, the schematic produced by Synopsys is sufficiently
large and dense to be dominated by wiring - making it nearly impossible to see any of the
logic. For this reason, the ALU synthesis results diagram includes an inset to show the
locations of two of the arithmetic blocks. The inset includes four blocks - the universal
multiplier (UM), universal divider (UD), left shifter (BL) and right shifter (BR). The
adder/subtractor and logic blocks are locate below the subset.
Figure 6.4: ALU from Synthesis
61
While the implementation of the XOR, OR, AND, and inverse logic functions are
trivial, the implementation of the shifting operations are more complex. They required
the used of a series of multiplexers to implement. The multiplexers use the shift control
to determine whether to output a portion of the input array, or a constant. Using several
multi-bit multiplexers, a barrel shifter is formed that can shift an n-bit number up to n bits
in roughly the same amount of time that it takes to perform an n-bit addition - a one
clock cycle operation. A Synopsys schematic of the left barrel shifter is included below.
Figure 6.5: Barrel Shift Left from Synthesis
62
Part II
Floating-Point Unit
This section presents the implementation of an IEEE compliant double-precision
floating-point unit based on the MIPS 64 instruction set. The floating-point unit
supports double precision addition, subtraction, multiplication and division, and
conforms to the IEEE 754 standard for number representation and rounding.
63
Chapter 7 Floating-Point Unit Overview
Floating-Point Units (FPU) are the hardware components that handle decimal
mathematical operations in the CPU. Like the ALU, the FPU implements the four basic
mathematical operations - addition, subtraction, multiplication, and division - the
difference being the number representation scheme utilized. An ALU handles integer
values, represented in binary numbers. This means that the entire 64-bits of the bit vector
represent the portion of a number to the left of the decimal point. An FPU deals with
both the integer and fraction portions of numbers. As there is no way to slide a decimal
point into the bit vectors to tell the computer what is the integer and what is the fraction,
the operands must be divided into sections representing the sign, exponent, and mantissa
of the number.
7.1 The IEEE 754 Standard
At the dawn of the computing age, chaos ruled the floating-point world. That is
to say, there was no standard way of implementing a floating-point operation. Each
platform implemented their own floating-point rules, with different ranges and precisions
- different mantissa and exponent lengths, and methods of implementing exponent biases.
As a result, programs written for one platform could not be easily ported to another
platform.
In 1977, the companies of Silicon Valley formed a committee to rectify this
problem under the banner of IEEE p754 [42]. Among the represented companies were
Intel, National Semiconductor, Zilog, Motorola, IBM, DEC, CDC and Cray. After years
64
of debate and compromise - with many companies trying to persuade the committee to
use their standards so that they did not have to change implement new formats - the IEEE
754-1985 [21] was adopted.
The IEEE 754 standard implemented a floating-point algorithm more complex
than any to that day. While most companies implemented one type of rounding, the
IEEE standard supports four modes: round to nearest (RN), round to zero (RZ), round to
infinity (RP), and round to negative infinity (RN). The standard supports single-precision
32-bit numbers, and double-precision 64-bit numbers. As would be expected, double-
precision offers a larger range (1 1 exponent bits compared to 8) and greater accuracy (52
fraction bits compared to 23) than the single-precision. As they operate in the same
manner and the focus of the work presented in this paper is on 64-bit inputs, only the
double precision format will be explored.
exponent
sign (11 bit)
fraction
(52 bit)
II II 1
63 52
Figure 7.1: Double-Precision IEEE 754 Format
The decimal value of normalized FP numbers in IEEE 754 format is represented
as
*2e *l.f . A complete list of the possible values for double precision numbers
can be seen in the following table. The difference between single and double precision
format lies in the exponent bias. For double precision it is 1023, while the single
precision bias is 127. The bias allows for values both with and without integer digits.
65
Table 7.1: Double Precision IEEE 754 FP Values
1 Exponent (e) Fraction (f) Value
Zero 00000000000 zero 0
Denormalized 00000000000 nonzero
_
i sign -j- je-bias+\ * Q f
Normalized 0 < e < 2047 any
_
i sign -j- ^ e-bias * i r
Infinity 11111111111 zero Infinity
NAN 11111111111 nonzero Not a number
IEEE rounding supports four modes that can be compressed into three rounding
modes dependent upon the sign of the result. The round to nearest mode (RN) rounds the
infinitely precise result to the nearest ULP. In the case where the infinite precision is
exactly midway between the possible values, the result is rounded to the value with the
ULP of 0. The round to positive infinity mode (RP) rounds the infinite precision number
to the smallest value greater than or equal to itself. The round to negative infinity mode
(RM) rounds the infinite precision number to the largest value less than or equal to itself.
The round to zero (RZ) mode rounds the infinite precision value to the smallest value
greater than or equal to itself if the number is negative, or the largest value less than or
equal to itself if the number is positive.
Analyzing these modes, it can be seen that the rounding modes possess a certain
amount of redundancy. As such, the IEEE rounding can be implemented without the RM
and RP modes, as suggested by Quach, et al in 1991 [36]. These modes are replaced by a
round to infinity mode that rounds away from zero, regardless of the sign of the number.
Table 7.2: Possible Implementation of IEEE Rounding Modes [37]
IEEE
Roundinq Mode
Positive Number Negative Number
Treated As
RN RN
RP Rl RZ
RM RZ Rl
RZ RZ
66
7.2 FPU Instructions
The instructions implemented for the floating-point unit are add, subtract,
multiply and divide. The floating-point addition, subtraction and multiplication
operations support all four IEEE rounding modes. The FP division implements round to
nearest rounding only, as it is common to sacrifice the remaining three rounding modes in
order to decrease the division latency.
Table 7.3: FPU Operations
Operation OP Code Rounding Description
Addition 00
00 DP Addition with RZ Rounding
01 DP Addition with RP Rounding
10 DP Addition with RM Rounding
11 DP Addition with RN Rounding
Subtraction 01
00 DP Subtraction with RZ Rounding
01 DP Subtraction with RP Rounding
10 DP Subtraction with RM Rounding
11 DP Subtraction with RN Rounding
Multiplication 10
00 DP Multiplication with RZ Rounding
01 DP Multiplication with RP Rounding
10 DP Multiplication with RM Rounding
11 DP Multiplication with RN Rounding
Division 11 - DP Division with RN Rounding
67
Chapter 8 Floating-PointAdder/Subtractor
Just as integer addition/subtraction is the most used ALU operation, FP
addition/subtraction is the most utilized floating-point operation. Oberman and Flynn
[34] report that FP adder is used for 55% of all floating-point instructions. Because of
this, many methods of implementing floating-point division have been researched and
implemented. Although the approaches can be implemented in vastly different manners
in an attempt to optimize the circuitry, all methods implement some variation of the block
diagram shown in Figure 8.1.
FP Operands
Unpack
+/-
Subtract
Exponents
!-? Sign Logic
Complement
and swap
Align
Significands
^ Adjust Exp
k-
Round and
Select
Complement
Adjust Exp 4 Normalize
FP Sum/Difference
Figure 8.1: Block Diagram of Floating-PointAdder/Subtractor
68
The unpacking stage in this diagram involves separating the sign, exponent, and
significant for each operand. This includes reinstating the hidden 1 for normalized
number, and the hidden 0 for denormalized numbers. If a different number formatting is
used in the adder, the number conversion occurs here as well.
The difference between the exponents is used to determine the amount of right
shifting necessary to align the smaller operand with the larger operand. It is also utilized
to determine which operand is larger.
To reduce the costs of additional logic, the alignment of the mantissas is limited
to only one operand, requiring only one shift operand. This requires that it be permissible
to swap the operands so that the one with the smallest magnitude may be fed into the
shifter. To lessen the effects of the swap, it is implemented in the same block that
performs selective two's complement in preparation for effective subtraction.
After the addition/subtraction of the aligned significands, the values
sum/difference has a magnitude in the range [0,4). This result must be normalized to be
in the range of [1,2) to conform to the IEEE standard. If the result is in the range of [2,4),
it must be normalized by shifting one bit to the right. If the result is in the range of [0,1),
it must be shifted one bit to the left for normalization. The exponent must be adjusted in
accordance to the normalization shift - an increase of 1 for a shift right, a decrease of 1
for a shift left.
After normalization, IEEE rounding is performed in accordance with the standard
described in the previous chapter. If the normalized result is rounded upward, it has the
potential to cause an overflow to 2, which must then be normalized with another shift
right, and an increment in the exponent. This result can then converted to IEEE format.
69
8.1 Common FP Adder/Subtractor Optimizations
8.1.1 Use ofCompoundAdders
Compound adders compute a+b+cin and a+b+cin+1 in parallel. Using compound
adders to add the aligned significands allows the rounding decision to be computed in
parallel. The round decision is then used to select between the sum and sum+1. This
replaces an adder with a multiplexer, reducing both delay and area.
8.1.2 Parallel Paths
The FP-adder pipeline can be partitioned into two parallel paths that work under
different assumptions, as described by Farmwald [6]. Each path can be optimized for its
specific purpose, and disregard some of the steps required for the alternative path. The
most common method of portioning the paths is according to exponent difference. For
small exponent differences (Aexp -5exp ={- 1,0,1} ), a near path is defined. For larger
exponent differences, a far path is defined.
8.1.3 One 's Complement SignificandNegation
The one's complement of a number is formed by inverting the bits of the
significand, while conversion to the two's complement format requires the use of
inverters and an adder, as the two's complement is equal to the one's complement + ULP.
The conversion of one's complement to two's complement format can be accounted for
by adjusting the ULP of a subsequent calculation. For example, using a compound adder,
the a+b+1 output accounts for the missing ULP of the two's complement.
70
8.1.4 Leading ZeroApproximation
For the near path, effective subtraction may result in a sum/difference that will
require normalization via left shifting. The number of bits that the shift will require can
be approximated by analyzing the number of leading zeros of a recoding of the mantissas.
Since it requires only the mantissas, and not an exact value for the sum/difference, the
approximate number of leading zeros can be derived in parallel to the
addition/subtraction operation taking place via the compound adder. The exact number
of leading zeros can then be selected simultaneously to the selection of the correct
sum/difference using multiplexers. This eliminates several logic delays, allowing the
correct normalization to be performed immediately up on the selection of the correct
sum/difference. The exact process for approximating the leading zeros varies between
FP adder/subtractor implementations.
8.1.5 Reduction ofRoundingModes
Using the compressed rounding, as described in the previous section, allows for
simpler rounding implementation. There are many proposed rounding algorithms that
improve upon the naive implementation described previously [13, 36, 37, 48]. Those that
offer the highest speeds tend to use this compression from four down to three rounding
modes. The benefits of this process are not difficult to see. The compression logic can
be implemented in parallel to the computation of the sum/difference, so it does not
introduce additional delay into the system. The amount of rounding logic, and the logic
used to select between rounding modes may then be reduced to accommodate for one less
mode, reducing both area and delay.
71
8.2 The SE FP Adder/Subtractor
As there are so many variations on the floating-point adder/subtractor - each
company has its own variation(s) - only the one implemented for this paper will be
discussed. The FP adder/subtractor used for the FPU described in this paper is based on
the Seidel-Even (SE) FP adder [41], but has been modified to allow for denormalized
numbers, which were not supported the original. The formulae behind the SE adder are
proven in [47].
The SE adder utilizes a two-path parallel path architecture, and takes two clock
cycles to execute. The N path is similar to the near path described previously, assuming
that Aexp - fiexp = {- 1,0,1} and that effective subtraction is taking place. Effective
subtraction means that the sign of the A operand and the output could possibly have
differing signs depending upon the respective sizes ofA and B:
SEFF = sign(A) sign(B) is _ sub
Where is_sub is 1 if subtraction is being performed and sign(A) and sign(B) are the sign
bits of the A and B operands.
This parallel path method has the advantages described previously. The paths can
be optimized for a single purpose, and the hardware can be smaller and faster. This
partitioning of logic offers the additional benefit of only requiring one path
- the N path -
to perform subtraction. By limiting the near path to effective subtraction, the significand
is guaranteed to be of smaller magnitude than the larger of the inputs. As such, the
rounding operation can be removed from this path. The use of a single subtraction path is
a method utilized by both AMD [33] and SUN [16] in recent processor implementations.
72
K path
exponent
difference
QgnificanrJ
Is cornpl
align]
swap
3t\ivpf
'
V path
expoDen I
difference
prediction
. 1
align swap
leading zero
predictions
dgnifscaod
addilien
parti
1st cycle
I I
sigpstficand
adi Ii ii> in
high
Dgnificand
aHditirm
law
i
rounding
decision
i
pDfit-
TinmialTTwrinri
leading zero
election
rigmficand
addition
pjrt r
nnnnahzatioii &
post-non-nalizaobn
roundine path
aalecticD selectinn
2nd cycle
Figure 8.2: High-Level SE FP Adder/Subtractor BlockDiagram [41]
8.2.1 Near Path
To determine which of the operands is the larger and which is the smaller, the
least significant two bits of the exponent of B are subtracted from the least significant
two bits of the exponent of A. The two-bit output of this subtraction is used as the
selection inputs for a group of multiplexers that determine the inputs for a compound
adder. If the difference is "00", the exponents are the same, so the inputs for the adder
are the mantissa of A and the inverse of the mantissa of B. If the difference is "11", the
exponent of B is one larger than the exponent of A, so the mantissa of B is shifted by
two, and used as an input along with the inverse of the mantissa of A. If
"01" is the
difference, the exponent of A is one larger, so the mantissa of A is shifted by two, and
used as an input along with the inverse of the mantissa of B.
73
SA EA EB
11. 11.
Decoder
^"64
<l:Ov-
,'<1:0>
FB(52:0)FA(52:0)
EA-EB
vjM
oo. ..pi 11.
n.
J^i
i min ea_eb
P=1 HC_Adder
V f H Bit
11
min_ea_eb+l
-<l>-
A
oq oo
V V
xy-i^Cy-
55 55
53
A
53
<53:0>
55
<52:0>
PN-Recoding
PN -Carry
55-
PN-Sum
^-55
XOR
1 55
FLP 'Li
53
55
FSOPA
PPA - HC Tree
55 Bit
Gen_C Prop_C
OR
11
0* < 54 -0*1
OR
54
PENC54
55
55 55
XNOR
PENC55
NFOPSUM
55
OR
55 XOR
,54
<53:0>
.FOPSUMI
54
A
Ul
11111"
11
HC_Adder
11 Bit c
'I
11
<y
HC_Adder
cin 11 Bit
54
Barrel Left
<52:Q_2.
c53>
y
<L53:1>
zNei
11
EWear OVF
> 53
FNear SNear
Figure 8.3: SE Adder Near Path
74
Table 8.1: Compound Adder Inputs Based on Exponents
EA[1:0] EB[1:01 Diff[1 :0]
Adder
Inputs
0 0 0 0 0 0 FA + -FB
0 0 0 1 1 1 2FB + ~FA
0 0 1 0 1 0
0 0 1 1 0 1 2FA + ~FB
0 1 0 0 0 1 2FA + ~FB
0 1 Lo 1 0 0 FA + ~FB
0 1 1 0 1 1 2FB + ~FA
0 1 1 1 1 0
1 0 0 0 1 0
1 0 0 1 0 1 2FA + ~FB
1 0 1 0 0 0 FA + ~FB
1 0 1 1 1 1 2FB + -FA
1 1 0 0 1 1 2FB + ~FA
1 1 0 1 1 0
1 1 1 0 0 1 2FA + -FB
1 1 1 1 0 0 FA + ~FB
The case where the difference is "10" can be disregarded, as the difference of two
violates the assumptions for the near path, so the far path will be used. Similarly, cases
where the exponents of A and B differ at a larger bit position than those analyzed violate
the near path conditions and fall into the far path. The shift by two of the larger operand
if one exponent is one ULP larger than the other exponent acts to align the significands,
ensuring that correct values are subtracted.
The compound adder used to perform the subtraction uses the lazy one's
complement, where the addition of the ULP required to convert the inverse of the lesser
significand to two's complement format is not added into the calculation (i.e. by carry
in). Instead, it is accounted for by the structure of the compound adder itself. This adder
is specially designed to accommodate for the fact that the operands should be switched if
EA and EB are equivalent, and FB is greater than FA. This combined with the selection
75
and alignment logic described previously ensures that the operation \A\ - \B\ is performed
if A > B, and \B\ - \A\ is performed if B > A.
Rather than merely producing the carries, as done for the integer adder described
in Chapter 2, the propagation groups, and the original propagation signals from the prefix
logic are also utilized in this compound adder. The carries and propagation groups
produced by the Han-Carlson tree, can be OR'd together to form an incremented carry-
group. The incremented carry-group can then be XOR'd together with propagation
signals from the prefix logic to create the two's complement subtraction |A| + \B \ + 1 . An
XNOR of the propagation signals with the carries produces the inverse of the addition of
the inputs: \A\ + \B\. The resulting output of the compound adder is defined as:
|a| + |z?|+i i/|a|-|i9|>o
afo(|A|-|5|)
\A\ + \B\ if\A\-\B\<0
Using this method for the compound adder allows both \A\ - \b\ and \b\ - \A\ to be
computed simultaneously. Using the most significant bit of the |5| - \A\ = \A\ + |fi| output,
the correct orientation can be determined. If that value is 1, then |2?|>|A|, so |#|-|a|
must be selected. Otherwise, \A\ - \b\ is selected.
The number of leading zeros of the differences are predicted in parallel with the
compound subtractions. As explained previously, this allows for an immediate
normalization upon the selection of the proper subtraction orientation. To perform the
leading zeros prediction, the subtraction is approximated using PN-recoding, as described
in [10], [11]. This produces sum and carry outputs, which are XOR'd together. The
76
output of the XOR is then OR'd with the exponent of B, decoded into one-hot format,
with an exponent value of 0 decoded into a 64-bit value of 1 . This limits the number of
leading zeros so that denormalized numbers may be included in the calculations. These
values are then fed into priority encoders such that one priority encoder (PENC55 in the
diagram) takes an additional 0, creating an output of 1 greater than the other priority
encoder. If |fi| > |a| , then the output of PENC55 is used as the leading zeros count. The
output of PENC54 is used otherwise.
To implement the PN-recoding, two levels of half-adders are used. The holes
created by the half-adders - carry[LSB] and sum[MSB] - are filled with O's to align the
outputs, as theMSB of the carry should be two times greater than the MSB of the sum.
Figure 8.4: PN Recoding Circuit
77
,'n
r|0 : n - 1]
,'n
U-PENC(n)
tip :n- I], '71
!
DIPF(n)
ZERO-TEST(n)
zt'\
u'[0:ti- lj.'n
pad-zeros(2fc+I)
"[0:2*+1
ENCODER (fc + 1)
&in(n) y'fJt : 0]
> fc + 1 . ' k + 1
MUX(fc-F 1)
y[k : 0]
Figure 8.5: Binary Priority Encoder [12]
The priority encoder (PENC) counts the number of leading ones in a binary string,
and expresses that number in binary format. In order to do this, it is constructed of a
simpler type of priority encoder that uses a unary format. The unary output is then
converted into one-hot format, and subsequently encoded into binary format.
The n-bit binary input is encoded into an n-bit unary output in accordance with
the following rule, with the input x[0:n- 1] and the output y[0 : n - 1] .
y\i] = OR{x[0:i])
To reduce the area of the unary priority encoder (UPENC), an n/2 input unary with a
level of OR gates before and after the UPENC to incorporate the remaining inputs.
78
X[ 1] X[2] X[3] X[4] X[n-1] X[n]
OR
I
ii
OR
I
~k
UPENC (n/2)
OR
T
ii
I
Ti
Y[l] Y[2] Y[3] Y[4] Y[n-2] Y[n-1] Y[n]
Figure 8.6: Area Optimized Unary Priority Encoder
The difference logic is a simple array of XOR gates that compares each bit to the
previous bit. In this manner, the location of the change from 0 to 1 is detected, and the
unary format is converted to one-hot. The one-hot format is then padded with O's so that
it becomes a power of two in length, and converted into binary format by way of an
encoder. A zero detector is utilized to ensure that the input is nonzero, but this is
unnecessary with the addition of the decoder and OR logic added for the denormalized
numbers.
As both the UPENC and encoder circuit grow non-linearly in area, the size of the
binary priority encoder (BPENC) can be reduced through the use of a divide and conquer
implementation. The recursive n-input BPENC uses two n/2 input BPENC circuits as
show in the following figure.
79
AND
vW
x{0 : g - 1)
in/2
B-PENC(|)
Vr[A - 1) INV(yB * - 1))
fc
yL[fc-2:0|
x|f :n-ll
/n/2
B-PENC(^)
VL[fc-l]
_L
AND
/it- 1
yn[-t 2 : 0| \k
1
y{k - 1|
M"UX(/c - 1)
|fc-l
y(k -2:0]
lyM-t-M
Figure 8.7: BPENC Implemented Using Recursive Structure
With both the leading zeros and the mantissa difference calculated, the difference
can be normalized by way of left shifting to remove the leading zeros created by the
subtraction. The difference is shifted left by the number of leading zeros predicted
through the approximation. Final normalization is then adapted using a multiplexer with
the most significant bit (bit 53) of the shifted difference used for the selection logic. If
theMSB is 1, then the output is bits 53 down to 1 of the shifted difference, otherwise it is
52 down to 0.
8.2.2 Far Path - First Cycle
The far path encompasses all cases not covered by the near path. That is to say,
all effective addition operations and subtraction operations where the exponents differ by
more than a ULP utilize the far path. The far path is divided into two parts - one for each
clock cycle.
80
II I A ll.ll. FA SA SOP SB FB
Add
(l'scomp sue)
7-
< 1 1 >
SS1G
SEFF
J:11
Add
E-SEFF
.
i-a-a*
if"
Irhr
i. vv
-'6
V
FA SA FB
, -53 .. -53 S3, - y
,-55
Shift Right
sy^
-is big -\ /
us i
SEFJ
118
9 (.
FLP RNRI(l) RNRI(O) SL
Figure 8.8: SE Adder Far Path - First Cycle
The first cycle of the far path determines which operand is the larger, and shifts
the smaller accordingly to accommodate the difference in magnitude. It also converts the
four IEEE rounding modes into the compressed three rounding mode format utilized by
the rounding algorithm and determines if the exponent difference is sufficiently large to
require the far path.
The difference in magnitude is determined using a l's complement subtraction of
EB from EA. To accommodate for the use of l's complement instead of 2's complement
format, the value of the effective subtraction bit is subtracted from the larger of the
exponents. Which exponent is larger is determined by the most significant bit of the
exponent subtraction. If it is 1, the magnitude of B is larger than that of A.
81
The significands are chosen through a series of shifts and multiplexers such that
flp = fl-2sej}
\fs-2~S]im
ifseff = 0
fsopa =
otherwise
The exponent subtraction result 8 can be limited such that a shift need not exceed the next
power of two larger than the number of bits of the mantissa. This is true due to the fact
that a difference of greater than 53 in the exponent will result in a shift of the smaller
mantissa such that the smaller mantissa has no effect upon the larger mantissa when they
are added together.
Slim = mm{S,2q}
q > log2 p + 2 = log2 55 > q = 6
As a result, only the least significant six bits of the difference, XOR'd with the
seventh bit need to be analyzed to determine a shift in the smaller operand. The seventh
bit of 5 is used to determine whether A or B is larger in magnitude when the exponent
difference is limited to the least significant six bits. A 0 at 8 [6] shows that the exponent
of A is larger than that of B. A 1 at this position shows that the exponent of B is larger
than that ofA. The inversion if B is larger allows the lower six bits to be used as the shift
regardless of which operand is larger.
The most significant bit of 8 is also XOR'd with 8[1 1:6] and then OR'd together
to determine if the Slim condition has been met. If so, the most significant 65 bits of are
padded with the effective subtraction bit, which is a faster operation when performed
directly, than the variable length shifting performed otherwise. This is_big condition can
82
be OR'd the shift bits to determine the size of 8, and produce a component of the path
selection algorithm.
is_rl = (\S\>2)
The rounding compression reduces the four standard IEEE rounding modes into
the three effective modes described previously. The compression requires only AND
gates, inverters and a single OR gate. As it is produced concurrently with the significand
conditioning, compression adds no additional delays to the circuit. The sign used to
determine the sign of the output is selected using 8[1 1], as done for the large and small
mantissa values.
Table 8.2: Reduce Rounding Modes
IEEE Reduced Rounding Mode - RNRI
Rounding Mode Positive Number Negative Number
RZ 00 RZ00
RP 01 RI01 RZOO
RM 10 RZ00 RI01
RN 11 RN 10
8.2.3 Far Path - Second Cycle
The second cycle of the far path adds the mantissas, rounds the result, and
determines if an overflow exception has occurred. By implementing the rounding in
parallel with the compound adder used to create the sum of the mantissas, the latency of
the adder is significantly reduced over the naive rounding method.
83
.01 ESP<10:0> FLP<53:0> FSOP/
I En ** 4-
LjNF 0... -^ , A<117:64> FS0PA<63> RNR1 SEFF FSOPA<62:0;
54 \l | , J^3
r Add
n.
FOSUM<-S2>\ /
11
T
>"n
HA CRS
-5=Q 5=1.
XSum[53:0]
XC[52:0]
I
)r%i
OR Tree
XSUITKO-r*
Compound
Adder
FQP5'JM1 FOFSIJM
11
53
FOPSUM<0,52>
FOPSUMI<0>
53
? V V
^R5RNRI
Rounding
r
F7
n
J
RIIMC
11 T
OVF EFAR
52
FFAR<52:1> FFAR<0>
DF5UM[<52->
Figure 8.9: SE Adder Far Path - Second Cycle
To accommodate the desire to perform the rounding in parallel with the mantissa
addition, the mantissa of the larger operand (FLP) and the most significant 54 bits of the
smaller operand's mantissa are run through a half-adder before the compound adder.
This compression creates the "carry" input necessary for the rounding algorithm as the
least significant bit of the sum output. The half-adder also has the benefit of reducing the
compound adder inputs from 54 bits to 53 bits, requiring a few gates less hardware to
implement. No carry is generated to bit 54, as the sum will be less than four.
84
The compound adder uses the same format as the compound adder in the near
path. The only difference being that the XNORs are replace with XORs so that the far
path adder forms A+B and A+B+l rather than A + B and A+B+l as formed in the near
path. The use of the compound adder here produces both possible mantissas. The larger
of which is used if the rounding requires that non-LSB bits of the mantissa be changed.
Each of the compound adder outputs is normalized using its most significant bit to
control a one bit shift operation.
Naive rounding implementations perform the addition/subtraction of the shifted
mantissa values to get a sum/difference value of o. Three values are analyzed for
rounding. The carry bit
"C" is the at the bit position of the least significant bit of FLP.
The round bit "R" is the most significant bit not affected by FLP. The sticky bit
"S"
represents all bits ofDue to the fact that the smaller operand may have been shifted by up
to 65 bits, while the larger operand requires only a one bit shift, it can be observed that
the least significant 64 bits of a smaller than C. To acquire the sticky bit, the least
significant 63 bits of a are OR'd together. The rounding bit is equal to o[63]. The carry
bit is equal to a[64].
+/-
FLP
vrr
FSOPA
Figure 8.10: CRS Generation in Naive FP Addition/Subtraction Rounding
85
The rounding algorithm used in the SE Adder is a variation of rounding by
injection [13], where an injection value is added to the FLP and FSOPA addition. The
injection is defined as follows, where the 2"52 bit position occurs at FLP[0] and
FSOPA[64], and 2"53 bit position occurs at FSOPA[63]. This results in the injection
being added to the lower part of the FSOPA - directly influencing the R and S values
respectively.
INJ = <
0
-53
ifRZ
ifRN
2'^
-ULP if Rl
However, due to the fact that the R and S bits rely only on the smaller operand
FSOPA, the injections can be added directly to the R and S bits. This eliminates the need
to extend the injection for Rl all the way to the ULP, reducing a 64-bit adder to a 2-bit
adder. The resulting 2-bit injection can be derived form the RN and Rl three mode
rounding using the a simple XOR gate of RN and Rl for INJ[1], and using Rl as INJ[0].
Table 8.3: RNRI to E\J Conversion
RN 1 Rl | INJ |
0 0 00
0 1 11
1 0 10
The addition of the injection is fairly straight forward when effective addition is
being implemented. As FSOPA is being added to FLP, there is no two's complement to
account for, so R can be read directly from FSOPA[63], and S can be derived from an
OR tree of the lowest 63 bits of FSOPA. The injection, truncated to two bits, can then be
added to R and S to create a new R and S, and a carry-out.
86
FLP:
FSOPA:
53 0
117 64 63 62 0 |
"
"
--^ORtree^--*
'
?
R' S'
+
INJ[1] INJ[0]
CRS
INJ:
Figure 8.11: CRS Creation for Effective Addition
Since the full addition of the FLP and FSOPA is not required for the rounding, as
only the LSB of the FLP is being used, the S bit must be modified to account for the fact
that addition and subtraction would affect the FSOPA differently. While the FSOPA is
already negated if effective subtraction is being performed, the increment of the ULP
required for 2's complement subtraction is not implemented. If the lower part of the
FSOPA is all ones, the addition of the ULP results in an overflow that must be accounted
for. To accommodate this fact, the least significant 63 bits of the FSOPA are XOR'd
with the effective subtraction bit. This has the effect of inverting the least significant 63
bits of FSOPA if effective subtraction is performed, and leaving them alone otherwise.
As a result, the S bit can be used to denote both itself, and the carry into the R position -
represented as the inverse of S. If the lower part of FSOPA is all ones an increment
would result in an overflow. This is shown in the hardware as the inversion of all ones
results in an S bit of 0. The inverse of S then shows the carry-in of the R position as 1 . If
the lower part of FSOPA is not all ones, no overflow results from an increment, so the
XOR results in an S bit of 1 and no carry-in into the R position is required.
87
FLP:
FSOPA :
53 0
INJ:
117 64 63 62 0
+
1
? i
NOT
\. OR tree ^^
r r
R S'
not(S')
INJ1 IMJO
Figure 8.12: CRS Creation for Effective Subtraction
Since the derivation of S requires an array of XOR gates followed by an OR tree,
it takes significantly longer to generate than C, R or the injection. For this reason, two
sets of CRS outputs are created - one each for the cases when S is 0 and when S is 1 .
The generated value for S is then used to select which CRS to use for the rounding
implementation. In this representation, C denotes a carry-in into the a[0] bit position.
SE FF RNRI<1> RNRI<0>
IP
FSOPA<63 >
I
CRS0<2> CRS0<1> CRS0<0> CRS1<2> CRS1<1> CRS1<0>
Figure 8.13: CRS Circuit
88
The CRS circuit is derived from the following truth tables, which were in turn
derived from an analysis of CRS creation diagrams for effective subtraction and addition
displayed previously. The tables are divided into the cases where S = 0 and S = 1, to
accommodate the division of the circuitry. The cases where ENJ is "00" were omitted
from the truth tables as it should never occur. The accompanying optimized CRS
equations were derived using Karnaugh maps.
Table 8.4: Truth Tables for CRS Generation
S = 0
R' SEFF INJ CRS
C = (Seff * R')+(Seff * INJ(1))+(INJ(1)
R = R' xorSeffxorlNJ(l)
S = INJ(0)
S = 1
0 0 00 000
0 0 10 010
0 0 11 011
0 1 00 j 010
0 1 10 100 'R')
0 1 11 101
0 00 010
0 10 100
0 11 101
1 00 100
1 10 110
1 11 111
R' SEFF INJ CRS
C = (INJ(1)*INJ(0)) + (INJ(1)*R')
R = R'xorlNJ(1)xorlNJ(0)
S = not(INJ(0))
0 0 00 001
0 0 10 011
0 0 11 100
0 1 00 001
0 1 10 011
0 1
_J
11 100
0 00 011
0 10 101
0 11 L 110
1 00 011
1 10 100
1 11 110
89
The least significant bit of the half-adder sum described previously accounts for
the remaining portion of the carry bit. When it is XOR'd with the C bit from the CRS,
the actual carry bit (o[64]) is created. This, combined with the R, S, and RN bits, is used
to define the rounding modifiers OVF and NOVF, used for overflows and non-overflows
of the compound addition of FLP and FSOPA.
When there is no overflow in the compound addition, the least significant bit of
the mantissa formed by the XORing of the C and least significant bit of the sum output of
the half-adder array is valid for all cases except for one. The round to nearest algorithm
implemented rounds upward if the result is exactly halfway between two representable
numbers. The IEEE round to nearest algorithm rounds to the nearest even number in this
case, always pulling the ULP down. To accommodate this difference, the halfway
condition must be detected and used to create a mask to alter the output LSB. Without
the injection, the halfway condition occurs when R is 1 and S is 0. With the injection for
RN, the halfway condition occurs when R is 0 and S is 0. As the mask only affects the
output when the halfway condition is met and RN rounding is used, the no overflow mask
is defined as
NOVF = R + S + RN.
This produces a 0 if the conditions are met and a 1 otherwise. When NOVF and the
output of XSUM[0] C are AND'd together, they produce the correct LSB for the case
when no overflow has occurred.
Similarly, the case when an overflow occurs must also be altered for round to
nearest halfway condition. Without the injection, the halfway condition for this case
comes when XSUM[0]C is 1, R is 0, and S is 0. With the injection, the halfway
90
condition occurs when XSUM[0]Cis 1, R is 1 and S is 0. The mask can then be
created as
OVF = (XSUM[0]C) + R + S + RN .
This mask must then be applied to the LSBs of both outputs of the compound adder in
order to provide the correctly rounded values of each. The most significant bits of the
compound adder outputs are then used to select from the overflow and non-overflow least
significant bits.
I
73
J -n w A c
RINC
ODD
L'ninc Lninc L'lnc
Figure 8.14: Injection Based Rounding
To choose between the straight sum and incremented sums of the compound
adder, as well as the corresponding rounded least significant bits, an increment circuit is
implemented along with the rounding circuit. The increment circuit is divided into two
91
paths - one each for the overflow and no overflow conditions. The overflow is equal to
most significant bit of the non-incremented compound adder output (FOPSUM[52]). For
the non-overflow case, when FOPSUM[52] is 0 or the round to zero rounding mode is
selected, the FOPSUM[l] bit needs to be incremented only if both the C and XSumfO]
are 1.
For the case when FOPSUM[52] is 1 and round to infinity or round to nearest is
selected, the path needs to be further divided according to the rounding mode. If the
round to infinity mode is selected, the increment takes place if the value is greater than a
truncation to 53 bits, so the increment occurs if C or XSum[0] is 1. If round to nearest is
selected, the increment will only take place if the value is greater than the halfway
condition. This occurs when R + C + XSum[0] > 2 , with XSum[0] and C each
independently representing the halfway condition.
INC = <
C XSum[0] if FOPSUM [52] or RZ
C + XSum[0] if FOPSUM [52] and Rl
RC + C- XSum[0] + R XSum[0] if FOPSUM [52] and RN
8.3 Summary
In this chapter, a floating-point adder/subtractor based on that of Seidel and Even
was presented. The design uses the two-path architecture, dividing the operations by
exponent difference and effective operation. Subtraction is only performed if the
exponent difference is 0 or 1, and effective subtraction is performed - the
"near"
path.
Rounding by injection is only performed on the
"far"
path. The original design was
altered to accommodate denormalized numbers, and to fix an error in the SE paper that
resulted in an exponent difference of 2 being in the
"near"
path.
92
Chapter 9 Floating-PointMultiplication
Floating-point multiplication has nearly as many far-reaching applications as
floating-point addition/subtraction. Oberman and Flynn [34] report that floating-point
multiplication accounts for 37% of all floating-point operations. As a result, it is
important to implement an efficient multiplier design. The major variation in floating
point multiplier design stems from the method used of implementing the unsigned
multiplier, and the corresponding rounding.
FP Operands
Unpack
XOR
Add
Exponents
Multiply
Significands
Adjust Exp Normalize
Round
Adjust Exp Normalize
Product
Figure 9.1: Generic FP Multiplier Block Diagram
93
It is important to note that the exponent biases must be accounted for during the
exponent addition. As each exponent includes a bias, straight addition would result in a
product exponent offset by an additional bias.
(AExp + bias) + (BExp + bias) = (PExp + bias) + bias
This also makes the prospects of an exponent overflow exception much more likely. A
simple subtraction of one bias value from either exponent input, or the resulting exponent
will resolve this issue.
9.1 FP Multiplier using Booth-Wallace
The multiplier can be implemented using either the shift and add sequential or the
parallel implementations as described in Chapter 4. While the shift and add
implementation requires less hardware, reducing area, it requires more clock cycles, and
thus increases latency. For that reason, the Booth-Wallace architecture is once again
utilized in this multiplier design.
Structurally, the Booth-Wallace multipliers implemented for the floating-point
and integer dividers are very similar. The major difference stems from the fact that the
floating-point multiplier does not need to support both signed and unsigned
multiplication. The multiplier used in the floating-point needs to support only unsigned
numbers, as the mantissas represent 53-bit unsigned number, despite the fact that the
mantissa of each normalized number begins with a leading one.
As the mantissas with the hidden bit added are 53-bit, two bits must be added to B
to prepare it for use in the Booth encoder. A 0 is appended as the LSB so that the initial
partial product is not mistakenly encoded as belonging to the middle of a string of Is. For
94
B to be divided into groups of three, it must have an odd number of bits, so the 54-bit
extended B must be padded with an additional 0. To ensure that the initial value of A is
represented in unsigned format, A is also padded with a 0.
A_ext = 0 & A
B_ext = 0 & B & 0
The radix-4 modified Booth encoding is exactly the same as that used in the 64-
bit integer multiplier, the functionality of which may be seen in Table 3. The Booth
encoding results in 27 partial products, resulting in an asymmetric Wallace tree. The 27-
input Wallace tree consists of two less compressor blocks than the symmetrical 32-input
tree, while maintaining the same four-level design.
o.,ipM
cq en cq co co m \ta \m to cq
1D9-
Pa en m m U* U* cq cq cq
J109'
_J1BW TTorC
4:2
'109-L
[107
Sum
108:1]
m m m m
eft ,f- ,io ,<n ,^h
I I Jf-jlio
fma-
_j"109
4;2
J"109
Booth
Encoding
Wallace
Tree
Figure 9.2: 53-Bit Booth-Wallace Tree for FPMultiplication
95
The 128-bit Han-Carlson adder that performed the final summation of the
Wallace tree outputs in the integer multiplier is replaced with a compound adder and
rounding logic. This allows for both the incremented and straight sums that are the
possible results of the rounding and the rounded LSB to be computed simultaneously, in
much the same way as done in the floating-point adder.
9.2 Rounding by Injection
While any number of rounding methods can be employed for the multiplier,
including those mentioned previously [13, 36, 37, 48], the rounding by injection method
[13] introduced for the floating-point adder/subtractor is also utilized for the floating
point multiplier. However, as the algorithm was developed for multiplicative rounding, it
lends itself to be both much more integrated and elegant in the multiplier instantiation.
The multiplicative rounding of the outputs of a tree structure differs from the
additive rounding of the shifted mantissas in two simple ways. Firstly, as the addition
algorithm allows for a 6-bit shift control, the smaller operand may be 118 bits in length,
while the maximum bit size of the output of the partial products tree is twice that of the
inputs, or 106 bits - 108 bits including the leading zeros that result from the unsigned
buffering of A. This accounts for the fact that both operands are normalized in the range
of [1,2), so can never attain a value of 4. The second difference, which has more of an
impact on the rounding algorithm, is that both operands can exceed the 53-bit mantissa
size, and not just one as with the addition algorithm. For addition, the larger operand is
shifted a maximum of one bit, so the lower bits are zeroed, allowing for simpler handling
of the rounding algorithm that creates the carry, round and sticky bits. This is not true for
96
the multiplication algorithm, where the Wallace tree produces a carry and a sum, both of
which are 106 bits in length. This means that the lower bits must also be added together,
a step which was unnecessary in the additive rounding algorithm. The carry, round, and
sticky bits for the multiplier rounding scheme are defined the same as they were for the
addition/subtraction rounding scheme.
53-55
WTSum
rsr 5553
WT Carry
Figure 9.3: CRS Generation in Naive FP Multiplication Rounding
The injections used in floating-point multiplication are defined the same as those
used in the floating-point adder:
0
INJ = 1 -53
ifRZ
ifRN.
2'^
-ULP if Rl
The primary difference between the two methods is that the injection cannot be reduced
to a two bit number due to the fact that both Wallace tree outputs have the same number
of bits, and both affect the CRS. The Wallace tree structure still does allow for a
different injection optimization though. Examining the Wallace tree structure for the 53-
bit multiplier, it can be observed that there is one input that is unaccounted for. In the
previous diagram, it was zeroed so as not disrupt the addition of the partial products, but
97
it is possible to insert the injection value here. By injecting it directly into the Wallace
tree, no additional addition hardware is required, and the sum and carry outputs already
include injection. The injection is selected from the RNRI three mode rounding
compression, using multiplexers.
As the injection is already included in the Wallace tree outputs, the computation
of the CRS bits is derived by a simple addition of the lower 52 bits of the Wallace tree
outputs, which produces the C as the carry out, and the R as the MSB. The remaining 51
outputs bits are OR'd together to create S.
Apart from the CRS generation, the multiplicative and additive rounding exactly
are the same. A compound adder is used to create the sum and the incremented sum after
a row of half adders finds sum at the C position without the carry from the previous bit
(XSum[0]), while the possible least significant bits are formed by the rounding block.
The rounding block also determines if the incremented sum or straight sum is used for the
product using the C, R and XSum[0] bits. The increment decision logic is the same as
that used for the floating-point adder/subtractor.
INC
C XSum[0] if FOPSUM [52] or RZ
C + XSum[0] if FOPSUM [52] and Rl
RC + C- XSum[0] + R XSum[0] if FOPSUM [52] and RN
98
P[SZ: D]
Figure 9.4: FPMultiplier with Rounding by Injection - Mantissa Only
9.3 Exponent Calculation and Overflow Exception
As mentioned previously, the unbiased exponent of the product is equal to the
sum of the unbiased exponents of the operands. The result may have to be incremented if
the product of the mantissas produces a value greater than or equal to two, which must
subsequently be normalized into the range of [1,2). The addition of the exponents has the
possibility of creating an overflow condition, where the carry-out of the addition is equal
to 1. This is means that the product of the operands cannot be expressed in double-
precision format, and an exception flag must be thrown.
99
To account for the additional bias, and to detect the overflow condition, the bias is
removed from the A operand using an 1 1-bit adder. The two's complement of the bias -
"10000000001"
- is added to A. The sum is equal to the exponent with the bias
removed, and the carry-out denotes whether the two's complement of the value must be
taken. If the carry-out is 1, the exponent minus the bias is positive (or zero), and the sum
denotes the difference. If the carry-out is 0, the exponent minus the bias is negative, so
the exponent is negative, and the absolute value of the exponent can be found by taking
the two's complement of the sum.
The exponent of A, with the bias removed, is then added to the exponent of B,
with the bias, using a compound adder. This creates both possible exponents. The
carries created can then be used in conjunction with the carry from A exponent - bias
calculation to determine if an overflow has occurred for each of the two possible
exponents. If both carries are 1, then an overflow has occurred - the resulting exponent
plus the bias is equal to or greater than 2048 - and the exception must be flagged. If both
carries are 0, the resulting value is denormalized. If the carries differ, the resulting
exponent plus bias are between 0 and 2047. These are the acceptable exponent values.
The overflow flags can therefore be detected by an XNOR of the carries.
9.4 Summary
This chapter presents a floating-point multiplier using a Booth encoder and a
Wallace tree for partial product reduction. IEEE rounding is implemented using injection
based rounding. Overflow exceptions are detected and flagged.
100
Chapter 10 Floating-Point Division
Oberman and Flynn [34] report that while division accounts for only 3% of
floating-point operations, it accounts for 40% of the latency. This assumes a twenty
clock cycle latency for division, and a three clock cycle latency for multiplication and
addition. The large number of result of the large number of cycles required for a division
operation to be implemented, as demonstrated with the integer divider.
FP Operands
Unpack
XOR
Subtract
Exponents
Divide
Significands
Adjust Exp Normalize
Round
Adjust Exp Normalize
Pack
Quotient
Figure 10.1: Generic FP Division Block Diagram
101
As with the floating-point multiplier, the exponent biases in the division must be
accounted for. In floating-point division, exponent of the dividend is subtracted from that
of the divisor. A straight subtraction results in an exponent without the IEEE bias.
{AExp + bias) - (BExp + bias) = QExp
The bias must be added to the quotient exponent to conform to the IEEE standard.
10.1 FP Divider Using SRT by Comparison Multiples
The divider used in the floating-point division algorithm is a variation of the SRT
non-restoring digit recurrence algorithm proposed independently by Sweeny [5],
Robertson [40], and Tocher [45]. The benefits of using the SRT divider were discussed
in Chapter 5. While it requires more clock cycles to implement than functional dividers,
it has a smaller area and operates at a higher frequency.
The quotient digit selection algorithm used for the SRT divider was implemented
using the comparison multiples method described in [10] and [31]. The integer
implementation has a few key differences from this floating-point implementation from
which it was derived. Firstly, as the floating-point divider need only handle one type of
multiplication, and not be able to perform both signed and unsigned, the operand
conditioning - two's complement and shifting
- that were implemented in the integer
divider are not found in the floating-point divider. The removal of the operand shifting is
done with the assumption that normalized numbers will have a leading one, and not
leading zeroes. This means that the FP divider has a fixed latency, unlike the variable
latency integer divider. The other key difference is that the floating-point rounding
102
algorithm requires three shift registers, as opposed to the two register method utilized in
the integer divider.
As the algorithm used for the integer divider was adapted from the floating-point
divider, all of the equations derived in Chapter 5 are applicable here. The radix-4 QDS
2
function is instituted once again, with a redundancy factor of p =, and a signed digit
set of J2, l,0,l,2j. Thus, the quotient digit selection function is defined as:
y+i
2, ifirPjJKO and {rPj}c <-{M2}c
T, if{rPj}<0 and -{M2\ < {rP^ < -{Mx\
0, if{rPj}<0 and -{Mx\ < [rP^
0, if{rPj}>0 and {rP,}c < {M.}c
1, if{rPj}>0 and {M,\ < {rP^ < {M2\
2, if{rPj}>0 and {M2}c<{rP.}c
Utilizing the truncation inequalities determined in Chapter 5,
2C >-
-, e = log2r + i,
2p-l
in conjunction with the fact that the dividend and divisor are shifted such that they fall
into the range of [0.5, 1), the minimum number of fraction digits required after
truncation, c, is determined to be 5. The minimum number of integer digits is 2. This
proves to be the same as in the QDS function of integer divider, in spite of the differences
in the number of bits used to represent the numbers. As a result, the QDS function used
for the floating-point divider can be used with absolutely no changes in the integer
divider - or visa versa.
Similarly, much of the remaining divider hardware can be used in both dividers
with little or no modification. The multiplexers need only be shortened so that they 55
103
bits, rather than 66 bits. The comparators for the partial residual computations can also
be used with a simple alteration of bit length. The adjust unit that compresses the partial
residual into a 55-bit format is then formed by removing the least significant 1 1 bits from
the integer adjust unit. The resulting operation can be seen in the following Table, with
the hardware implementation following immediately.
Table 10.1: FP Compression Process for 1 E {0,1,2}
,'i<57:0>
r-TTP
nt57:53>= ABCXY in<54:53>= jy
OOOXY XY
OOIDI .11
ooilx .IX
001J01
~
B ._
001JX .IX
011.01 .11
011JX IX
_ ^_
011.01 .11
-n M
011JX .IX
1TT.0T .11
111JX IX
Turn IT
Tiiix .Ix
i m. tt? m nA A A A A
TO TO TO TO c^
mi w? w* u, u,
V V V V V
*, I .J-
iiy --4 0
Figure 10.2: Adjust Unit for Floating-Point Divider
104
Using binary signed digit representation, the binary representation of a number A
is A+ , so the sign of the number is determined from the subtraction by the first bit
position where
A*
* A7 . Using the two's complement representation of the negative
BSD value produces a sign extended value which then allows the carry-out of the
subtraction to indicate the inverse of the sign of A. After the inversion of
A"
, the sign
detectors for the partial residuals are composed of the parallel prefix creation - propagate
and generate formation - and carry-tree of the Han-Carlson adder. To reduce logic, and
subsequently area, only the most significant carry is outputted, and any dot operations
that do not contribute to this were removed. The carry-out bit of each sign detector is
then run through an invert to find the sign of each of the possible partial residuals. The
following figure depicts the modified Han-Carlson tree used for the sign detectors.
I Sign
? Propagite and Generate i(G,P)o(Gf,P)~(G+ P G',.P />') O Qd-G+y-Cf)
Figure 10.3: Floating-Point PR Sign Detector
105
The largest difference - apart from the lack of operand conditioning - between
floating-point and integer dividers occur in the block where the BSD to binary conversion
and rounding takes place. The floating-point divider uses the on-the-fly conversion and
rounding scheme introduced in [2], and refined in [3] and [4]. The on-the-fly summation
is similar to that implemented by the integer divider, except that an a third shift register is
introduced that stores the running total plus 1. The Q register maintains an on-the-fly
conversion of quotient by concatenating the q values produced each cycle with itself (if
positive or zero) or the Qm register (if negative). The Qm register maintains Q - 1 ULP,
allowing for negative values to be added to the total without the need to perform a literal
subtraction of one ULP before the concatenation. The newly added Qp register maintains
2
Q + 1 ULP to similar effect. Although the choice of a redundancy factor (p) of makes
Qp unnecessary for on-the-fly conversion, it is used for rounding. The registers are
updated according to the following equations.
Qm[j + 1] =
<2L/ + lH
(GL7],te;+,-i)) ifqj>o
(Qm[JUr-\qJ+l\-l)) ifqj+x<Q
\QUUqj+1)) ifqj+l^0
(Qm[jUr-\qJ)) ifqj+i<0
Qp[j + l] =
2U], (<?,+,+!)) if-l<qJ+1<r-2
(Qm[j],(r-\qJ + l)) ifqj+l<-l
In these equations, (x,y) is a concatenation operation. It may be observed that Qp
is not covered for all r - Qp becomes itself, shifted left two bits if qj+l is r-1 (3 in this
case), however, as the possible qJ+l values are {-2,-1,0,1,2} due to the choice of
106
redundancy factor, so this case is eliminated. These equations also assume that the
registers are 54 bits wide and initialized to zero. A simple modification to allow Q to be
initialized such that the least significant bit is 0 and all other bits are 1, allows
conversion/rounding block to become an indicator of when the proper number of shifts
has q values have been added. Using this method, when the initial 0 overflows to bit 54,
27 two-bit q values have been added to the registers - ensuring that necessary 53 bits of
the normalized quotient have been produced.
The qj+] value is determined by comparison rP with the comparison multiples of
d. As the initial rP value is equal to x, an analysis of the possible values for x and d,
which fall into the range of [0.5,1), and the resulting comparison multiples - 1.5d and
0.5d - results in a qj+l of
"01"
or "10", as shown in Table 10.2.
Table 10.2: Analysis of Initial q Values With Respect to x and d
x, d Values x~1,d~ 1
.5d < x < 1.5d
qi+1 = 1
x~1,d = .5
x> 1.5d
qj+1 =2
x = .5, d - 1
1.5d > x > .5d
qj+1 = 1
x = d = .5
.5d < x < 1.5d
qj+1 = 1
x vs Comp Mults
Resulting q
However, a q0 will be followed by a negative q, to bring the result below 2. So
the quotient after 27 q values have been added - and the overflow to stop the
incrementing of the shift registers - will have a leading one at bit 52 or 51. Combining
this with q2S for LSB calculations, gives a 56 bit quotient, allowing for a rounded result.
bit number
q[2S] =
bitnumber
q[2S] =
55 54 53 51 2 1 0
0 1 X X * * X X X
55 54 53 52 2 1 0
0 0 1 X X X X
107
In the above representation, the 0 that was initially at the LSB of Q would be at
bit 56. Each cycle, two bits are shifted out of the shift register. The least significant of
the two bits shifted out each cycle is caught in an overflow bit - depicted above as bit 54.
When the overflow bit is 0, the shift registers are disabled to prevent quotient bits from
being shifted out in subsequent cycles. It is for this reason that the initialization routine
fills all but the LSB with Is - so that the LSB may act as an indicator that the proper
number of cycles has passed. With the shift registers frozen, the entire 53-bit quotient -
if the MSB (bit 53 in the above representation) is 1, the LSB is excluded, otherwise bits
52 through 0 are used - can be used in conjunction with the following (n+1) quotient to
determine the rounded quotient.
* parallel load with wired left shift
QMRegister
QM,in
Q Register
QP Register
load-shift
* Qin
I load-shift
QP,in
load-shift
Load
Shift
Control
#+i
QM Q QP
-p<0
J M It P^r
Qn+l
0<p< r-1 Control sign
u
q Reg. (rounded)
zero
Figure 10.4: FP Conversion and Rounding [5]
108
Due to the fact that division has the potential to produce irregular fractions, the
precision required to implement the round to infinity or round to negative infinity IEEE
rounding modes, could theoretically approach an infinite number of decimal places. As a
result, it is common for floating-point dividers to implement only the round to nearest
(even) rounding mode, as is the case with this divider. The rounded LSB for the quotient
in the round to nearest mode can be determined using only the least significant bit of the
pre-rounding quotient and the bit position immediately to the right of that.
Since the quotient values are in BSD format, they have the potential to be
negative values, so the rounding is slightly more complex than it sounds. A negative
value for q2& could require that the Qm register be used. Similarly, if the rounding will
result in an overflow, the Qp register is used. The following table, shows the rounding
rules for the CRN unit.
Table 10.3: CRN Rounding Rules [5]
53-bit normalised and rounded q
QM[27]<52> sign
<?28 00 01 10 11
2
T
<QM[27]<51:0>,1) (QM[27]<51:Q>, 1) Q[27]<52:0>
(Q[27]<51:0>,0) (QM[27]<51:0>, 1) Q[27]<52:0>
Q[27]<52:0>
Q[27]<52:0>
53-bit normalised and rounded (\
Q[27]<52> sign
<?28 00 01 10 11
0 (Q[27]<l:0>,0) (Q[27]<51 : 0>, 0) Q[27]<52:0> Q[27]<52:0>
1 (Q[27]<51:0>,1) (Q[27]<51 : 0>, 0) Q[27]<52:0> Q[27]<52:0>
2 (Q[27]<51:0>,1) (Q[27]<51:0>,1) QP[27]<52:0> Q[27]<52:0>
109
To determine whether the to use the normal Q, the incremented Qp, or the
decremented Qn, two control signals are used. These signals, sO and si are used to
control multiplexers which select the proper Q. Simultaneously, the rounded LSB bit is
computed as u. The values for u, sO, and si can be derived from the following table,
which takes into account the most significant bits of Q and Qm, as well as q2S , and the
sign of the partial residual that results from the q2i calculation.
Table 10.4: CRN Rounding [5]
-?28 sign Q[27]<52> QM[27]<52> u si sO
2=111 0 X 0 1 1 1
2=111 0 X 1 X 0 0
2=111 1 X 0 1 1 1
2=111 1 X 1 X 1 1
1=110 0 X 0 0 0 0
1=110 0 X 1 X 0 0
T= 110 1 X 0 1 1 1
1=110 1 X 1 X 0 0
0=X00 0 0 X 0 0 0
0=x00 0 1 X X 0 0
0=x00 1 0 X 0 0 0
0=x00 1 1 X X 0 0
1 = 010 0 0 X 1 0 0
1 = 010 0 1 X X 0 0
1 = 010 1 0 X 0 0 0
1 = 010 1 1 X X 0 0
2 = 011 0 0 X 1 0 0
2 = 011 0 1 X X 0 1
2=011 1 0 X 1 0 0
2 = 011 1 1 X X 0 0
110
Using this table, and some 8x8 Karnaugh maps, the equations for u, si, and sO can
be derived using the six inputs. Using Sq, ql and qO to denote the sign, middle and LSB
bits of q2g , respectively, these equations were derived as:
u = ql-[qO + (SqSign)\
si = Qm[52] {Sign + qO) {Sq ql)
sO = Qm[52] (Sign + qO) -{Sqql)+SqqO- Sign [52]
Using these equations, the rounding and normalization portion of the CRN can be
implemented as follows.
Qm Qp Q
53.
53.
53
0 /sO
<3^ -a
Q[52]
J^3
A
I Sign
L<3
S3.
Qm[52]
Figure 10.5: FPDivider Rounding and Normalization Logic
111
Using the components described previously in this section, the mantissa divider
can be constructed. With the exception of the CRN block, it is structurally very similar to
the divider described in Chapter 5 - without the operand conditioning.
jPz-Rej |P.-IUJ x |P0'Rej |P2 Rtg] | P* R,-j [i^fcg fr Rqjj fc Ragj |S Rej
1-jOR
EH3
MUXi/ \MUXjJH i-\ MUXi -\wn-
5V_TL_3-i\MUX3)t ^ -\MUXz*^
L-UP-l/n-fJ L(4PI)]}5-J
|X MJXz/-
QDS
1 T
ql qO
,
-{ U}5-*
Sign{qjf{i 4P[/r 4PM
]Snn Pet] | Adl ] ISjRnDetl [ Adj. I ISignDetl I Adl
nT--j2r?B_Trr_jrEnEi iHTEi.
jqlRej l-jORed |5DRqj |p0* Red [ P0 Regl fi^ I PT fed I Pi R-bI |%*H 1^"H I P2^ [gR-i
Figure 10.6: FP Divider Block Diagram
10.2 Exponent Calculation and Overflow Exception
As mentioned previously, the unbiased exponent of the quotient is equal to the
difference between the unbiased exponents of the operands. Assuming the normalized
mantissa range of [1,2), the possible quotient range is (0.5, 2). As the range of [1,2) is
normalized, any quotient less than one must be shifted left, and the exponent must be
decremented. The subtraction of the exponents has the possibility of creating an
overflow condition, when the unbiased exponent of the divisor is negative. An exception
flag must be triggered in this event, but the flag need not be differentiated from the
divide-by-zero exception.
112
When the exponent subtraction of A-B occurs is processed, it has the effect of
removing the bias from the resulting exponent. The bias must then be added back into
the resulting exponent so that it conforms to the IEEE 754 standard. To form both
possible exponents - the biased subtraction, and the biased subtraction minus one used if
the quotient from the mantissa division is less than one - a compound subtracter is used.
The compound subtracter performs the compound addition of the exponent of A and the
inverse of the exponent of B. By not including the addition of the ULP for proper two's
complement subtraction, the result of the addition becomes Aexp-Bexp-1 for the sum
output, and Aexp-Bexp for the incremented sum output. The bias is then added to each
of the possible exponents, and the same bit used to select whether or not to use the
rounding bit, u, is used to select the exponent to output.
The overflow condition occurs only when the absolute value of the divisor -
2BeKp~bias
-l.f\ - is less than one, and thus results in a quotient that is larger than the
absolute value of the dividend. More specifically, the overflow occurs when the dividend
exponent minus the divisor exponent plus the bias results in a value greater than the 10-
bit exponent can hold (2047). The detection of the overflow condition lends itself easily
from the exponent calculation. Firstly, the overflow will only occur ifA>B-ifA<5
and B <1, the resulting quotient will be in the range of (A, 1], and will not overflow. If
this condition is met, the subtraction of the exponents will results in a carry-out of 1 .
Secondly, for an overflow to occur, the unbiased quotient exponent plus the bias must
create a carry-out. The overflow condition can therefore be detected using a simple AND
of the carry-outs of the compound subtraction and subsequent bias additions. This
113
produces two possible overflow bits - one for Aexp-Bexp and one for Aexp-Bexp- 1 .
The proper overflow bit is chosen along with its corresponding exponent.
10.3 Summary
This chapter describes the creation of a double-precision divider following the
IEEE 754 standard and supporting round-to-nearest (even) rounding. The SRT non-
restoring digit recurrence algorithm is implemented using a radix-4 quotient digit
selection function by comparison multiples, with on-the-fly digit conversion, rounding
and normalization. The divider detects and triggers an exception bit for the overflow and
divide-by-zero exceptions as specified in the MIPS64 ISA.
114
Chapter 11 FPU Synthesis Results
Each of the floating-point blocks was implemented in VHDL. Test benches were
developed in ModelSim, and executed in order to ensure that the functions performed as
expected. The VHDL code was then synthesized under worst case conditions using
Synopsys Design Compiler (2006.06) and the power, area, and delay were observed.
Post synthesis verification was then done on the synthesized circuits. Unlike the integer
adder and divider, there are no DesignWare libraries against which to compare the FP
implementations, so the FP blocks were compared to the ALU in terms of speed to ensure
that no component required a longer clock period than the universal integer divider.
11.1 FPAdder/Subtractor Synthesis
The floating-point adder/subtractor requires two cycles to execute an instruction
but is fully pipelined such that one instruction may be issued per cycle, without affecting
the operation issued the previous cycle - giving it a maximum throughput of one
operation per cycle. When synthesized, the reported area was 16490 gates, with a
maximum power draw of 92 mW. At a latency of 2.67 ns - a frequency of 375 MHz -
the FP adder/subtractor is slightly faster than the integer divider. The schematic
produced by Synopsys can be seen in the following figure. Due to the lack of logic
outside of the R, N, and path selection blocks, the top view of the design is relatively
uncluttered.
115
Figure 11.1: FP Adder/Subtractor (Top View) from Synthesis
11.2 FP Multiplier Synthesis
The floating-point Booth-Wallace multiplier utilizes a four clock cycle structure
that is divided very similarly to the integer multiplier. The three sets of registers are
place after the Booth encoding, and after each two stages of the Wallace tree. Unlike the
integer multiplier, the critical path of the floating-point multiplier lies not in the Wallace-
tree stages, but in the final addition and rounding. As a result, the FP multiplier has a
longer delay than the integer multiplier, at 2.32 ns - an operating frequency of 431 MHz
- but is still faster than the integer divider. The floating-point multiplier has an area of
55,891 gates and a maximum dynamic power of 197.6 mW.
116
Figure 11.2: FP Booth-Wallace Multiplier from Synthesis
11.2 FP Divider Synthesis
The floating-point divider compares favorably to the integer divider in terms of
both size and speed. Since the critical path of each is the same portion of the circuit - the
creation of the new partial residual - and floating-point divider uses fewer bits to perform
the calculations, the floating-point divider is slightly faster than the integer divider at a
speed of 392 MHz - a delay of 2.55 ns. The lack of operand conditioning in the floating
point divider, combined with the smaller number of bits to be processed by the actual
divider circuit, counter the addition of the rounding and exponent circuitry to result in an
area of 11,879 gates. The maximum dynamic power of the floating-point divider was
reported as 2.315 mW. Division is the only operation that is iterative, and thus does not
allow for pipelining. As with the ALU, multiple FPUs can be included to increase
throughput.
117
Figure 11.3: FP Divider from Synthesis
11.4 FPU Synthesis
The completed floating point unit added utilized the three floating-point
arithmetic designs, and added a small amount of logic to switch between which is being
used. As a result, the floating point unit is slightly slower than the divider with a delay of
2.82 ns, and an operating frequency of 355 MHz. The area of the FPU is 84,440 gates. It
has a maximum dynamic power of 153.9 mW. As one instruction can be completed per
cycle when executing addition or multiplication instructions, the maximum number of
floating-point operations per second (FLOPS) is 355 million.
The schematic for the floating- point unit, as taken from Synopsys, primarily
shows the multiplexer logic, as that was much of the top level of VHDL code. The
floating-point adder, multiplier and divider in the circuit have been labeled.
118
Figure 11.4: FPU from Synthesis
119
Chapter 12 Conclusions and FutureWork
Both a 64-bit arithmetic logic unit and a double-precision floating-point unit were
designed, modeled in VHDL and tested using ModelSim. The ALU performs arithmetic
and logical shifts, logic operations including XOR, NOR, OR, AND, and INV, as well as
signed and unsigned integer addition, subtraction, multiplication and division. The
addition/subtraction was implemented using a 64-bit Han-Carlson/carry-select hybrid
adder that operates at higher speeds than standard parallel prefix adders. The multiplier
was implemented using an signed/unsigned Booth-Wallace structure. Division was
implemented using a novel approach that combined a comparison multiples SRT divider
with operand conditioning to perform both signed and unsigned division at variable
latency. The ability to perform variable latency division reduces throughput for all cases
where the (positive) operands have leading zeros.
The floating-point unit performs double-precision addition, subtraction and
multiplication with all four IEEE rounding modes supported. FP addition and subtraction
were implemented using a dual path adder based on the Seidel-Even adder [41]. The
multiplier used a Booth-Wallace structure with rounding by injection. The FP divider
utilized an SRT digit recurrence algorithm with quotient digit selection by comparison
multiples.
All designs were constrained and synthesized using Synopsys Design Compiler
(2006.06). The delay, area, and power were recorded. The ALU synthesis reported an
area of 122,215 gates, a power of 384 mW, and a delay of 2.89 ns - a frequency of 346
MHz. At one operation per second for addition, this results in 346 million instructions
per second. The FPU synthesis reported an area 84,440 gates, a delay of 2.82 ns and an
120
operating frequency of 355 MHz or 355 MFLOPS when pipelining is considered. It has a
maximum dynamic power of 153.9 mW.
While the components were designed for speed, and compared favorably with the
corresponding DesignWare components, the exhibited speeds of around 350 MHz under
worst case conditions, or around 1 GHz for best case conditions, are nowhere near the
speeds produced by commercial processors. Much of this can be attributed to the speed
differences between the mixed (static and dynamic) logic used in industry and the static
logic produced by synthesis with Synopsys. The SOI CMOS that is used in industry has
also been shown to provide significant speed improvements of the bulk CMOS used here.
Future work should include rewriting the VHDL code in Verilog. Verilog
provides a lower level implementation and also allows for transistor modeling. The
transistor modeling can be used to create dynamic logic to increase speed, however
transistor level synthesis is not currently supported by Synopsys. A concerted effort to
implement the designs in both SOI and dynamic logic would result in a product that
could be compared to industry products. An examination into compressor structures
should also be undertaken to allow for multipliers that are smaller in terms of both delay
and area.
121
Bibliography
1. Brent, R.P., Kung, H.T., "A Regular Layout for Parallel
Adders," IEEE
Transactions on Computers, vol. C-31, no. 3, pp. 260-264, Mar. 1982.
2. Chang, C.H., et. al., "Ultra Low-Voltage Low-Power CMOS 4-2 and 5-2
Compressors for Fast Arithmetic Circuits," IEEE Transactions on Circuits and
Systems, vol. 51, no. 10, Oct. 2004.
3. Chang, K.C., Digital Systems Design with VHDL and Synthesis: An Integrated
Approach. Los Alamitos: IEEE Computer Society Press, 1999.
4. Ciletti M.D., Advanced Digital Design with the Verilog HDL. Upper Saddle River,
NJ: Prentice Hall, 2002.
5. Cocke, J., Sweeny, D.W., "High Speed Arithmetic in a Parallel Device," Technical
report, IBM Corporation, February 1957.
6. Dadda, L., "Some Schemes for ParallelMultipliers," Aha Frequenza, vol. 34, no. 5,
pp.349-356, 1965.
7. Daumas, M., Matula, D.W., "Recoders for Partial Compression and
Rounding,"
Technical Report RR97-01, Laboratoire de l'lnformatique du Parallelisme, Lyon,
France, 1997.
8. Doran, R.W., "Variants of an Improved Carry Look-Ahead
Adder," IEEE
Transactions on Computers, vol. 37, no. 9, pp. 1110-1113, Dec. 1988.
9. Ercegovac, M.D., Lang, T., DigitalArithmetic. San Francisco: Morgan Kaufmann
Publishers, 2004.
10. Ercegovac, M.D., Lang, T., Division and Square Root: Digit-Recurrence
Algorithms and Implementations. Boston: Kluwer Academic Publishers, 1994.
11. Ercegovac, M.D., Lang, T., "On-the-Fly
Rounding," IEEE Transactions on
Computers, vol. 41, no. 12, pp. 1497-1503, Dec. 1992.
12. Even, G. "Computer Structure & Introduction to Digital Computers Lecture
Notes,"
Tel-Aviv University, 2003,
ftp://www.eng.tau.ac.il/~guy/Computer_Structure03/lecture_notes/master.ps
13. Even, G., Seidel, P.M., "A Comparison of Three Rounding Algorithms for IEEE
Floating-Point
Multiplication," IEEE Transactions on Computers, vol. 49, no. 7,
pp. 638-650, July 2000.
122
14. Farmwald, P.M., "On the Design of High Performance Digital Arithmetic Units,"
Ph.D. thesis, Stanford University, August 1981.
15. Fried, R. "Minimizing Energy Dissipation in High-Speed
Multipliers," Proceedings
of the 1997 international symposium on Low power electronics and design, p.
214-219, August 1997.
16. Gorshtein, V.Y., et al., "Floating point addition methods and
apparatus,"Sun
Microsystems, U.S. patent 5808926, 1998.
17. Grad, J., Stine, J., "A Hybrid Ling Carry-Select
Adder," IEEE Signals, Systems and
Computers, vol. 2, pp. 1363-1367, Nov. 2004.
18. Gunawan, S., Hsu, K., "Design and Analysis of 64-bit Han Carlson Adder in
0.18u.m CMOS," submitted to ACM, April, 2004.
19. Han, T.D., Carlson, D.A., "Fast Area-Efficient VLSI Adders," Proceedings
ComputerArithmetic, The Computer Society of the IEEE, pp. 49-55, May 1987.
20. Hsiao, S.F., et. al., "Design ofHigh-Speed Low-Power 3-2 Count and 4-2
Compressor for FastMultipliers," Electronic Letters, vol. 34, no.4, pp. 869-891,
1998.
2 1 . IEEE. Std 754- 1 985 IEEE Standard for Binary Floating-Point Arithmetic.
Standards Committee of The IEEE Computer Society. New York, NY, 1985.
22. Kogge, P.M., Stone, H.S., "A Parallel Algorithm for the Efficient Solution of a
General Class of Recurrence Equations," IEEE Transactions on Computers, vol.
C-22, no. 8, pp. 786-793, 1973.
23. Ling, H, "High-Speed Binary
Adder," IBM J. Res Develop, vol. 25, pp. 156-166,
May, 1981.
24. Margala, M., Durdle, N.G., "Low-Power Low-Voltage 4-2 Compressors forVLSI
Applications," IEEE Alessandro VoltaMemorial Workshop, pp. 84-90, Mar.
1999.
25. Masuri, Othman, Lakshmanan, et. al., "High Performance Parallel Multiplier using
Wallace-Booth
Algorithm," IEEE International Conference on Semiconductor
Electronics, pp. 433-436, Dec. 2002.
26. Matthew, S.K, Krishnamurthy, R.K., et. al., "Sub-500-ps-b ALUs in 0.18-um
SOI/Bulk CMOS: Design and Scaling
Trends," IEEE Journal of Solid-Circuits,
vol. 36, no. 11, pp. 1636-1646, Nov. 2001.
123
27. Millar, B., et. al., "A Fast Hybrid Multiplier Combining Booth and Wallace/Dadda
Algorithms," Proceedings of the 35th Midwest Symposium on Circuits and
Systems, pp 158-165, 1992.
28. MIPS Technologies, MIPS64 Architecture For Programmers Volume U: The
MJPS64 Instruction Set, Revision 0.95, March 2001.
29. Nagamatsu, M., "A 15 nx 32x32 bit CMOS Multiplier with an Improved Parallel
Structure," IEEE Custom Integrated Circuits Conference, 1989.
30. Nielsen, A.M., et al., "An IEEE Compliant Floating-point Adder that Conforms
with the Pipeline Packet-Forwarding Paradigm," IEEE Transactions on
Computers, vol. 49, no. 1, pp. 33-47, Jan. 2000.
3 1 . Nikmehr, H., "Architectures for Floating-Point Division," Ph.D Thesis, University
ofAdelaide, Australia, 2005.
32. Nikmehr, H, Lim, C.C., "A New On-the-fly Summation Algorithm," Asia-Pacific
Computer Systems Architecture Conference, pp. 258-267, 2003.
33. Oberman, S.F., "Floating-point Arithmetic Unit Including an Efficient Close Data
Path," AMD, U.S. patent 6094668, July 2000.
34. Oberman, S.F., Flynn, M.J., "Design Issues in Division and Other Floating-Point
Operations," IEEE Transactions on Computers, vol. 46, no. 2, pp. 154-161, Feb.
1997.
35. Oberman, S.F., Flynn, M.J., "Division Algorithms and Implementations," IEEE
Transactions on Computers, vol. 46, no. 8, pp. 833-854, August 1997.
36. Quach, N.T., et. al., "On Fast IEEE Rounding," Technical Report CSL-TR-91-459,
Stanford University, January 1991.
37. Quach, N.T., et. al., "Systematic IEEE RoundingMethod for High-Speed Floating-
PointMultipliers," IEEE Transactions ofVery Large Scale Integration Systems,
vol. 12, no. 5, pp. 511-521, May 2004.
38. Quach, N.T., Flynn, M.J., "High-Speed Addition in
CMOS," IEEE Transactions on
Computers, vol. 41, no. 12, pp. 1612-1619, December 1992.
39. Parhami, B., ComputerArithmetic Algorithms andHardware Designs. New York:
Oxford University Press, 2000.
40. Robertson, J.E., "A New Class ofDigital Division
Methods," IRE Transactions on
Electronic Computers, EC-7, pp. 88-92, September 1958.
41. Seidel, P.M., Even, G. "Delay-Optimized Implementation of IEEE Floating-Point
Addition." 2002 http://hyde.eng.tau.ac.il/Projects/FPADD/index.html
124
42. Severance, C, "An Interview with the OldMan of Floating-Point," University of
California - Berkeley, February 1998,
http://www.cs.berkeley.edu/~wkahan/ieee754status/754story.html
43. Stallings, William., Computer Organization & Architecture: Designing for
Performance, 7th ed. Upper Saddle River, NJ: Prentice Hall, 2006.
44. Sun, S., Han, Y., et. al., "409ps 4.7 F04 64b Adder Based on Output Prediction
Logia in 0. 1 8um CMOS," Proceedings of the IEEE Computer Society Annual
Symposium on VLSI: New Frontiers in VLSI Design, pp. 52-58, 2005.
45. Tocher, K.D., "Techniques of Multiplication and Division for Automatic Binary
Computers,"
Quarterly Journal ofMechanics and Applied Mathematics, vol. 11,
pp. 364-384, 1958.
46. Wallace, C.S., "A Suggestion for a FastMultiplier," IEEE Trans. Electronic
Computers, vol. 13, pp. 14-17, 1964.
47. Yariv Levin, "Supporting De-normalized Numbers in an IEEE Compliant Floating-
Point Adder Optimized for Speed," Tel-Aviv University, July 2001.
48. Yu, R., Zyner, G., "167 MHz Radix-4 Floating Point
Multiplier," Proc. 12th
Symposium on Computer Arithmetic, pp. 149-154, 1995.
49. Zlatanovici, R., Nikolic, B., "Power-Performance Optimal 64-bit Carry-Lookahead
Adders," IEEE Solid-State Circuits Conference, September 2003.
125
