Decimal Floating-point Fused Multiply Add with Redundant Number Systems by Han, Liu
Decimal Floating-point Fused Multiply
Add with Redundant Number Systems
A Thesis Submitted to the
College of Graduate Studies and Research
in Partial Fulllment of the Requirements
for the degree of Doctor of Philosophy
in the Department of Electrical and Computer Engineering
University of Saskatchewan
Saskatoon, Saskatchewan, Canada
By
Liu Han
©Liu Han, May, 2013. All rights reserved.
Permission to Use
In presenting this thesis in partial fullment of the requirements for a Postgraduate degree
from the University of Saskatchewan, I agree that the Libraries of this University may make
it freely available for inspection. I further agree that permission for copying of this thesis in
any manner, in whole or in part, for scholarly purposes may be granted by the professor or
professors who supervised my thesis work or, in their absence, by the Head of the Department
or the Dean of the College in which my thesis work was done. It is understood that any
copying or publication or use of this thesis or parts thereof for nancial gain shall not be
allowed without my written permission. It is also understood that due recognition shall be
given to me and to the University of Saskatchewan in any scholarly use which may be made
of any material in my thesis.
Requests for permission to copy or to make other use of material in this thesis in whole
or part should be addressed to:
Head of the Department of Electrical and Computer Engineering
University of Saskatchewan
57 Campus Drive
Saskatoon, Saskatchewan
Canada
S7N 5A9
i
Abstract
The IEEE standard of decimal oating-point arithmetic was ocially released in 2008.
The new decimal oating-point (DFP) format and arithmetic can be applied to remedy the
conversion error caused by representing decimal oating-point numbers in binary oating-
point format and to improve the computing performance of the decimal processing in commer-
cial and nancial applications. Nowadays, many architectures and algorithms of individual
arithmetic functions for decimal oating-point numbers are proposed and investigated (e.g.,
addition, multiplication, division, and square root). However, because of the less eciency
of representing decimal number in binary devices, the area consumption and performance of
the DFP arithmetic units are not comparable with the binary counterparts.
IBM proposed a binary fused multiply-add (FMA) function in the POWER series of pro-
cessors in order to improve the performance of oating-point computations and to reduce
the complexity of hardware design in reduced instruction set computing (RISC) systems.
Such an instruction also has been approved to be suitable for eciently implementing not
only stand-alone addition and multiplication, but also division, square root, and other tran-
scendental functions. Additionally, unconventional number systems including digit sets and
encodings have displayed advantages on performance and area eciency in many applications
of computer arithmetic.
In this research, by analyzing the typical binary oating-point FMA designs and the
design strategy of unconventional number systems, \a high performance decimal oating-
point fused multiply-add (DFMA) with redundant internal encodings" was proposed. First,
the xed-point components inside the DFMA (i.e., addition and multiplication) were studied
and investigated as the basis of the FMA architecture. The specic number systems were also
applied to improve the basic decimal xed-point arithmetic. The superiority of redundant
number systems in stand-alone decimal xed-point addition and multiplication has been
proved by the synthesis results. Afterwards, a new DFMA architecture which exploits the
specic redundant internal operands was proposed. Overall, the specic number system
improved, not only the eciency of the xed-point addition and multiplication inside the
FMA, but also the architecture and algorithms to build up the FMA itself.
ii
The functional division, square root, reciprocal, reciprocal square root, and many other
functions, which exploit the Newton's or other similar methods, can benet from the proposed
DFMA architecture. With few necessary on-chip memory devices (e.g., Look-up tables) or
even only software routines, these functions can be implemented on the basis of the hardwired
FMA function. Therefore, the proposed DFMA can be implemented on chip solely as a key
component to reduce the hardware cost. Additionally, our research on the decimal arithmetic
with unconventional number systems expands the way of performing other high-performance
decimal arithmetic (e.g., stand-alone division and square root) upon the basic binary devices
(i.e., AND gate, OR gate, and binary full adder). The proposed techniques are also expected
to be helpful to other non-binary based applications.
iii
Acknowledgements
The entire research is sponsored by the Electrical and Computer Engineering department
in University of Saskatchewan and the Natural Science and Engineering Research Council
(NSERC) of Canada. All the toolkits and standard cell libraries used in this research are
provided by CMC Microsystems, Canada.
First of all, I would like to thank my supervisor Dr. Seok-Bum Ko. In the rst year of my
Ph.D. program, I took one course which is lectured by Dr. Ko. We discussed a lot about my
Ph.D. project during that time. He provided me many inspirations by his experiences on not
only research but also life philosophy. Without Dr. Ko's support, this research would not be
nished or even started. I would like to thank other professors in our university. Without the
helps from Dr. Li Chen, the evaluation works may not be nished quickly. Dr. Aryan Saadat
Mehr, Dr. Anh van Dinh, Dr. Chip Hong Chang (Nanyang Technological University), and
Dr. Raymond J. Spiteri provided many helpful ideas and suggestions to improve the quality
of the research and the thesis. The lab manager, Trevor Zintel, also showed his patience to
guarantee that the toolkits were working properly. I would like to thank my friends in our lab
for their kind advices, helps, and supports. Finally, I would like to thank my wife, Lidan Hu,
and my parents for their patience and supports which accompanied me during the hardest
time in my life.
iv
To my father, Han Shuyu, and my son, Han Tianen.
Rest in peace.
v
Contents
Permission to Use i
Abstract ii
Acknowledgements iv
Contents vi
List of Tables ix
List of Figures x
List of Abbreviations xii
I Preface 1
1 Introduction 2
1.1 Background of the Decimal Floating-point . . . . . . . . . . . . . . . . . . . 2
1.1.1 Floating-point Number . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Why is Decimal Floating-point Arithmetic Necessary? . . . . . . . . . 4
1.2 Motivation of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Overview of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
II Research Background 11
2 Decimal Floating-point Standard 12
2.1 Basics of Decimal Floating-point Standard . . . . . . . . . . . . . . . . . . . 12
2.1.1 Basic Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Special Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.3 Rounding Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.4 Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Fused Multiply-Add 17
3.1 Basics of Fused Multiply-Add . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 FMA Designs of Binary Floating-point . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Original Binary Floating-point FMA Architecture . . . . . . . . . . . 20
3.2.2 Multiple-path Binary Floating-point FMA Architecture . . . . . . . . 20
3.2.3 Binary Floating-point FMA with Reduced Latency . . . . . . . . . . 23
3.2.4 Combined Decimal and Binary Floating-point FMA . . . . . . . . . . 25
3.3 Applications of Binary FMA . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
vi
4 Number Systems 29
4.1 Binary Number System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Decimal Number System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Redundant Number System . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
III Designs 37
5 Previous Designs 38
5.1 Decimal Fixed-point Addition . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Decimal Fixed-point Multiplication . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.1 Analysis of Previous Parallel Designs . . . . . . . . . . . . . . . . . . 42
5.2.2 Analysis of Previous Sequential Designs . . . . . . . . . . . . . . . . . 43
5.3 Decimal Floating-point FMA . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6 Proposed Designs 47
6.1 Decimal Fixed-point Addition . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.1.1 Carry Free Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.1.2 Absolute Value Digit-Set Conversion . . . . . . . . . . . . . . . . . . 52
6.2 Parallel Decimal Fixed-point Multiplication . . . . . . . . . . . . . . . . . . 56
6.2.1 Signed Digit Partial Product Generation . . . . . . . . . . . . . . . . 58
6.2.2 SD Partial Product Reduction . . . . . . . . . . . . . . . . . . . . . . 63
6.2.3 SD-BCD Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3 Sequential Decimal Fixed-point Multiplication . . . . . . . . . . . . . . . . . 79
6.3.1 Signed Digit Partial Product Generation . . . . . . . . . . . . . . . . 80
6.3.2 Partial Product Accumulation . . . . . . . . . . . . . . . . . . . . . . 82
6.4 Decimal Floating-point FMA . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.4.1 Pre-Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4.2 Post-Alignment and Sticky Bits Generation . . . . . . . . . . . . . . 97
6.4.3 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7 Comparison and Discussion 113
7.1 Decimal Fixed-point Addition . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2 Decimal Fixed-point Multiplication . . . . . . . . . . . . . . . . . . . . . . . 115
7.2.1 Parallel Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2.2 Sequential Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.3 Decimal Floating-point FMA . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.3.1 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.3.2 Comparison and Discussion . . . . . . . . . . . . . . . . . . . . . . . 126
7.3.3 Pipeline Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . 128
IV Conclusion 130
8 Summary and Future Research 131
8.1 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
vii
8.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
References 135
viii
List of Tables
2.1 Parameters of decimal oating-point numbers . . . . . . . . . . . . . . . . . 13
2.2 The rounding directions of dierent rounding modes . . . . . . . . . . . . . . 15
4.1 The signed numbers in dierent representations . . . . . . . . . . . . . . . . 31
4.2 The decimal digits in dierent representations . . . . . . . . . . . . . . . . . 32
6.1 Range division directly based on operands . . . . . . . . . . . . . . . . . . . 49
6.2 Signed digit representation of the proposed multiples . . . . . . . . . . . . . 59
6.3 Analysis of the number of operands of SD addition . . . . . . . . . . . . . . 67
6.4 Proposed SD addition algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.5 Proposed transfer digit and interim sum recoder . . . . . . . . . . . . . . . . 70
6.6 Delay analysis of each digit of the proposed partial product reduction . . . . 72
6.7 Selection of the easy-multiples . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.8 Conversion from BCD to the specic digit set . . . . . . . . . . . . . . . . . 81
6.9 Iterative Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.10 Selection algorithm of the shifted addend . . . . . . . . . . . . . . . . . . . . 96
6.11 Scenarios of one digit error on leading one position . . . . . . . . . . . . . . 103
6.12 Node functions for the positive and negative detection trees . . . . . . . . . . 104
6.13 Rounding increment generation algorithm of \TiesToAway" and \TowardPos-
itive" modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.14 Rounding increment generation algorithm of \TiesToEven" and \TowardNeg-
ative" modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.15 Rounding increment generation algorithm of \TowardZero" mode . . . . . . 110
7.1 Synthesized results and comparison of 16-digit adders . . . . . . . . . . . . . 114
7.2 Delay analysis of 16 16-digit decimal xed-point multipliers . . . . . . . . 118
7.3 Performance comparison of 16 16-digit decimal xed-point multipliers . . . 119
7.4 Critical path of the proposed 16 16-digit multiplier . . . . . . . . . . . . . 120
7.5 The critical delay path of the proposed multiplier (ns) . . . . . . . . . . . . 123
7.6 Area consumption of the proposed 16-digit multiplier . . . . . . . . . . . . . 123
7.7 Comparison of the 16-digit multipliers . . . . . . . . . . . . . . . . . . . . . 124
7.8 Delay and area partition of the proposed architecture . . . . . . . . . . . . . 126
7.9 Performance comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
ix
List of Figures
1.1 The layout of the oating-point axis . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Example of round o error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Representation errors created in the computation of decimal data in binary
system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Example of the 1 ulp error in decimal processing . . . . . . . . . . . . . . . . 5
2.1 The layout of the bits to represent decimal oating-point . . . . . . . . . . . 14
3.1 Basic FMA architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Binary FMA architecture of [26] . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Shifting range of alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Binary FMA architecture of [34] . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Combined Decimal and Binary FMA architecture . . . . . . . . . . . . . . . 26
4.1 Example of calculation with redundant number system . . . . . . . . . . . . 34
4.2 Example of calculation with redundant number system: reduced digit set in
output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Example of calculation with redundant number system: signed digit set . . . 35
4.4 Consideration of the number system . . . . . . . . . . . . . . . . . . . . . . . 36
5.1 Decimal oating-point fused multiply-add architectures . . . . . . . . . . . . 46
6.1 Proposed n-digit signed digit decimal adder . . . . . . . . . . . . . . . . . . 51
6.2 Proposed absolute value digit-set converter . . . . . . . . . . . . . . . . . . . 55
6.3 Adjust and correction logics of the proposed digit-set converter . . . . . . . . 55
6.4 Top level architecture of the proposed parallel decimal multiplication . . . . 57
6.5 Example of the proposed 4 4-digit multiplication algorithm . . . . . . . . . 58
6.6 Proposed architecture of partial product generation . . . . . . . . . . . . . . 63
6.7 Restructure of the proposed partial product reduction . . . . . . . . . . . . . 64
6.8 Dot notation of the proposed two levels of multi-operand SD additions . . . 69
6.9 Hardware structure of the proposed 1st level multi-operand SD adder . . . . 71
6.10 Hardware structure of the proposed 2nd level multi-operand SD adder . . . . 73
6.11 Top level architecture of the proposed partial product reduction unit . . . . 74
6.12 Simplied 4-bit CLA and G, P generation circuit . . . . . . . . . . . . . . . 76
6.13 Proposed hybrid prex network in the SD-BCD converter . . . . . . . . . . . 78
6.14 Final conditional constant adder . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.15 Recoding of the multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.16 The proposed partial product generation . . . . . . . . . . . . . . . . . . . . 82
6.17 The dot-notation of partial product accumulation (digit-slice) . . . . . . . . 83
6.18 The circuitry of partial product accumulation (digit-slice) . . . . . . . . . . . 83
6.19 The proposed parallel conversion . . . . . . . . . . . . . . . . . . . . . . . . 85
6.20 The proposed sequential decimal multiplier . . . . . . . . . . . . . . . . . . . 86
x
6.21 Proposed architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.22 Details of structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.23 Details of calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.24 Decimal oating-point fused multiply-add architectures . . . . . . . . . . . . 92
6.25 Left and right shifting range of the pre-alignment. . . . . . . . . . . . . . . . 93
6.26 Architecture of the pre-alignment . . . . . . . . . . . . . . . . . . . . . . . . 95
6.27 Layout of the aligned product and addend . . . . . . . . . . . . . . . . . . . 96
6.28 Post-alignment shift amount decision . . . . . . . . . . . . . . . . . . . . . . 101
6.29 Detailed structure of the post-alignment shift amount calculation . . . . . . 102
6.30 Hardware structure of the correction detection unit . . . . . . . . . . . . . . 105
6.31 Architecture of the rounder . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.1 Area-Delay Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2 Power-Delay Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.3 Delay-area space of the decimal multipliers . . . . . . . . . . . . . . . . . . . 120
7.4 Evaluation of speed, area, power consumption of the proposed sequential mul-
tiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.5 A regular pipeline conguration of the proposed architecture . . . . . . . . . 129
xi
xii
List of Abbreviations
ADP Area Delay Product
BCD Binary Coded Decimal
BFA Binary Full Adder
BFP Binary Floating-point
CFA Carry Free Adder
CLA Carry Lookahead Adder
CPA Carry Propagate Adder
CSA Carry Save Adder
DFP Decimal Floating-point
DFMA Decimal Floating-point Fused Multiply-Add
DPD Densely Packed Decimal
DSD Decimal Signed Digit
FA Full Adder
FADD Floating-point Addition
FDIV Floating-point Division
FMA Fused Multiply-add
FMUL Floating-point Multiplication
FSQRT Floating-point Square Root
FXP Fixed-point
HA Half Adder
LSB Least Signicant Bit
LSD Least Signicant Digit
LUT Look-up Table
LZA Leading Zero Anticipator
LZD Leading Zero Detector
MSB Most Signicant Bit
MSD Most Signicant Digit
NaN Not a Number
PDP Power Delay Product
PPA Partial Product Accumulation
PPG Partial Product Generation
PPR Partial Product Reduction
RISC Reduced Instruction Set Computing
RTA Round Ties to Away
RTE Round Ties to Even
RTN Round Toward Negative
RTP Round Toward Positive
RTZ Round Toward Zero
SD Signed Digit
SDDA Signed Digit Decimal Adder
SIMD Single Instruction Multiple Data
TZD Trailing Zero Detection
ulp unit of least precision
xiii
Part I
Preface
1
Chapter 1
Introduction
This chapter introduces the basic principles of oating-point arithmetic. Afterwards, the
necessity of decimal oating-point arithmetic is revealed by analyzing the limits of binary
oating-point processing in section 1.1. High-performance computational demands are re-
quired by nancial and commercial applications in which decimal computing is dominant;
moreover, limited eciency restricts the existing decimal oating-point solutions. These con-
ditions lead to in the motivations of this research which are described in section 1.2. Finally,
section 1.3 provides the layout of this thesis.
1.1 Background of the Decimal Floating-point
Computer arithmetic is a vital element in the functionality of any computer system. Two
major types of data, integer and oating-point, are processed by computer to mimic compu-
tation of the real number. An integer is simply dened and processed by the basic binary
devices. However, oating-point arithmetic involves procedures much more complicated than
these underlying integer data.
1.1.1 Floating-point Number
A oating-point number is comprised of four elements: sign, signicand, exponent, and an
implicit base number. Therefore, a oating-point number F is dened as:
F = ( 1)S  C BE (1.1)
2
where S;C;E; and B represent sign, signicand, exponent, and base number respectively.
In representing an innite real number (i.e.,  = 3:141592 : : : ) with a nite hardware
resource, the precision problem emerges. However, a oating-point number has two important
attributes, which are range and precision. The range that denes the maximum and minimum
representable numbers in a given oating-point format depends mainly on the number of bits
to represent the exponent. Note that the exponent represents the power of the base number,
and the oating-point axis is therefore not divided equally by the numbers on it. Additionally,
the zero is a special case in oating-point, having an arbitrary exponent. Consequently, the
maximum number far away from zero is represented as Cmax  BEmax , and the minimum
number close to zero is represented as CminBEmin . A fragment of the oating-point axis
is shown in Fig. 1.1:
0 Fmin
... ......
-Fmin B×Fmin
......
B
2
×Fmin
Figure 1.1: The layout of the oating-point axis
Once an exponent is given, the precision of a oating-point format indicates how many
non-zero numbers (i.e., BP   1, where P is the number of digits in the signicand) can be
exactly represented in the given exponent window which is dened by BE. For example, 9
numbers (e.g., .1 to .9) are exactly representable if P is 1, B is 10, and E is 0.
The precision of a given oating-point format always implies another useful parameter,
which is the so-called unit of least precision (ulp). The ulp is dened as the non-zero minimum
value of the least signicant digit. If the decimal point is implicitly on the left of the most
signicant digit, 1 ulp equals 1=(BP ). Note also that there are several slightly dierent
denitions of ulp summarized in [1]. In this thesis, the denition mentioned above is chosen.
Since innite real numbers (e.g.,  = 3:141592 : : : ) cannot be exactly represented by the
oating-point format, the dierence between a real number and the corresponding oating-
point number termed the round o error is able to be measured by the ulp. In Fig. 1.2,
the dierences between the exact result (Rexact) and the possible rounded results (Rrounded)
3
are shown. In dierent rounding modes, the exact result will be rounded to the left or right
closest number which is exactly represented within the format. Therefore, the maximum
round o error could be 1 ulp, or 0.5 ulp at maximum, depending on dierent modes and
the value of the exact result.
Rexact
... ...
Rrounded
1 ulp
Rrounded
round off error
Figure 1.2: Example of round o error
Currently, the binary oating-point which employs 2 as base (i.e., it uses \0" and \1" in
signicand) is dominant in computer systems. A technical standard (i.e., IEEE 754) about
the oating-point arithmetic was established by the IEEE in 1985, whereafter it was revised
in 2008.
1.1.2 Why is Decimal Floating-point Arithmetic Necessary?
Although binary arithmetic has shown its advantages in processing speed, hardware com-
plexity, and storage compactness (as everyone knows), decimal arithmetic has already been
used in the rst days of modern electronic computers [2, 3]. In the applications where bi-
nary arithmetic is dominant, either the xed-point (e.g., integer) arithmetic is competent for
the tasks, or the tasks are tolerant of the rounding error. Nonetheless, this situation is not
always true in some specic applications. For example, in nancial computing, both very
small numbers, such as 2:6  10 5, and very large numbers, such as 1:9  109, need to be
dealt with according to the amount and unit in the transactions. Note that these monetary
data are represented in scientic notation with a decimal base. Consequently, a representa-
tion error could arise when converting the decimal monetary data to binary oating-point
format. For instance, a decimal number 0:4 in binary oating-point single precision format
is about \0:4000000059604645:::". The small dierence between the binary approximation
and original decimal number could cause a 1 ulp error or even more than that in some cases.
Furthermore, such a tiny dierence could be accumulated during the computation and may
4
cause serious problems.
Binary System
Binary
Operands
Binary
Results
(round off error)
(a) Processing binary data in binary system
Binary System
Decimal
Operands
Decimal
Results
Dec to Bin
Conversion
Bin to Dec
Conversion
(round off error)(conversion error) (conversion error)
(b) Processing decimal data in binary system
Decimal System
Decimal
Operands
Decimal
Results
(round off error)
(c) Processing decimal data in decimal system
Figure 1.3: Representation errors created in the computation of decimal data in binary
system
In Fig. 1.3, three computation models are given to illustrate the error generation dur-
ing the decimal processing. The traditional binary oating-point processing is shown in
Fig. 1.3(a). The computation error occurs mainly because of the nite precision of the
hardware, which can be solved by iterative computations in software. Prior to the decimal
oating-point standard, the decimal oating-point data are processed in the binary system,
as shown in Fig. 1.3(b). However, the conversion error, as described in the rst paragraph
of this section, makes the problem complicated. First, this conversion error is fed into the
binary system. After computations inside the binary system, the error could then become
accumulated. Second, once rounded and converted from the binary number back to decimal
form, if the exact result and the hardware result are not on the same side of the half ulp, a
1 ulp error may be created in some rounding modes. This error is shown in Fig. 1.4.
Rexact
... ...
RN(Rcomputed) RN(Rexact)
Rcomputed
Figure 1.4: Example of the 1 ulp error in decimal processing
In nancial applications, the monetary data sometimes is necessary to be rounded to
5
cent. For example, if a cell phone call costs 1.3 CAD, with 5 percent of tax, the bill of this
call will be 1:3  1:05 = 1:365 CAD, and the rounded bill is 1.37 CAD. However, the real
computation in a binary system with single precision binary oating point library can be
1:2999999523162842   1:0499999523162842    = 1:364999771118164 : : : , and the rounded
result is 1.36 CAD. The error is much larger than 1 ulp. A benchmark demonstrated that in
a large telephone bill system, the errors could be accumulated up to about 5 million dollars
for every single year [4].
To solve the errors caused by representing decimal fractional numbers in binary oating
point format, software solutions can be applied to obtain more accurate results by iterative
algorithms or converting operands to integer space. For instance, Intel provided a software
library to dene and process the DFP numbers in Windows, Linux, and HP-UX [13]. Other
programming languages and open source libraries that support DFP arithmetic can be found
in [14{16]. The benet of software solution is the exibility on various platforms. However,
the hardware solution is superior in processing speed and energy eciency. The experiments
on dierent benchmarks showed that the hardware solution can be typically 100 to 1000
times faster than the software solution [16, 18, 19].
In 2003, IBM proposed a decimal oating-point (DFP) hardware solution for nancial
computing and other similar commercial applications [5]. Meanwhile, some decimal xed-
point and oating-point arithmetic units were implemented and integrated into IBM's pro-
cessors and mainframes [7{12]. Recently, many basic arithmetic operations of hardware
solutions were proposed. For example, Wang et al. proposed a DFP adder and multifunc-
tion unit with injection-based rounding [20, 21]. Hickmann et al. provided an parallel DFP
multiplier in [22], and Lang et al. announced a 16-digit decimal SRT division algorithm
in [23]. The architectures and algorithms for transcendental functions are also investigated
in [24, 25]. With continuously increasing demands in decimal computation and decreasing
sizes of the transistor on the integrated chip, the hardware solution has gained popularity
for commercial and nancial applications.
Because of the importance seen in decimal arithmetic, in 2008, DFP format and opera-
tions were included in the latest version of the IEEE standard for oating-point arithmetic
(IEEE 754-2008) [17]. In this thesis, a pure hardware architecture for decimal oating-point
6
processing as shown in Fig. 1.3(c) is discussed.
1.2 Motivation of Research
There are several dierent methods for implementing decimal oating-point arithmetic on
chip. First, a decimal adder with necessary logics can be implemented on chip, and other
functions can be computed serially (i.e., digit-by-digit) by the decimal adder. The advan-
tage of this method is its much lessened hardware area in comparison to that taken by any
other methods. However, the processing speed is worse than the levels achieved by the other
methods. Second, separate function units can be implemented on chip. Therefore, the func-
tions on chip can be performed simultaneously to achieve a better performance. In contrast
to the previous method, the hardware area, local routing, and power consumption may be
considerable. Third, a fused multiply add (FMA) function can be implemented on chip, and
other functions can reuse the FMA hardware with necessary hardware or software support.
With the on-chip FMA, the hardware requirement of other functions would be minimized or
even eliminated. For example, the addition and multiplication could be implemented simply
by setting the operands of the FMA function properly (i.e., (A 1) + C and (A B) + 0).
Furthermore, the other numerical functions can then be mathematically evaluated by a series
of additions and multiplications, as shown in equation 1.2:
f(x) = a0 + a1x+ a2x
2 + a3x
3 +   + anxn
= a0 + x(a1 + x(a2 +   + x(an 1 + anx) : : : ))
(1.2)
Since this method provides a hardware solution which is balanced on performance and
cost, it has been supported in several commercial processor architectures, such as Intel Ita-
nium and IBM POWER series microprocessors [28, 29].
In the past decade, many such decimal hardware solutions have been investigated. In
these designs, unconventional encoding systems have shown the advantages on processing
performance. Since the traditional binary decimal encoding (BCD) that applies 4 binary bits
to represent 10 decimal numbers is not fully exploited, the representation space is wasted,
which causes a lower performance. Additionally, the computational eciency of the decimal
7
processing upon binary hardware devices (e.g., AND, OR gates, and full adder) is limited by
the inecient representation of the decimal numbers. In this thesis, we focus on the ecient
number systems for decimal processing on binary devices. The redundant number systems
that include unconventional digit sets and encodings are studied and examined, not only to
represent the decimal number eciently, but also to process the decimal data on a register
transistor level and on an architectural level.
As analyzed above, the research covers the following topics:
1, the number system and related hardware design to improve the eciency of basic
decimal processing,
2, the high performance algorithm and architecture of basic decimal arithmetics, such as
addition and multiplication,
3, the high performance architecture of decimal oating-point fused multiply-add.
In a word, the ecient algorithms, architectures, and encodings encouraging a better
decimal oating-point fused multiply-add are studied and investigated in this research in
order to improve the performance of decimal oating-point processing.
1.3 Overview of Research
In this thesis, the decimal xed-point addition and multiplication are rstly studied. The
application of the unconventional digit sets and encodings is therefore examined. Afterwards,
a new two steps non-speculative carry free decimal adder is proposed to decrease the delay
and hardware cost at the same time. Additionally, a new decimal xed-point parallel multi-
plier for high throughput application is proposed to create less partial products without carry
propagation. A hybrid carry propagation network is also created to eciently accumulate
the nal product at the last step of the proposed parallel multiplier. Furthermore, a xed-
point sequential decimal multiplier utilizing two necessary multiples and an on-the-y digit
set converter is proposed to perform decimal multiplication with less hardware cost. Subse-
quently, the decimal oating-point fused multiply-add which exploits our xed-point addition
and parallel multiplication is investigated. With the help of the unconventional number sys-
tem, the architecture of the proposed FMA can be signicantly optimized. Moreover, the
8
proposed design follows the denition of such an operation in the IEEE standard. Thus,
all necessary ags and special operands are supported in such designs. By exploiting the
unconventional digit sets and encodings in our designs, not only the performance of decimal
processing is improved, but the understanding of redundant number system for non-binary
data processing is also expanded.
This thesis is organized as follows to present the research works:
Part I: Preface includes:
Chapter 1: Introduction presents the basics of the oating-point number, why
the decimal oating-point is needed, and the motivation of the research.
Part II: Research Background includes:
Chapter 2: Decimal Floating-point Standard introduces the IEEE 754-2008
standard which denes the basics of the decimal oating-point format, the special
cases of the operands, the rounding modes, and the exception handling.
Chapter 3: Fused Multiply-Add gives basic concepts of the fused multiply-
add operation. Additionally, the state-of-the-art algorithms and architectures
proposed in the binary designs are briey introduced to draw the consideration
of the decimal designs.
Chapter 4: Number Systems bring the fundamentals of the number system
(i.e., digit-sets and encodings) which is part of the scope of this research.
Part III: Designs include:
Chapter 5: Previous Designs discuss the state-of-the-art designs, which include
the decimal xed-point addition, multiplication, and the decimal oating-point
fused-multiply-add, proposed in the literature. These previous works are reviewed
to analyze the possible improvement on these existing designs.
Chapter 6: Proposed Designs describe the new designs proposed during this
9
research. The algorithms and architectures of the new designs are discussed in
details.
Chapter 7: Comparison and Discussion analyze the dierences between the
previous designs and our proposed designs. The improvement to achieve the bet-
ter performance is discussed.
Part IV: Conclusion includes:
Chapter 8: Summary and Future Research conclude the research result and
propose the future works.
10
Part II
Research Background
11
Chapter 2
Decimal Floating-point Standard
This chapter introduces the basics of the IEEE standard of oating-point arithmetic (i.e.,
IEEE 754-2008). As a warm up, the format is rst described briey. Furthermore, the special
operands, rounding modes, exception conditions, and ags are introduced respectively. The
details of the contents described in this chapter are reviewed and discussed in later chapters.
2.1 Basics of Decimal Floating-point Standard
The representation error caused by the conversion from decimal oating-point to binary
oating-point was introduced in section 1.1. However, in the database systems in the elds
of banking, nancial analysis, retail sales, and etc, over 55% of data are in decimal format [5].
Because of the ineciency of the current binary hardware, the decimal processing overhead
could be over 90% in order to achieve accurate results in decimal format [4]. A decimal
oating-point specication was introduced in 2001 for obtaining accurate decimal oating-
point results with appropriate eciency in nancial and commercial applications [6]. New
data type and necessary arithmetics of the data were further introduced by M. F. Cowlishaw
in [5]. By the help of many researchers, a new oating-point industry standard (i.e., IEEE
754-2008) has been released in 2008 [17]. In IEEE 754-2008, a family of commercially fea-
sible ways to perform decimal oating-point arithmetic was dened. Four basic aspects are
introduced in next subsections.
2.1.1 Basic Format
In IEEE 754-2008, a decimal oating-point number is represented by sign (S), signicand
(M), exponent (E), and implied base (B = 10). A signed decimal oating-point number is
12
therefore represented as:
( 1)S M BE (2.1)
where,
S is 0 or 1,
E is in [Emin; Emax],
M is in [0;B), and M is represented as d0  d1d2:::d(p 1) (p means the precision).
If M is represented as shown above, the decimal oating-point is viewed as a scientic
form, in which a radix point is right after the rst digit. In some cases, it is convenient to
represent the signicand as an integer. Thus, the signed decimal oating-point number is
represented as:
( 1)S  C BQ (2.2)
where,
S is 0 or 1,
Q is any integer Emin  Q+ p  1  Emax,
C is any integer 0  C < Bp (p means the precision).
By dening the maximum and minimum exponents and dierent precisions, the decimal
oating-point numbers are divided into three formats.
Table 2.1: Parameters of decimal oating-point numbers
parameter decimal32 decimal64 decimal128
p, digits 7 16 34
Emax +96 +384 +6144
In Table 2.1, only Emax is dened. The Emin is calculated by 1   Emax. In hardware,
13
a decimal oating-point number is represented in three segments (i.e., the base number is
implicit). The layout of the hardware representation of a decimal oating-point number is
illustrated in Fig. 2.1. Since the range of the exponent is xed, a bias is added to exponent
in each format in order to simplify the exponent processing. The details of the number of
bits for each segment can be found in [17].
S E C
1 bit MSB LSB MSB LSB
(significand field) (combination field) (sign)
Figure 2.1: The layout of the bits to represent decimal oating-point
2.1.2 Special Format
In IEEE 754-2008, several special numbers are dened. These numbers have to be processed
correctly in the hardware solution which is standard compliant.
Innity
Every nite number can be represented in the equation (2.2). However, if the magnitude
of a decimal oating-point number is larger than that of the representable largest number,
it is dened as innity. The computations with innity are usually exact. For example,
(1 + x) = 1 and (1  x) = 1. Therefore, the exception signals are not set for these
computations. However, in some special cases, the exception signals will be set up. For
example, (x 0) = 1, or overow raised by nite computations. The detailed denition of
such a number can be found in section 6.1 in [17].
NaNs
The NaN, which stands for Not a Number, includes two dierent cases (i.e., signaling and
quiet). The signaling NaN or sNaN provides representations for uninitialized variables in
memory and other enhancements of arithmetic which are not dened in the standard. The
quiet NaN or qNaN provides diagnostic information for invalid or unavailable results. For
example, (01), (+1+ 1), and (00) have no valid results. The invalid ag is therefore
14
set and an NaN can be given as the result. The detailed denition of such a number can be
found in sections 6.2 and 7.2 in [17].
2.1.3 Rounding Modes
If the result cannot be exactly represented in a give format (i.e., decimal64 or decimal128),
the result has to be rounded to the number which is the closest to the true result. The
dierence between the rounded result and the true result is the so-called rounding error. The
standard denes ve rounding modes that have dierent maximum rounding errors. The
examples of the rounding modes are given in Table 2.2.
Table 2.2: The rounding directions of dierent rounding modes
inputs RTP RTN RTZ RTE RTA
5.4 6 5 5 5 5
5.5 6 5 5 6 6
6.5 7 6 6 6 7
5.6 6 5 5 6 6
-5.4 -5 -6 -5 -5 -5
-5.5 -5 -6 -5 -6 -6
-6.5 -6 -7 -6 -6 -7
-5.6 -5 -6 -5 -6 -6
Therefore, the maximum rounding errors of \roundTowardPositive (RTP)", \roundTo-
wardNegative (RTN)", and \roundTowardZero (RTZ)" modes are less than one ulp, and the
maximum rounding errors of \roundTiesToEven (RTE)" and \roundTiesToAway (RTA)"
modes are less than half ulp.
2.1.4 Flags
The standard also denes ve ags to provide diagnostic information for exceptions. The
Invalid Operation is raised once the result is not usefully denable or any operand of the
operation is invalid. In the meantime, an NaN is given as the result. For example, square
15
root of a number less than zero or (11). The Division by Zero is simply dened as if any
number is divided by zero, and an innity is given as the result with a appropriate sign. The
Overow is set if the result is larger than the maximum representable magnitude. In this
case, the result is rounded rst and the signal is set according to the rounding direction. On
the other hand, if the result is less than the minimum representable magnitude, the signal
Underow is set. In the given format, the precision is always xed. However, in some cases
(e.g., multiplication, division, and etc), the exact result cannot be completely represented
in the given precision. Therefore, the result is rounded to the closest representable number,
and the signal Inexact is set. The details of the exception handling and ags are introduced
in section 7 in [17].
16
Chapter 3
Fused Multiply-Add
A high performance decimal oating-point fused multiply-add (DFMA) is one of the target
in this research. Reviewing the architectures of the existing binary oating-point fused
multiply-add designs may be helpful to understand why such a function is useful and to
obtain inspirations for designing a DFMA. The basics of the fused multiply-add function
is therefore introduced rst in section 3.1. Subsequently, several previous typical binary
oating-point designs are analyzed and summarized in section 3.2. Finally, the applications
of such a function in binary designs are discussed in section 3.3.
3.1 Basics of Fused Multiply-Add
The Fused Multiply-Add (FMA), which is also known as Multiply-Add Fused (MAF) and
Multiply-Accumulator (MAC), was rst proposed by IBM in the oating-point processor on
its RISC System/6000 (RS/6000) in 1990 [26]. The key feature of the oating-point FMA is
a merged oating-point addition and multiplication able to minimize the rounding error for
the chained binary oating-point multiplications and additions (i.e., (AB) +C), and also
to reduce the hardware area and on-chip busing of binary oating-point processors.
The individual addition and multiplication could be easily implemented on FMA by
setting B as one (i.e., A + C = (A  1) + C) and setting C as zero (i.e., A  B = (A 
B) + 0). Moreover, many computations, such as division, square root, reciprocal, and many
transcendental functions, can be iteratively calculated based on the FMA architecture with
the Newtons method or other similar methods [27]. Thus, these functions are able to be
implemented with very little or even no extra area cost. Because of the benets on accuracy,
latency, and hardware cost, in several high performance commercial processors, only the
17
FMA architecture is implemented in the oating-point unit [28, 29].
The basic data ow of binary oating-point FMA is shown in Fig. 3.1. The product of
the signicands of A and B is created by a xed-point multiplier array. Mean while, the
shifting amount to align the product and addend C is calculated in parallel. The following
operations are similar to those of a oating-point addition. The product and addend are
rst swapped and shifted. An addition or subtraction is then performed by considering the
eective operation. Subsequently, the number of leading zeros in the result is detected, and
the result is shifted in order to make it normalized. Finally, the a rounded result is obtained
by adding the possible increment to the normalized result.
Because of the advantages of binary oating-point FMA, such an architecture gained at-
tentions in last 20 years. P. W. Markstein has proved that the accuracy improvement by
FMA is the prerequisite to obtain the correctly rounded result of the division and square
root in Newton-Raphson approach and the elementary function evaluation [27]. Thus, the
characteristic of FMA makes it possible to cut o the hardwired division and square root
unit on chip. In [30], R. M. Jessani et al. discussed the eect on area and performance of the
oating-point FMA with a reduced multiply array, namely dual-pass. Such an FMA with
a halved multiplier reduces 40% of the chip size, and also decreases the performance unfor-
tunately. P. M. Seidel proposed a multiple-path algorithm on FMA by following the basic
ideas used in the dual-path adder [31] in which two shifters in alignment and normalization
are divided into dierent data paths. By reallocating the normalizer and rounder, T. Lang
and J. D. Bruguera reduced the delay of the FMA in [32, 33], and further the improved
architecture with two paths makes the oating-point addition faster than the oating-point
multiplication and FMA [34]. E. Quinnell et al. presented a bridge architecture that reuses
the components in the existing oating-point adder (FADD) and oating-point multiplier
(FMUL) on chip [35, 36]. The single instruction multiple data (SIMD) feature of FMA is
also discussed and implemented by L. Huang et al. in [37, 38]. So far many commercial
general purpose processors support this instruction in hardware, such as the IBM PowerPC,
the HP PA-8000, and the HP/Intel Itanium [39]. Since the FMA function is included in the
IEEE standard 754-2008 as a primitive operator, more processors will realize this instruction
in the future.
18
Multiplication
Alignment
Addition
Leading Zero Detection
Normalization
Rounding
A B C
Result
Operands Swapping
Alignment Shifting
Calculation
Figure 3.1: Basic FMA architecture
19
3.2 FMA Designs of Binary Floating-point
After the rst FMA, which was introduced in 1990 [26], several novel binary designs are
proposed to speed up such a function because of its superiority on oating-point processing.
The typical designs are reviewed in this section to summarize the ideas in hardware design.
3.2.1 Original Binary Floating-point FMA Architecture
The rst FMA arithmetic unit which was implemented in 1990 [24] is shown in Fig. 3.2.
As described before, such an architecture was announced to increase the accuracy due to
one less rounding operation. The multiplication inside FMA rst produces a double-length
product (i.e., A  B). To add the product with doubled width to the third operand C
with the single width, the alignment range has to be enlarged which implies a wider data
path or a larger delay. This problem is shown in Fig. 3.3. However, such a wider data
path which includes shifter, adder, and normalizer keeps all the necessary information to the
nal rounding operation. Therefore, the accuracy is improved compared to the individual
FMUL and FADD. To shorten the total latency, the alignment operation, which shifts the
signicands of the product and addend to perform the addition on the operands with the
same exponent, can be placed in parallel to the multiplication path. Since the delay of the
alignment is normally less than that of the multiplier tree, the latency of FMA is reduced
by hiding the delay of the alignment shifting. Moreover, the leading zero anticipation (LZA)
can be applied to save part of the delay of obtaining the leading zero before normalization.
Finally, only one rounding operation is performed for one fused multiplication and addition.
3.2.2 Multiple-path Binary Floating-point FMA Architecture
In 2004, P. M. Seidel proposed a theoretical analysis in [31], in which a multiple-path al-
gorithm similar as that in his dual-path adder was applied. Since the alignment is just to
shift two operands (i.e., (A B) and C) to get rid of the dierence between the exponents,
an algorithm regarding the dierence between the exponents is discussed. In the algorithm,
only part of the data paths is enabled during the calculation, and each sub-path is simpler
20
MUX MUX
A B+INCR B  
Latch
Adder
Hex Normalize
Latch
Shifter
Binary normalize
IEEE round
Register file
Shifter
Latch
Leading-zero
anticipator
A B C
Basic structure
IBM one
Alignment
Figure 3.2: Binary FMA architecture of [26]
21
A B 
Alignment Range of C
Bit width of A B 
Figure 3.3: Shifting range of alignment
than the combined path to cover all cases. Thus, the delay and power could be reduced with
the cost of larger area.
The cases are split based on the dierence among the exponents, which is dened by 
(i.e.,  = (EA + EB)  EC + bias, where EA;EB; and EC are the exponents of operands
A;B; and C).
1.    54, where the exponent of the product is too small to aect the addend as the
intermediate result, and the product only can form the sticky bit to generate the increment
bit in some rounding modes. In this case, the adder can be disabled.
2.  53     3, where part of the product will be added into the low weight bits of
the addend, and the exceeding bits of the product is added/subtracted to all trailing zeros,
thus the operation could be simplied to an negation (for subtraction). Only the high 53
bits need to be handled in the adder.
3.  2    1, where the cancelation would generate the leading zero in the result in the
subtraction, thus a relative big normalization would be applied after the adder. However,
the big shifter in alignment is not needed in this case.
4. 2    52, where the low-weight bits of product and the addend need to be added
together. But actually only the high 53 bits form the intermediate result, and the exceeding
part only aects the rounding result. The addition could be simplied to only determine the
increment for rounding.
5. 53  , where the intermediate result is formed totally by the product, the addend
only forms the sticky bit. Thus the adder can be disabled. In [35], the authors proposed a
22
three-path design in which the Far path is further divided into two paths (i.e., Adder Far
path and Product Far Path). This design will not be discussed since it is theoretically similar
to the dual-path design proposed in [34, 40].
3.2.3 Binary Floating-point FMA with Reduced Latency
In [32, 33], the authors proposed an improved architecture to combine the addition and the
rounding. In such an architecture, the nal adder with carry propagation and the rounding
unit are parallel by using the compound adder which can simultaneously calculate the sum
and sum+ 1. Thus the correct result can be selected by the increment bit generated by the
rounding logic. In the architecture, the nal addition is placed at the end of the dataow.
Therefore, the normalization has to be performed before the addition, and the LZA cannot
be placed parallel with the adder. Consequently, a new normalization scheme is proposed for
shifting two operands which is obtained by adding three operands (i.e., the sum and carry of
the multiplier and the addend) with a carry save adder. Moreover, the authors also designed
a modied LZA structure to t the proposed normalization and avoid the increasing on delay.
The brief architecture is shown in Fig. 3.4.
In all the designs referenced previously, the FADD and FMUL take the same latency as
the FMA unit does. Thus designs enlarge the latency of the individual FADD and FMUL
instruction compared to implementing them in individual oating-point adder and oating-
point multiplier. In [34], the authors discussed a method to shorten the latency of FADD in
a FMA unit by bypassing the multiplier (i.e., a recoder and a CSA tree) when running the
addition instruction individually. To do so, the alignment in previous designs is no longer
parallel with the multiplier and delayed. Furthermore, to avoid the two shifters in align-
ment and normalization on critical path, a dual-path design for the addition part separates
the shifters in alignment and normalization into two paths. Finally, the compound adder,
which calculates the sum and sum + 1 simultaneously, makes the rounder parallel with the
signicand adder as discussed in [32, 33].
23
Multiplier Tree
A B
Alignment
C
Carry Save Adder
bit-invert
HAs and
part of adder
normalization
shifter
sign
detection
rounding
unit
dual
adder
MUX
LZA
Result
Figure 3.4: Binary FMA architecture of [34]
24
3.2.4 Combined Decimal and Binary Floating-point FMA
Since some designs of the decimal FADD and decimal FMUL have been proposed and imple-
mented in last several years, the decimal FMA could borrow some ideas from those state-of-
the-art decimal designs. P. K. Monsson has proposed a FMA with encodings for combined
decimal and binary processing [41]. The author chose a non-speculative multi-operand adder
which has been proposed in [42] to reduce the partial product array in the multiplier. The
data path is divided into decimal path and binary path in the partial product generation.
These two paths share the reduction tree, since the size of it is relatively big compared to the
entire architecture. To support combined decimal and binary processing, the data path has
to be constructed in an ordinary structure, which means less optimization. The traditional
encodings in reduction tree causes an extra delay in the decimal correction unit. Further-
more, the optimization method in [34] cannot be applied to implement an elaborate rounding
and addition unit. The brief architecture of the combined decimal and binary FMA is shown
in Fig. 3.5. The units for decimal or binary computations are marked by the letter \D" or
\B" on the right bottom.
The synthesis results of the dual-radix design show that the design costs too many tran-
sistors compared to the individual designs. The area of the combined FMA is about 282%
of that of the FMUL in [22] and 1240% of that of the FADD in [20]. Furthermore, compared
to the binary FMA, the area of the combined FMA is about 1267% of that of the binary
FMA in [43]. The results violated one of the motivations of the FMA which is to implement
FMUL and FADD with a small total area.
3.3 Applications of Binary FMA
In previous sections, only how FMA can improve the performance and accuracy of continuous
individual FADDs and FMULs are introduced. Actually, other functions (e.g., oating-point
division (FDIV), oating-point square root (FSQRT), and etc.) can be implemented based
on the FMA unit. This is also one of the benets of the RISC processor in which only
key arithmetic units are implemented in hardware and other functions are implemented in
25
Multiplier Tree Alignment
Carry Propagate Adder Leading
Normalization
Rounding
Zero
Anticipation
DECA DECB DECC
Result
DPD2BCD DPD2BCD DPD2BCD
Partial Product
Generation
MUX
BINA BINBBINC
MUX
Partial Product
Generation
Decimal correction
Addition on digit
and
Carry Save Adder
MUX MUX
DDD
D
D
D
D
DD
D
B
B
D B
B
B
B B
B
Figure 3.5: Combined Decimal and Binary FMA architecture
26
software with the support of the library and the assisting hardware. Additionally, several
papers have demonstrated that the binary FMA is the requisite component to get correct
results of some functions.
P. W. Markstein in [27] introduced the Newton-Raphson's (NR's) method to perform the
division and square root with FMA instruction and analyzed the correctness of such a method.
Previously, without the FMA instruction, using the NR approach for division required a
special corrective action at the end of iterations to get the last bit rounded correctly. If
the corrective action is not applied, the result can be rounded incorrectly [44]. Additionally,
in the computation of the elementary functions, the FMA instruction is also very useful to
improve the accuracy of the argument reduction which is important for the accuracy of the
elementary function evaluation.
F. G. Gustavson et al. in [45] proposed an algorithm in order to correctly calculate the
four basic operations (i.e., ADD, SUB, MUL, and DIV) with good performance. Because of
the conversion error (representation error) which is the dierence between the binary approx-
imated number and decimal oating-point number, the error will be propagated during the
calculation. The authors proposed a method to process the result in two parts (i.e., extended
precision), and showed that the FMA instruction is the key to perform exact oating-point
multiplication and division.
To obtain a higher frequency, more stages of pipeline can be applied. However, such a
feature increases the latency of the operation. In NR's algorithm, the rening guesses of
quotient and the reciprocal of divisor in one iteration can be interleaved inside the multiply-
add architecture, and the dierent pipeline stages of the multiplier can be used in every
iteration. But since the multiplications between two iterations can not be run independently,
the utilization ratio of the multiplier decreases as the pipeline increases.
R. C. Agarwal et al. proposed a method based on power series approximation on the IBM
POWER3 [46]. In such a method, the authors used a table to rst create an approximation of
the reciprocal of divisor, and applied the power series of the error to rene the approximated
reciprocal of the divisor. Since the error can be calculated, and the initial approximation is
obtained depending on the range of the divisor, the rening formula can be rearranged to a
series of multiply-add operations on the known variables. Hence, the dependence between
27
the multiply-add iterations is decreased. The results also showed that the performance is
much better in a longer pipelined architecture.
The key benets obtained from such high-level functions implemented based on FMA
are less area and simple hardware design that also means the faster speed. This is also
why FMA alone is included in the oating-point unit of RISC processors. The algorithm for
implementing high-level functions in software with the assistance of the FMA instruction may
also be implemented in hardware. Such a strategy balances the complexities of the hardware
design (i.e., area consumption) and the compiler design. The computing latency and memory
access may be also minimized. Moreover, the exclusive usage of the dedicated unit can be
solved somehow by multi-core and multi-issue architectures, which are the popular trend of
RISC processors.
28
Chapter 4
Number Systems
For designing a high performance DFMA architecture, unconventional number systems are
considered as a encouraging technique in order to improve the performance and area eciency.
Therefore, in this chapter, the basics of number system are introduced as a preliminary study
before going to the decimal designs. The binary, decimal, and redundant number systems
are briey introduced in sections 4.1, 4.2, and 4.3 respectively.
4.1 Binary Number System
Number System is comprised of the methods to represent numbers and the rules to perform
arithmetic operations on the numbers. Nowadays, the central processing units inside com-
puters are abundantly built by bistable devices, which have two stable states (e.g., high level
voltage and low level voltage). The binary numbers and arithmetics are therefore widely
used in today's computer systems.
In the binary number system, which has 2 as the radix, two elements (i.e., \0" and \1")
are used to represent numbers. For example, the decimal number 5 can be represented in
binary number as 101 or 0101.
(101)2 = 1 22 + 0 21 + 1 20 = 4 + 1 = (5)10 (4.1)
In a conventional number system, the radix always implies the elements can be used to
represent numbers in such a system. Since every numbers which are larger than or equal
to the radix number generate carries to the higher position, the available elements of a
conventional number system with a given radix are [0; r   1], where r is the radix. The set
of the elements is called digit set. In binary system, the digit set is therefore [0,1]. Note that
29
the unit of the element can be special in dierent number systems. For example, in binary
system, \bit" is applied, and in decimal system, \digit" is usually used. So in equation (4.1),
we say \a 1-digit decimal number, 5, is represented in a 3-bit binary number as 101".
Another very important attribute of a number system is encoding, which means how the
numbers in a given number system are represented with the basic elements. For example, how
is the negative decimal number \-8" represented with binary bits? To answer this question,
let's rst look at the basic methods to represent signed numbers that can be positive or
negative.
Signed magnitude representation employs one sign bit, which is usually the most signi-
cant bit, to indicate the sign of a number (e.g., \0" means a positive number, and \1" means
a negative number). The rest of the bits are therefore used for the magnitude or the absolute
value of a number. For example, the decimal number \-8" can be represented as \11000".
Complement representation in binary system includes one's complement and two's com-
plement. The positive number in one's complement has a \0" at the most signicant bit,
and the rest of the bits are exactly the same as those of the unsigned representation. The
negative number in one's complement is represented by inverting each bit of the correspond-
ing positive number. For example, the decimal number \8" in one's complement is \01000",
and \-8" in one's complement is \10111". Alternatively, the negation of a number in one's
complement can be done by subtracting the number from 2n  1, where n is the bit width of
the number. For example, \-8" can be negated from \8" in one's complement by:
25   1  \01000" = \11111"  \01000" = \10111" (4.2)
On the other hand, two's complement represents negative number with an extra 1 at
the least signicant bit on the basis of the one's complement. For example, the negation of
\8" is done by inverting \01000" to \10111", and adding the extra \1" to obtain \11000".
Alternatively, the negation of a number in two's complement can be done by subtracting the
number from 2n, where n is the bit width of the number. For example, \-8" can be negated
from \8" in two's complement by:
25   \01000" = \100000"  \01000" = \11000" (4.3)
30
Table 4.1: The signed numbers in dierent representations
Binary representation Signed Magnitude One's complement Two's complement
\000" +0 +0 0
\001" +1 +1 +1
\010" +2 +2 +2
\011" +3 +3 +3
\100" -0 -3 -4
\101" -1 -2 -3
\110" -2 -1 -2
\111" -3 -0 -1
In Table 4.1, the signed numbers can be represented with 3 bits signed magnitude, one's
complement, and two's complement are listed. In the rst two representations, two `0's
(i.e., positive and negative zeros) exist. Thus one representation symbol is wasted and the
computation on these numbers is more complicated. However, the symmetry of these numbers
may simplify the negation. On the other hand, only one `0' exists in two's complement
representation. Additionally, the addition or subtraction on the numbers in two's complement
is straightforward. But the representable numbers are no longer symmetrical around zero.
The further knowledge about the binary number system can be found at [47, 48].
4.2 Decimal Number System
Since the decimal numbers are used by human, the binary coded decimal (BCD) numbers
are used to represent the decimal numbers in radix 10 system or decimal number system.
Actually, binary number system is very ecient on decimal integer calculation. However, the
conversion between decimal and binary numbers has to be performed at the rst beginning
and the last steps. Furthermore, the binary integer is inecient on decimal shifting and
rounding. For example, (80)10 right shifted by 1 digit is (08)10.
In the conventional decimal number system, the digit set is [0,9] implied by the radix 10.
To represent these ten digits, many encodings have been investigated. The BCD numbers
31
apply binary bits to represent a decimal digit as mentioned by its name. The most common
one is BCD-8421 system, where \8421" means the weight on each binary bit. For example,
the decimal number \9" can be represented by 4 binary bits as:
(1001)2 = 1 23 + 0 22 + 0 21 + 1 20 = (9)10 (4.4)
However, the weights on 4 binary bits do not have to be \8421". Some unconventional
BCD encodings are available. For example, in BCD-4221, the decimal number \5" can be
represented as \1001" or \0111". Another useful encoding is called Excess-3 BCD, which
adds 3 to each number in BCD-8421 encoding. For example, the decimal number \0" is
represented as \0011", and \9" is represented as \1100".
Table 4.2: The decimal digits in dierent representations
Binary representation BCD-8421 BCD-4221 Excess-3 BCD
\0" \0000" \0000" \0011"
\1" \0001" \0001" \0100"
\2" \0010" \0010" or \0100" \0101"
\3" \0011" \0011" or \0101" \0110"
\4" \0100" \1000" or \0110" \0111"
\5" \0101" \1001" or \0111" \1000"
\6" \0110" \1010" or \1100" \1001"
\7" \0111" \1011" or \1101" \1010"
\8" \1000" \1110" \1011"
\9" \1001" \1111" \1100"
In Table 4.2, the decimal digits from 0 to 9 are represented in three dierent encodings.
Note that only ten representation symbols are used in the BCD-8421 system. The unused
symbols cause two disadvantages. First, 37:5% (i.e., 6/16) of representation space is wasted
which means a lower encoding eciency. Second, the carry and the sum may be generated
incorrectly in the addition on these numbers. The example below shows what is the problem.
32
(3)10 + (9)10 = (0011)BCD8421 + (1001)BCD8421 = (1100)BCD8421 (4.5)
In equation (4.5), the encoding \1100" is not used in BCD-8421 system. The correct
result (e.g., 11) can be obtained by adding 6 on the intermediate result if it is over the
representable range. For example,
(1100)BCD8421 + (0110)BCD8421 = (0001 0010)BCD8421 = (12)10 (4.6)
The rst disadvantage is solved by the unconventional BCD encodings, such as BCD-
4221 mentioned above or BCD-5211. The second disadvantage can be partially solved by the
excess-3 BCD encodings. Since each number in excess-3 is added by 3 to each corresponding
number in BCD-8421, the addition on two excess-3 operands is added by 6. Therefore the
carry is always correct in excess-3 addition. However, the correction on sum is still necessary.
A notable advantage of the excess-3 encoding is that the nine's complement is simply inverting
each bit as the one's complement in binary system.
4.3 Redundant Number System
In previous two sections, the radix, digit-set, and encoding of the conventional number sys-
tems are introduced. If given a radix, the number of elements in the digit set is larger than
the radix, then the number system is redundant. Two examples are rst provided to show
the redundant number system. In the rst example in Fig. 4.1, suppose that 4 binary bits
are used to represent a decimal digit. But the decimal digit set is extended to [0,11]. If two
operands addition is performed, the digit set of the result is therefore in [0,22]. Consequently,
2 is the largest carry in this number system (i.e., carry digit is in [0,2]). After extracting the
carry, the intermediate sum is [0,9]. Finally, the result of the addition, which is the sum of
the digit sets of carry and intermediate sum, is in [0,11]. With the given number system, the
carry only propagates to the higher one digit. Thus, the long term carry propagation which
causes a big timing delay is eliminated.
However, the digit set of the result does not have to be the same as that of input operands.
In the second example shown in Fig. 4.2, the digit set of operands is extended by one digit.
33
[0,11]
[0,11]+
[0,22]
[0,2]
[0,9]
[0,2]+
[0,11]
Digit
i+1
Digit
i
Digit
i-1
Figure 4.1: Example of calculation with redundant number system
After calculation, the digit set of the result is less than that of operands. If the digit set
of the result is not larger than that of the operands, the continuous additions without long
term carry propagation can be performed.
[0,12]
[0,12]+
[0,24]
[0,2]
[0,9]
[0,2]+
[0,11]
Digit
i+1
Digit 
i
Digit
i-1
Figure 4.2: Example of calculation with redundant number system: reduced digit set
in output
In binary number system, carry save addition is widely used in many applications. In
binary carry save addition, three binary bits are added together, and 1 bit sum and 1 bit
carry are obtained. If a digit set (e.g., [0,2]) and the corresponding encoding with 2 binary
bits (e.g., 0=\00", 1=\01" or \10", and 2=\11") are dened, the binary carry save addition
can be considered as an addition on incomplete redundant operands.
The two's complement or ten's complement numbers are ecient to perform subtrac-
tions. However, in redundant number system, the subtraction can be performed easily with
34
symmetrical digit sets (i.e., include negative digits). Suppose a redundant decimal number
system with a symmetrical digit set [-6,6]. The carry propagation of the addition is shown in
Fig. 4.3. Note that if subtraction is performed on this digit set, the carry propagation is the
same as that shown in Fig. 4.3. This advantage reduces the complexity of subtraction on the
redundant number system by eliminating the complement operation. This kind of numbers
is called signed digit numbers. However, the conversions from traditional digit set (redundant
digit set) to redundant digit set (traditional digit set) might be needed at the rst and last
operations. More information about redundant number system can be found in [49].
[-6,6]
[-6,6]+
[-12,12]
[-1,1]
[-5,5]
[-1,1]+
[-6,6]
Digit
i+1
Digit 
i
Digit
i-1
Figure 4.3: Example of calculation with redundant number system: signed digit set
When the specication is determined, what need to be considered in a number system
and what are these considerations about are illustrated in Fig. 4.4. In previous sections,
digit set and the corresponding encoding are discussed. Actually, an implicity factor, which
is related to digit set and encoding, is not introduced. Once a digit set is applied to represent
numbers or an encoding is applied to represent the digit set, how many binary bits are used
is directly related to the size of the memory and the width of the data path that means the
representation eciency. Furthermore, how to perform the basic operations (e.g., addition
and subtraction) on a given digit set and encoding is directly related to the complexity of
the hardware design. For example, a binary full adder, which includes two XOR gates and
one carry circuit, is necessary for 1 bit carry save addition. In the scope of digit set, the
best contribution is eliminating the carry propagation. However, encoding is very important
to determine the arithmetic rules and to further determine the complexity of the hardware
35
design. These two problems will be examined in later chapters.
Specification
Numbers
(Radix)
Digit sets Encodings Arithmetic rules
Representation
Efficiency
Operation
Efficiency
Hardware Cost and Processing Speed
Number System
Figure 4.4: Consideration of the number system
36
Part III
Designs
37
Chapter 5
Previous Designs
Prior to presenting the proposed decimal designs, the existing decimal xed-point and oating-
point designs related to ours are introduced as a literature review. The contents in this
chapter describe the leading edge of the decimal designs which are the basis and also ref-
erence for comparison of our research. Since the entire research is divided into three parts
(i.e., addition, multiplication, and fused multiply-add), the following three sections describe
decimal xed-point addition, multiplication, and oating-point FMA respectively.
5.1 Decimal Fixed-point Addition
In the decimal oating point arithmetic, the operation on the signicand encoded in densely
packed decimal (DPD) format, which express 1000 decimal numbers in 10 binary bits instead
of 12 binary bits in BCD encoding, could be implemented in the same technique as in the
decimal xed point arithmetic. Among the xed point decimal arithmetics, addition is the
most important one since it is the basis of all other operations. For instance, in the sequen-
tial multiplication and digit recurrence division, the partial product for each iteration are
accumulated by the adder. Moreover, in the parallel multiplication and functional division,
partial products are reduced by the adder tree. In [58], the authors show that the logics
related to the signicand computation units occupy 51% timing delay and 41% area of the
decimal oating point adder. Since an improvement in addition can benet to many other
decimal operations, many methods and algorithms were applied to boost the performance of
the decimal adder.
Traditionally, a decimal digit zi, where zi 2 f0; 1; : : : ; 8; 9g, is represented in the 4-bit
binary coded decimal (BCD) encoding. Alternatively, the signed digit set f ; (  
38
1); : : : ;    1; g is applied to represent the decimal numbers. If 2 + 1 > r, where r is
the radix, the number system is redundant. For a redundant number system, one number
has more than one representation which allows the carry free addition. The conventional SD
carry free addition/subtraction algorithm is given as follows [49]:
Algorithm 5.1: One Digit Conventional Carry Free Addition
Data: SD operands Xi, Yi, transfer digit Ti and operation op.
Result: SD result Si and transfer digit Ti+1.
1. Compute Pi = Xi op Yi
2. Divide Pi into Ti+1 and Wi = Pi   r  Ti+1
3. Compute Si = Wi + Ti
where Pi is the position sum, Ti+1 is the transfer digit, Wi is the temporary sum and r is
the radix.
The signed digit (SD) number system could be applied to eliminate the carry chain in
the carry free addition. After the rst published decimal signed digit adder designed by A.
Svoboda in 1969 [50], some papers were presented in the last decade.
B. Shirazi et al. in [59] proposed a redundant binary coded coded decimal (RBCD) adder
in digit set [ 7; 7]. Since BCD encoding only applies 4 binary bits to present from 0 to 9 as
an unsigned number, six encoding symbols are wasted. In the RBCD, the authors treat the
4 binary bits as a signed number. Therefore [-8,7] can be represented within the 4 binary
bits. However, to perform the subtraction on a symmetrical digit set, \-8" is not used in
RBCD addition. To perform the addition, a binary full adder with carry propagation in 4
bits is applied rst to get the intermediate sum (i.e., a+ b) of the two operands. Afterwards,
a combinational logic unit is used to detect if the intermediate sum is out of range (i.e., over
[-6,6]) and to decide the transfer digit or carry to the next digit. Since the intermediate
sum has to be corrected by considering the carry out to the next digit and the carry in to
the current digit, another combinational logic unit is performed to create a correction signal
39
based on carry in and carry out. Finally, the nal redundant result in [-7,7] is obtained by
adding the correction signal to the intermediate sum in a 4 bits binary carry propagation
adder. All these operands mentioned above are in serial. Note that, the conversions have to
be performed before and after the RBCD, since the BCD numbers \8" and \9" are not in
the range of the RBCD encoding.
H. Nikmehr et al. provided the decimal signed digit (DSD) adders in digit set [ 9; 9] in
[51]. In this DSD adder, a speculative method which creates all possible results and selects
the correct one based on the carry in and carry out signals is applied. For instance, in [51],
the intermediate sum p is rstly created by adding the operands a and b. Subsequently,
p + 9, p + 10, p   1, p   10, and p   11 are calculated at the same time. A combinational
logic is then performed to decide the transfer digit to next digit. Finally, the correct result is
selected from the pre-calculated result by the carry signals. Note that, to represent a decimal
signed digit, two 4-bit binary numbers are used. The extra bits to represent numbers and
the pre-calculated result both mean less area eciency.
J. Moskal et al. also provided a non-speculative method to perform decimal signed digit
addition in digit set [ 9; 9] in [56]. The digit set and number representation are similar to
the design in [51]. However, since non-speculative method is used, the nal result is not
selected from pre-calculated result but corrected from the intermediate result. Therefore,
the two operands are rstly compressed (added) to obtain the intermediate. After that, the
correction signal is created by considering the sign signal which is obtained by a bit-wise
carry propagation network. Finally, a signed digit binary adder array is applied to correct
the intermediate sum. However, due to the complicated representation, some combinational
logics (i.e., multiplexor, inverter, reduction logic) have to be used inside the adder.
In [52], A. Kaivani provided a fully redundant decimal addition based on stored unibit
transfer (SUT) encoding in [ 8; 9]. In the SUT encoding, an extra binary bit so called unibit
is applied to represent \-1" and \1". Thus, together with a 4-bit signed binary number, the
digit set is enlarged to [-8,9]. Note that, the 4-bit signed binary has negative weight on each
bit (i.e., \8 -4 -2 -1") compared to a traditional signed binary number. Therefore without the
unibit, the signed binary is in [-7,8]. To perform the addition, the unibits of two operands
are rstly extended to a 4-bit signed binary number. A binary carry save adder is therefore
40
applied to add up the three 4-bit operands. Subsequently, to compress the result, a 3-bit
binary carry propagation has to be used after the carry save adder. Finally, the result is
divided into two parts to be corrected at the last stage. The unibit is therefore \stored" in
the result.
There are two branches of the architectures as mentioned above. The speculative archi-
tecture shows disadvantage on the hardware area eciency. Additionally, if the encoding is
complicated, the timing delay to compute all the necessary pre-calculated result could be
large. On the other hand, the non-speculative architecture generally has a smaller hardware
area. However, the selection of the encoding is also very important in this architecture.
Let's review the conventional signed digit addition/subtraction algorithm. Since the
transfer digit Ti+1 to the next stage is independent on Ti from the last stage, the carry chain
is eliminated, and the delay is no longer related to the digit width of the input. However,
this algorithm still could be improved in following two aspects:
Deciding Ti+1 without Pi
Once Pi is obtained, the transfer digit Ti+1 is generated according to the range of the position
sum. Hence, a carry chain for calculating Pi limits the performance of the adder. In this
thesis, an algorithm to decide the transfer digit directly on the range of the operands, Xi
and Yi, is introduced. The method for range division is discussed in section 6.1.
Calculating Si without Wi
After generating the Ti+1, a compensation in decimal addition (i.e., 10) is applied to cor-
rectly calculate the temporary sum, Wi. Further, the transfer digit Ti from the last stage is
added to obtain the nal result Si. In this process, these two continuous additions imply a
complicated and slower design. A method to merge them into one add operation which leads
to a better performance on area and delay is proposed in section 6.1.
41
5.2 Decimal Fixed-point Multiplication
Multiplication is one of the four basic arithmetic operations. An analysis of benchmarks
shows that the percentage of execution time of decimal multiplication could reach over 27%
in some applications [18]. Due to the importance of multiplication, some decimal xed-point
designs are proposed in [42, 63{70]. Furthermore, decimal oating-point multipliers based
on those xed-point designs are published in [22, 73, 74, 93, 94].
5.2.1 Analysis of Previous Parallel Designs
In [65], to avoid complicated multiples of X, the operand Y is recoded into two parts,
Yi = YHi + YLi, where YH 2 f0; 5; 10g and YL 2 f 2; 1; 0; 1; 2g. Therefore, only the
 2X; X; 2X, and 5X need to be implemented in logic gates. Since the multiples are repre-
sented in 10's complement format, the negation is implemented by a 9's complement recoder,
and the incremental one is only applied on the least signicant digit (LSD). Furthermore, to
generate the partial products from 1X to 9X in BCD carry save (BCD-CS) format, a decimal
CSA has to be applied. The parallel PPR for 2n partial products (i.e., n sums and n carries)
is implemented by 6 levels of BCD full adders (BCD-FA) for a 16 16-digit multiplication.
Half of the decimal carries of partial products are added separately by carry-counters. Two
outputs of PPR, 2n-digit sum and 2n-bit carry, are added together by a prex network with
a conditional adder. Furthermore, an improved PPR algorithm based on a multi-operand
decimal addition in [71] is provided by L. Dadda in [66]. The partial products in columns
are rstly added in a binary form with the binary carry save adder. Subsequently, a binary
to decimal conversion algorithm is applied to convert the binary result to decimal encoding.
In [67], G. Jaberipur et al. propose a new PPG algorithm which only generates 2X and
5X to compose other multiples from 1X to 7X. The 8X and 9X are divided into two parts
in which the 8X is implemented by E = 10Eh +El, and the 9X is implemented in the same
way as N = 10Nh + Nl. Therefore, the algorithm avoids not only the negation logic for
 2X and  X, but also the 4X (double times 2X) to generate 8X and 9X. Furthermore,
by analyzing the range of the computation and gate level representation, the BCD-FA in the
PPR unit is simplied. The two outputs of the PPR unit are further reduced to one 2n digits
42
and one (2n  l)-digit BCD numbers, where l is the number of levels of the BCD-FA in the
PPR structure. Hence, for the nal product computation, only 2n  l digits are involved in
the carry propagation adder to generate the nal multiplication result.
In [70], the authors propose a redundant decimal addition algorithm based on a specic
encoding, namely weighted bit-set encoding. With such an addition algorithm, a multiplier
based on the redundant number system is provided. The double-BCD format multiples are
rstly created by combining the easy decimal multiples (i.e., 2X, 4X, and 5X). In the
PPR unit, two-operand redundant adder is applied to reduce 2n BCD partial products to
a redundant number in the range of [0; 15], so called overloaded decimal digit set (ODDS).
Furthermore, in the last step, the redundant product is converted to the BCD encoding by
a digit set converter with a propagation process.
In [68], A. Vazquez et al. propose an improved design of their previous work published
in [76]. In the improved new family parallel decimal multiplier, two unconventional decimal
encodings (i.e., BCD-4221 and BCD-5211) and two architectures (i.e., radix-10 and radix-
5) are applied to generate and reduce the partial product. In radix-10 architecture, the
operand Y is recoded into SD digit-set [ 5; 5], and n + 1 partial products are selected by
the recoded Y . Alternatively, in radix-5 architecture, the second operand is encoded into
two parts, Yi = Y
U
i  5 + Y Li , where Y Ui 2 f0; 1; 2g and Y Li 2 f 2; 1; 0; 1; 2g. Therefore,
in this scheme, there are 2n partial products need to be reduced. In the PPR unit, only
the binary full adders and combinational recoders are applied due to the specic encodings.
Finally two 2n-digit results are added together with a quaternary tree (Q-T) adder based on
the conditional speculative decimal addition proposed by the same authors in [75].
In [74], another variant of the design proposed in [76] (i.e., the radix-10 architecture) is
introduced. The author applied the idea and basic architecture of Vazquez's radix-10 design
to create a decimal oating-point multiplier. The only dierence is that the nal product
accumulation is replaced by a decimal adder with a Kogge-Stone carry network.
5.2.2 Analysis of Previous Sequential Designs
The sequential multipliers generate partial products gradually (i.e., one per iteration) and
accumulate them sequentially. This architecture, although not very fast, is popularly used
43
whenever cost eciency is the main intention. All the sequential multipliers consist of two
main steps 1) partial product generation (PPG) and 2) partial product accumulation (PPA).
The most popular algorithm for decimal PPG is based on the generation of easy-multiples
of the multiplicand X i.e., the multiples which can be generated as a non-redundant decimal
number via a carry-free approach (e.g., X; 2X; 4X; 5X). The concept of generation of these
multiplies is very similar to that in the parallel multipliers.
In order to achieve the nal product one needs to generate and sum up the partial prod-
ucts for all digits of the multiplier. This step, also known as partial product accumulation,
usually consists of a carry-propagating or carry-free decimal adder which is much simpler than
the partial product reduction array in parallel multiplier. However, as mentioned above, the
performance (i.e., latency and throughput) is limited due to the sequential processing strat-
egy. It should be noted that a nal conversion to non-redundant representation is required,
in case of using a carry-free decimal adder.
For example, Erle et al. propose a traditional method of decimal multiplication in [63].
The design borrows the idea from binary multiplication which reduces the partial products
in a carry save adder (CSA) based structure. Furthermore, to reduce the complexity of
the multiples generation, a so-called secondary set which contains fX; 2X; 3X; 4X; 8Xg is
applied, and all the missing multiples could be generated based on the elements in the
secondary set with no more than one carry save addition. The decimal 3:2 CSA and 4:2
compressor are described in [63]. Furthermore, the partial product for each iteration could be
added iteratively within the delay of a decimal 4:2 compressor. An nn-digit multiplication
can be nished in n+ 4 cycles.
Another sequential decimal multiplier with easy multiples (i.e., fX; 2X; 4X; 5Xg) is pro-
posed in [42]. Additionally, a 2-stage overloaded decimal adder which can sum two partial
products and one iteration result with less delay than a decimal 4:2 compressor is presented.
By doing so, a clean-up block has to be applied to nally correct the decimal encoding before
the carry propagated addition in the nal step. Thus, in such a multiplier, the latency of
one operation is up to n+ 8 cycles.
An alternative sequential redundant multiplication is described in [64]. The authors
present an algorithm which recodes both operands into the SD digit-set [ 5; 5] to generate the
44
SD operands with simple logic. Further, a digit multiplier block on the range of [2; 5] [2; 5]
is proposed to generate the partial products in SD format. Hence, a Svoboda's signed digit
adder with a restricted range is consequently applied to add signed digit partial products
iteratively. The SD sequential multiplier takes n+ 4 cycles to nish one multiplication.
5.3 Decimal Floating-point FMA
To the best of our knowledge, four pure decimal oating-point fused multiply-add (DFMA)
designs have been announced previously. In [103], the author described a top level design of
the DFMA which is based on a previous binary FMA architecture with decimal sub-modules
[33]. In [93], the authors provided a conventional DFMA comprising a parallel decimal
multiplier proposed by the same research group in 2008 [94], a decimal carry propagation
adder (CPA) to accumulate the intermediate results, and a decimal rounder which creates the
nal signicand after the decimal FMA core. In [95], the authors proposed a new leading zero
anticipation (LZA) algorithm which starts the detection and decision of the post-alignment
shift amount in parallel with the decimal adder. In [102], the author introduced a new FMA
architecture which combines the addition and the rounding operations into a single unit.
Since there are no descriptions about the detailed structures in the rst two designs, only
the latter two designs are referred to be introduced in this section.
Two previous architectures of the DFMA operation proposed in [102] and [95] are shown
in Figures 5.1(a-b). In Fig. 5.1(a), the alignment shifting is performed in parallel with the
multiplier array. Therefore, the pre-alignment is excluded from the critical path. To gure
out the rounding position before the nal addition is performed, in the following selection and
adder modules, the proper range of the operands are selected and a 4221-BCD decimal carry
save adder is applied to add one product in carry-save format and one addend. A leading
zero anticipator and shift controller are therefore created to decide the possible exponent and
rounding position. The nal result is then obtained by a combined adder and rounder. On
the other hand, in Fig. 5.1(b), a new leading zero anticipation algorithm is introduced. In
this architecture, a swapping module and a shifting module are placed after the multiplier
array. Afterwards, the carry propagation adder and the leading zero anticipation unit are
45
DPD Decoder
Mul Array
Pre-Align
Op Selection
Combined Add Round
DPD Encoder
X Y Z
SXCX EXSZCZ EZSYCY EY
CZsh
Post-processing
LSH & RSH
R
4221-BCD CSA
LZA
shamt1
Productsum
Sum1carry
shamt2
Sum2carry
Result
Resultpost
Op0 Op1 Op2
Shamt
Calculation
LSH & RSH
Productcarry
Sum1sum
Sum2sum
(a) Architecture proposed in [102]
DPD Decoder
Mul Array
Pre-Align
Rounding
DPD Encoder
X Y Z
SXCX EXSZCZ EZSYCY EY
ProductBCD shamt1
Post-processing
Swap Unit
R
LSH & RSH
LZAAdder (CPA)
LSH
Shamt
Op0
Op1shOp0sh
Op1
Sum1
shamt2
Sum2
Result
Resultpost
LZD
Calculation
(b) Architecture proposed in [95]
Figure 5.1: Decimal oating-point fused multiply-add architectures
applied in parallel. The following post-alignment shifter creates and transmits a 2n-digit
intermediate result to the nal rounder.
46
Chapter 6
Proposed Designs
This chapter describes the details of the proposed designs. The xed-point addition and
multiplication with redundant internal encodings are rstly designed to reduce the latency
of the major components in a DFMA. Additionally, the architecture and algorithms of the
proposed DFMA are benecial from not only the new addition and multiplication but also
the specic number system. The detailed descriptions of the addition, multiplication, and
DFMA are provided in sections 6.1, 6.3, and 6.4.
6.1 Decimal Fixed-point Addition
The addition is the basis of all the other arithmetics. Therefore, a xed-point carry free
addition is rstly studied and investigated to increase the performance of decimal oating-
point processing. Additionally, a new nal digit-set conversion is described in order to apply
the carry free addition solely.
6.1.1 Carry Free Addition
In the conventional carry free addition algorithm, to obtain the transfer digit Ti+1, the
operands Xi and Yi have to be added together and compared with the threshold value.
A carry chain in this process limits the performance of the carry free adder. To improve it,
a speculative method could be used (e.g., [51], [53] and [54]). In this method, all possible
results which depend on dierent transfer digits Ti and Ti+1 are calculated simultaneously
and the correct one is selected by the value of the transfer digits. The redundancy on hard-
ware in aforementioned designs implies the bigger area and higher power consumption. On
the other hand, the nonspeculative method for maximally redundant SD addition (e.g., [55])
47
provided a faster design in binary world (i.e., radix-2h). In this section, a new nonspeculative
decimal SD addition which directly calculates the result without the hardware redundancy is
introduced. The proposed adder which works in digit set [ 9; 9] has a simple range division
logic. Moreover, the operands and result are encoded in 5-bit two's complement to reuse the
binary circuit as much as possible.
The Algorithm
To unify the addition and subtraction, a new operand Y opi is dened in equation (6.1).
Hence, the adder and subtractor could be represented by a unied model as shown in equa-
tion (6.2).
Y opi =
8<: Yi if operation is add Yi if operation is sub (6.1)
Xi  Yi = Xi + Y opi (6.2)
In the traditional carry free algorithm, the transfer digit Ti+1 and the temporary sum Wi
are generated from the position sum Pi. The process could be represented by equation (6.3).
Ti+1 = f(Pi) = f(Xi + Y opi)
Wi = g(Pi) = g(Xi + Y opi)
(6.3)
To reduce the timing delay and parallelize the transfer digit generation with the position
sum calculation, the temporary sumWi and transfer digit Ti+1 could also be directly expressed
in terms of Xi and Y opi as shown in equation (6.4).
Ti+1 = f
0(Xi; Y opi)
Wi = g
0(Xi; Y opi)
(6.4)
In decimal sign digit number system, 9 should be avoided in temporary sum, otherwise,
an incoming transfer digit could lead to a carry to the next digit. The position sum which
is equal to 9 is called an exception in the proposed design. Furthermore, the exception
detection will pull down the performance of the decimal SD adder compared with its binary
counterpart. Hence, the less number of exceptional cases, the better it is.
48
T
a
b
le
6
.1
:
R
an
ge
d
iv
is
io
n
d
ir
ec
tl
y
b
as
ed
on
op
er
an
d
s
R
an
g
e
o
f
T
i
+
1
W
i
S
i
C
o
rr
ec
ti
o
n
S
ig
n
al
X
i
an
d
Y
o
p
i
T
i
=
−
1
=
“
1
1
”
T
i
=
0
/
1
=
“
0
0
/1
”
co
r
4
.
.
.
1
=
ca
se
1
X
i
≥
1
,Y
o
p
i
≥
1
1
P
i
−
1
0
P
i
+
(−
1
1
)
=
P
i
+
(−
1
0
)/
(−
9
)
=
1
0
1
0
,
if
T
1 i
=
1
an
d
(0
,9
),
(9
,0
)
P
i
+
1
0
1
0
1
P
i
+
1
0
1
1
0
/1
1
0
1
1
,
if
T
1 i
=
0
ca
se
2
X
i
≥
0
,Y
o
p
i
≤
0
0
P
i
P
i
+
(−
1
)
=
P
i
+
0
/
1
=
1
1
1
1
,
if
T
1 i
=
1
X
i
≤
0
,Y
o
p
i
≥
0
P
i
+
1
1
1
1
1
P
i
+
0
0
0
0
0
/1
0
0
0
0
,
if
T
1 i
=
0
ex
cl
u
d
e
(0
,±
9
),
(±
9
,0
)
ca
se
3
X
i
≤
−
1
,Y
o
p
i
≤
−
1
-1
P
i
+
1
0
P
i
+
9
=
P
i
+
1
0
/
1
1
=
0
1
0
0
,
if
T
1 i
=
1
an
d
(0
,−
9
),
(−
9
,0
)
P
i
+
0
1
0
0
1
P
i
+
0
1
0
1
0
/1
0
1
0
1
,
if
T
1 i
=
0
49
To implement the range division and exception detection eciently, a scheme to divide
the cases for generating dierent values of transfer digit Ti+1 is proposed in Table 6.1. In
this method, there are four pairs of Xi and Y opi need to be detected as an exception.
Therefore, only six multi-input gates are applied to implement the exception detecting circuit.
Furthermore, these gates could be reused to decide the transfer digits. Consequently, besides
the exception detecting logic, only the most signicant bits for Xi and Y opi and simple logic
are needed to determine the transfer digit.
In Algorithm I, once the transfer digit is obtained, Wi is generated by adding a correction
value to Pi, and then, the result Si is calculated by adding Wi with Ti. In this process, those
two serial computations cause a limit on speed. To combine these two additions into one
computation eciently (i.e., the three operands addition Si = Pi  rTi+1+Ti), an analysis
for the eect of the input range and incoming transfer bits on the decimal correction value
is provided in Table 6.1.
To generate the correction signal, the transfer digit Ti from the last digit is used. There-
fore, the decimal correction signal can be decided directly by T 1i and the range of input
operands Xi and Y opi, then the further computation for adding Ti is removed. The bold
numbers in Table 6.1 show that the least signicant bit of the correction signal is equal to
T 0i .
The Hardware Implementation
To reuse the well optimized circuits in binary world as much as possible, the operands Xi
and Yi are encoded in 2's complement. Therefore, the Y opi is obtained by inverting Yi with
XOR gates controlled by the operation signal op which is the penalty of the subtraction. The
increment on the least signicant bit could be added as an incoming carry to the right most
full adder.
In this design, the exception logic is minimized to four pairs of operands detection, and
to improve the speed, the operands Xi and Yi are directly used for the exception handling.
In equation (6.5), the signals Ep and En are for positive exception (i.e., (0; 9) or (9; 0)) and
negative exception (i.e., (0; 9) or ( 9; 0)) respectively.
50
*FA FA FA FA FA
op
4
i
X 4
i
Y 3
i
X 3
i
Y 2
i
X 2
i
Y 1
i
X 1
i
Y 0
i
X 0
i
Y
*FA **FA HA
0
iT
Transfer & 
iX iY
1
i
T
1
1 i
T
0
1 iT
24cor
1cor
3cor
cor24 cor24cor3 cor1
0
iS
1
iS
2
iS
3
iS
4
iS
op
Correction
Generator
* The two full adders only contain the logic for sum
SDDA
1
X
1
Y
1
S
SDDA
0
X
0
Y
0
S
1
T
SDDA
1-nX 1-nY
1-n
S
SDDA
2-n
X
2-n
Y
2-n
S
1-n
T
2
T
 
n
T
00" "
55
5
2
Carry Chain
c1c2c3c4
pg1pg2pg3
1p1g2p2g3p3g
1c
2c3c4c
Carry Chain
**FA **FA
' &** The three full adders contain the logic for ( ) and ( )p a b g a b
X
Y
82
45-
X
Yop
82
45 
P 47
T 010
cor 1a
S 033
P 47 
Example:
op
Figure 6.1: Proposed n-digit signed digit decimal adder
Ep =

Xi = 0 ^
 
(Yi = 9 ^ op) _ (Yi =  9 ^ op)
_
(Xi = 9 ^ Yi = 0)
En =

Xi = 0 ^
 
(Yi =  9 ^ op) _ (Yi = 9 ^ op)
_
(Xi =  9 ^ Yi = 0)
(6.5)
case1 =
 
(X4i ^Xi = 0) ^ (Y op4i ^ Yi = 0)
 _ Ep
case3 =
 
X4i ^ (Y op4i ^ Yi = 0)
 _ En
case2 =
 
X4i ^ (Y op4i _ Yi = 0)
 _  (X4i _Xi = 0)^
(Y op4i _ Yi = 0)
 ^ Ep ^ En
(6.6)
In Table 6.1, the operands' range division for generating Ti+1 is not right on zero, thus,
the zero input should be excluded for some cases. The zero detection logic in equation (6.5)
could be reused, and the range division logic is given in equation (6.6).
The transfer digit only depends on the range division, and it can be obtained at the same
time as Pi is ready. Thus, the critical path only passes one of the two units for transfer
digit generation and position sum addition. Equation (6.7) shows the logics to generate the
51
transfer bits.
T 1i+1 = case3
T 0i+1 = case1 _ case3
(6.7)
According to the analysis in Table 6.1, the decimal correction signals are decided by
operands range and incoming transfer bits. The conditional adder with multiplexor which is
controlled by Ti and casei could be applied. Nevertheless, to reduce the area, combinational
logics shown in equation (6.8) are used to directly generate the correction signal and to
connect it to the second level of binary full adders. An example of the process of the proposed
adder is shown in Fig. 6.1.
cor4 = case1 _ (T 1i ^ case2)
cor3 = case3 _ (T 1i ^ case2)
cor2 = cor4
cor1 = T 1i  case2
cor0 = T 0i
(6.8)
Since the critical path passes through the second level of full adders, to further improve
the performance of the proposed design, a simplied carry chain which is similar to the prex
network is applied as shown in Fig. 6.1.
Finally, the hardware implementation of the proposed decimal SD adder is given in
Fig. 6.1. The bold dash line is the critical path which passes through the transfer and
correction logic and an optimized carry chain. The full adders with the asterisk only contain
the logics for sum. Furthermore, in the second level of full adders, the critical path only pass
through one XOR gate in the left most full adder.
6.1.2 Absolute Value Digit-Set Conversion
Since the decimal data stored in memory are generally encoded in BCD format, to use the
decimal carry free adder, the operands in BCD encoding should be converted into the internal
encoding scheme used in the proposed design. Similarly the nal result coming from the SD
adder needs to be converted back to BCD format before sending to the memory.
52
In the IEEE 754-2008 Standard, the absolute value of the mantissa is represented in the
signicant digits section. However for the signed digit subtraction, the result could be less
than zero. Hence, before sending to the memory, the result which is less than zero must be
converted to it's absolute value.
In [58], a negation unit and prex network are applied to correctly calculate the nal result
in BCD format. In [59] and [52], to convert the BCD encoding to the internal encoding, 9-
level and 1-level of gates are used respectively. Further, the authors proposed two algorithms
to convert from internal encoding to BCD encoding with a carry propagation chain. The
negation unit for the redundant number system could be implemented digit by digit.
In this section, a merged algorithm which can directly convert the negative result to it's
absolute value in BCD encoding with a less penalty on delay is introduced.
The Algorithm
In our design, since the digit set is encoded in 5-bit 2's complement, and the input operands
in BCD encoding are always in digit set [0; 9], thus the front conversion which is only a
1-bit sign extension does not cost any logic. For converting from the internal 5-bit format to
the BCD encoding, a borrow (negative carry) propagation which passes through the entire
word-width logics is involved.
For the negative SD result, before converting to the BCD encoding, the absolute value of
it is obtained by inverting all signs on each digit. For example,
(1234010)SD = (1234010)SD = (0833990)BCD:
To merge the negation algorithm with the digit set conversion algorithm and improve the
performance of the converter, an absolute value digit set conversion algorithm which includes
a prex network and a correction unit is proposed. The algorithm leads to a logarithmical
timing delay which is more suitable for high precision computation. An example is provided
in Fig. 6.2.
Algorithm 6.1.2: Absolute Value Digit Set Conversion
53
Data: SD number S.
Result: BCD number R (R = jSj).
1. Compute generate bit (Gi) and propagate bit (Pi) for each digit of the result.
Gi =
8<: 1 if Si < 00 otherwise, Pi =
8<: 1 if Si = 00 otherwise,
Gi:j =
8<: Gi if i = jGi _ (Pi ^Gi 1:j) if i > j,
Pi:j =
8<: Pi if i = jPi _ Pi 1:j if i > j.
(6.9)
2. Compute the negative carry Ci.
Ci+1 = Gi:j _ (Pi:j ^ Cj); C0 =
8<: 0 if S  01 if S < 0. (6.10)
3. Generate the result Ri in BCD format.
Ri =
8>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>:
Si if Ci+1Ci = 00 and S  0
Si   1 if Ci+1Ci = 01 and S  0
Si + 10 if Ci+1Ci = 10 and S  0
Si + 9 if Ci+1Ci = 11 and S  0
Si + 10 if Ci+1Ci = 00 and S < 0
Si + 11 if Ci+1Ci = 01 and S < 0
Si if Ci+1Ci = 10 and S < 0
Si + 1 if Ci+1Ci = 11 and S < 0,
(6.11)
where the symbol Si is the bit inversion of the digit Si.
The Hardware Implementation
The architecture of the proposed convertor for a p-digit input is given in Fig. 6.2. The C 0msb
which is the sign of the SD result is obtained at the output of the prex tree. Therefore
if the SD result contains trailing zeros, then the C0 which is C
0
msb cannot be propagated
54
correctly. To x this problem, a trailing zero detection is placed in parallel with the prex
tree, and a two-gate logic is applied to adjust the result coming from the prex tree as shown
in Fig. 6.3a.
In BCD encoding, only 4 bits are used to represent a number. Thus, the fth bit S4i of
each digit in the SD result is discarded in the nal 4-bit adder to compensate the result. The
Fig. 6.3b shows the logic for the 4-bit correction signal for each digit which is obtained based
on the equation (6.21) in Algorithm 6.1.2.
 GenerationG Pi i
Prefix Trailing Zero
Adjust
 
S
R
C
GP P
!C
Sign
Correction
Generator
FA HA*FA
!S
!
msb
C
3!
iS
2!
iS
1!
i
S 0!
i
S
3
cor
i
3
i
R 2iR
1
i
R 0
i
R
2
cori
1
cori
0
cor
i
5p"4p"
4p"
Z
4p"
2p" p
pp
p
* The full adder only contains the logic for sum
Network Detection
Logic
Signal
1234010S: 
0000101P: 
1011000G: 
1011000!C : 
0000001Z: 
10110011C: 
0 10 1Cor:   b ab
1:Sign
0 23!S : + d fef
0833990R:   
!# msbC
Cor
FA
Figure 6.2: Proposed absolute value digit-set converter
1!pZ"msbC
2!p
Z
1!
"
p
C
1
Z
2
"C
0
Z
1
"C
1
C
2
C
1!pCmsbC 0C
  
Sign
(a) Adjust logic
msb
C
1 i
C
i
C
3
i
cor
2
i
cor
1
i
cor
0
i
cor
(b) Correction logic
Figure 6.3: Adjust and correction logics of the proposed digit-set converter
Before converting SD result to the BCD encoding, an exclusive-OR gate is applied on
each bit of Si to do the bit inversion for the negative result. In Fig. 6.2, the bold dashed line
is the critical path which passes through the PG generation block, the prex network, the
adjustment logic, the correction signal generator and three full adders in the nal converting
unit. On the critical path, only the delay in prex network is proportional to the width of
55
the input.
6.2 Parallel Decimal Fixed-point Multiplication
In the proposed parallel multiplication, one of the two operands is encoded into the digit-set
[ 5; 5], and represent the multiples of the other operand from  5X to 5X in the digit-set
[ 8; 8]. By doing so, all the multiples could be obtained in a constant delay, and only n+ 1
partial products, namely PP , are generated. Furthermore, to reduce the n+1 levels of partial
products into the nal SD result, a multi-level multi-operand SD addition is discussed. To
reduce the delay and area of the hardware in PPR unit, binary arithmetic units and com-
binational recoders are applied in the multi-operand SD adder. Finally a digit-set converter
with hybrid carry propagation network is applied to convert the product from SD to BCD
encoding. In the proposed hybrid prex tree, dierent prex trees with less digit width are
applied to construct a big prex carry propagation network. Consequently, in the prex tree,
the levels of prex nodes after the longest column in the PPR unit are reduced. Overall, the
structures of the PPG, PPR and nal converter are balanced, and the delay of the proposed
multiplier is optimized. The top level architecture of the proposed multiplication is shown
in Fig. 6.4.
In the rst stage, the n-digit operand YBCD is recoded into (n+ 1)-digit YSD in digit-set
[ 5; 5]. The 5-bit \one hot" selection signal Y si for each digit is generated based on the
recoded operand, YSD. In the proposed design, only the positive multiples, X through 5X,
are implemented by logic gates. The negative multiples,  X,  2X,  3X,  4X, and  5X,
could be represented in the similar way. However, to reduce the area of the multiplier, the
negative multiples are generated by inverting the sign on each digit of the positive multiples.
Since the digit in the proposed SD multiples is represented in 2's complement encoding, the
inversion is done by an XOR gate controlled by Y ni which is also the increment bit for each
digit to invert the sign of the multiples. Note that only one bit is enough to invert a partial
product, therefore the increment bits for all digits in a partial product are identical.
The second stage in the proposed multiplication is a PPR unit implemented by multiple
levels of multi-operand SD adders. For example in a 16  16-digit multiplication, after
56
Partial Product Generation
1X2X3X4X5X
[-5,5] SD Recoder
4( 2)n  4( 2)n  4( 2)n  4( 2)n  4( 2)n  
Partial Products Reduction
2 -digit SD-BCD Convertern
BCD
X
BCD
Y
BCD
R
SD
R
Ys Yn
5 1n  
4 2n !
4 2n !
4( 2) (4 1)n n n !   
n
4n 4n
Selection Generator
SD
PP
n
n
Selector
Figure 6.4: Top level architecture of the proposed parallel decimal multiplication
the PPG unit, there are 17 partial products need to be reduced. the layout of the partial
product array is rearranged to apply two levels of SD adders to generate the nal product
in SD format. In such an multi-operand SD addition, the operands could be rstly reduced
by the binary arithmetic unit, and in the end, a recoder is applied to correct the transfer
digit and interim sum within the decimal manner. Thus in the proposed multiplication, the
decimal correction is compacted as much as possible.
In the third stage, to convert the SD product back to BCD encoding eciently, a hybrid
carry propagation network which consists of several small carry prex networks is provided to
counterbalance the dierent delays on dierent bits of the result of the PPR unit. Compared
to traditional methods, the hybrid prex tree has less level by more level of nodes on the
middle and less signicant digits of the result from the PPR unit. Since the middle columns
in a partial product array consume more delay than the ending columns, the overall delay of
the multiplier could be further reduced with the proposed hybrid prex tree. An example to
illustrate each component of a proposed 4 4-digit multiplier is provided in Fig. 6.5.
57
9234
BCD
X "
1 : 011234
SD
X
011234
032318
021532
011234
011234
0143310434
SD
R "
   62689626
BCD
R "
2 : 021532
SD
X
3 : 032318
SD
X
4 : 043144
SD
X
5 : 166230
SD
X
6789
BCD
Y "
: 13211
SD
Y
1
SD
X
1
SD
X
2
SD
X
3
SD
X
1
SD
X
Inputs:
PPG:
PPR:
SD-BCD 
Convert:
Output:
Generator
 SD
Rec.:
 Sel.
Gen.:
SD
PP
Figure 6.5: Example of the proposed 4 4-digit multiplication algorithm
6.2.1 Signed Digit Partial Product Generation
In the proposed multiplier, we follow the SD radix-10 method described in [68] to recode YBCD
into the SD digit-set [ 5; 5]. To represent multiples, a new method which generates n + 1
partial products without carry propagation is proposed. In Table 6.2, the positive multiples
from 1X to 5X are represented in the SD digit set [ 8; 7]. Thus a 4-bit 2's complement
number could be applied to represent each digit in the multiples from 1X to 5X. To reduce
the area of the PPG unit, the negative multiples are generated by inverting the sign on each
digit of the corresponding positive multiples, and the digit-set for all multiples from  5X
to 5X is extended to [ 8; 8] (i.e., f[ 8; 7] [ [ 7; 8]g 2 [ 8; 8]). Unlike the binary signed
digit encoding proposed in the decimal signed digit number system in [51], to invert the sign
of each digit, an increment bit is involved in the proposed encoding system. Therefore one
signed digit in the proposed multiples is represented by 5 bits. However, the penalty on the
hardware area is minimized, since the increment bits for all digits in a partial product are
58
Table 6.2: Signed digit representation of the proposed multiples
BCD 1Xi 2Xi 3Xi 4Xi 5Xi Yi
Operand Ti+1 Wi Ti+1 Wi Ti+1 Wi Ti+1 Wi Ki+2 Ti+1 Wi Ti+1 Wi
0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 1 0 2 0 3 1 -6 0 0 5 0 1
2 0 2 1 -6 1 -4 1 -2 0 1 0 0 2
3 0 3 1 -4 1 -1 1 2 0 1 5 0 3
4 0 4 1 -2 2 -8 2 -4 1 -8 0 0 4
5 0 5 1 0 2 -5 2 0 1 -8 5 1 -5
6 1 -4 1 2 2 -2 3 -6 1 -7 0 1 -4
7 1 -3 1 4 2 1 3 -2 1 -7 5 1 -3
8 1 -2 2 -4 3 -6 3 2 1 -6 0 1 -2
9 1 -1 2 -2 3 -3 4 -4 1 -6 5 1 -1
Ti +Wi [ 4; 6] [ 6; 6] [ 8; 6] [ 6; 6] [ 5; 5]
Ki + Ti +Wi [ 8; 7]
identical.
Generation of Multiples
In Table 6.2, all the multiples could be divided into two parts except the 5Xi which is divided
into three parts. To simplify the representation of the multiplies generation, three variables
are dened in Table 6.2, where Wi represents the residual number which has the same weight
as the current BCD digit. Ti+1 and Ki+2 are the transfer digits to the next two digits which
have 10 and 100 times weight of the current BCD digit respectively. The sum of the three
variables is restricted in the range of [ 8; 7] to form one digit in SD number. Since the
variables can be directly generated according to dierent inputs, and the carry (transfer
digit) never propagates exceeding three neighbor digits, the delay of the proposed PPG is
independent on the width of the operand. In addition, for an n-digit operand, each multiple
contains n+2 SD digits. The SD multiple could be obtained by adding Ki, Ti and Wi with a
59
4-bit adder after a recoder generating these variables. Due to the specic converting pattern
in Table 6.2, the conversion can be assumed as a constant addition. Thus, the 4-bit add
operation is optimized and converted to the combinational logic to reduce area and delay.
The equations of one digit of the positive multiples are listed below. Note that the signals
on the right side of the equal sign is in BCD encoding, and the signals on the left side of the
equal sign is in proposed SD encoding. The hardware implementation can be optimized with
logic gates with less delay (e.g., NAND, NOR, XNOR gates).
1X: Since the digit-set [ 8; 7] is applied to generate the positive multiples in the pro-
posed PPG algorithm, 1X has to be converted to the target digit-set. In the equation, the
signal Ti represents the incoming transfer digit which is determined by the previous one digit.
Ti = X
3
i 1 +X
2
i 1X
1
i 1
1X3i = X
3
i (X
0
i + T i) +X
2
iX
1
i
1X2i = TiX
1
iX
0
i +X
3
i (X
0
i + T i) +X
2
i
1X1i = TiX
0
i (X
3
iX
1
i +X
2
i )+
(X
0
i + T i)(X
2
iX
1
i +X
3
i )
1X0i = Ti X0i
(6.12)
2X: In the proposed algorithm, since the transfer digit from last digit in multiple 2X
could be from 0 to 2, two bits (i.e., T 1i and T
0
i ) are needed to represent the incoming transfer
digit.
T 1i = X
3
i 1
T 0i = X
2
i 1 +X
1
i 1
2X3i = X
0
i (T
1
iX
2
iX
1
i +X
3
i ) +X
2
iX
1
i + T
1
iX
3
i
2X2i = X
0
i (T
1
i X
3
iX
2
i +X
1
i ) + T
1
i X
1
i +
X
0
i (T
1
iX
2
iX
1
i +X
3
i ) + T
1
iX
3
i
2X1i = X
2
iX
1
i (T
1
i X
0
i + T
1
iX
0
i )+
(T
1
iX
0
i + T
1
i X
0
i )(X
1
i +X
2
i )
2X0i = T
0
i
(6.13)
5X: To generate multiple 5X in digit-set [ 8; 7], two transfer digits which have 10 and
100 times weight of the current digit are needed. Since only two elements are in the digit-sets
60
of the residual number Wi and the transfer digit Ki, the logic could be simplied as shown
in equation (6.14).
Wi = X
0
i
Ki = X
3
i 2 +X
2
i 2
5X3i = X
3
i 1(Ki +W i) +X
2
i 1
5X2i = Wi(X
3
i 1 +Ki)
5X1i = WiKiX
3
i 1 +X
3
i 1(Ki +W i)+
X1i 1(Ki +Wi)
5X0i = X
1
i 1  (Wi Ki)
(6.14)
3X: By applying the redundant number system to represent the partial product, the 3X
logic does not contain the carry propagation in digit level any more. Thus a constant delay
in PPG could be achieved.
T 1i = X
3
i 1 +X
2
i 1
T 0i = X
3
i 1 +X
2
i 1X
1
i 1
3X3i = X
3
i (X
0
i + T
0
i + T
1
i ) +X
2
iX
1
i+
X1i (T
0
iT
1
iX
2
i +X
0
i (X
2
i + T
1
i ))
3X2i = T
1
i X
3
iX
0
i +X
1
i (T
0
iT
1
iX
2
i +X
0
i (X
2
i + T
1
i ))+
X0i (T
0
i T
1
i X
2
i +X
1
i (T
0
i X
3
i + T
1
i T
0
i ) + T
1
iX
3
i )
3X1i = X
3
i (T
1
iT
0
iX
0
i + T
1
i (X
0
i + T
0
i ))(X
1
i +X
2
i )+
(T 1i T
0
iX
0
i + T
1
i (X
0
i + T
0
i ))(X
2
iX
1
i +X
3
i )
3X0i = T
0
i X
0
i + T
0
iX
0
i
(6.15)
4X: The multiple 4X in the proposed work is not generated based on two times of 2X
61
as in other works. A direct and simple method is shown below.
4X3i = X
3
iX
2
iX
1
iX
0
i +X
2
iX
1
iX
0
i+
X
0
i 1(X
2
iX
1
iX
0
i +X
2
iX
0
i )+
X
3
i 1(X
1
iX
2
i 1(X
0
i +X
2
i ) +X
2
iX
1
iX
0
i +X
2
iX
0
i )
4X2i = X
3
i (X
3
i 1X
0
i 1 +X
2
i 1)+
X0i (X
3
iX
3
i 1(X
1
iX
0
i 1 +X
2
i )
+X
3
i 1(X
2
iX
1
iX
2
i 1 +X
3
i ) +X
2
iX
2
i 1)+
X
0
i (X
2
i (X
1
iX
3
i 1X
0
i 1 +X
1
iX
3
i 1X
2
i 1)
+X2i (X
3
i 1(X
0
i 1 +X
1
i ) +X
1
iX
3
i 1 +X
2
i 1))
4X1i = X
2
i 1(X
0
i 1 +X
3
i 1)(X
3
iX
0
i +X
3
iX
2
iX
0
i +X
1
i )+
(X3i 1X
0
i 1 +X
2
i 1)(X
1
i (X
3
iX
0
i +X
2
i ) +X
3
iX
0
i )
4X0i = X
3
i 1X
2
i 1X
0
i 1 +X
3
i 1X
0
i 1 +X
1
i 1
(6.16)
Selection of Partial Product
In the proposed multiplier, a minimally redundant radix-10 digit-set [ 5; 5] is applied to
represent the operand Y . Since the recoded set is symmetrical, and the multiples are encoded
in signed digit number, the selection signals for the negative multiples are the same as the
positive multiples (i.e., Y s4:::0i indicate the signals to select 5X; : : : ;1X). If a negative
multiple is selected, a one-bit negation signal Y ni for each selected partial product is applied
to invert the signs of all digits in the corresponding positive multiple. The equations for the
selection signal and negation signal are given in equation (6.17).
Ti = Y
2
i 1(Y
0
i 1 + Y
1
i 1) + Y
3
i 1
Y s4i = Y
2
iY
1
i (TiY
0
i + T iY
0
i )
Y s3i = TiY
0
i (Y
3
iY
2
iY
1
i + Y
2
i Y
1
i )+
T iY
0
i (Y
2
iY
1
i + Y
3
i )
Y s2i = Y
1
i (TiY
0
i + T iY
0
i )
Y s1i = TiY
0
i (Y
2
i Y
1
i + Y
2
iY
1
i ) + T iY
2
i Y
0
i
Y s0i = Y
2
i Y
1
i (TiY
0
i + T iY
0
i )
Y ni = Y
3
i (Y
0
i + T i) + Y
2
i (Y
0
i + Y
1
i )
(6.17)
62
In Table 6.2, the column for the Ti+1 of Yi shows that an n-digit operand YBCD could
generate an (n+1)-digit SD recoded operand YSD, and the (n+1)
th digit in YSD can only be
0 or 1. Thus for the (n+1)th partial product can only be 0X (all zeros) or 1X. Furthermore,
since the (n + 2)th digit of the multiple 1X is always zero, and the (n + 1)th digit of the
1X can only be 0 or 1, only 1 bit is enough to represent the most signicant two digits in
the (n + 1)th partial product, PPn. Thus, the selection logic for PPn could be simplied.
Additionally, the actual bit-widths on the output of the PPG are 4 (n+2)n+(4n+1)
for the partial product PP and n for the inversion signal Yn. The detailed structure of the
PPG is shown in Fig. 6.6.
5  GenX 4  GenX 3  GenX 2  GenX 1  GenX
SD Recoder
Selection
4( 2)n  4( 2)n  4( 2)n  4( 2)n  4( 2)n  
BCD
X
BCD
Y
4...0
i
Ys iYn
4( 2)n  
4( 2)n  ( 0... 1)i i nPP ! "
1  GenX
BCD
X 1BCDnY "
0
n
Ys
4 1n  
n
PP
4n 4n
1
4n 4
Generator
SD Recoder
Selection
Generator
0
1...0{1 ,1 }n nX X "
4 1n  1
Selector
5
Figure 6.6: Proposed architecture of partial product generation
6.2.2 SD Partial Product Reduction
To illustrate the proposed algorithm, a PPR scheme of a 16  16-digit multiplier is imple-
mented and discussed. First, the layout of the partial product array and the basic structure
of the PPR unit are introduced. Subsequently, a PPR algorithm based on multi-operand SD
addition is discussed. Finally, a hardware implementation of the proposed PPR unit for a
16 16-digit multiplier is addressed. Additionally, the delay model in terms of the delay of
a binary full adder is analyzed to guide in designing of the proposed SD-BCD converter.
63
3
2
1
0
16
16
16
16
p
p
p
p
0
I
P
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
0h
h
h
h
h
h
h
h
h
h
h
h
h
h
h
h
h
 
h
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
h
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
h
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
h
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
h
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
h
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
h
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
h
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
h
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
h
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
h
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
h
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
h
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
h
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
h
u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
 u
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
h
h
p pp p p p p p p p p p p p p ph
 
h
 
h
  
2
-d
ig
it
n
1-digit n!
: 
o
n
e 
d
ig
it
 o
f 
p
a
rt
ia
l 
p
ro
d
u
ct
s 
(P
P
s)
,
[
8,
8
].
p
p
"
#
: 
o
n
e 
d
ig
it
 o
f 
th
e 
17
p
a
rt
ia
l 
p
ro
d
u
ct
,
[
4,
6
].
th
h
h
"
#
: 
th
e 
m
os
t 
si
gn
if
ic
an
t 
d
ig
it
 o
f 
a 
p
ar
ti
ta
l 
p
ro
d
u
ct
, 
[
1,
1]
.
u
u
"
#
: 
th
e 
1
7
d
ig
it
 o
f 
th
e 
1
7
p
a
rt
it
a
l 
p
ro
d
u
ct
,
[0
,1
].
th
th
h
h
"
 
 
,
: 
co
m
b
in
at
io
n
s 
of
 o
n
e 
d
ig
it
 
,
[
2,
2
],
[0
,1
].
h
h
h
h
h
 
  
 
  
"
#
"
h
  
(16) n$
h
2
1
0
h
h
h
 
 
 
%
&
a
%
&
c
%
&
d
 a
n
 
:
in
cr
em
en
t 
b
it
 o
f 
d
ig
it
s 
in
 t
h
e 
p
ar
ti
al
 p
ro
d
u
ct
.
th
Ii
P
i
3
2
1
0
15
15
15
15
p
p
p
p
1
I
P
3
2
1
0
14
14
14
14
p
p
p
p
2
I
P
3
2
1
0
13
13
13
13
p
p
p
p
3
I
P
%
&
b
p
ar
ti
al
 p
ro
d
u
ct
 g
en
er
at
o
r
S
D
 a
d
d
er
S
D
 a
d
d
er
!!
S
D
 a
d
d
er
B
C
D
X
B
C
D
Y
S
D
R
p
p
p
p
p
p
p
p
h
 
h
  
F
ig
u
re
6
.7
:
R
es
tr
u
ct
u
re
of
th
e
p
ro
p
os
ed
p
ar
ti
al
p
ro
d
u
ct
re
d
u
ct
io
n
64
Partial Product Reduction Array
As described in section 4, for the multiplication of two n-digit operands, n+1 partial products
in (n+2)-digit are generated from the PPG unit in the proposed algorithm. Then the n+1
partial products need to be shifted according to the weight of each digit in the second operand.
Finally, these n+ 1 shifted partial products are added by the SD multi-operand adders.
An example of the layout of partial products for the proposed 1616-digit multiplication
is shown in Fig. 6.7(a). The partial product 0 _hh : : : hh indicates the partial product generated
by the most signicant digit (MSD) of the recoded operand YSD. Recalling the description
in section 6.3, the MSD of YSD only can be 1 or 0, and the 18
th digit of PP16 is always zero
for a 16 16-digit multiplication. Furthermore, the 17th digit of PP16, _h, is in [0; 1] as shown
in Table 6.2. The up : : : pp represents the partial products generated according to the least
16 signicant digits of YSD. Since the 18
th digit may be 1 only in 5X, the range of u is
restricted to [ 1; 1].
In Fig. 6.7(b), the layout of the partial product array is rearranged. Thus, except the
middle two partial product columns, all other columns are not more than 16 digits. For a
1616-digit decimal multiplication, the result is maximally in 32 digits which is the product
of two operands with 16 consecutive nines. If the product in SD format is not going to be
used by other SD arithmetic units before converting back to BCD format, then the digits
beyond the least 32 digits can be discarded (e.g., the digits in a dashed rectangular). For
example, in a 16 16-digit multiplication, the least 33 digits of the SD product can only be
in one of the formats shown in equation (6.18). Otherwise, after converting back to BCD
format, it will be larger than 32 digits.
RSD =
8>>>>>><>>>>>>:
1d : : : , or
10 : : : 0d : : : , or
0d : : : , or
00 : : : 0d : : : .
(6.18)
where d is a positive decimal digit, and d is the negation of d. The range of d is dependent
on the digit set applied (e.g., d 2 [1; 5] for the digit set [-5,5]).
65
The leading one in the rst case in equation (6.18) will be reduced by one to form a ten
in the less signicant digit to cancel out the negative digit d. The 32nd digit of the SD result
should be converted to (10 + d) or (10 + d   1) only depending on the value of the digits
on the same position and less signicant positions. In the second case, the 0-sequence on
the right side of the leading one should be converted to a 9-sequence to cancel out the rst
negative digit on its right side with the same manner as mentioned in the rst case. In the
latter two cases, the most signicant positive digit will be converted to d or d  1 depending
on less signicant digits. Furthermore, the most signicant positive digit guarantees that no
extra borrow is propagated to its left side. Hence, the conversion of an SD digit only depends
on the sign of itself and the less signicant digits. Consequently, in all of the cases the 33rd
digit is always zero in the result in BCD format. The details of the conversion algorithm to
correctly generate the result in BCD format is discussed in section 6.
16 partial products can be divided into four groups in which a 4-operand SD adder is
applied. Subsequently, 4 results of the rst level of 4-operand SD adders are summed up by
a 4-operand SD adder in the second level. In the middle two columns in the partial product
array, there are 17 partial products which potentially cause a complicated design. In the
proposed algorithm, the 17th operand is recoded into four subtle numbers (i.e., h0, h00) as
shown in Fig. 6.7(c), and issue them into four SD adders in the rst level. The maximum
number of operands for the rst level of SD adder is shown in the dashed circle. Note that
the increment signals PIi for all digits in one partial product are identical. The dataow
of the PPR algorithm is given in Fig. 6.7(d). As shown in Fig. 6.7(d), two levels of SD
adders are applied in the proposed PPR unit. Furthermore, as shown in the next section,
the multi-operand adder to process four p and one h0 has the same complexity as the adder
for four p.
Multi-Operand SD Addition Algorithm
In the proposed multiplier, the n + 1 partial products are encoded in SD digit-set [ 8; 8]
within 4 bits 2's complement number and 1 bit increment. The partial product reduction is
indeed a multi-operand SD addition. Although in principle, the result of the multi-operand
SD addition could be in the same digit-set as the input operands, to reduce the number of
66
internal wires, the result of the SD addition is retained in [ 8; 7] in which the 1-bit increment
signal is removed. An SD addition could be simply summarized into three steps, which are
adding operands to get position sum psi, extracting transfer digit ti+1 and obtaining interim
sum wi = psi  10ti+1 (suppose radix is 10), and computing nal sum si = wi + ti. Actually,
a two-operand SD adder can be applied as the minimum element in the PPR unit, and
the position sum is corrected (i.e., psi   10ti+1) for each addition. However, the correcting
operation is not immediately needed, and can be postponed to reduce the delay and area of
the PPR. In Table 6.3, the cases for multiple operands in the SD addition are shown. The
range of psi limits the selection of ti, and the range of wi cannot be decreased innitely to
cover all the digits in a decimal range [0,9]. Table 6.3 shows that as the range of psi increases,
the ranges of ti and si increase. To restrict the range of si in [ 8; 7], the maximum number
of operands in [ 8; 8] is four.
Table 6.3: Analysis of the number of operands of SD addition
#Op. Range of psi Range of ti Range of wi Range of si
2 [ 16; 16] [ 1; 1] [-6,6] [-7,7]
3 [ 24; 24] [ 2; 2] [-5,5] [-7,7]
4 [ 32; 32] [ 3; 3] [-5,4] [-8,7]
5 [ 40; 40] [ 4; 4] [-5,4] [-9,8]
If the ti and wi are in the ranges of [ 3; 3] and [ 5; 4] in the proposed algorithm, the
maximum range of the position sum psi can reach to [ 35; 34]. The extra range out of
[ 32; 32] (i.e., sum of four numbers in [ 8; 8]) implies that the number of operands of the
addition on [ 8; 8] may be between 4 and 5. In Fig. 6.7(c), the 17th operand, h 2 [ 4; 6], is
recoded into four parts, and the maximum range of the subtle numbers (i.e., the h0 and h00)
is [ 2; 2]. Thus, it is possible to add four operands with the subtle number together without
overow on the number system. The process of the SD addition according to our proposed
number system is listed in the Table 6.4. In the proposed SD addition, the operands are
summed up with binary arithmetic. To do the decimal correction, a recoder which maps the
binary position sum ps (ps0) to the decimal transfer digit t (t0) and interim sum w (w0) is
applied in each level of SD addition.
67
Since the signed digit operands are involved in the multi-operand addition, the addition
algorithm of weighted bit-set (WBS) encoding proposed in [78] is applied and extended for
multiple operands and multiple bit-widths in our algorithm. In Fig. 6.8, the proposed two
levels of SD additions are illustrated by the dot notation representation which is proposed in
[78]. In Fig. 6.8 the white circle represents a binary bit with negative weight, namely negabit,
and the black circle represents a binary bit with positive weight, namely posibit. Additionally,
the carry save half adder, full adder, and 4:2 compressor are respectively represented by the
dashed rectangles with 2, 3, and 4 circles. The solid line, solid double-line, and bold solid line
represent one level of carry save arithmetic units, a carry lookahead adder, and a recoder,
respectively.
Table 6.4: Proposed SD addition algorithm
Addition Steps
SD addition operands
Digit +1i Digit i
level1-step1:
sum the partial products
ps p p p p h ! " " " "
level1-step2:
1generate  and iit w"
calculate i i is t w! "
level2-step1:
sum the four SD results
ps s s s s ! " " "
level2-step2:
1generate  and iit w"
  
calculate i i is t w
   ! "
4 [ 8, 8]# $
[ 2, 2]$
[ 34, 34]$
"
[ 5, 4]$
[ 3, 3]$
[ 8, 7]$
"
4 [ 8, 7]# $
[ 32, 28]$
[ 5, 4]$
[ 3, 3]$
[ 8, 7]$
"
Symbols
p
h 
ps
w
t
s
ps 
w 
t  
s 
4 [ 8, 8]# $
[ 2, 2]$
[ 34, 34]$
"
[ 5, 4]$
[ 3, 3]$
[ 8, 7]$
"
4 [ 8, 7]# $
[ 32, 28]$
[ 5, 4]$
[ 3, 3]$
[ 8, 7]$
"
As shown in Fig. 6.8, the transfer digits and interim sums from the rst level of SD
addition are summed up directly in the second level of SD addition to avoid the delay cost
of a carry lookahead adder to add w and t. Therefore, the step 2 of the rst level of addition
and the step 1 of the second level of addition proposed in Table 6.4 are merged together.
Furthermore, to reduce the number of the arithmetic units in the hardware implementation,
the sign bit of the operands (i.e., h0 and PI) is not extended. Thus the position sum ps (ps0)
is given in hybrid posibit-negabit encoding. For example, the third bit and sixth bit of ps
have negative weight  22 and  25. Note that in Fig. 6.8(a), the increment signal PI for
68
p
p
p
p
h "
I
P
w
i
1
t
i# ti
ps
i
(a) First level
w
i
t
iw
i
w
i
w
i
t
i
t
i
t
i
1
t
i
"
#
t
i
"
w
i
"
ps
i
"
(b) Second level
Figure 6.8: Dot notation of the proposed two levels of multi-operand SD additions
each digit is summed up by a binary counter to reduce the number of operands in the least
signicant bit of each SD adder. Such a counter can be applied right after the Radix-10
operand recoder of the operand Y , thus it cannot aect the critical path. Additionally, since
the increment bits for all digits in a partial product are identical, the number of the counters
can be minimized.
The hybrid posibit-negabit encoded binary to signed digit decimal recoder which is a
one-to-one mapping can be implemented in the combinational logic. A segment of the map
in binary bits to recode psi and ps
0
i is given in Table 6.5. As shown in Fig 6.8, the ps is
represented in hybrid posibit-negabit encoding, and the negative weighted bits are placed
at the third and sixth binary positions. Thus, in the recoder, an input of \1100010" (34)
generates \011" (3) as t and \0100" (4) as w.
Hardware Implementation and Delay Model of the Proposed PPR
As shown in Fig. 6.8(a), the maximum bits of operands of the rst level SD adder are six,
which can be reduced to one carry-sum pair by 3 levels of binary full adders (FA) and half
adders (HA) as shown in Fig. 6.9. By applying the WBS adder, the inverters are placed on
the input or output of the traditional arithmetic unit, such as a full adder. As shown in [79],
69
Table 6.5: Proposed transfer digit and interim sum recoder
Recoder in 1st-level SD adder Recoder in 2nd-level SD adder
psi ti+1 wi ps
0
i t
0
i+1 w
0
i
\1100010" \011" \0100" \1111100" \011" \1110"
\1100001" \011" \0011" \1111011" \011" \1101"
\1100000" \011" \0010" \1111010" \011" \1100"
\1100111" \011" \0001" \1111001" \011" \1011"
\1100110" \011" \0000" \1111000" \010" \0100"
\1100101" \011" \1111" \1110111" \010" \0011"
\1100100" \011" \1110" \1110110" \010" \0010"
\0011011" \011" \1101" \1110101" \010" \0001"
\0011010" \011" \1100" \1110100" \010" \0000"
\0011001" \011" \1011" \1110011" \010" \1111"
\0011000" \010" \0100" \1110010" \010" \1110"
: : : : : :
\0100110" \101" \1100" \0100000" \101" \1110"
70
[80], and [78], the inverters in between the arithmetic units can be canceled. The remaining
inverters at the input and output of the calculation unit could be absorbed by the previous
logic. For example, the inverters of the negabits p3i:::i+3 can be removed by the XOR gates at
the output port of the PPG with inverted logic (i.e., XNOR gate). To save the delay on the
critical path, the transfer digit ti and the interim sum wi generated by the rst level of SD
adders is kept. Additionally, the eight internal parameters are added by two levels of binary
4:2 compressors as shown in Fig. 6.10. Since the recoder inside the PPR unit is a simple
one-to-one mapping from the inputs to the outputs, the recoders described in Table 6.5 are
simply created by the combinational logic gates. Note that, except two middle columns of
operands as shown in Fig. 6.7(b), all other columns can be reduced with elements no more
complicated than the adders on the critical path which are shown in Fig. 6.9 and Fig. 6.10.
For example, 12 operands can be reduced by four 3-operand SD adders on the rst level
and one 4-operand SD adder on the second level. Thus the area of the PPR is potentially
reduced. Finally, a segment of the top level architecture of the SD adders in the PPR unit
for a 16 16-digit multiplier is given in Fig. 6.11.
FA FAFAFA
FA FAFA HA
HA FA FAFA
3-bit CLA
3
... 3{ }i ip !
0
... 2{ }i ip !
0 0 0
3{ , , }I iP h p ! 
Transfer Digit & Interim Sum Generator
3
i
w0 1it !
1
1it !
2
1it !
2
i
w 1
i
w 0
i
w
0
i
ps1
i
ps2
i
ps3
i
ps4
i
ps
5
i
ps6
i
ps
1
... 2{ }i ip !
2
... 2{ }i ip !
1 1 1
3{ , , }I iP h p ! 
2 2 2
3{ , , }I iP h p ! 
O
A
I
O
A
I
a[0]b[0]a[1]b[1]a[2]b[2]b[3]
a[1]
b[1]
a[2]
b[2]
2
i
ps3
i
ps4
i
ps5
i
ps6
i
ps
3-bit CLA
Figure 6.9: Hardware structure of the proposed 1st level multi-operand SD adder
In addition, the dierent structures of columns of PPR unit make the result signals of
dierent digits of the PPR available at dierent time. To analyze the delay on each digit
71
T
a
b
le
6
.6
:
D
el
ay
an
al
y
si
s
of
ea
ch
d
ig
it
of
th
e
p
ro
p
os
ed
p
ar
ti
al
p
ro
d
u
ct
re
d
u
ct
io
n
L
og
ic
C
ol
u
m
n
P
os
it
io
n
M
o
d
u
le
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
B
in
3:
2
1
1
3
3
3
3
3
3
B
in
4:
2*
1.
5
1.
5
3
1.
5
1.
5
1.
5
1.
5
3
4.
5
4.
5
3
3
3
3
3
3
B
in
5:
2*
2
2
2
2
2
3-
b
it
C
L
A
*
2
2
1
1
1
1
1
1
1
4-
b
it
C
L
A
*
1.
25
1.
25
1.
25
1.
25
1.
25
2.
5
2.
5
2.
5
2.
5
2.
5
2.
5
2.
5
1.
25
1.
25
R
ec
o
d
er
s*
2.
5
2.
5
2.
75
3
3
3
3
3
3
3
3
3
3.
25
3.
25
3.
5
3.
5
eq
u
iv
al
en
t
B
F
A
s
7
7
8
9.
75
9.
75
9.
75
9.
75
10
.5
10
10
10
.5
10
.5
10
.7
5
10
.7
5
11
.7
5
11
.7
5
L
og
ic
C
ol
u
m
n
P
os
it
io
n
M
o
d
u
le
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
B
in
3:
2
3
1
B
in
4:
2*
3
3
3
3
4.
5
4.
5
4.
5
1.
5
1.
5
3
1.
5
1.
5
B
in
5:
2*
2
2
2
2
2
2
2
2
3-
b
it
C
L
A
*
1
1
1
1
1
4-
b
it
C
L
A
*
2.
5
2.
5
2.
5
2.
5
2.
5
2.
5
2.
5
1.
25
1.
25
1.
25
1.
25
2.
5
1.
25
1.
25
R
ec
o
d
er
s*
3.
25
3.
25
3
3
3
3
3
3
2.
5
2.
5
2.
75
2.
5
1.
5
1.
5
1.
25
1
eq
u
iv
al
en
t
B
F
A
s
10
.7
5
10
.7
5
10
.5
10
.5
10
10
10
10
.2
5
8.
25
8.
25
8
6.
5
4.
75
4.
25
3.
25
1
*
T
h
e
d
el
ay
is
re
p
re
se
n
te
d
in
th
e
n
u
m
b
er
of
eq
u
iv
al
en
t
B
F
A
s.
72
4-bit CLA
Transfer Digit & Interim Sum Recoder
3
i
w  01it ! 
1
1it ! 
2
1it ! 
2
i
w  1
i
w  0
i
w  
3
... 3{ }i iw !
2
... 3{ }i iw !
2
... 3{ }i it !
1
... 3{ }i iw !
1
... 3{ }i it !
0
... 3{ }i iw !
0
... 3{ }i it !
HAFA
0
i
ps  1
i
ps  2
i
ps  3
i
ps  4
i
ps  5
i
ps  6ips  
4:2
4:2
4:2
4:2
4:2
4:24:2
4:24:24:2 OA
I
O
A
I
O
A
I
a[0]b[0]a[1]b[1]a[2]b[2]a[3]b[3]b[4]
a[1]
b[1]
a[2]
b[2]
a[3]
b[3]
1
i
ps  2
i
ps  3
i
ps  4
i
ps  5
i
ps  6
i
ps  
4-bit CLA
Figure 6.10: Hardware structure of the proposed 2nd level multi-operand SD adder
of the output of the PPR, a list of equivalent binary full adders in modules on the critical
paths in each column is shown in Table 6.6. We assume that the binary 4:2 compressor has a
delay of 1.5 binary full adder (BFA), and the binary 5:2 compressor has a delay which equals
to 2 BFAs' delay [77]. According to the delay analysis, we assume that the 3-bit and 4-bit
carry lookahead adder (CLA) have delay of 1 BFA and 1.25 BFAs on the critical path which
passes through ps4i and ps
04
i , respectively. The delay of the combinational recoders is also
represented in terms of the 3:2 BFA which is obtained by the delay analysis. Thus, the brief
estimation of the delay on each digit of PPR could be obtained in terms of the equivalent
binary full adders. In Table 6.6, the delay from connected neighbor columns is considered.
Additionally, since the latency to generate each partial product in PPG and the delay of the
CLA to add w0 and t0 for each digit are almost the same, the inuence of the delay of the
PPG stage and the nal CLA in Table 6.6 is not considered.
For a 3434-digit multiplication, at most 35 partial products should be reduced. The 35
partial produces could be divided into three groups (i.e., double 17 partial products and one
extra partial product). For the double 17 partial products, the proposed structure could be
applied to obtain two SD results. Thus one more level of 3-operand SD adders are applied
on the critical path to reduce the two SD results of the 17 : 2 SD addition and the extra one
73
st1  Level
SD Adder
st1  Level
SD Adder
nd2  Level
SD Adder
1i  i
1i  
1 4iw  "
1 4it  "
4
i
w "
 
SD Adder
i
nd2  Level
1iw  
!
1it  
!
i
w !
 
partial product digits
!! !!
i
t !
Figure 6.11: Top level architecture of the proposed partial product reduction unit
partial product.
6.2.3 SD-BCD Conversion
The partial products in a signed digit-set can be reduced to one SD result with the multi-
operand SD adders. Unlike in other works, a digit-set converter is proposed to convert back
the SD result into the conventional BCD encoding. Moreover, in such an SD-BCD conversion
algorithm, a hybrid carry propagation network is discussed in detail.
SD-BCD Conversion Algorithm
In the proposed multiplier, the 2n-digit result of the PPR is in digit-set [ 8; 7]. If the digit
is negative, a borrow (i.e., negative carry) occurs. To convert it back to the digit set [0; 9] in
BCD encoding, the negative digit is increased by 10, and the rst non-zero digit with higher
weight is reduced by one. All the continuous zeros in between the current negative digit and
the rst non-zero digit on its left side are converted to 9. An example is provided below:
(10048023)SD = (09952017)BCD
Thus to convert the SD result into BCD encoding, the negative digit (i.e., generates
74
the negative carry) and zero digit (i.e., propagates the negative carry) need to be detected.
Furthermore, a carry propagation network and necessary logics are applied to determine and
convert the SD digits into BCD encoding. The conversion algorithm is shown as follows:
Algorithm 6.2.3: SD to BCD Encoding Conversion
Data: SD number S.
Result: BCD number R.
1. Detect borrow generation bit (Gi) and propagation bit (Pi) for each digit of S.
Gi =
8<: 1 if Si < 00 otherwise, Pi =
8<: 1 if Si = 00 otherwise,
Gi:j =
8<: Gi if i = jGi + (Pi Gi 1:j) if i > j,
Pi:j =
8<: Pi if i = jPi  Pi 1:j if i > j.
(6.19)
2. Compute the negative carry Ci of S (C0 = 0).
Ci+1 = Gi:j + (Pi:j  Cj): (6.20)
3. Convert the result S to BCD encoding.
Ri =
8>>>>>><>>>>>>:
Si if Ci+1Ci = 00
Si   1 if Ci+1Ci = 01
Si + 10 if Ci+1Ci = 10
Si + 9 if Ci+1Ci = 11
(6.21)
Hardware Implementation of the Converter
In Algorithm 6.2.3, the rst step is to detect the negative and zero digits. Since in the
proposed multiplier, the outputs of the PPR can be added into an SD number in the 4-bit
two's complement encoding, the negative detection is simply a fourth-bit detection. To detect
75
a zero digit in two's complement encoding, all four bits are needed. Since inside the 4-bit
CLA to sum up the nal transfer digit t0 and interim sum w0, the results on dierent bits in
a digit are available at dierent time, only one extra OR gate on critical path for the zero
detection could be achieved by connecting three OR gates in cascade as shown in Fig. 6.12.
0
i
w 
0
i
t 
1
i
w 
2
i
w 
3
i
w 
1
i
t 
2
i
t 
0
i
S
1
i
S
2
i
S
3
i
S
i
PiG
Figure 6.12: Simplied 4-bit CLA and G, P generation circuit
For the traditional method in the carry propagation step, a dlog(an)e-level prex network
is applied to quickly generate the nal carry. The parameter a depends on the processing
scope (e.g., in [75] the proposed quaternary tree unit works in bit level, thus a dlog(4n)e-
level prex tree is applied). No matter how many levels are in the prex tree, the critical
path passes through all levels of internal nodes. On the other hand, in the PPR stage, the
longest path is potentially on the middle columns of the partial product array, and the rest of
columns have shorter paths. It implies that the digits in nal product which are close to the
least and most signicant digits are available earlier and can be processed before the digits in
the middle part of the partial products array are ready. In section 5.3, a delay model on each
digits of the nal product is shown in Table 6.6. According to the estimated delay, the 32-
digit SD result is divided into ve groups which are gp0 = fS11; : : : ; S0g, gp1 = fS15; : : : ; S12g,
gp2 = fS17; S16g, gp3 = fS21; : : : ; S18g, and gp4 = fS31; : : : ; S22g. For each group, a small
traditional carry propagation tree is applied. Thus the well optimized prex tree circuit for
76
binary design could be reused. The carry propagation process is described in the following
equations:
Ci =
8>>>>>>>>>>>><>>>>>>>>>>>>:
0 if i = 0
Gi 1:0 if 12  i  1
Gi 1:12 + Pi 1:12  C12 if 16  i  13
Gi 1:16 + Pi 1:16  C16 if 18  i  17
Gi 1:18 + Pi 1:18  C18 if 22  i  19
Gi 1:22 + Pi 1:22  C22 if 32  i  23
(6.22)
where the Ci is the carry-in of the i digit, and note that the carry-in to the least signicant
digit is always zero.
In Fig. 6.13, a detailed structure of the proposed prex network is shown. The white dot
represents the logic to create the generation bit Gi and propagation bit Pi for each digit. The
black dot represents the logic to create the group generation bits Gi:j and group propagation
bits Pi:j described in equation (6.19). For the lower 12 digits, a Ladner-Fischer network is
applied to minimize the number of levels and the area cost. Since the carry-in on the least
signicant digit is always zero, the carry-in to 13th digit equals to G11:0. For the digits from
S12 to S15, a two-level Ladner-Fischer network is used to create the group-carry-in generation
and propagation signal, G15:12 and P15:12. To further calculate the carry, only an AND-OR
gate is needed. For carry C18 and C17, a 2-bit carry look-ahead structure is used. In higher
14 digits, the same technique as the one in the lower 16 digits is used. Note that to reduce
the fanout of gates from low weight inputs through high weight outputs, a Han-Carlson
network is applied to calculate the group-carry propagation and generation signals. In the
16-digit multiplication, at least 5-level of internal nodes should be on the critical path in
a conventional method. In the proposed architecture, about 3-level of nodes are connected
after the outputs of the middle columns in partial products array, and the level of nodes
after the most signicant columns are kept as 5. Although for less signicant columns, the
connected prex tree nodes would be greater than ve, the shorter delay on those columns
could counterbalance the delay of the nodes in the prex network. Note that the architecture
77
 
 
 
 
 
 
0
S
1
1
S
1
2
S
1
5
S
1
7
1
7
,
G
P
1
8
S
2
1
S
2
2
S
3
1
S
0
0
C
"
1
2
1
1:
0
C
G
"
1
3
C
1
C
1
6
C
1
7
C
1
8
C
1
9
C
2
2
C
2
3
C
3
2
C
 
 
 
 
 
 
 
 
1
6
1
6
,
G
P
F
ig
u
re
6
.1
3
:
P
ro
p
os
ed
h
y
b
ri
d
p
re

x
n
et
w
or
k
in
th
e
S
D
-B
C
D
co
n
ve
rt
er
78
of the hybrid prex network highly depends on the structure of PPR. An improved structure
would provide a better performance if the PPR structure is changed.
1
{ , }
i i
C C
!
3
i
S
2
i
S
1
i
S
0
i
S
"1111"
i
S !
3
i
S
2
i
S
1
i
S
0
i
S
"1010"
i
S ! "1001"
i
S !
3
i
S
2
i
S
1
i
S
0
i
S
Selector
4444
4 iS
4
i
R
2
Figure 6.14: Final conditional constant adder
In the third step of the Algorithm 6.2.3, the SD result which is converted into BCD
encoding by the conditional adder is selected by the carry signals of two neighbor digits in
S. To convert the SD result into BCD encoding, since the correction signals (i.e., \0000",
\1111", \1010", and \1001") for the four dierent carry-in cases are constant, the correction
process could be designed as a conditional constant addition which could be comparatively
simplied. In Fig. 6.14, the circuit of one digit conditional constant adder which consists of
three constant adder and a combinational selector is shown.
6.3 Sequential Decimal Fixed-point Multiplication
In contrast to the parallel multiplication described in section 6.1, the sequential multiplication
shows the advantage of the area eciency. Thus, if the hardware cost is more sensitive, the
sequential design could be applied in DFMA to achieve a new balance between cost and
performance.
79
6.3.1 Signed Digit Partial Product Generation
The PPG of the proposed multiplier is based on the generation of easy-multiples of the
multiplicand; thus, the required easy-multiples have to be determined. The representation of
the multiplier Y plays a pivotal role in selecting the appropriate easy-multiples. Consequently,
digit-set [-4,5] is selected to represent the multiplier Y in order to reduce the number of
required easy-multiples and hence ameliorate the complexity of the PPG. This, however,
calls for a recoder to convert the multiplier from digit-set [0,9] to [-4,5]. The recoder is
implemented based on equation (6.23) where yci + y
s
i constitute the i
th digit of the multiplier
Y in [-4,5] digit-set.
8<: ysi = yi; yci+1 = 0 if yi  5ysi = yi   10; yci+1 = 1 if yi > 5 (6.23)
Given that the recoded multiplier needs to be ready iteratively (one digit per iteration),
the carries in equation (6.23) (yci+1) are stored in a latch and used in the next iteration as
shown in Fig. 6.15.
Shift Register
Y
LatchRecoder
yi
c
yi
[-4,5]yi
y
c
i+1
Figure 6.15: Recoding of the multiplier
Given the digit-set of the multiplier i.e., [-4,5], computing X;2X and 4X, as easy-
multiples, is sucient for generating a partial product as a sum of two decimal numbers (i.e.,
Pi = Ui+ Vi). It should be noted that the addition Ui+ Vi is actually performed in the PPA
step. Finally, a combinational logic is required to select the appropriate easy-multiples based
80
on the value of the multiplier's digit. Table 6.7 describes the selection rules for generating
Ui and Vi.
Table 6.7: Selection of the easy-multiples
yi -4 -3 -2 -1 0 1 2 3 4 5
Ui 0 1X 0 1X 0 1X 0 1X 0 1X
Vi  4X  4X  2X  2X 0 0 2X 2X 4X 4X
With the intention of reducing the complexity of the PPG, the easy-multiples is gener-
ated in the encodings shown in Table 6.8; thereby simplifying the carry-free addition of the
PPA step (see details on Section 6.3.2). Particularly, easy-multiple X is kept as BCD and
2X,4X are encoded into digit-set [-6,6] and represented as a signed-digit two's comple-
ment. In this approach, rst, each digit (e.g., ith) is divided into a transfer ti+1 and a sum
wi (as shown in Table 6.8); next wi + ti generates the converted i
th digit.
Table 6.8: Conversion from BCD to the specic digit set
Xi
2X 4X
ti+1 wi ti+1 wi
0 0 0 0 0
1 0 2 1 -6
2 1 -6 1 -2
3 1 -4 1 2
4 1 -2 2 -4
5 1 0 2 0
6 1 2 3 -6
7 1 4 3 -2
8 2 -4 3 2
9 2 -2 4 -4
According to Table 6.8, the generation of the easy-multiples 2X and 4X is performed via
the logical expressions which are similar to equations (6.13) and (6.16).
81
Regarding the symmetric signed-digit 2's complement representation of 2X and 4X, the
 2X and  4X multiples are generated through a simple two's complement per digit. How-
ever, the two's complement operation is partially deferred until the PPA step. The overall
architecture of the proposed PPG is illustrated in Fig. 6.16, where ci is stored for the two's
complement operation (per digit) performed in the PPA step.
Shift Register
Y
Recoder
yi
c
y
c
i+1
2X 4X
X
Selector
AND
lsb
Ui
XOR
Vi
msb
ci
y
s
i
Figure 6.16: The proposed partial product generation
6.3.2 Partial Product Accumulation
Partial product accumulation is meant to add, properly, the generated partial product (i.e.,
Ui + Vi + Ci, according to Section 6.3.1) to the accumulated previous products P [i]. This is
resembled in the recurrence equation 6.24, where Ci is the word-wide extension of ci (1-bit
ci per digit).
P [i+ 1] = 0:1 P [i] + Ui + Vi + Ci (6.24)
With the intention of reducing the latency of the PPA step, one can use a multi-operand
redundant adder as to implement equation 6.24, where P [i] and P [i+1] are represented in a
carry-save format. Figures 6.17 and 6.18 illustrates the dot-notation and the circuitry of the
82
multi-operand redundant addition used in the proposed PPA where (4:2) compressors with
asterisk are the simplied one.
[0,9]Ui
[-6,6]Ci+Vi
[-6,6]P[i]
[-12,21]
Wi
Figure 6.17: The dot-notation of partial product accumulation (digit-slice)
4:24:24:24:2
CLA
Recoder
Ui+Vi+Wi+Ti
Ci
WiTi+1
Figure 6.18: The circuitry of partial product accumulation (digit-slice)
Finally, after n iterations, the generated product P [n + 1] should be converted to the
standard BCD format. This conversion is performed iteratively (a digit per iteration) for the
lower part of the product PL (based on Table 6.9), and in parallel (in two cycles) for the
higher part PH .
The parallel conversion consists of two main parts each of which with the following duties.
Part I: Preparing generate and propagate signals (i.e., g and p) to be used by the paral-
lel prex tree in Part II. Moreover, A 4-bit carry-look-ahead adder (CLA) is responsible to
generate the appropriate digit value.
Part II: A parallel prex tree computes the carry of each digit position; then a combina-
tional logic (based on Table 6.9) produces the nal converted product.
Fig.6.19 depicts the architecture of the proposed parallel conversion.
83
Table 6.9: Iterative Conversion
Digit in Carry in Digit out Carry out Digit in Carry in Digit out Carry out
4 0 4 0 4 -1 3 0
3 0 3 0 3 -1 2 0
2 0 2 0 2 -1 1 0
1 0 1 0 1 -1 0 0
0 0 0 0 0 -1 9 -1
-1 0 9 -1 -1 -1 8 -1
-2 0 8 -1 -2 -1 7 -1
-3 0 7 -1 -3 -1 6 -1
-4 0 6 -1 -4 -1 5 -1
-5 0 5 -1 -5 -1 4 -1
In a nutshell, the whole architecture of the proposed sequential multiplier (including the
PPG and PPA) is shown in Fig. 6.20, where concatenating PL and PH produces the nal
product.
6.4 Decimal Floating-point FMA
The top level architecture of the proposed DFMA is shown in Fig. 6.21. After the operand
decoder, two signicands CX and CY are fed into the multiplier array. Meanwhile, the align-
ment shifting operation is done in parallel with the multiplication. Subsequently, a decimal
carry free adder sums up the redundant product and the addend which is inverted according
to the eective operation. With the internal redundant number system, the carry propaga-
tion in the nal digit-set converter of the multiplier array is eliminated. Moreover, a simpler
leading zero decision algorithm can be applied on the carry free result. The propagation in
the decimal addition is therefore eliminated before the rounding position is obtained. In the
post-alignment shifter, the digits which exceed the required precision are moved out, and the
(n+1)-digit result is sent to the nal rounder. In the nal rounding unit, the absolute value
conversion, the digit-set conversion, and the rounding operation are performed at the same
84
Parallel Prefix Tree
Carries
Digit 
Values
PH
4-bit CLA
4n
2
n
4n
P
ar
t 
I
Combinational Logic
g&p
G&P Generation
2 4
2n 4n
P
ar
t 
II
. . .2 2 2 2 2
Figure 6.19: The proposed parallel conversion
time. A detailed structure of the proposed DFMA for Decimal64 format is given in Fig. 6.22.
The characteristics and process of the proposed FMA computation are described as follows.
1. In the multiplier array, the multiplication structure which has been proposed in [96] is
exploited. Since the redundant intermediate product is further used in following units, the
nal digit-set conversion proposed in [96] which involves a carry propagation is not needed
anymore. A (2n+ 1)-digit product on digit-set [ 8; 7] is therefore retained.
2. In the meantime, the pre-alignment of the addend which is performed in parallel with
the multiplier array no longer exists on the critical path. First, the exponent dierence
between the product and addend is obtained by binary prex tree adders. Subsequently, the
addend is shifted to right or left depending on the sign and absolute value of the exponent
dierence. Since the product consists of 2n + 1 redundant digits, to guarantee the required
precision and rounding information of the nal result, the shifting range of alignment is
extended to 4n+2 digits. After all, the XOR gates are applied to negate the shifted addend
for eective subtraction.
3. In the addition module, the nonspeculative decimal adder proposed in [97] is modied
to add two operands in [ 8; 7] and [ 9; 9] and create a result in [ 8; 7]. Since the number
of shifting digits in post-alignment is detected based on the redundant result, the carry
propagation is not necessary anymore.
85
Selector
Sel Gen
On the fly Conversion
Dec. SD Adder
Conversion C1
Conversion C2
C
o
n
tro
ller
P2S
S2P
REG
REG
REG*
REG
REG
PH
PL
REG*
REG
REG
REG*
4n4n
4(n+1) 4(n+1) 4(n+1)
4(n+1)4(n+1)
2(n+1)
4(n+1)
4n
4n
4
4
4n
1
4
1
1
1
4n
Wi
Si
Wi
X Y
4X2X1X
PPR
i
PPL
i
T
i
#Cycle
1
2
to
n+2
n+3
n+4
Figure 6.20: The proposed sequential decimal multiplier
86
DPD Decoder
Mul Array
Pre-Align
Decimal Carry Free Adder
Combined Conver. Round
DPD Encoder
X Y Z
SXCX EXSZCZ EZSYCY EY
CZsh
Post-processing
RSH
R
LSH & RSH
shamt1
Productcf
Sum1
shamt2
Sum2
Result
Resultpost
LZA
Shamt
Calculation
Figure 6.21: Proposed architecture
87
CZ
CZsh
CZ
CZlsh CZrsh
Select
12 12
4*49D 4*33D
4*66D
CYCX EZ-EP EP-EZ
Product
Correct Digit
Generation
Intermediate Signals Generation
Multiplier Array
Multiplexer Array
Sel. Gen.
Misc. Signals 
Generation
CR
SR
EOP
EOP
Exp1
Sticky1
Misc. Signals 
Generation
Sign2
Sign2
RD
Sticky2
ER
LZD TZD CD Sticky Gen.
Post-Alignment
Calculation
Lsa1 Rsa1
Rsa2
Negation 15-bit 
Prefix Tree
Inc,C_lsd 
Generation
Carry Generation
Correction Generation
4-bit CLA Array
Sign
Generation
Exponent
Generation
Lsa1
Sum1
Sum2
gp
Multiplier 
Array
Pre-alignment
Carry Free 
Addition
Post-alignment
Rounding
LSH RSH
RSH RSH
4*67D
4
4
4*33D
4*17D 2
C_lsd
C17
2*15D
1
Cor24*16D
4*16D
PSum
Cor1
1
10
Exp2
4b Add
4b Add
12b Add 12b Add
Sum2'4*16D
PPG
PPR
XOR Array
Figure 6.22: Details of structure
88
Sel. Gen.
Misc. Signals 
Misc. Signals 
0SX  0963625485443960CX  18EX  
0SY  1EY  "
1SZ  31EZ  
Input:
Calculation:
012463432142204420125102041301120
Product  
11...11 99998877654311...11CZsh aaaa 
1EOP  1 00 (zero)Sticky  1 1Exp  
1, Multiplication:
2, Pre-Alignment:
3, Addition:
012463432142204420125102041301120
4, Post-Alignment:
2 31Rsa  
345656323043112312Sum  
2 0Sign  2 01 (positive)Sticky  2 32Exp  
5, Rounding:
1RD  
0incRD  1lsdC  
11111100100010110C  
6543443170429077CR  0SR  32ER  
7828178241591672CY  
9999888877665432CZ  
17EP  31EZ  
1 14 (active)Lsa  1 14Rsa  "
Output:
00...00999988887766543200...00#
00...00134565632304311231251020413011200...00
Figure 6.23: Details of calculation
89
4. The post-alignment unit shifts the intermediate result after addition to achieve the
preferred exponent and guarantee the required precision. Since the digits (radix   1)
(i.e. 9 in this radix-10 system) are not used, the long-term cancelation does not exist.
Consequently the leading zero detection of the proposed digit-set is simple. However, the
sticky digit is harder to be examined than it in other architectures, since the moved out digits
may represent a negative value. A method to obtain the sticky bits, which represent a signed
sticky digit, with almost the same delay as the post-alignment shifting is introduced.
5. The result from the post-alignment shifter can be positive or negative, and the digit-set
is redundant. Therefore, the absolute value of the non-redundant result has to be obtained
before performing the rounding decision in a straightforward method. More than one carry
propagation might be involved in this process. In this work, an algorithm which negates,
converts, and rounds the redundant intermediate result into the BCD format with one long-
term carry propagation and constant delay logics is described.
To illustrate the principle of the computation in the proposed FMA, an example is given
in Fig. 6.23. Once all the operands are ready, CXCY is rstly performed in the multiplier
array, and the redundant Product is obtained. In the meantime, the dierence on exponent
is calculated by two adders. If EZ   (EA + EB) is positive, the left shifting is active,
otherwise the right shifting is active and selected. Since the eective operation EOP is
subtractive, the shifted addend is further negated by the XOR gates after the multiplexer
array to achieve two's complement on every digit. The missing increment one for every digit
is therefore sent to the adder by EOP . Note that the negated addend digit a actually means
10 (i.e 9 if includes the increment 1 in EOP ). However, in hardware only 4 bits are used
for each digit, and a is represented as \0110" without the fth bit. After alignment, the
intermediate exponent Exp1 = 1 is calculated by adding or subtracting the right or left shift
amount on EZ and subtracting 16 for moving decimal point. Subsequently, two operands
from alignment unit and multiplier are added, and a redundant result is obtained in Sum1.
At the same time, the right shift amount for the post-alignment is calculated. Since the result
Sum1 has more than 16 signicant digits which is larger than the required precision, only
a 16-digit signicand, a 1-digit rounding digit and a 2-bit sticky digit are retained in Sum2
after the shifter. Due to the post-alignment, the intermediate exponent Exp2 is updated
90
by adding Exp1 with the Rsa2. In the nal rounding unit, the rounding digit 1 and the
positive Sticky2 cause a zero increment in the least signicant digit (LSD) of the signicand.
Consequently, the negative carry C for converting the digit-set with the consideration of
the rounding increment is obtained. Finally, the absolute value of the rounded result in the
conventional digit-set is achieved by adding the correction value of 10's complement decided
by C to the negated Sum2 obtained by the XOR gate.
To illustrate the dierences between our proposed design and other previous designs on
the top level architecture. The simplied architectures of three designs are given in Fig. 6.24.
The core architecture of the proposed DFMA is partitioned into ve sub-modules, which
are multiplication, pre-alignment, addition, post-alignment, and rounding unit. Since the
major works of the multiplier and adder have already been proposed in previous sections,
in this section, these two basic computations are simplied by two models which create
(n+m+1)-digit result for n-digitm-digit redundant multiplication and (k+1)-digit result
for n-digit+m-digit redundant addition, where k = maxfn;mg. The rest components of the
design, which include addend alignment, decision of the rounding position, and rounding the
redundant result with direct conversion, are described in details including algorithms and
hardware structures in this section.
6.4.1 Pre-Alignment
In the proposed DFMA, the pre-alignment block is in parallel with the multiplier array.
Therefore the pre-alignment shifting is only processed on the addend CZ, and the product is
kept in its position while CZ is shifting. In principle, the pre-alignment algorithm shifts the
operand to make the following addition upon two operands which have the same exponent
or guarantees the result of the addition equals to what it is supposed to be once the shifting
range is too large. Since the decimal operand is not normalized, the number of signicant
digits of the non-zero product obtained after the multiplier can be from 1 to 2n + 1. Thus,
to guarantee the precision and the correct rounding digit, the necessary shifting width of
the addend can be 4n + 2 digits, which are decided in two extreme cases. In the rst case,
the least signicant digit of the addend is shifted 3n digits to left, and n digits precision are
therefore guaranteed on the left of the product. Note that if more digits are required to be
91
D
P
D
 D
ec
o
d
er
M
u
l 
A
rr
a
y
P
re
-A
li
g
n
O
p
 S
el
ec
ti
o
n
C
o
m
b
in
ed
 A
d
d
 R
o
u
n
d
D
P
D
 E
n
co
d
er
X
Y
Z
S
X
C
X
E
X
S
Z
C
Z
E
Z
S
Y
C
Y
E
Y
C
Z
sh
P
o
st
-p
ro
ce
ss
in
g
L
S
H
 &
 R
S
H
R
4
2
2
1
-B
C
D
 C
S
A
L
Z
A
sh
a
m
t1
P
ro
d
u
ct
su
m
S
u
m
1
ca
rr
y
sh
a
m
t2 S
u
m
2
ca
rr
y
R
es
u
lt
R
es
u
lt
p
o
st
O
p
0
O
p
1
O
p
2
S
h
a
m
t
C
a
lc
u
la
ti
o
n L
S
H
 &
 R
S
H
P
ro
d
u
ct
ca
rr
y
S
u
m
1
su
m
S
u
m
2
su
m
(a
)
A
rc
h
it
ec
tu
re
p
ro
p
os
ed
in
[1
02
]
D
P
D
 D
ec
od
er
M
u
l 
A
rr
a
y
P
re
-A
li
gn
R
o
u
n
d
in
g
D
P
D
 E
n
co
d
er
X
Y
Z
S
X
C
X
E
X
S
Z
C
Z
E
Z
S
Y
C
Y
E
Y
P
ro
d
u
ct
B
C
D
sh
a
m
t1
P
o
st
-p
ro
ce
ss
in
g
S
w
ap
 U
n
it
R
L
S
H
 &
 R
S
H
L
Z
A
A
d
d
er
 (
C
P
A
)
L
S
H
S
h
a
m
t
O
p
0
O
p
1
sh
O
p
0
sh
O
p
1
S
u
m
1
sh
a
m
t2
S
u
m
2
R
es
u
lt
R
es
u
lt
p
o
st
L
Z
D
C
al
cu
la
ti
on
(b
)
A
rc
h
it
ec
tu
re
p
ro
p
o
se
d
in
[9
5
]
D
P
D
 D
ec
o
d
er
M
u
l 
A
rr
a
y
P
re
-A
li
g
n
D
ec
im
a
l 
C
a
rr
y
 F
re
e 
A
d
d
er
C
o
m
b
in
ed
 C
o
n
v
er
. 
R
o
u
n
d
D
P
D
 E
n
co
d
er
X
Y
Z
S
X
C
X
E
X
S
Z
C
Z
E
Z
S
Y
C
Y
E
Y
C
Z
sh
P
o
st
-p
ro
ce
ss
in
g
R
S
H
R
L
S
H
 &
 R
S
H
sh
a
m
t1
P
ro
d
u
ct
cf
S
u
m
1
sh
a
m
t2
S
u
m
2
R
es
u
lt
R
es
u
lt
p
o
st
L
Z
A
S
h
a
m
t
C
a
lc
u
la
ti
o
n
(c
)
P
ro
p
o
se
d
a
rc
h
it
ec
tu
re
F
ig
u
re
6
.2
4
:
D
ec
im
al

oa
ti
n
g-
p
oi
n
t
fu
se
d
m
u
lt
ip
ly
-a
d
d
ar
ch
it
ec
tu
re
s
92
shifted to left, the rst digit lower than the most 16 signicant digits are implied to be zero.
In the second case, the most signicant digit of the addend is shifted 2n digits to right. If
the exponent dierence between product and addend is larger than the necessary range of
alignment shifting, the extra digits other than the necessary shifting range do not aect the
correctness of the nal result. Additionally, the decimal point is shifted n digits to right to
guarantee that the nal result is an integer.
Product
CZ
left shift [0,2n+LZD(CZ)+1] digits
right shift [0,2n] digits
n digits
Figure 6.25: Left and right shifting range of the pre-alignment.
If the product is zero, the result should be numerically equal to the addend. However
the preferred exponent which is dened in the standard should be achieved. If EP  EC,
the absolute value of the result is exactly equal to the addend which has a less exponent. If
EP < EC, the signicand of Z has to be shifted to left to reduce the exponent of the addend.
The nal result will be normalized to get the possible maximum number of signicant digits or
the possible minimum exponent which is close to EP . If the addend is zero, the precision and
preferred exponent of the nal result will be guaranteed by the digits in product and the post-
alignment algorithm regardless of the shifting direction and shifting digits of the zero addend.
In the rst case, the nal result can be directly gured out, and in the latter two cases, the
computing process follows the pre-alignment rule analyzed in the previous paragraph. The
post-alignment algorithm to guarantee the rounding position and the preferred exponent is
described in next section. The shifting range to align two operands are shown in Fig. 6.25.
In the proposed architecture, the pre-alignment algorithm is divided into four cases, which
are 1) left shifting with overow, 2) left shifting without overow, 3) right shifting without
overow, and 4) right shifting with overow. In the case 1), the exponent of the added is
larger than the exponent of the product, and the dierence is larger than the maximum left
shifting amount. In this case, the addend is shifted 2n+1+LZD(CZ) digits to left. The OV
signal is therefore set to indicate a left shifting overow occurs. In the case 2), the exponent
93
Algorithm 6.4.1: Pre-alignment algorithm
Data: EX;EY;EZ;CZ:
Result: Left and right shift amount Lsa1; Rsa1:
Overow signal OV .
if (2n+ 1 + LZD(CZ) < EZ   EP ) then
Lsa1 = 2n+ 1 + LZD(CZ);
OV = \10";
else if (0  EZ   EP  2n+ 1 + LZD(CZ)) then
Lsa1 = EZ   EP ;
OV = \00";
else if (0 < EP   EZ  2n) then
Rsa1 = EP   EZ;
OV = \00";
else if (2n < EP   EZ) then
Rsa1 = 2n;
OV = \01";
end
where LZD() means the leading zero detection function, and EP = EX + EY ;
of the added is larger than the exponent of the product, but the dierence is smaller than
the maximum left shift amount. Hence, the dierence on the exponent is set to the left
shifting amount (Lsa1). In the latter two cases, the exponent of the product is larger than
the exponent of the addend, thus, right shifting is performed. The mathematical description
of the pre-alignment algorithm is given in Algorithm 6.4.1.
The hardware implementation of the proposed pre-alignment unit is depicted in Fig. 6.26(a).
The left and right shifting amount Lsa1 and Rsa1 are calculated simultaneously by two bi-
nary prex tree adders. Since the maximum shift amount to right or left are constant, only
lower bits of the results from two carry propagating adders are fed into the shifters. To re-
duce the timing delay, the number of leading zeros in the addend LZD(CZ) is not considered
to obtain the Lsa1 before the left shifter. Instead, the addend without the leading zeros
94
(CZwolz) is created by a separate shifter, and selected out as the most signicant digits if left
overow occurs (i.e. OV = \10"). To select the correct shifted addend from the shifters, a
selection signal generator and a multiplexors array are applied aside and after the shifters.
The selection signal generator decides the real shifting direction, and selects out the corrected
shifted addend from the results of the three shifters. The selection signal can be easily gured
out by the sign of EZ   EP and the overow signal OV .
LZD Binary 4:2
CZ EZ EX EY
Prefix Tree
Shifter
Left
CZ
Sel Gen
Multiplexer Array
CZsh
EZEX EY BiasBias
Right
CZ
Shifter Shifter
Binary 4:2
Prefix Tree
Left EZ-EP EP-EZ
CZwolz CZlsh CZrsh
EZ-EP
select signal
LZD(CZ)
12 12
68 196 132
264
6 6
4
64 0 LZD(CZ)
(a) Hardware implementation of the pre-alignment
1   0
oxxo
1   0
xx
xxoo ooxx
2 bits 
4 bits 
[0]sel
[1]sel
xxxx
(b) Hardware implementation of a simplied left shifter in the pre-alignment
Figure 6.26: Architecture of the pre-alignment
Since the widths of the input and output of the two shifters in Fig. 6.26(a) are dierent,
it is possible to elaborate the structure and reduce the hardware cost of the shifter. In
Fig. 6.26(b), a simplied model of the proposed left shifter is shown to shift one bit input x
to left. Since the lower bits of result are obtained earlier than the higher bits in the binary
adder, the multiplexors for shifting less digits are placed on the top of the shifter. The right
shifter has a symmetrical structure. In contrast to the original shifter which has the same
width on both input and output, the rened shifter saves about 37% of the multiplexors.
95
Table 6.10: Selection algorithm of the shifted addend
fSign1a; OV g CZsh:S3b CZsh:S2 CZsh:S1 CZsh:S0
000 CZlsh:S3 CZlsh:S2 CZlsh:S1 0
001 x x x x
010 CZwolz 0 0 0
011 x x x x
100 0 0 CZrsh:S1 CZrsh:S0
101 0 0 0 0
110 x x x x
111 x x x x
a Sign1=The sign of (EZ   EP )
b S:S3 = Sf65 : 49g;S:S2 = Sf48 : 33g;S:S1 = Sf32 :
17g;S:S0 = Sf16 : 0g
After the pre-alignment, the layout of the aligned addend and product and the shifted
decimal point are shown in Fig. 6.27. The exponent after pre-alignment Exp1 (i.e. the
exponent of the result after the carry free adder) is adjusted by subtracting 16. There are
some signals which are used out of the critical path can be calculated in pre-alignment unit
as well. The equations of these signals are given below:
Product
CZsh
n digits
OR
Sticky14n+2 digits
Figure 6.27: Layout of the aligned product and addend
EOP = SX  SY  SZ (6.25)
96
if (OV = \10")
Exp1 = EC   LZD(CZ)  49;
else
Exp1 = EP   16;
endif
(6.26)
if (RSHOR(CZ) = 0)
Sticky1 = \00";
else
if(EOP = 1)
Sticky1 = \11";
else
Sticky1 = \01";
endif
endif
(6.27)
where RSHOR() means the bit-by-bit OR of all right shifted digits out of the CZrsh.
In equation (6.26), if left shifting overow occurs (i.e. OV = \10"), the real left shifting
amount is only 2n + 1 + LZD(CZ) digits. In this case, Exp1 is adjusted by the real left
shifting amount and the shifting of the decimal point.
6.4.2 Post-Alignment and Sticky Bits Generation
Post-Alignment Shifting Amount Decision
The result from the adder may have a large number of signicant digits which exceed the
required precision. Thus a post-alignment unit is applied to decide a proper exponent and
truncate the signicand to t the required precision. Moreover, if the result is inexact,
enough information has to be kept to decide the increment to the least signicant digit
in the following rounding unit. In the decimal oating-point standard, both of the input
operands and output result are not normalized. Therefore, the post-alignment processing
in the decimal oating-point is more complicated than the normalization processing in the
97
Algorithm 6.4.2: Analysis of the shifting direction and range to achieve preferred
exponent
if (EP  EC) then
/* right shift addend */
Exp1 = EP   16;
Expp = EC;
DIFFpre = EC   EP + 16;
DIFFpre =  DIFFabs + 16;
DIFFpre  16;
else
/* left shift addend */
if (OV = 0) then
Exp1 = EP   16;
Expp = EP ;
DIFFpre = EP   EP + 16;
DIFFpre = 16;
else
Exp1 = EC   LZD(CZ)  49;
Expp = EP ;
DIFFpre = EP   EC + LZD(CZ) + 49;
DIFFpre =  DIFFabs + LZD(CZ) + 49;
DIFFpre  16;
end
end
DIFFpre = Expp  Exp1;
DIFFabs = ABS(EP   EC);
Expp =MAX( 398;MIN(EP;EC));
ABS() means the absolute value function.
98
binary oating-point. The most dicult problem is to decide if the preferred exponent can
be achieved or not. In a conventional method, the leading zero anticipation algorithm needs
to detect the place of the most signicant one, and detect the decimal cancelation in parallel.
For example, if the signicand is \199:::9234:::", the sequence of 9 after the leading 1 will
have to be canceled in the nal result, and the width of the signicand is reduced accordingly.
In the proposed number system, the digit 9 or 9 are not existing. Therefore, the cancelation
only causes one-digit error on the basis of the conventional leading one detection algorithm.
For example, \18:::123:::" will be converted to \02:::123:::".
Algorithm 6.4.3: Post-alignment algorithm
LOP 0 = LOP + 1needcorrect;
if (LOP 0   TZD  16 and TZD  DIFFpre  0) then
if (LOP 0  DIFFpre  16) then
/* case 1 */
Rsa2 = DIFFpre;
else if (LOP 0  DIFFpre > 16) then
/* case 2 */
Rsa2 = LOP 0   16;
end
else if (LOP 0   TZD  16 and TZD  DIFFpre < 0) then
/* case 3 */
Rsa2 = TZD;
else if (LOP 0   TZD > 16 or DIFFpre < 0) then
/* case 4 or 5 */
Rsa2 = LOP 0   16;
end
To gure out the cases of the shifting in post-alignment, the exponent of the temporary
result of the addition and the preferred exponent are analyzed in Algorithm 6.4.2. First of
all, two parameters are created. DIFFpre is dened as the dierence between the preferred
exponent and the temporary exponent. If DIFFpre is larger than zero, it means the necessary
99
right shifting digits to achieve the preferred exponent. DIFFabs is dened as the absolute
value of the dierence between the exponents of the product and the addend. In the rst
case, the addend is right shifted and DIFFpre  16. If 0  DIFFpre  16, the number of
right shifting digits in post-alignment depends on the signicant digits between leading and
trailing zeros. If DIFFpre < 0, the temporary result has to be shifted to left to achieve the
preferred exponent. But after moving the decimal point, signicant digits are always enough
to guarantee the required precision. Therefore, in this case the preferred cannot be achieved,
and the number of right shifting digits depends on the most signicant non-zero digit. In the
second case, whereDIFFpre = 16, it is similar to the rst case. In the third case, LZD(CZ)+
49 means the possible maximum left shifting digits in the proposed pre-alignment. Since left
overow happens in this case, DIFFabs is always larger than LZD(CZ) + 33 according to
Algorithm 6.4.1. Thus DIFFpre is less than 16, and the analysis of shifting is similar to the
rst case.
In Figures 6.28(a-e), the post-alignment algorithm is illustrated into ve cases, and the
mathematical description is given in Algorithm 6.4.3. In the rst two cases, the result can
be exactly represented in 16 digits. Therefore, the preferred exponent might be reached. In
case 1, the number of signicant digits, which is equal to the dierence between the leading
one position (LOP ) and the number of trailing zeros (TZD), is less than 16 digits, and the
dierence between the temporary exponent and the expecting exponent (DIFFpre) is smaller
than the number of trailing zeros in the redundant result. Note that the LOP 0 means the
real position of leading non-zero digit, which is obtained by correcting LOP . Additionally,
after shifting the result of the addition to right by DIFFpre digits, the exponent reaches the
preferred one, and the result still keeps all the signicant digits. In case 2, after shifting
DIFFpre, not all the signicant digits can be restored, and more digits need to be shifted. In
this case, only the most signicant 16 digits are retained. In case 3, the DIFFpre is greater
than the number of trailing zeros. Thus, the maximum right shifting amount only can be
TZD to keep all the signicant digits. The preferred exponent cannot be reached in this
case, and the adjusted exponent (Exp2) is less than and closest to the preferred exponent.
In the previous three cases, the result is exact. In case 4, the signicant digits are larger
than 16 digits. Hence, the preferred exponent cannot be reached and the nal result is
100
16D 16D 16D 16D
1D
2
pre
Rsa DIFF 
LOP !
TZD
pre
DIFF
1 :Sum
2 :Sum
(a) Case 1
16D 16D 16D 16D
1D
LOP !
TZD2 16Rsa LOP ! "
1 :Sum
2 :Sum
pre
DIFF
(b) Case 2
16D 16D 16D 16D
1D
2Rsa TZD 
LOP !
TZD
1 :Sum
2 :Sum
pre
DIFF
(c) Case 3
16D 16D 16D 16D
1D
2 16Rsa LOP ! "
LOP !
TZD
1 :Sum
2 :Sum
pre
DIFF
(d) Case 4
16D 16D 16D 16D
1D
1 :Sum
2 :Sum
LOP !
TZD2 16Rsa LOP ! "
pre
DIFF
(e) Case 5
Figure 6.28: Post-alignment shift amount decision
101
inexact. In case 5, if the preferred exponent is smaller than the temporary exponent, the
preferred exponent cannot be reached, since the LOP has 16 digits at minimum. Therefore,
left shifting is not possible in the proposed architecture. Consequently, any digits out of the
required precision are moved out.
DIFFpre
LOP TZD
Rsa2
LOP TZD LOP TZD
Intermediate signals
LOD TOD CD
Rsa
Decision
1
Rsa
Decision
2
C
o
n
d
.
D
e
t
e
c
t
C
o
n
d
.
D
e
t
e
c
t
-16LOP -15LOP
needcorrect
DIFFpre DIFFpre
Figure 6.29: Detailed structure of the post-alignment shift amount calculation
The hardware implementation is shown in Fig. 6.29. The data path is divided into two
branches, which are selected by detecting the one digit error of the leading one position
in Sum1. The left path covers the ve cases if the possible one-digit error doesn't exist.
Otherwise, the right path is selected. There are three blocks upon the shifting amount
decision unit. The leading zero detector (LOD), which generates the position of the leading
non-zero digit minus one (LOP 0   1), is similar to the detector applied in binary designs.
The trailing zero detector (TZD) similarly creates the number of the trailing zeros. The
correction detector (CD) is introduced in the next section.
Leading One Position Correction
To detect the possible one digit error on the leading one position in Sum1, two cases have to be
recognized. As shown in Table 6.11, if the pattern of the Sum1 is zk1zln(x) or zk1zlp(x), the
leading one or leading minus one will be converted to zero in the rounding unit. The position
of the leading non-zero digit is therefore reduced by one. To detect these two patterns,
a binary tree structure is applied for both positive and negative Sum1. The principle of
the correction detector is similar to the algorithm described in [98]. The dierence is that
the result of the radix-10 signed-digit subtraction is more complicated than the one in the
102
radix-2 signed-digit subtraction proposed in [98]. In Table 6.12 and equations (6.28 and
6.29), the algorithm and logic of the basic node on the detection tree for both positive and
negative Sum1 are given. Similar to the logic proposed in [98], x is represented by setting
all the output signals to zero. A simplied hardware structure for positive Sum1 is shown
in Fig. 6.30. In the last level, only the correction signal y is needed, and the nal correction
signal is ORed by two correction signals for both Sum1 > 0 and Sum1 < 0. A leading zero
anticipation algorithm for binary redundant encodings can be found in [104]..
Table 6.11: Scenarios of one digit error on leading one position
Sign of Sum1 Sum1 String pattern No. of LZ Example
Sum1 > 0
zk1zlp(x) k 0:::012:::
zk1zln(x) k + 1 0:::012:::
zkp+(x) k 0:::022:::
Sum1 < 0
zk1zln(x) k 0:::012:::
zk1zlp(x) k + 1 0:::012:::
zkn (x) k 0:::022:::
z : (s = 0); p : (s > 0); n : (s < 0); p+ : (s > 1); n  : (s <
1);
k >= 0; l >= 0; x: don't care
Sum1 > 0 =>
8>>>>>>>>><>>>>>>>>>:
p+ = p+l  zr + zl  p+r
po = zl  por + pol  zr
z = zl  zr
n = nl + zl  nr
y = yl + zl  yr + pol  nr
(6.28)
103
Table 6.12: Node functions for the positive
and negative detection trees
Sum1 > 0
right branch
p+ po z n x y
left branch
p+ x x p+ x x x
po x x po y x x
z p+ po z n x y
n n n n n n n
x x x x x x x
y y y y y y y
Sum1 < 0
right branch
n  no z p x y
left branch
n  x x n  x x x
no x x no y x x
z n  no z p x y
p p p p p p p
x x x x x x x
y y y y y y y
po : (s = 1); no : (s =  1); y: need
correction
104
Sum1 < 0 =>
8>>>>>>>>><>>>>>>>>>:
n  = n l  zr + zl  n r
no = zl  nor + nol  zr
z = zl  zr
p = pl + zl  pr
y = yl + zl  yr + nol  pr
(6.29)
where sl means the signal s is from the left branch, and sr means the signal s is from the
right branch; \  " means logic and, \ + " means logic or.
Basic Node
{ , , , , }p po z n y 
Basic Node
Basic Node
……
…
…
left right
{ , , , , }p po z n y 
{ , , , , }p po z n y 
{ }y
Figure 6.30: Hardware structure of the correction detection unit
To decide the post-alignment shifting amount Rsa2, there are some intermediate signals
that have to be generated rst. As shown in Fig. 6.29, the intermediate signals are generated
and fed into three detection units. Afterwards, the corresponding variables are obtained,
and the shifting amount is therefore decided. All the intermediate signals have been already
introduced in Table 6.11 and Table 6.12. These seven signals, z; p; n; p+; p ; po and no are
directly generated from the Sum1.
Sticky Bits Generation
Since the digit-set of the result Sum1 is redundant in [ 8; 7], if right shifting is applied in
post-alignment, the shifted out digits can be positive, zero or negative. For example, after
105
right shifting, the shifted out digits after the rounding digit of the string \123:::90:::56:5432:::"
is \432:::". The rounding digit 5 after the point is therefore reduced by one due to the negative
sticky digit. Consequently, the rounding process is more complicated than the one in the
conventional architecture. To correctly round the result, a signed sticky with two bits has to
be detected.
If the signal pi is set to 1, the value from the i
th digit to right digits is larger than zero or
positive, and if zi is set to 1, the value from the i
th digit to right digits equals zero. If both
signals are 0, the value is negative. For example, the p and z signals are set to \0011:::" and
\0000:::" for string \0321:::". The detection algorithm is similar to the carry propagation
process, in which a positive (negative) digit will be propagated to the left by another positive
(negative) digit or a zero digit and killed by a negative (positive) digit. A prex tree structure
is applied to generate the sticky signals. The logic of a basic cell on the prex tree is given in
equation (6.30). Once the signals p and z are ready, the correct sticky digit can be obtained
by right shifting it with Rsa2 digits. Therefore, the right shifter for obtaining sticky only
contains 2 bits at the output.
p = pl + zl  pr
z = zl  zr
(6.30)
6.4.3 Rounding
In the proposed DFMA, the redundant result which is shifted in the post-alignment unit
is sent to the rounding block to obtain the nal rounded result in the conventional BCD
encoding. Since the digit of the result Sum2 can be positive or negative, two major problems
are emerged in the proposed system. First, the digits shifted out of the rounding digit can be
positive or negative. In some rounding cases, this might aect the value on the rounding digit
and hence the rounded result. For example, \ssss:::ssss51234:::" and \ssss:::ssss51234:::"
have same digits on the most signicant 16 digits (ssss:::ssss) and the rounding digit (5)
and dierent rest of digits on the right side (1234::: and 1234:::). In the Ties-to-Away mode,
the rst result will be rounded to \ssss:::ssss + 1", and the latter one will be rounded to
\ssss:::ssss". Second, the nal result can be a negative number which needs to be inverted
106
rst. In this case, all the consideration of the rounding algorithm is based on the nega-
tion of the current result. For example, if \ssss:::ssss51234:::" is negative, it is negated to
\ssss:::ssss51234:::", and rounded it to \ssss:::ssss 1" in the Ties-to-Away rounding mode.
However, the simplicity of the proposed number system should be noticed. In the proposed
digit-set, there is no positive carry propagation at all. The only consideration is the negative
carry propagation or borrow propagation. The nal result is divided into two parts (i.e. the
most signicant 15 digits and the least signicant 1 digit). In the most signicant 15 digits,
the only processes are the negation of the negative result and the negative carry propagation
for digit-set conversion. On the other hand, the least signicant digit can be added by \1"
or retained according to the rounding increment.
The rounding algorithm is therefore divided into two steps, obtaining rounding increment
and converting the result with the consideration of the rounding increment. An algorithm
which generates the increment accordingly is given in Tables 6.13-6.15. Suppose a positive
nal result with a negative intermediate sum \1000:::0003:51234:::" creates an increment
\+ 1" in Ties-to-Away mode. But since the intermediate sum is less than zero, it is negated
to \1000:::0003:51234:::". Moreover the increment is negated to \   1". The nal result is
therefore rounded and converted to \1000:::0002". The complicated computation is encap-
sulated in the following conversion algorithm. The signal SignF means the sign of the nal
result, and the logic is optimized by \Sign2 (SX  SY )".
The top level structure of the proposed rounder is given in Fig. 6.31. Since the digit-set
of the post-aligned sum is in [ 8; 7], after adding the increment from the rounding digit, no
positive carry will be generated from Sum2lsd. Instead, a negative carry might be generated
(e.g. \1200:::0003" or \1200:::0000" with increment  1). Therefore, the propagation bits and
generation bits of the negative carry for the most 15 signicant digits are generated rst by a
prex tree structure. In the meantime, the possible negative carry (NClsd) from Sum2lsd is
generated as shown in equation (6.31). Once the negative carry C for all the 16 digits of CR
is obtained, the digit-set conversion with absolute value conversion algorithm is performed
as given in the equations (6.32 and 6.33). The correction value Cor2 is simply obtained by
the nine's complement conversion algorithm with borrow consideration. Since the LSD of
CR will be adjusted by the rounding increment, the correction value for the LSD is dierent
107
Table 6.13: Rounding increment generation al-
gorithm of \TiesToAway" and \TowardPositive"
modes
Ties to Away
Sign2 = 0 Sign2 = 1
RDa SDb inc RD SD inc
[6; 7] xc +1 [6; 7] x -1
5 0 or 1 +1 5 1 -1
5 -1 0 5 0 or -1 0
[0; 4] x 0 [0; 4] x 0
[ 4; 1] x 0 [ 4; 1] x 0
-5 0 or 1 0 -5 1 0
-5 -1 -1 -5 0 or -1 +1
[ 8; 6] x -1 [ 8; 6] x +1
Toward Positive
Sign2 = 0; SignF = 0 Sign2 = 0; SignF = 1
RD SD inc RD SD inc
[1; 7] x +1 [1; 7] x 0
0 1 +1 0 1 or 0 0
0 0 or -1 0 0 -1 -1
[ 8; 1] x 0 [ 8; 1] x -1
Sign2 = 1; SignF = 0 Sign2 = 1; SignF = 1
[1; 7] x 0 [1; 7] x -1
0 1 or 0 0 0 1 -1
0 -1 +1 0 0 or -1 0
[ 8; 1] x +1 [ 8; 1] x 0
a RD=Rounding Digit
b SD=Sticky Digit
c x=don't care
108
Table 6.14: Rounding increment generation algorithm
of \TiesToEven" and \TowardNegative" modes
Ties to Even
Sign2 = 0 Sign2 = 1
RD SD LEa inc RD SD LE inc
[6; 7] x x +1 [6; 7] x x -1
5 1 x +1 5 1 x -1
5 0 0 +1 5 0 0 -1
5 0 1 0 5 0 1 0
5 -1 x 0 5 -1 x 0
[0; 4] x x 0 [0; 4] x x 0
[ 4; 1] x x 0 [ 4; 1] x x 0
 5 1 x 0  5 1 x 0
 5 0 0 -1  5 0 0 +1
 5 0 1 0  5 0 1 0
 5 -1 x -1  5 -1 x +1
[ 8; 6] x x -1 [ 8; 6] x x +1
Toward Negative
Sign2 = 0; SignF = 0 Sign2 = 0; SignF = 1
RD SD inc RD SD inc
[1; 7] x 0 [1; 7] x +1
0 1 or 0 0 0 1 +1
0 -1 -1 0 0 or -1 0
[ 8; 1] x -1 [ 8; 1] x 0
Sign2 = 1; SignF = 0 Sign2 = 1; SignF = 1
[1; 7] x -1 [1; 7] x 0
0 1 -1 0 1 or 0 0
0 0 or -1 0 0 -1 +1
[ 8; 1] x 0 [ 8; 1] x +1
a LE=LSD is Even
109
Table 6.15: Rounding increment generation algo-
rithm of \TowardZero" mode
Toward Zero
Sign2 = 0; SignF = 0 Sign2 = 0; SignF = 1
RD SD inc RD SD inc
[1; 7] x 0 [1; 7] x 0
0 1 or 0 0 0 1 or 0 0
0 -1 -1 0 -1 -1
[ 8; 1] x -1 [ 8; 1] x -1
Sign2 = 1; SignF = 0 Sign2 = 1; SignF = 1
[1; 7] x -1 [1; 7] x -1
0 1 -1 0 1 -1
0 0 or -1 0 0 0 or -1 0
[ 8; 1] x 0 [ 8; 1] x 0
than the other 15 digits. If the intermediate result Sum2 is less than zero, the XOR gates
are necessary to negate the digit. Note that, in the digit-set conversion algorithm, the least
signicant bit of the negative carry signal C is equal to the sign of the Sum2. Therefore,
if Sign2 = 1, the negative carry might be propagated through Sum2f1g. Subsequently, the
negative carry for higher 15 digits can be obtained by the equation: NCi:0 = gi:0&(pi:0jNClsd).
To clarify the rounding and conversion algorithm, the example provided in Fig. 6.23 is
considered. Since the leading positive one in Sum1 will be corrected, it is not shifted in the
17-digit Sum2, and the sign of the Sum1 is positive. In the rounder, the 15-bit propagation
\000000000100000" and generation \111111001000101" signals for the negative carry of the
most 15 signicant digits are rst created. In the Ties-to-Away rounding mode, no increment
is generated from the rounding digit 1. Subsequently, the LSD \3" creates a negative carry to
the higher digits. The 15-bit propagation and generation signals, together with the negative
carry from LSD and the positive Sign2, create a nal negative carry \11111100100010110".
Afterwards the correction signal \99999a0fa00faf9a" is further created by the equations
110
15-bit Prefix
Carry Generation
Correction Gen.
4-bit CLA Array
CR
Sticky2Sign2Sum2Sign2
g NClsd
C[16:0]
Cor2Sum2'
 Gen.
lsd
NCXOR
p
Sum2{1:0}Sum2{16:1} Sum2{16:2}
15 15 1
4*16D4*16D
SignF
Figure 6.31: Architecture of the rounder
(6.32 and 6.33). After all, the nal rounded signicand is obtained by adding the bit-inverted
Sum20 = \3456563230431123" with the correction signal.
NC+1lsd =
8>>><>>>:
1 if Sum2f1g <  1 or
(Sum2f1g =  1&Sign2 = 1)
0 otherwise
NC0lsd =
8>>><>>>:
1 if Sum2f1g < 0 or
(Sum2f1g = 0&Sign2 = 1)
0 otherwise
NC 1lsd =
8>>><>>>:
1 if Sum2f1g < 1 or
(Sum2f1g = 1&Sign2 = 1)
0 otherwise
NClsd =
8>>><>>>:
NC+1lsd if RDinc = +1
NC0lsd if RDinc = 0
NC 1lsd if RDinc =  1
(6.31)
111
Cor2f0g =
8>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>:
1 if C1:0 = 00 & RDinc = 1,
10 if C1:0 = 01 & RDinc = 1,
11 if C1:0 = 10 & RDinc = 1,
0 if C1:0 = 11 & RDinc = 1,
0 if C1:0 = 00 & RDinc = 0,
11 if C1:0 = 01 & RDinc = 0,
10 if C1:0 = 10 & RDinc = 0,
1 if C1:0 = 11 & RDinc = 0,
 1 if C1:0 = 00 & RDinc =  1,
12 if C1:0 = 01 & RDinc =  1,
9 if C1:0 = 10 & RDinc =  1,
2 if C1:0 = 11 & RDinc =  1.
where C = fNC[15 : 0]; Sign2g
(6.32)
Cor2f15 : 1g =
8>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>:
0 if Ci+1:i = 00 & C0 = 0,
 1 if Ci+1:i = 01 & C0 = 0,
10 if Ci+1:i = 10 & C0 = 0,
9 if Ci+1:i = 11 & C0 = 0,
10 if Ci+1:i = 00 & C0 = 1,
11 if Ci+1:i = 01 & C0 = 1,
0 if Ci+1:i = 10 & C0 = 1,
1 if Ci+1:i = 11 & C0 = 1.
(6.33)
112
Chapter 7
Comparison and Discussion
In this chapter, all the proposed designs are rstly evaluated by synthesizing the Verilog model
with Synopsys tools. Furthermore, the dierences on performance between the proposed
designs and previous corresponding designs are illustrated and analyzed. The organization of
rest sections are addition (section 7.1), multiplication (section 7.2), and DFMA (section 7.3).
7.1 Decimal Fixed-point Addition
A model of the proposed decimal SD adder is implemented in VHDL. The exhaustive test to
ensure the correctness is performed. Subsequently, the proposed design was synthesized by
Synopsys Design Compiler in STM 90 nm CMOS technology with normal case parameters
(1.2V, 25℃).
To compare with other designs, the designs in [51], [56], [52] and [59] were also imple-
mented in the same technology. The implementation results, including timing delay, hardware
area, power consumption, area delay product (ADP) ratio and power delay product (PDP)
ratio are listed in Table 7.1. On the timing delay, the proposed design has at least a 34%
improvement. On the performance in terms of the ADP and PDP, our design could have
more than 32% and 76% improvement compared with the referenced designs, respectively.
For further evaluation, the hardware area and power consumption under dierent timing
constraints of the designs in [51], [56], [52], [59] and the proposed one are listed in Fig. 7.1 and
Fig. 7.2. The less area and less timing delay compared with all other works could be obtained
simultaneously once the timing constraint is larger than 0.3 ns. Since currently the decimal
computation is mostly used on high performance server [57], we focus on computation speed
rather than hardware cost which could be improved in the future.
113
T
a
b
le
7
.1
:
S
y
n
th
es
iz
ed
re
su
lt
s
an
d
co
m
p
ar
is
on
of
16
-d
ig
it
ad
d
er
s
D
ig
it
S
et
D
el
ay
R
at
io
A
re
a
R
at
io
A
D
P
R
at
io
P
ow
er
R
at
io
P
D
P
R
at
io
F
W
-C
on
v
B
W
-C
on
v
(n
s)
(
m
2
)
(m
W
)
(
G
)
(
G
)
[5
1]
[ 
9;
9]
0:
49
1:
69
35
56
1
2:
76
4:
66
57
:9
4
3:
55
5:
99
0
N
/A
[5
6]
[ 
9;
9]
0:
51
1:
76
22
07
8
1:
71
3:
01
35
:6
7
2:
18
3:
84
0
N
/A
[5
2]
[ 
8;
9]
0:
39
1:
34
12
65
4
0:
98
1:
32
21
:3
9
1:
31
1:
76
1
2n
+
8
[5
9]
[ 
7;
7]
0:
45
1:
55
12
78
1
0:
99
1:
54
20
:0
1
1:
22
1:
90
9
2n
+
10
P
ro
p
os
ed
[ 
9;
9]
0:
29
1
12
89
8
1
1
16
:3
4
1
1
0
2
lo
g
n
+
10
114
Since our design works on the digit set [ 9; 9] and the operands are encoded in two's
complement, no extra forward converter which converts the BCD inputs to the proposed
digit set is needed at all. In [52], the authors use the digit set [ 8; 9], so there is an OR gate
in the front converter mentioned in their paper. In [59], a combinational logic to generate
the correction signal and a 4-bit adder are proposed to convert BCD to RBCD encoding with
9G (i.e., level of gates) delay.
For the backward converter which converts the digit set in the proposed design to the
conventional BCD encoding, in [52] and [59], the authors proposed two algorithms which
process in linear timing delay proportional to the digit width of the input. Furthermore,
to generate the absolute value, the aforementioned designs need more logics to check the
sign of the result and invert the result digit by digit which are not counted in Table 7.1.
The proposed converter in this thesis could generate the absolute value of the result in BCD
encoding with a timing delay logarithmically proportional to the digit width.
5000
10000
15000
20000
25000
30000
35000
0.24 0.3 0.36 0.42 0.48 0.54 0.6 0.66 0.72
A
re
a
o
f
1
6
D
ig
it
A
d
d
e
r
(u
m
2
)
Delay(ns)
Proposed
[51]
[52]
[56]
[59]
A
re
a
o
f
1
6
D
ig
it
A
d
d
e
r
(u
m
2
)
Figure 7.1: Area-Delay Comparison
7.2 Decimal Fixed-point Multiplication
To compare the proposed multiplication algorithm with other designs, a delay model is
rstly created in terms of fanout-of-4 inverter's delay on the estimated critical path. Thus
the eects from fanout gates and the gate scaling are ignored in the theoretical comparison.
115
010
20
30
40
50
60
0.24 0.3 0.36 0.42 0.48 0.54 0.6 0.66 0.72
P
o
w
e
r
o
f
1
6
D
ig
it
A
d
d
e
r
(m
W
)
Delay(ns)
Proposed
[51]
[52]
[56]
[59]
Figure 7.2: Power-Delay Comparison
To obtain a more accurate comparison, a Verilog-HDL model of the proposed 16 16-digit
multiplier is synthesized with STM 90 nm standard cell normal case library (1.0V, 25℃). For
fairly comparing with previous works implemented in dierent technologies, the fanout-of-4
inverter's delay and NAND2 gate's area are applied to measure the performance of dierent
designs in dierent technologies. Since the values of 1 unit of these two metrics change as
the technology, these units provide a comparison among dierent designs on an identical
reference. A discussion on the dierences of performance between our proposed architecture
and other designs is given afterwards.
7.2.1 Parallel Multiplication
Performance Evaluation
In Table 7.2, the numbers of logic gates (i.e., NAND2 gate or G) for dierent stages of the
parallel 1616-digit multipliers from other designs are listed. We assume that an AND2/OR2
gate equals to one NAND2 gate, and an XOR gate equals to two NAND2 gates. The PPG
unit in Table 7.2 is used to generate the partial products in the format which can be directly
processed by the PPR unit in the next stage. For example, the decimal carry save adder,
to reduce the multiples from double-BCD format (i.e., double-four-bit) to BCD-CS format
(i.e., one-four-bit) applied in the sequential design in [63] and the parallel design in [65],
116
is counted into the PPG stage. Additionally, to fairly analyze the eciency of the PPR
methods, we suppose that the outputs of the PPR unit are two numbers in arbitrary formats
(e.g., double-BCD or BCD-CS format). Thus, the fourth level of the ODDS adder in [70] and
the nal simplied CLA shown in Fig. 6.12 are assumed as the adder setup unit in the nal
stage. Finally, for the three sequential multipliers in the bottom of Table 7.2, only the ratio
(e.g., marked by an asterisk) between the G involved in iterative cycles and the proposed
design is provided, since other non-iterative cycles can be pipelined without reducing the
overall eciency of the multiplier. As shown in Table 7.2, some algorithms may be faster
than our proposed design on PPG or PPR, but by considering the trade-o among three
multiplication stages, our design can perform the best.
To obtain a more accurate performance on not only timing delay but also hardware cost, a
hardware model for a 1616-digit multiplier is implemented by Verilog-HDL and synthesized
using Synopsys Design Compiler and STM 90 nm CMOS standard cells library which has
45ps as the delay of an inverter with fanout of 4 inverters and 4:4um2 as the area of the
smallest NAND2 gate. 500,000 random cases and 100 manually created boundary cases are
veried in the Verilog-HDL model to guarantee the correctness. The delay in picosecond of
each module on the critical path is shown in Table 7.4. Furthermore, the delay-area values
which are measured under Design Compiler within the range from 1:94ns and 49900 NAND2
to 2:65ns and 36655 NAND2 are shown in Fig. 7.3. The delay-area values of other parallel
designs are also provided. The latest designs of the architectures of Radix-10 and Radix-5 in
[68] and the architecture in [70] are implemented and evaluated with our library and synthesis
environment.
Comparison and Discussion
In Table 7.3, the state-of-the-art decimal multipliers for 16-digit operands are listed in terms
of timing delay, hardware area, throughput, and latency. In [65], the design is synthesized
using the STM 90 nm library which is the same library as used in our design. The latency
provided by the authors is 2.65ns, which equals to about 58.9 FO4. In [66], the authors
improve the design in [65] and reduce the latency to 2.51ns (55.8 FO4) by an elaborated
PPR tree and a binary to decimal converter. Both of these designs have the area of 68000
117
T
a
b
le
7
.2
:
D
el
ay
an
al
y
si
s
of
16

16
-d
ig
it
d
ec
im
al

x
ed
-p
oi
n
t
m
u
lt
ip
li
er
s
A
rc
h
it
ec
tu
re
P
P
G
R
at
io
P
P
R
R
at
io
S
et
u
p
+
F
in
al
A
d
d
er
R
at
io
T
ot
al
R
at
io
(
G
)
(
G
)
(
G
)
(
G
)
P
ar
al
le
l
[6
5]
20
1:
67
60
1:
28
17
1:
06
97
1:
29
[7
4]
37
3:
08
54
1:
15
25
1:
56
11
6
1:
55
[6
7]
9
0:
75
57
1:
21
19
1:
19
85
1:
13
[7
0]
11
0:
92
46
0:
98
29
1:
81
86
1:
15
R
ad
ix
-1
0
[6
8]
39
3:
25
42
0:
89
23
1:
44
10
4
1:
39
R
ad
ix
-5
[6
8]
11
0:
92
53
1:
13
23
1:
44
87
1:
16
P
ro
p
os
ed
12
1
47
1
16
1
75
1
S
eq
u
en
ti
al
[4
2]
11
-
13

17
-
43
-
27
5
2:
95
*
[6
4]
13
-
31

17
-
-
-
-
7:
03
*
[6
3]
20
-
20

17
-
17
-
37
7
4:
76
*
*
R
at
io
=

P
P
R
/
p
ro
p
o
se
d
to
ta
l
118
T
a
b
le
7
.3
:
P
er
fo
rm
an
ce
co
m
p
ar
is
on
of
16

16
-d
ig
it
d
ec
im
al

x
ed
-p
oi
n
t
m
u
lt
ip
li
er
s
A
rc
h
it
ec
tu
re
#
C
y
cl
es
C
y
cl
e
T
im
e
L
at
en
cy
T
h
ro
u
gh
p
u
t
A
re
a
(F
O
4)
(F
O
4)
R
at
io
M
u
lt
./
C
y
cl
e
(N
A
N
D
2)
R
at
io
P
ar
al
le
l
B
in
.
R
ad
ix
-4
[6
5]
1
-
31
.1
0.
72
1
34
00
0
0.
68
[6
5]
1
-
58
.9
1.
37
1
68
00
0
1.
36
[6
6]
1
-
55
.8
1.
29
1
68
00
0
1.
36
[7
4]
1
-
54
.4
1.
26
1
60
50
0
1.
21
[6
7]
1
-
53
.5
1.
24
1
79
60
0
1.
60
[7
0]
1
-
48
.1
1.
12
1
49
50
0
0.
99
R
ad
ix
-1
0
[6
8]
1
-
48
.4
1.
12
1
44
40
0
0.
89
R
ad
ix
-5
[6
8]
1
-
47
.8
1.
11
1
50
90
0
1.
02
P
ro
p
os
ed
1
-
43
.1
1
1
49
90
0
1
S
eq
u
en
ti
al
[4
2]
24
12
.7
30
5
5:
00
*
1/
17
31
50
0
0.
63
[6
4]
20
16
32
0
6:
31
*
1/
17
16
00
0
0.
32
[6
3]
20
14
.7
29
4
5:
80
*
1/
17
18
55
0
0.
37
*
R
at
io
=
(F
O
4 D
el
a
y
/T
h
ro
u
gh
p
u
t)
/F
O
4 p
ro
p
o
se
d
119
Table 7.4: Critical path of the proposed 16 16-digit multiplier
Gen. of mult. Sel.+Inv. PPR GP gen.+Prex Tree+Sel.
160ps 140ps 1230ps 410ps
40000
45000
50000
55000
60000
65000
70000
75000
80000
A
re
a
(
#
N
A
N
D
2
)
Proposed
Radix 5![68]
Radix 10![68]
[67]
[65]
[66]
[74]
35000
42.5 45 47.5 50 52.5 55 57.5 60
A
re
a
(
#
N
A
N
D
2
)
Delay(#FO4)
[70]
Figure 7.3: Delay-area space of the decimal multipliers
NAND2 gate. Our PPG algorithm avoids the decimal CSA in the PPG unit applied in those
designs. Furthermore, the PPG which consists of six levels of BCD-FAs in [65] involves six
levels of carry propagations in 4-bit width which lower the performance of the multiplier.
These two radix-10 combinational multipliers cost at least 29% more timing delay and 36%
bigger area than our proposed design.
In [74], the authors propose a parallel decimal oating-point multiplier by applying the
xed-point design with radix-10 architecture proposed in [76]. Such a parallel decimal mul-
tiplier applies new decimal encodings (i.e., BCD-4221 and BCD-5211) to simplify the design
of the PPR tree. In the proposed radix-10 design in [76], a carry propagation through all
bits in an operand is involved in the PPG stage. Besides, the proposed 17:2 reduction tree
with binary CSAs and encoding converters is slower than our proposed PPR unit with two
levels of SD adders. Overall, our proposed algorithm reduces about 26% timing delay and
21% area compared to the radix-10 design applied in [74].
120
In [67], the authors propose a method to represent 8X and 9X in two digits to avoid the
long path in PPG. Consequently, the delay of the PPG is reduced signicantly. To reduce
the partial products, the authors present an architecture within 6-level simplied BCD-FA.
Additionally, after the PPR unit, a narrower result is obtained. However, the level of prex
tree applied in the nal addition cannot be reduced, since the reduction on the result of
PPR is not over half of the width. The BCD-FA used in the PPG in [65] is replaced by
a simplied BCD half adder. Nevertheless, the digit-level reduction tree based BCD-FA
shows the disadvantage of the relatively large delay and big area as described for the design
proposed in [65]. The synthesized design in [67] under TSMC 130 nm standard cells library
costs about 53.5 FO4 and 79600 NAND2, respectively. Although the PPG unit which has no
XOR gate and a simple selection circuit is faster than our proposed PPG, due to the slower
PPR and nal addition in [67], our multiplier could gain about 24% less delay with 60% less
hardware cost overall.
In the SD multiplier proposed in [70], the eciency of the PPG and PPR units are
guaranteed as in our proposed design. However, the double-BCD format partial product
array takes o the advantage of the overall performance on timing delay in [70]. Furthermore,
the nal overloaded decimal digit set adder with the following traditional digit converter is
slower than the simplied 4-bit CLA and the converter proposed in our design. Finally, the
synthesis result shows that our design takes 12% less delay with almost the same area cost.
In [68], the authors improve the design they proposed in [76]. The PPR trees are optimized
for both radix-10 and radix-5 architectures in [76]. Thus as shown in Table 7.2, the number
of gates for the 17:2 reduction tree in radix-10 architecture is faster than our proposed PPR.
However, the 32:2 reduction tree for radix-5 architecture is still slower than our design, since
about twice operands are processed in such a structure. Moreover, the carry propagation
which aects the overall performance in the PPG of the radix-10 architecture cannot be
avoided. In the radix-5 architecture, the partial products are generated by a couple of recoders
within a small delay. Additionally, the unconventional encodings avoid the complicated
decimal correction in most of other works. Thus, the proposed PPR tree could be arranged
as the binary CSA tree (i.e., Wallace-like structure based on binary CSAs and encoding
converters). However, our design balances the delay of PPG and PPR and applies a simpler
121
nal conversion compared to the designs in [68]. Overall, our design has about 11% less delay
with 2% less area compared to the fastest state-of-the-art design (radix-5) in [68], and has
about 12% less delay with 11% more area compared to the radix-10 architecture in [68].
The sequential designs of the xed-point decimal multiplication are also listed in Table 7.7.
The latency ratio with asterisk is calculated according to the FO4 spent on iterative cycles.
Such sequential designs show the advantage on the area cost and disadvantage on latency
and throughput as expected.
Generally speaking, the output format of a PPG algorithm can be a single-BCD (e.g.,
the radix-10 architecture in [68]), a single-BCD with identical carry for each partial product
(e.g., the proposed method), a BCD-CS (e.g., the method applied in [65]), or a double-BCD
(e.g., the algorithms used in [67, 70], and the radix-5 architecture in [68]). In general, the less
bits in the output of a PPG, the more complexities in the PPG, and the less complexities in
the PPR. For example, the single-BCD result of the PPG in the radix-10 architecture in [68]
provides the chance to apply the simplest PPR unit, but it cannot avoid the carry propagation
in the PPG. On the other hand, the double-BCD result of the simplest PPG in [67] involves
a complicated PPR unit. Our proposed PPG method generates the partial product which
has the bit-width close to the single-BCD format without the carry propagation. Thus, the
complexity of the PPR is potentially reduced. Moreover, since only simple combinational
logic is applied to convert the digit-set in the proposed PPG, the BCD-FA used in the PPG
of [65] is eliminated. The PPR algorithm highly depends on the encoding of the result of
the PPG. Besides, in the PPR, the less input width the better, and the less bit-level carry
propagation the better. Our proposed PPR design based on the multi-operand SD addition
which involves two bit-level carry propagation is a bit more complicated than the design in
[68] with the same input width (i.e., n + 1 digits), and is simpler than the design in [68]
with the double-sized input width and the designs based on BCD-FA in [65, 67]. The carry
propagation in the nal addition cannot be avoided in any method, since the result of the PPR
is in the redundant format. However, the eciency of the nal addition or conversion can be
aected by the complexity of the setup logic and the prex tree. The proposed conversion
method involves a 4-bit carry propagation to generate the propagation and generation bit
for each digit, but by applying the hybrid carry prex tree, the logic on the critical path is
122
minimized. After all, although the proposed method is not the simplest on some stages of a
multiplication, the overall delay of the proposed multiplication is minimized by considering
the trade-o of the complexity in each stage.
7.2.2 Sequential Multiplication
The proposed multiplier, according to Fig. 6.20, consists of three main parts namely PPG,
PPA and Conversion each of which consumes 1, n+1 and 2 cycles, where n is the digit width
of the input. This concludes that the entire single multiplication can be performed in n+ 4
cycles with the initiation interval of n+ 1 cycles.
The cycle time, thus the clock frequency, determined by the critical delay path of the
PPA, is equal to the latency of the multi-operand adder shown in Fig. 6.18. The details of
the critical delay path are tabulated in Table 7.5.
Table 7.5: The critical delay path of the proposed multiplier (ns)
(4:2) Compressor 4-bit CLA Recoder Register Total
0.17 0.13 0.15 0.18 0.63
The area consumption of the proposed 16-digit multiplier is evaluated as the sum of the
area cost of various constituent parts tabulated in Table 7.6.
Table 7.6: Area consumption of the proposed 16-digit multiplier
Area (um2)
PPG 6900
PPA 20100
Conversion 4500
Registers 7680
Misc 220
Total 39400
123
Comparison and Discussion
The multiplier described in [63] requires n + 4 cycles per multiplication where the cycle
time of is equal to the latency of a BCD (4:2)-compressor plus registers. According to the
evaluation in [68], the cycle time and the area of this design is 16 FO4 and 16000 NAND2
gates, respectively, for a 16-digit multiplication.
In [42], the multiplier using the overloaded decimal representation calls for a special
decimal carry-free adder which brings about a critical delay path of a (4:1) multiplexer, a
+6 increment block, a binary full-adder plus registers. This concludes to the latency of 12.7
FO4 where the number of required cycles is n + 8. The area of this multiplier for 16-digit
operands is reported as 31500 NAND2 gates.
Another multiplier proposed by Erle in [64] takes the advantage of the decimal signed-
digit adder which is introduced in [50] for the iterative portion of the PPA. Thus the latency
of the redundant adder plus registers (i.e., 14.7 FO4) determines the cycle time. The number
of required cycles is the same as [63] (i.e., n + 4) and the area cost is reported as 18550
NAND2 gates for a 16-digit multiplication.
In accordance with the above discussions, Table 7.7 illustrates the details of the evaluation
results and compares the proposed design with others in terms of latency and area. Moreover,
the simulation results of the proposed multiplier based on delay constraints are depicted in
Fig. 7.4. It is shown that the proposed design consumes lower area in comparison with the
previous works. The evaluation and comparison results reveal the undisputed area advantage
of the proposed sequential decimal multiplier over the previous sequential designs.
Table 7.7: Comparison of the 16-digit multipliers
Cycle time Ratio No. of cycles Total Latency Ratio Area Ratio
(FO4) (FO4) (NAND2)
[63] 16 1.14 20 320 1.14 16000 1.79
[42] 12.7 0.91 24 305 1.09 31500 3.52
[64] 14.7 1.05 20 294 1.05 18550 2.07
Proposed 14 1 20 280 1 8960 1
124
56
7
8
9
10
11
12
13
14
15
15000
20000
25000
30000
35000
40000
45000
0.6 0.7 0.8 0.9 1 1.1
A
re
a
C
o
st
(
u
m
2
)
ClockSpeed(nS)
Area Cost
Power 
Consumption
Power/Delay
P
o
w
e
r
(m
W
o
r
m
W
/G
H
z)
Figure 7.4: Evaluation of speed, area, power consumption of the proposed sequential
multiplier
7.3 Decimal Floating-point FMA
To analyze the performance of the proposed architecture, a Verilog-HDL model is created and
veried by a test package with 425599 vectors [101] and 50K random vectors generated by
Python decimal library. Furthermore, the Verilog model is synthesized by Synopsys Design
Compiler with the normal case of the STM 90 nm standard cell library [100] (1.0V, 25℃)
which has 45ps as the delay of an inverter with fanout of 4 inverters and 4:4um2 as the area
of the smallest NAND2 gate.
7.3.1 Performance Evaluation
In this section, only the synthesis result of the combinational logic of the proposed architec-
ture, which does not contain the registers at the input and output, is provided. In Table 7.8,
the delay and area of the entire design which contains six major blocks are shown. If only
the combinational conguration is considered, the pre-alignment is not on the critical path.
Additionally, the exception processing unit includes the DPD/BCD conversions at the front
and end of the design and the post processing unit which handles the exceptions and creates
the ag signals.
125
Table 7.8: Delay and area partition of the proposed architec-
ture
Component Delay (ns) Ratio Area (um2) Ratio
Multiplier Array 1:75a 44:6% 211362 66:9%
Pre-alignment 1:75 - 19341 6:1%
Adder 0:38a 9:7% 38095 12:1%
Post-alignment 0:8a 20:4% 33092 10:5%
Rounding 0:5a 12:8% 6002 1:9%
Exception Proc. 0:49a 12:5% 7933 2:5%
Total 3:92 100% 315825 100%
a The units on the critical path.
7.3.2 Comparison and Discussion
To explain the advantages of the proposed architecture, two previously published designs
are compared in details in this section. In [95], rst, the multiplier involves a nal partial
product accumulation which is a decimal quaternary tree adder. In our design, the redundant
product obtained from the partial reduction of the multiplier is directly used by the following
units. Second, since only addend is shifted in our design, the swapping unit to exchange
the operands is not necessary, and the pre-alignment shifters are totally moved out from the
critical path which passes through the multiplier. However, since both left and right shiftings
are performed in [95], the data path is therefore restricted in 2n digits which may reduce
the area in DFP128 format. Third, the \pre-correction" of the decimal adder, two prex
networks, one 4-bit binary adder, and some combinational logics are involved in the decimal
leading zero anticipator which is simultaneously performed with the adder in [95]. On the
other hand, in our design the carry free adder is applied before the leading zero anticipator.
However less units (i.e., one prex network, one 4-bit binary adder, and some combinational
126
logics in the \Transfer digit generation" and \Intermediate signals generation") are applied
to gure out the number of leading and trailing zeros in our design. The rounding unit is
not analyzed since the detailed design in [95] is not provided.
In [102], although the top level architecture is similar to our design, the complexity of
the circuits in sub-modules is dierent. First, the multiplier which does not include the
nal partial product accumulation has a similar complexity to our multiplier. However, the
multiplexer in the selection unit and two encoding converters in the decimal 4221-BCD CSA
take more delay than the correction signal generator in our decimal carry free adder. Second,
in the LZA of [102], four steps anticipation algorithm is proposed. However, in our LZA
design, the input signals are directly generated from the only one redundant result with 4-bit
two's complement on each digit. Additionally, in our design, only binary leading zero detector
is performed on critical path to obtain the possible position of the non-zero leading digit. The
pattern is only used for correcting the possible one digit error at the end of shifting amount
generator for post-alignment. Moreover, the post-alignment shifter in our design is simply
a right shifter with optimization to reduce the delay on critical path and hardware area.
Finally, the combined addition and rounding unit in [102] employs a binary compound adder
with pre-/post-corrections which are bitwise constant adders with multiplexors, whereafter
the nal result is selected by the rounding increments. On the other hand, the nal rounding
and conversion unit in our design applies three generation units with only small constant
delay and one bitwise binary CLA.
In Table 7.9, a comparison on the synthesis results of three combinational designs and the
performance of corresponding software libraries evaluated in [99] are provided. The actual
delay of the designs in [102] and [95] under 65nm technology are 5:4ns and 4:6ns which are
slower than that of our design under 90nm technology. Since previous works are synthesized
under dierent standard cell libraries, the delay and area are unied by 35ps of FO4's delay
and 1:44um2 of NAND2's area in 65nm technology. Under the same metric, our design takes
about 66% of timing delay and 83% of hardware area of the previous fastest design proposed
in [95]. Additionally, the power estimation of our design is about 114mW . Note that the
number of cycles of the software libraries depends on the processor and compiler.
127
Table 7.9: Performance comparison
Design Delay (FO4) Ratio Area (NAND2) Ratio
[102] 154.3 1.77 107708 1.50
[95] 131.4 1.51 86061 1.20
Proposed 87.1 1 71778 1
decDouble[99] 785a - - -
idfpl64[99] 879a - - -
decNum[99] 1683a - - -
a The performance of software libraries is measured by the num-
ber of cycles.
7.3.3 Pipeline Conguration
The proposed DFMA can also be regularly congured to perform eciently. In Fig. 7.5,
a possible conguration is shown. In this case, the multiplier array can be divided into 3
cycles. If addition is performed, the multiplier array can be bypassed by the multiplexor at
the end of the third cycle in the multiplier array. The following three units can be partitioned
accordingly. The minimum cycle time is therefore decided by the timing delay of the pre-
alignment unit and pipeline registers. Consequently, the decimal oating-point addition may
be nished in 4 cycles at 1.1GHz (i.e., 0.9ns per cycle). On the other hand, if the decimal
oating-point multiplication is performed, all the components on the critical path have to
be enabled by setting Z to 0. Hence, the DFP multiplication may be nished in 6 cycles at
1.1GHz.
128
Mul Cycle 1
Mul Cycle 2
Mul Cycle 3
Pre-Alignment
Addition
Post-Align 1
Post-Align 2
Rounding
Post-Align 3
X Y Z
R
Bypass
Enable
Figure 7.5: A regular pipeline conguration of the proposed architecture
129
Part IV
Conclusion
130
Chapter 8
Summary and Future Research
8.1 Summary and Conclusion
In this thesis, the architectures and algorithms to perform decimal xed-point addition and
both parallel and sequential multiplications are rst proposed. Afterwards, a new decimal
oating-point FMA architecture is described in detail. The study's motivations are recalled
before the research and theoretical work appear summarized in the conclusion. In section 1.2,
the demands of high performance decimal oating-point arithmetic are discussed. Further,
the decimal oating-point processing with hardwired fused multiply-add function is proposed
as the major work of this research. Decimal processing with unconventional number systems
is also considered and seen to be the competitive technique for performance improvement.
During this research, the previous designs of decimal xed-point carry free addition and
of parallel/sequential multiplications have both been studied. On the basis of previous tech-
niques, our own ideas to further improve the performance of the decimal xed-point addition
and multiplication were proposed.
In the proposed addition, a new nonspeculative decimal carry free adder, which calculates
the operands in digit set [ 9; 9] with two's complement binary encoding, was discussed.
This design determines the transfer digit directly on the input operands instead of on the
position sum, the ordinary process used in the conventional carry free signed digit addition
algorithm. The digit range of the operands to minimize the cost of exception handling
was also analyzed. Furthermore, to improve the speed and reduce the area of the adder,
a new algorithm to calculate the result digit without the temporary result was proposed.
The synthesized results demonstrate the superiority of the proposed design in terms of the
area delay product as well as those of the power delay product. Overall, about 25% of
131
the delay is reduced in comparison to the fastest state-of-the-art decimal redundant adder.
Furthermore, a digit set conversion algorithm that directly converts the absolute value of
the signed digit result to the conventional BCD encoding was introduced in detail to solely
apply the proposed carry free addition (i.e., without the subsequent processing unit). The
new conversion algorithm, which has only one propagation with logarithmical timing delay,
is more suitable for high precision computation.
In the proposed multiplications, a new technique to implement the parallel decimal mul-
tiplication is rst introduced. Unlike other designs, in the proposed algorithm, the multiples
(i.e., from  5X to 5X) are represented in a redundant digit-set [ 8; 8]. Thus, the signed
digit partial products could be generated without the carry propagation in 3X. To reduce the
partial products into one signed digit result, a partial product reduction unit based on the
multi-operand signed digit addition was discussed. Moreover, all of the components inside
the multi-operand signed digit adder, except for two combinational recoders, could be reused
in binary designs. The combinational recoders are currently implemented with logic gates.
However, the customized circuits can be applied to further improve the performance. More-
over, the proposed hybrid prex network displayed the advantages of squeezing more delay
from the critical path in the nal digit-set conversion for standalone application. Overall, the
synthesis result under STM 90nm technology showed that the proposed parallel multiplier
could achieve about 11% less delay with 2% less hardware cost even when compared to the
fastest state-of-the-art parallel decimal multiplier. In the proposed sequential multiplication,
we exploited the signed digit multiples generation algorithm for 1X, 2X, and 4X. Further-
more, the partial product generation algorithm (which uses only these three easy multiples)
was proposed by introducing redundancy into the second operand in the multiplication (i.e.,
Y ). Following this step, a partial product accumulation architecture, including a series of
multi-operand carry free adders, was given to iteratively sum up the partial product in every
iteration. At the same time, the lower half digits of the product are converted simultane-
ously. After the last iteration of the partial product accumulation, the higher half digits of
the product are generated by a parallel conversion algorithm with prex network. Finally,
the evaluation of the proposed design illustrated that our design achieves about 52% of less
area and 0:5% of less latency compared to the fastest state-of-the-art design.
132
Meanwhile, three decimal oating-point fused multiply-add designs have been published.
After nishing the xed-point addition and multiplication, this work discusses such FMA de-
signs. Subsequently, a new technique to improve the performance of the decimal oating-point
fused multiply-add is proposed. The xed-point adder and parallel multiplier were therefore
reused and modied in the new DFMA. Applying the specic number system required that
the digit-set conversion inside the proposed DFMA be minimized as much as possible. There-
fore, only two stages{partial product generation and partial product reduction{are retained
in the parallel multiplier. Moreover, the pre-alignment could be further moved out of the
critical path by shifting only the addend. The modication of our proposed decimal carry free
adder also allows dierent digit sets on the operands and the result. In the post-alignment
unit, the rounding position was decided by detecting the number of leading/trailing zeros
and the possible cancellation with simple logics due to the application of the specic number
system. Since the digits (radix   1) do not exist in the proposed digit set, only one digit
error may happen in the leading zero detection of the accumulation result. Therefore, the
shifting amount decision is relatively simple in the post-alignment unit. Finally, the rounder
combined the absolute value conversion, digit-set conversion, and the rounding operation in
one carry propagation process. The synthesis result of the Verilog model shows that about
33:7% of delay and 16:6% of area were reduced in comparison to results from the previous
fastest designs.
So far, the advantage of unconventional number systems in decimal arithmetic has been
exhibited in previous chapters. With the specic redundant number system and the careful
hardware design, both processing speed and area eciency (i.e., hardware cost) of decimal
xed-point addition and multiplication were improved. Furthermore, both the proposed
xed-point functions and the specic number system were applied in order to design a new
decimal oating-point fused multiply-add with a better performance. The decimal oating-
point arithmetic was therefore enriched by the proposed designs and techniques. In the thesis,
the proposed ideas (e.g., two steps non-speculative adder, multiplies without carry generation,
hybrid carry propagation, easy leading zero anticipation, and etc.) can be also applied or
extended in other non-binary computing systems in order to improve the performance of such
systems which are built up with binary devices.
133
8.2 Future Research
The decimal oating-point fused multiply-add itself is discussed in this thesis. However,
the application of such a hardwired design may be a topic for future work. For example,
the functional division, square root, reciprocal, and reciprocal square root operations which
exploit a series of fused multiply-add operations with Newton's or similar methods could
benet from the proposed DFMA. Both software and hardware solutions are applicable.
In order to improve the area eciency, the parallel multiplier can be replaced by the
sequential design. However, the latency and especially the throughput are going to worsen
from this step. Furthermore, if the sequential multiplier is applied, the carry free adder that is
currently used in the DFMA may no longer be necessary. The architecture and corresponding
algorithms of the DFMA should be changed. The pros and cons of the DFMA architectures
with parallel and sequential multipliers suggest another future research topic.
To perform the standalone decimal oating-point addition and multiplication eciently,
the hardware of the proposed DFMA could be optimized. The techniques that have been
applied in existing binary design could be exploited in the future.
Alternatively, the similar concept of the unconventional number system applied to im-
prove the performance of the DFMA could be considered for other functions (e.g., sequential
division, square root, reciprocal, and even reciprocal square root).
134
References
[1] J. M. Muller, \On the denition of ulp (x)", URL:
http://ljk.imag.fr/membres/Carine.Lucas/TPScilab/JMMuller/ulp-toms.pdf, Last ac-
cess: May 17, 2013.
[2] H. H. Goldstine and A. Goldstine, \The electronic numerical integrator and computer
(ENIAC)", IEEE Annals of the History of Computing, vol. 18, no. 1, pp. 10-16, 1996.
[3] G. Gray, \UNIVAC I instruction set", Unisys History Newsletter, vol. 5, no. 3, 2001.
[4] M. F. Cowlishaw, \The `telco' benchmark", URL:
http://speleotrove.com/decimal/telco.html, Last access: May 17, 2013.
[5] M. F. Cowlishaw, \Decimal oating-point: Algorism for computers", in 16th IEEE Sym-
posium on Computer Arithmetic, pp. 104-111, Jun. 2003.
[6] M. F. Cowlishaw, E. M. Schwarz, R. M. Smith, and C. F. Webb, \A Decimal Floating-
Point Specication", in 15th IEEE Symposium on Computer Arithmetic, pp. 147-154, Jun.
2001.
[7] A. Y. Duale, M. H. Decker, H. G. Zipperer, M. Aharoni, and T. J. Bohizic, \Decimal
oating-point in z9: An implementation and testing perspective", IBM Journal of Re-
search and Development, vol. 51, no. 1/2, pp. 217-228, Jan.-Mar. 2007.
[8] C. F. Webb, \IBM z10: The next-generation mainframe microprocessor", IEEE Micro,
vol. 28, no. 2, pp. 19-29, Mar.-Apr. 2008.
[9] E. M. Schwarz, J. Kapernick, and M. Cowlishaw, \Decimal oating-point support on
the IBM z10 processor", IBM Journal of Research and Development, vol. 53, no. 1, pp.
4:1-4:10, Jan. 2009.
[10] H. Q. Le, W. J. Starke, J. S. Fields, F. P. O'Connell, D. Q. Nguyen, B. J. Ronchetti, W.
M. Sauer, E. M. Schwarz, and M. T. Vaden, \IBM POWER6 microarchitecture", IBM
Journal of Research and Development, vol. 51, no. 6, pp. 639-662, 2007.
[11] J. Friedrich, B. McCredie, N. James, B. Huott, B. Curran, E. Fluhr, G. Mittal, E.
Chan, Y. Chan, D. Plass, S. Chu, H. Le, L. Clark, J. Ripley, S. Taylor, J. Dilullo, and M.
Lanzerotti, \Design of the POWER6 microprocessor", in IEEE International Solid-State
Circuits Conference (ISSCC), pp. 96-97, Feb. 2007.
135
[12] L. Eisen, J. W. Ward, III, H.-W. Tast, N. Mading, J. Leenstra, S. M. Mueller, C. Jacobi,
J. Preiss, E. M. Schwarz, and S. R. Carlough, \IBM POWER6 accelerators: VMX and
DFU", IBM Journal of Research and Development, vol. 51, no. 6, pp. 663-684, Nov. 2007.
[13] M. Cornea, \Intel Decimal Floating-Point Math Library", URL:
http://software.intel.com/en-us/articles/intel-decimal-oating-point-math-library/, Last
access: May 17, 2013.
[14] \Class BigDecimal", URL:
http://download.oracle.com/javase/1,5.0/docs/api/java/math/BigDecimal.html, Last
access: May 17, 2013.
[15] \Decimal xed point and oating point arithmetic", URL:
http://docs.python.org/library/decimal.html, Last access: May 17, 2013.
[16] \The decNumber Library", URL:
http://speleotrove.com/decimal/decnumber.html, Last access: May 17, 2013.
[17] IEEE Standard for Binary Floating-Point Arithmetic, IEEE Standard 754-2008, 2008.
[18] L.-K. Wang, C. Tsen, M. J. Schulte, and D. Jhalani, \Benchmarks and Performance
Analysis of Decimal Floating-Point Applications", in 25th International Conference on
Computer Design, pp. 164-170, Oct. 2007.
[19] M. Anderson, C. Tsen, L.-K. Wang, K. Compton, and M. J. Schulte, \Performance anal-
ysis of decimal oating-point libraries and its impact on decimal hardware and software
solutions", in 26th International Conference on Computer Design, pp. 465-471, Oct. 2009.
[20] L.-K. Wang and M. J. Schulte, \Decimal oating-point adder and multifunction unit
with injection-based rounding", in 18th IEEE Symposium on Computer Arithmetic, pp.
56-68, Jun. 2007.
[21] L.-K. Wang and M. Schulte, \A decimal oating-point adder with decoded operands and
a decimal leading-zero anticipator", in 19th IEEE Symposium on Computer Arithmetic,
pp. 125-134, Jun. 2009.
[22] B. J. Hickmann, A. Krioukov, M. J. Schulte, and M. A. Erle, \A parallel IEEE P754
decimal oating-point multiplier", in 25th International Conference on Computer Design,
Oct. 2007, pp. 296-303.
[23] T. Lang and A. Nannarelli, \A radix-10 digit-recurrence division unit: Algorithm and
architecture", IEEE Transactions on Computers, vol. 56, no. 6, pp. 727-739, Jun. 2007.
[24] D. Chen, L. Han, Y. Choi, and S. Ko, \Improved Decimal Floating-Point Logarithmic
Converter Based on Selection by Rounding", IEEE Transactions on Computers, vol. 61,
no. 6, pp. 607-621, May 2012.
[25] A.Vazquez, J. Villalba, and E. Antelo, \Computation of Decimal Transcendental Func-
tions Using the CORDIC Algorithm", in 19th IEEE Symposium on Computer Arithmetic,
pp. 179-186, June 2009.
136
[26] R. K. Montoye, E. Hokenek, S. L. Runyon, \Design of the IBM RISC System/6000
oating-point execution unit", IBM Journal of Research and Development, vol. 34, no. 1,
pp. 59-70, Jan. 1990.
[27] P. W. Markstein, \Computation of elementary functions on the IBM RISC System/6000
processor", IBM Journal of Research and Development, vol. 34, no. 1, pp. 111-119, Jan.
1990.
[28] M. Cornea, J. Harrison, and P. Tang, \Intel Itanium oating-point architecture", Inter-
national Symposium On Computer Architecture, Article 3, 2003.
[29] S. Anderson, R. Bell, J. Hague, H. Holtho, P. Mayes, J. Nakano, D. Shieh, and J. Tuc-
cillo, \RS/6000 Scientic and Technical Computing: POWER3 Introduction and Tuning
Guide", IBM Corporation, International Technical Support Organization, First edition,
Oct. 1998.
[30] R.M. Jessani, M. Putrino, \Comparison of single- and dual-pass multiply-add fused
oating-point units", IEEE Transactions on Computers, vol. 47, no. 9, pp. 927-937, Sep.
1998.
[31] P.-M. Seidel, \Multiple path IEEE oating-point fused multiply-add", in 2003 IEEE
International Symposium on Micro-NanoMechatronics and Human Science, vol. 3, pp.
1359-1362, Dec. 2003.
[32] T. Lang and J. D. Bruguera, \Floating-Point Fused Multiply-Add with Reduced La-
tency", in IEEE International Conference on Computer Design: VLSI in Computers and
Processors, pp. 145-150, 2002.
[33] T. Lang and J. D. Bruguera, \Floating-Point Multiply-Add-Fused with Reduced La-
tency", IEEE Transactions on Computers, vol. 53, no. 8, pp. 988-1003, Aug. 2004.
[34] J. D. Bruguera and T. Lang, \Floating-point fused multiply-add: reduced latency for
oating-point addition", in 17th IEEE Symposium on Computer Arithmetic, pp. 42-51,
June 2005.
[35] E. Quinnell, E.E. Swartzlander, C. Lemonds, \Floating-Point Fused Multiply-Add Ar-
chitectures", in 41th Asilomar Conference on Signals, Systems and Computers, pp. 331-
337, Nov. 2007.
[36] E. Quinnell, E.E. Swartzlander, C. Lemonds, \Bridge Floating-Point Fused Multiply-
Add Design", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.
16, no. 12, pp. 1727-1731, Dec. 2008.
[37] L. Huang, L. Shen, K. Dai, and Z. Wang, \A New Architecture For Multiple-Precision
Floating-Point Multiply-Add Fused Unit Design", in 18th IEEE Symposium on Computer
Arithmetic, pp. 69-76, Jun. 2007.
[38] L. Huang, S. Ma, L. Shen, Z. Wang, and N. Xiao, \Low Cost Binary128 Floating-Point
FMA Unit Design with SIMD Support", IEEE Transactions on Computers, Apr. 2011.
137
[39] J.-M. Muller et al., Handbook of Floating-point Arithmetic, ISBN 978-0-8176-4704-9,
Boston : Birkhuser, 2010.
[40] P.-M. Seidel, G. Even, \On the design of fast IEEE oating-point adders", in 15th IEEE
Symposium on Computer Arithmetic, pp. 184-194, Jun. 2001.
[41] P. K. Monsson, Combined Binary and Decimal Floating-point Unit, Master Thesis, Dept.
of Information and Mathematical Modeling, Technical University of Denmark, Aug. 2008.
[42] R. D. Kenney, M. J. Schulte, and M. A. Erle, \A high-frequency decimal multiplier", in
IEEE International Conference on Computer Design: VLSI in Computers and Processors,
pp. 26-29, Oct. 2004.
[43] H. He, Z. Li, and Y. Sun, \Multiply-add fused oat point unit with on-y denormalized
number processing", in 48th Midwest Symposium on Circuits and Systems, vol. 2, pp.
1466-1468, Aug. 2005.
[44] W. Kahan, Check Whether Floating-Point Division Is Correctly Rounded, monograph,
Dept. of Computer Science, University of California, Berkeley, 1956.
[45] F. G. Gustavson, J. E. Moreira, and R. F. Enenkel, \The fused multiply-add instruc-
tion leads to algorithms for extended-precision oating point: applications to java and
high-performance computing", in 1999 conference of the Centre for Advanced Studies on
Collaborative research, 1999.
[46] R.C. Agarwal, F.G. Gustavson, and M.S. Schmookler, \Series approximation methods
for divide and square root in the Power3TM processor", in 14th IEEE Symposium on
Computer Arithmetic, pp. 116-123, 1999.
[47] I. Koren, Computer Arithmetic Algorithms, 2nd Edition, ISBN 9781568811604, A. K.
Peters, 2002.
[48] M. Ercegovac and T. Lang, Digital Arithmetic, ISBN 1558607986, Elsevier Science
(USA), 2004.
[49] B. Parhami, Computer Arithmetic - Algorithms and Hardware designs, ISBN
0195125835, Oxford University Press, 2004.
[50] A. Svoboda, \Decimal adder with signed digit arithmetic", IEEE Transactions on Com-
puters, C-18(3), pp. 212-215, 1969.
[51] H. Nikmehr, B. Phillips and C.C. Lim, \A decimal carry-free adder", in SPIE conference
on Smart Materials, Nano-, Micro-Smart Systems, pp. 786-797, 2004.
[52] A. Kaivani and G. Jaberipur, \Fully redundant decimal addition and subtraction using
stored-unibit encoding", Integration, the VLSI journal, pp. 34-41, 2010.
[53] H. Fahmy and M.J. Flynn, \The case for a redundant format in oating-point aritmetic",
in 16th IEEE Symposium on Computer Arithmetic, pp. 95-102, June 2003.
138
[54] G. Jaberipur and M. Ghodsi, \High Radix Signed Digit Number Systems: Representa-
tion Paradigms", Scientia Iranica , 10(4), pp. 383-391, 2003.
[55] G. Jaberipur and S. Gorgin, \A Nonspeculative Maximally Redundant Signed Digit
Adder", The 13th international CSI Computer Conference, pp. 235-242, 2008.
[56] John Moskal, Erdal Oruklu and Jafar Saniie, \Design and Synthesis of a Carry-Free
Signed-Digit Decimal Adder", IEEE International Symposium on Circuits and Systems,
pp. 1089-1092, 2007.
[57] L-K. Wang, M. A. Erle, C. Tsen, E. M. Schwarz, M. J. Schulte, \A survey of hardware
designs for decimal arithmetic", IBM Journal of Research and Development, vol. 54, no.
2, Mar. 2010.
[58] L-K. Wang, M. J. Schulte, J. D. Thompson and N. Jairam, \Hardware Designs for Deci-
mal Floating-Point Addition and Related Operations", IEEE Transactions on Computers,
vol. 58, no. 3, Mar. 2009.
[59] B. Shirazi, D. Yun, C. N. Zhang, \RBCD: redundant binary coded decimal adder", in
IEE Proceedings, vol. 136, no. 2, March 1989.
[60] F. Y. Busaba et al., \The IBM z900 Decimal Arithmetic Unit", in Conference Record
of the Thirty-Fifth Asilomar Conference on Signals, Systems and Computers, vol. 2, pp.
1335-1339, 2001.
[61] E. M. Schwarz, J. S. Kapernick, and M. F Cowlishaw, \Decimal oating-point support
on the IBM System z10 processor", IBM Journal of Research and Development, vol. 53,
no. 1, pp. 4:1-4:10, Apr. 2010.
[62] M. Cornea et al., \A software implementation of the IEEE 754R decimal oating-point
arithmetic using the binary encoding format", IEEE Transactions on Computers, vol. 58,
no. 2, pp. 148-162, 2009.
[63] M. A. Erle and M. J. Schulte, \Decimal Multiplication Via Carry-Save Addition", in
IEEE International Conference on Application Specic systems, Architectures, and Pro-
cessors, pp. 348-358, Jun. 2003.
[64] M. A. Erle, E. M. Schwarz, and M. J. Schulte, \Decimal multiplication with ecient
partial product generation", in 17th IEEE Symposium on Computer Arithmetic, pp. 21-28,
2005.
[65] T. Lang and A. Nannarelli, \A Radix-10 Combinational Multiplier", in 40th Asilomar
Conference on Signals, Systems and Computers, pp. 313-317, Oct. 2006.
[66] L. Dadda and A. Nannarelli, \A Variant of a Radix-10 Combinational Multiplier", in
IEEE International Symposium in Circuits and Systems (ISCAS 2008), pp. 3370-3373,
May 2008.
[67] G. Jaberipur and A. Kaivani, \Improving the Speed of Parallel Decimal Multiplication",
IEEE Transactions on Computers, vol. 58, no. 11, pp. 1539-1552, Nov. 2009.
139
[68] A. Vazquez, E. Antelo, and P. Montuschi, \Improved Design of High-Performance Par-
allel Decimal Multipliers", IEEE Transactions on Computers, vol. 59, no. 5, pp. 679-693,
May 2010.
[69] I. D. Castellanos and J. E. Stine, \Decimal partial product generation architectures", in
51st Midwest Symposium on Circuits and Systems, pp. 962-965, Aug. 2008.
[70] S. Gorgin and G. Jaberipur, \A fully redundant decimal adder and its application in
parallel decimal multipliers", Microelectronics Journal, vol. 40, no. 10, Oct. 2009.
[71] L. Dadda, \Multioperand Parallel Decimal Adder: a mixed Binary and BCD Approach",
IEEE Transactions on Computers, vol. 56, pp. 1320-1328, Oct. 2007.
[72] I. D. Castellanos and J. E. Stine, \Compressor Trees for Decimal Partial Product Re-
duction", in 18th ACM Great Lakes Symposium on VLSI, pp. 107-110, May 2008.
[73] M. A. Erle, M. J. Schulte, and B. J. Hickmann, \Decimal oating-point multiplication
via carry-save addition", in 18th IEEE Symposium on Computer Arithmetic, pp. 25-27,
2007.
[74] M. A. Erle, B. J. Hickmann, and M. A. Schulte, \Decimal Floating-Point Multiplica-
tion", IEEE Transactions on Computers, vol. 58, no. 7, pp. 902-916, Jul. 2009.
[75] A. Vazquez and E. Antelo, \Conditional Speculative Decimal Addition", in 7th Confer-
ence on Real Numbers and Computers (RNC 7), pp. 47-57, Jul. 2006.
[76] A. Vazquez, E. Antelo, and P. Montuschi, \A New Family of High-Performance Parallel
Decimal Multipliers", in 18th IEEE Symposium on Computer Arithmetic, pp. 195-204,
June 2007.
[77] C. H. Chang, J. Gu, and M. Zhang, \Ultra low-voltage low-power CMOS 4-2 and 5-2
compressors for fast arithmetic circuits", IEEE Transactions on Circuits and Systems I:
Regular Papers, vol. 51, no. 10, pp. 1985-1997, 2004.
[78] G. Jaberipur and B. Parhami, \Constant-time addition with hybrid-redundant numbers:
Theory and implementations", Integration, the VLSI journal, vol. 41, pp. 49-64, 2008.
[79] T. Aoki et al., \Signed-weight arithmetic and its application to a eld-programmable
digital lter architecture", IEICE Transactions on Electronics, vol. E82-C, no.9, pp. 1687-
1698, 1999.
[80] P. Kornerup, \Reviewing 4-to-2 Adders for Multi-Operand Addition", Journal of VLSI
Signal Processing, vol. 40, pp. 143-152, 2005.
[81] Decimal IP, SilMinds, URL:
http://www.silminds.com/decimal-products, Last access: May 17, 2013.
[82] GNU C compiler library URL:
http://gcc.gnu.org/onlinedocs/gcc/Decimal-Float.html, Last access: May 17, 2013.
140
[83] J. Thompson, M. J. Schulte, and N. Karra, \A 64-bit decimal oating-point adder", in
IEEE Computer society Annual Symposium on VLSI, pp. 297-298, Feb. 2004.
[84] A. Vazquez and E. Antelo, \A high-performance signicand BCD adder with IEEE 754-
2008 decimal rounding", in 19th IEEE Symposium on Computer Arithmetic, pp. 135-144,
Jun. 2009.
[85] S. Gorgin and G. Jaberipur, \Fully redundant decimal arithmetic", in 19th IEEE Sym-
posium on Computer Arithmetic, pp. 145-152, Jun. 2009.
[86] L.-K. Wang, M. J. Schulte, J. D. Thompson, and N. Jairam, \Hardware designs for dec-
imal oating-point addition and related operations", IEEE Transactions on Computers,
vol. 58, no. 3, pp. 322-335, Mar. 2009.
[87] L.-K. Wang and M. J. Schulte, \A decimal oating-point divider using Newton-Raphson
iteration", Journal of VLSI Signal Processing Systems, vol. 49, no. 1, pp. 3-18, Oct. 2007.
[88] L.-K. Wang and M. J. Schulte, \Decimal Floating-Point Square Root Using Newton-
Raphson Iteration", in 16th IEEE International Conference of Application-Specic Sys-
tems, Architectures and Processors, 2005.
[89] A. Vazquez, E. Antelo, and P. Montuschi,\A radix-10 SRT divider based on alternative
BCD codings", in IEEE International Conference on Computer Design, pp. 280-287, Oct.
2007.
[90] R. C. Agarwal, F. G. Gustavson, and M. S. Schmookler, \Series approximation methods
for divide and square root in the Power3TM processor", in 14th IEEE Symposium on
Computer Arithmetic, pp. 116-123, 1999.
[91] R. M. Jessani and M. Putrino, \Comparison of single- and dual-pass multiply-add fused
oating-point units", IEEE Transactions on Computers, vol. 47, no. 9, pp. 927-937, Sep.
1998.
[92] J. D. Bruguera and T. Lang, \Floating-point fused multiply-add: reduced latency for
oating-point addition", in 17th IEEE Symposium on Computer Arithmetic, pp. 42-51,
June 2005.
[93] R. Samy, H. A. H. Fahmy, R. Raafat, A. Mohamed, T. ElDeeb, and Y. Farouk, \A deci-
mal oating-point fused-multiply-add unit", in 53rd IEEE International Midwest Sympo-
sium on Circuits and Systems, pp. 529-532, Aug. 2010.
[94] R. Raafat, A. M. Abdel-Maheed, R. Samy, T. ElDeeb, Y. Farouk, M. Elkhouly, and H.
A. H. Fahmy, \A decimal fully parallel and pipelined oating point multiplier", in 42
Asilomar Conference on Signals, Systems, and Computers, Asilomar, Oct. 2008.
[95] A. Akkas and M. J. Schulte, \A decimal oating-point fused multiply-add unit with
a novel decimal leading-zero anticipator", in 22nd IEEE International Conference on
Application-specic Systems, Architectures and Processors, Sep. 2011.
141
[96] L. Han and S. Ko, \High Speed Parallel Decimal Multiplication with Redundant Internal
Encodings", IEEE Transactions on Computers, vol. 62, no. 5, pp. 956-968, May 2013.
[97] L. Han, D. Chen, K. A. Wahid, and S. Ko, \Nonspeculative decimal signed digit adder",
in 2011 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1053-1056,
May 2011.
[98] J. D. Bruguera and T. Lang, \Leading-one prediction with concurrent position correc-
tion", IEEE Transactions on Computers, vol. 48, no. 10, Oct. 1999.
[99] M. Cowlishaw, \Decimal library Performance v1.12", URL:
http://speleotrove.com/decimal/decperf.pdf, Last access: May 17, 2013.
[100] STMicroelectronics, 90nm CMOS Design Platform, 2007.
[101] A. S. Ahmed and H. A. H. Fahmy, \2010 07 d64 fma.zip", URL:
http://eece.cu.edu.eg/hfahmy/arith debug/#vectors, Last access: May 17, 2013.
[102] A. EITantawy, Decimal oating point arithmetic unit based on a fused multiply add
module, MS.c. dissertation, Electronics and Electrical Communications Engineering De-
partment of Cairo University, 2011.
[103] A. Vazquez, High-performance decimal oating-point units, Ph.D. dissertation, Elec-
tronics and Computer Engineering Department of University of Santiago de Compostela,
2009.
[104] H. A. H. Fahmy, A Redundant Digit Floating Point System, Ph.D. dissertation, Stanford
University, June 2003.
142
