On the design of IEEE compliant floating point units and their quantitative analysis by Seidel, Peter-Michael
On the Design of IEEE Compliant Floating-Point Units
and Their Quantitative Analysis
Dissertation
zur Erlangung des Grades
Doktor der Ingenieurwissenschaften
(Dr.-Ing.) der Technischen Fakultat
im Fachbereich Informatik
der Universitat des Saarlandes
vorgelegt
von
Peter-Michael Seidel
Saarbrucken 1999
Dekan: Prof. Dr. Wolfgang Paul
Erster Berichterstatter: Prof. Dr. Wolfgang Paul
Zweiter Berichterstatter: Priv.-Doz. Dr. Silvia Muller
Tag des Kolloquiums: 14. Januar 2000
iAbstract
This thesis addresses the question of which are the important issues in the design of a
high-speed oating-point unit (FPU) that is fully compliant with the IEEE oating-point
standard 754-1985 [19]. There are a few choices that need to be made when designing
an IEEE compliant FPU, among them: the internal representation of oating-point num-
bers, the rounding algorithms, handling of denormal results, usage of the same rounding
hardware for dierent units (e.g. adder, multiplier, divider), and the implementations of
the adder, the multiplier and the divider. These choices inuence both the cost and the
performance of the FPU. Nevertheless, these issues have not been discussed in the open lit-
erature todate. This work begins to ll this gap by designing, analyzing and comparing 18
dierent IEEE compliant FPU implementations, that consider design options regarding:
(a) the internal representation of oating-point numbers; (b) the rounding algorithms; (c)
sharing of a rounding unit, the implementation of gradual step rounding or the implemen-
tation of dedicated rounding units for each functional unit; (d) the implementation of the
oating-point multiplier; and (e) the implementation of the oating-point divider. The
presented FPU designs make also use of the following innovations, that were developed
in the context of this work: (a) a fast implementation of variable position rounding inte-
grated into a FP multiplier [37]; (b) to the best of our knowledge the fastest integrated FP
addition and rounding algorithm published todate [40], (c) the fastest FP multiplication
rounding algorithm published todate [11, 12] and (d) the fastest linear reciprocal approx-
imation implementation published todate. [36, 39]; (e) an ecient integration of single
and double precision rounding [9]; (f) a Booth encoded adder-tree with an improved cost
formula [30].
All the FPUs designed in this work are fully compliant with the IEEE standard for all
implemented operations, support both single and double precision, and deal with denor-
mal values and special cases in hardware. Because to design an IEEE compliant FPU is a
complex and error-prone task, all the FPU designs are specied in full detail at gate level
and the correctness of the FPU designs (in particular the compliance with the IEEE stan-
dard) is proven. The proposed FPU implementations are analyed and compared regarding
the hardware cost, the cycle time and the performance that they achieve on traces of the
SPECfp92 benchmark suite [17] integrated into a pipelined RISC processor from [23]. In
this quantitative analysis [38] it is demonstrated that the choice of the rounding archi-
tecture in the FPU has a larger impact on the performance of the microprocessor than
the choice of the FP multiplication or the FP division implementation. In comparison to
this the impact of the rounding architecture choice on the cost is relatively small. The
rounding architecture that uses dedicated rounding units provides the best performance
with only small additional cost, so that this rounding architecture seems to be the best
choice in oating-point implementations. The fast implementation of this rounding archi-
tecture is only made possible by the fast variable position rounding implementation for
multipliers from [37]. This underlines the importance of this technique.
ii
Kurzzusammenfassung
In dieser Arbeit wird der Frage nachgegangen, welches die wichtigsten Designentscheidun-
gen bei der Implementierung einer schnellen Gleitkommaeinheit (FPU), die dem IEEE
Standard 754-1985 [19] genugt, sind. Es gibt verschiedene Entscheidungen, die beim En-
twurf einer IEEE konformen FPU getroen werden mussen, darunter: die internen Darstel-
lungen der Gleitkomma- (FP) Zahlen, die Rundungsalgorithmen, die Art der Behandlung
von denormalisierten Ergebnissen, die Mehrfachverwendung von Teilen der Hardware,
wie z.B. die Benutzung derselben Rundungshardware fur verschiedene Einheiten, und
die Implementierungen des FP Addierers, des FP Multiplizierers und des FP Dividier-
ers. Diese Entscheidungen beeinussen sowohl die Kosten alsauch die Leistung der FPU.
Nichtsdestotrotz wurden diese Entscheidungen bislang nicht in der Literatur diskutiert.
Die vorliegende Arbeit setzt in dieser Lucke an. Es werden 18 unterschiedliche FPUs
vorgestellt, analysiert und verglichen, die Optionen zu den folgenden Entscheidungen be-
trachten: (a) interne Darstellung der FP Zahlen; (b) Rundungsalgorithmen; (c) Gemein-
same Nutzung einer allgemeinen Rundungseinheit, Aufteilen des Rundens in mehrere
Schritte und gemeinsame Realisierung einer Teilmenge dieser Schritte oder vollstandige
eigene Implementierung des Rundens fur jede Funktionseinheit; (d) Implementierung des
FP Multiplizierers; (e) Implementierung des FP Dividierers. Die vorgestellten FPU De-
signs benutzen daruberhinaus folgende Neuerungen, die im Rahmen dieser Arbeit ent-
standen sind: (a) eine schnelle Rundungsimplementierung fur den FP Multiplizierer mit
variabler Rundungsposition [37]; (b) nach unserem besten Wissen den bisher schnellsten
publizierten Algorithmus zum Addieren und Runden von FP Zahlen [40], (c) den bisher
schnellsten publizierten Algorithmus zum Runden bei der FP Multiplikation [11, 12] und
(d) die bisher schnellste publizierte Implementierung einer linearen Approximation von
Reziproken [36, 39]; (e) eine eziente Integration des Rundens in single precision und
double precision [9]; (f) einen Booth-Multiplizierer mit verringerten Kosten [30].
Alle entworfenen FPUs sind fur alle implementierten Operationen vollstandig konform
zum IEEE FP Standard 754, unterstutzen sowohl single alsauch double precision Zahlen,
und behandeln selbst denormalisierte Ergebnisse und Spezialfalle in Hardware. Weil der
Entwurf von IEEE konformen FPUs eine komplexe und fehleranfallige Aufgabe ist, werden
samtliche entworfenen FPUs detailiert auf Gatterebene speziziert und ihre Korrektheit
(insbesondere die Konformitat zum IEEE FP Standard 754) bewiesen. Die vorgestellten
FPU Implementierungen werden bezuglich der Hardwarekosten, der Zykluszeit und der
Leistung, die sie integriert in einen gepipelinten RISC Processor aus [23] auf Traces der
SPECfp92 Benchmark Suite erbringen, analysiert und verglichen. In dieser quantitativen
Analyse (siehe auch [38]) wird demonstriert, da die Auswahl der Rundungs-Architektur
einer FPU einen groeren Einu auf die Prozessorleistung hat als die Auswahl der Im-
plementierung der FP Multiplikation oder der FP Division. Im Gegensatz dazu ist der
Einu der Auswahl einer Rundungs-Architektur der FPU auf die Hardwarekosten vergle-
ichsweise gering. Die Rundungs-Architektur, die vollstandige eigene Rundungsimplemen-
tierungen fur jede Funktionseinheit benutzt, liefert bei weitem die beste Leistung und ist
lediglich geringfugig teurer als Varianten mit anderen Rundungs-Architekturen. Demzu-
folge scheint diese Rundungs-Architektur die beste Wahl in FP Implementierungen zu
sein. Die schnelle Implementierung dieser Rundungs-Architektur wurde erst durch die
schnelle Rundungsimplementierung fur FP Multiplizierer mit variabler Rundungsposition
nach [37] ermoglicht. Das unterstreicht die Bedeutung dieser Technik.
Extended Abstract
The importance of oating-point operations is increasing in recent graphic and multimedia
applications. Therefore, each modern microprocessor has to contain at least one oating-
point unit, that supports and accelerates the oating-point computations. To achieve
a well dened behavior during the computations, the oating-point support should be
conform with the IEEE oating-point standard 754-1985 [19].
Despite the high demand for oating-point hardware implementations, an answer to
the question, how to design a fast IEEE compliant FP unit, rarely can be found in the
open literature. Moreover, there are several choices that need to be made when designing
an IEEE compliant FPU, among them: the internal representation of oating-point num-
bers, the rounding algorithms, handling of denormal results, usage of the same rounding
hardware for dierent units (e.g. adder, multiplier, divider), and the implementations of
the adder, the multiplier and the divider. These choices inuence both the cost and the
performance of the FPU. Nevertheless, these issues have not been discussed in the open
literature todate. In contrast to this lack of publications about the implementation of fully
IEEE compliant FP operations or fully IEEE compliant FPUs, there are many published
implementations of specic oating-point operations for the case of normalized operands
in a specic precision, e.g. [9, 26, 27, 32, 40, 43, 44], and these implementations are highly
optimized for speed. Therefore, it is an important question, how to integrate the imple-
mentations of the dierent FP operations for normalized operands into a oating-point
unit that supports more than one precision, denormalized numbers and special value re-
sults. This mainly includes the questions of which internal FP representations should be
used in a FP unit and how the microarchitecture of a FP unit could be organized.
This work starts to ll these gaps in the open literature and to nd answers to these
open questions. For this purpose, 18 dierent implementations of a oating-point unit
are designed, quantitatively analyzed and compared in this thesis. All proposed FP de-
signs provide full compliance with the IEEE FP standard 754-1985 for all implemented
operations, support both single precision and double precision operands and also consider
denormalized numbers, special values, exponent wrapping and oating-point exceptions
in hardware. The core of this work is the design and the comparison of three dierent
FPU microarchitectures that consider the following three options:
(I) the use of a shared general rounder for all functional units; A basic specication of
such a rounder was rst described in [10]. Thereafter, this rounder was implemented
by our group, resulting in a version that will be included in [23], where also a
rigorous proof of the compliance with the IEEE rounding denition will be found.
This rounder was further optimized to be included in this thesis.
(II) a gradual rounding implementation in two steps, a rst rounding step within the
functional units assuming the case of a normalized double precision result and a
second rounding step within a shared gradual rounder that xes the result for all
iii
iv
other cases; For the integrated rounding in the functional units assuming normalized,
double precision operands and results, several algorithms from literature could be
used. The implementation of the gradual rounder is based on the theory from [21]
about gradual rounding. This rounding technique is integrated in this thesis for full
IEEE compliant rounding including the handling of denormalized results, special
values, exceptions and exponent wrapping.
(III) the use of separate fully IEEE compliant rounding implementations for each func-
tional unit, each including the handling of denormalized numbers, special cases,
exceptions and exponent wrapping. The implementation of this microarchitecture
for a full IEEE compliant FPU with dedicated rounding implementations is com-
pletely new in this thesis. Especially the integration of a variable position rounding
implementation into the multiplier, that is required to deal with denormalized mul-
tiplication results, was one of the main problems for the implementation of this
microarchitecture and is one of the main innovations of this work [37].
Directly linked to the choice of the FP microarchitecture is the question of the internal
oating-point representations. In this work, ve dierent internal FP representations
are dened. These are used to specify the interfaces between the functional units in
detail. In addition to the consideration of the three dierent microarchitectures for the
FP implementation, the implementations of the FP-multiplication and the FP-division are
chosen among 6(2x3) variants:
 For the FP multiplication implementation a Booth encoded adder tree is used either
in a full-sized version that is able to compute double precision and single precision
multiplications in one iteration or in a half-sized version that computes double pre-
cision multiplications in two iterations and single precision multiplications in one
iteration.
 For the FP division implementation, we consider three dierent implementations
of the Newton-Raphson iteration with an initial reciprocal approximation with an
absolute approximation error bounded by 2
 8
, 2
 16
, and 2
 28
, respectively. For this
initial reciprocal approximation a fast implementation of a linear approximation
formula using partial compressions was developed [36, 39].
In addition to the dierent design choices for the internal FP representations, the rounding
microarchitecture and the choice of the FP multiplication and the FP division implemen-
tation, the presented FPU designs make also use of the following innovations, that were
developed in the context of this work:
(a) a fast implementation of variable position rounding for FP multiplication [37];
(b) to the best of our knowledge the fastest integrated FP addition and rounding algo-
rithm published todate [40],
(c) the fastest FP multiplication rounding algorithm published todate [11, 12] and
(d) the fastest linear reciprocal approximation implementation published todate. [36,
39];
(e) an ecient integration of single and double precision rounding for FP multiplication
[9];
v(f) a Booth encoded adder-tree with an improved cost formula [30].
The proposed FPUs are quantitatively analyzed regarding the hardware cost, the cycle
time and the performance. The hardware cost and the cycle time are measured using the
formal Hardware model from [22]. The performance of the FP units is analyzed on traces
of the SPECfp92 Benchmark Suite integrated into a pipelined RISC-processor from [23].
In this quantitative analysis (see also [38]) it is demonstrated that the choice of the
rounding microarchitecture in the FPU has a larger impact on the performance of the mi-
croprocessor than the choice of the FP multiplication or the FP division implementation.
In comparison to this the impact of the microarchitecture choice on the cost is relatively
small. The microarchitecture that uses dedicated rounding units provides the best perfor-
mance with only small additional cost, so that this rounding architecture seems to be the
best choice in oating-point implementations.
vi
Zusammenfassung
Floating-Point Operationen gewinnen in heutigen Grak- und Multimedia-Anwendungen
immer mehr an Bedeutung. Deshalb besitzen aktuelle Mikroprozessoren mindestens eine
Floating-Point Einheit, die die Floating-point Berechnungen unterstutzt und beschleunigt.
Um ein wohldeniertes Verhalten der Floating-point Berechnungen zu erhalten, sollte die
Floating-point Unterstutzung konform zum IEEE oating-point Standard 754-1985 [19]
sein.
Trotz des groen Bedeutung von Floating-Point Implementierungen in Hardware, gibt
es in der oenen Literatur nur sparliche Antworten auf die Frage, wie man eine schnelle
IEEE konforme FP Einheit entwirft. Daruberhinaus gibt es verschiedene Entscheidungen,
die beim Entwurf einer IEEE konformen FPU getroen werden mussen, darunter: die
Wahl der internen Darstellungen der Gleitkomma- (FP) Zahlen, die Rundungsalgorith-
men, die Art der Behandlung von denormalisierten Ergebnissen, die Mehrfachverwendung
von Teilen der Hardware, wie z.B. die Benutzung derselben Rundungshardware fur ver-
schiedene Einheiten, und die Implementierungen des FP Addierers, des FP Multiplizierers
und des FP Dividierers. Diese Entscheidungen beeinussen sowohl die Kosten alsauch
die Leistung der FPU. Nichtsdestotrotz wurden diese Entscheidungen bislang nicht in der
Literatur diskutiert.
Im Gegensatz zu diesem Mangel an Publikationen uber die Implementierung von IEEE
konformen FPUs, gibt es allerdings eine Reihe von publizierten Implementierungen von
einzelnen Floating-point Operationen fur den Fall von normalisierten Operanden in einer
festgelegten Genauigkeit, z.B. [9, 26, 27, 32, 40, 43, 44], und diese Implementierungen sind
in Hinblick auf ihre Geschwindigkeit optimiert. Deshalb ist es eine wichtige und inter-
essante Frage, wie diese Implementierungen einzelner FP Operationen fur normalisierte
Operanden in eine FPU, die mehr als einen FP Typ unterstutzt und auch die Behandlung
von denormalisierten Zahlen und special values berucksichtigt, integriert werden konnen.
Das beinhaltet hauptsachlich die Fragen, welche internen FP Zahlendarstellungen in einer
FP Einheit verwendet werden sollten und wie die Architetur einer FP Einheit zu organ-
isieren ist.
Die vorliegende Arbeit setzt in dieser Lucke an. Zu diesem Zweck werden in dieser
Arbeit 18 verschiedene FP Implementierungen entworfen, quantitativ analysiert und ver-
glichen. Alle vorgestellten FPU Entwurfe sind fur die FP Operationen, die sie implemen-
tieren vollstandig konform zu dem IEEE Standard 754-1985, unterstutzen sowohl single
precision alsauch double precision Operanden und berucksichtigen auch denormalisierte
Ergebnisse, special values, Exponenten wrapping und FP exceptions in Hardware. Der
Kern dieser Arbeit ist der Entwurf und der Vergleich von drei unterschiedlichen FPU
Architekturen, die die folgenden Optionen betrachten:
(I) die Verwendung eines gemeinsamen allgemeinen Runders fur alle Funktionseinheiten.
Eine grundlegende Spezikation eines solchen Runders wurde zuerst in [10] beschrieben.
vii
viii
Danach wurde dieser Runder in unserer Gruppe in einer Version implementiert, die
in [23] vorgestellt werden wird. Dieser Runder wurde fur die vorliegende Arbeit
weiter optimiert.
(II) eine Rundungsimplementierung in zwei Schritten (gradual rounding), ein erster Run-
dungsschritt in den Funktionseinheiten unter der Annahme von normalisierten Ergeb-
nissen in double precision und ein zweiter Rundungsschritt in einem gemeinsamen
Gradual Rounder, der das Ergebis fur alle anderen Falle (nicht double precision
oder kein normalisiertes Ergebnis) anpat. Fur das Runden in den Funktionsein-
heiten unter der Annahme von normalisierten double precision Ergebnissen konnen
unterschiedliche Algorithmen aus der oenen Literatur verwendet werden. Die Im-
plementierung des gradual rounders basiert auf der Theorie aus [21]. Dieses Run-
dungsprinzip wird in der vorliegenden Arbeit fur vollstandig IEEE konformes Run-
den unter Berucksichtigung von denormalisierten Ergebnissen, special values, excep-
tions und Exponent wrapping integriert.
(III) die Verwendung von eigenen voll IEEE konformen Rundungsimplementierungen fur
jede Funktionseinheit, die jeweils eigenstandig denormalisierte Ergebnisse, special
values, exceptions und ExponentWrapping gema dem IEEE Standard berucksichtigen.
Die Implementierung dieser Architektur einer IEEE konformen FPU mit eigenstandigen
Rundungsimplementierungen ist vollstandig neu in dieser Arbeit. Besonders die In-
tegration des Variable Position Rundens in den Multiplizierer, das erforderlich wird,
um denormalisierte Multiplikationsergebnisse behandeln zu konnen, ist eines der
Hauptprobleme dieser FPU Architektur und damit ist die beschriebene Implemen-
tierung eine der wichtigsten Innovationen der vorliegenden Arbeit [37] .
Direkt verbunden mit der Wahl der Architektur der FPU ist die Frage nach den zu ver-
wendenden internen FP Darstellungen. In dieser Arbeit werden funf verschiedene interne
FP Darstellungen deniert. Diese werden dann dazu verwendet um die Schnittstellen
zwischen den Funktionseinheiten einfach, aber detailiert zu spezizieren.
Zusatzlich zur Betrachtung der drei unterschiedlichen FPU Architekturen wahlen wir
die Implementierungen der FP Multiplikation und der FP Division unter 6(2x3) verschiede-
nen Varianten aus:
 Fur die Implementierung der FP Multiplikation wird entweder ein Booth2 Multi-
plizierer vollstandiger Groe verwendet, der sowohl single alsauch double precision
Multiplikationen in einer Iteration berechnen kann oder es wird ein Booth2 Mul-
tiplizierer halber Groe verwendet, der single precision Multiplikationen in einer
Iteration und double precision Multiplikationen in zwei Iterationen berechnet.
 Fur die Implementierung der FP Division betrachten wir drei unterschiedliche Im-
plementierungen der Newton-Raphson Iteration mit einer Startapproximation des
Reziproken 1fbmit absolutem Approximationsfehler kleiner als 2
 8
, 2
 16
bzw. 2
 28
.
Fur diese Approximation des Reziproken wurde eine schnelle Implementierung einer
linearen Approximationsformel unter Verwendung einer partiellen Kompression en-
twickelt [36, 39].
Die vorgestellten FPU Designs benutzen daruberhinaus folgende Neuerungen, die im Rah-
men dieser Arbeit entstanden sind:
(a) eine schnelle Rundungsimplementierung fur den FPMultiplizierer mit variabler Run-
dungsposition [37];
ix
(b) nach unserem besten Wissen den bisher schnellsten publizierten Algorithmus zum
Addieren und Runden von FP Zahlen [40],
(c) den bisher schnellsten publizierten Algorithmus zum Runden bei der FP Multiplika-
tion [11, 12] und
(d) die bisher schnellste publizierte Implementierung einer linearen Approximation von
Reziproken [36, 39],
(e) eine eziente Integration des Rundens in single precision und double precision [9];
(f) einen Booth-Multiplizierer mit verringerten Kosten [30].
Die vorgestellten FPU Implementierungen werden bezuglich der Hardwarekosten, der Zyk-
luszeit und der Leistung, die sie integriert in einen gepipelinten RISC Processor aus [23]
auf Traces der SPECfp92 Benchmark Suite erbringen, analysiert und verglichen. In dieser
quantitativen Analyse (siehe auch [38]) wird gezeigt, da die Auswahl der Rundungs-
Architektur einer FPU einen groeren Einu auf die Prozessorleistung hat als die Auswahl
der Implementierung der FP Multiplikation oder der FP Division. Im Gegensatz dazu ist
der Einu der Auswahl einer Rundungs-Architektur der FPU auf die Hardwarekosten
vergleichsweise gering. Die Rundungs-Architektur, die vollstandige eigene Rundungsim-
plementierungen fur jede Funktionseinheit benutzt, liefert bei weitem die beste Leistung
und ist lediglich geringfugig teurer als Varianten mit anderen Rundungs-Architekturen.
Demzufolge scheint diese Rundungs-Architektur die beste Wahl in FP Implementierungen
zu sein.
x
Contents
1 Introduction 1
2 IEEE Floating-Point Standard 4
2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Numbers and Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Factorings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 IEEE Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Packed IEEE Floating-Point Format . . . . . . . . . . . . . . . . . . 10
2.2.4 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 IEEE Rounding Denition . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Rounding Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 IEEE Rounding Functions . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.1 IEEE Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.2 Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.3 Operations on Special Values . . . . . . . . . . . . . . . . . . . . . . 24
2.4.4 Summary of IEEE Computations . . . . . . . . . . . . . . . . . . . . 26
2.5 Rounding Computation Utilities . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.1 Representatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.2 Injection Based Rounding . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.3 Gradual Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Internal Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6.1 Packed Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6.2 Unpacked Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.6.2.1 Packed Format  ! Unpacked Format . . . . . . . . . . . . 40
2.6.2.2 Unpacked Format  ! Packed Format . . . . . . . . . . . . 41
2.6.3 Normalized Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6.3.1 Unpacked Format  ! Normalized Format . . . . . . . . . 42
2.6.3.2 Normalized Format  ! Unpacked Format . . . . . . . . . 43
2.6.4 Representative Format . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6.5 Gradual Result Format . . . . . . . . . . . . . . . . . . . . . . . . . 46
3 FP Microarchitectures 49
4 Basic FP Operations 56
4.1 Internal Format Conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1.1 Unpacking I-III (packed  ! normalized format) . . . . . . . . . . . 56
xi
xii CONTENTS
4.1.2 General Rounding I (representative  ! packed format) . . . . . . . 59
4.1.3 Gradual Rounding II (gradual result  ! packed format) . . . . . . 74
4.1.4 Packing III (normalized  ! packed format) . . . . . . . . . . . . . . 77
4.2 Addition/Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2.1 Addition/Subtraction I (normalized  ! representative format) . . . 79
4.2.2 Addition/Subtraction II (normalized  ! gradual result format) . . 86
4.2.3 Addition/Subtraction III (normalized  ! normalized format) . . . . 105
4.3 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.3.1 Multiplication I (normalized  ! representative format) . . . . . . . 119
4.3.2 Multiplication II (normalized  ! gradual result format) . . . . . . . 123
4.3.3 Multiplication III (normalized  ! normalized format) . . . . . . . . 136
4.4 Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.4.1 Initial Reciprocal Approximation . . . . . . . . . . . . . . . . . . . . 157
4.4.1.1 Approximation formula . . . . . . . . . . . . . . . . . . . . 158
4.4.1.2 Redundant Booth-Digit Representations . . . . . . . . . . . 161
4.4.1.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 164
4.4.2 Division I (normalized  ! representative format) . . . . . . . . . . 166
4.4.2.1 Approximation of the quotient (step 1.) . . . . . . . . . . . 169
4.4.2.2 Computation of the p-representative for f
rc
(step 2.) . . . . 171
4.4.3 Division II (normalized  ! gradual result format) . . . . . . . . . . 174
4.4.4 Division III (normalized  ! normalized format) . . . . . . . . . . . 176
5 Evaluation 181
Chapter 1
Introduction
The importance of oating-point operations is increasing in recent graphic and multimedia
applications. Therefore, each modern microprocessor has to contain at least one oating-
point unit, that supports and accelerates the oating-point computations. To achieve
a well dened behavior during the computations, the oating-point support should be
conform with the IEEE oating-point standard 754-1985 [19]. This IEEE specication
could also be achieved by supporting parts of it in software, but for high-performance
systems a hardware solution is preferable.
Despite the high demand for oating-point hardware implementations, a full answer
to the question, how to design a fast IEEE compliant FP unit, rarely can be found in the
open literature. Moreover, there are several choices that need to be made when designing
an IEEE compliant FPU, among them: the internal representation of oating-point num-
bers, the rounding algorithms, handling of denormal results, usage of the same rounding
hardware for dierent units (e.g. adder, multiplier, divider), and the implementations
of the adder, the multiplier and the divider. These choices inuence both the cost and
the performance of the FPU. Nevertheless, these issues have not been discussed in the
open literature todate.In contrast to this lack of publications about the implementation
of fully IEEE compliant FP operations or fully IEEE compliant FPUs, there are many
publications about the implementation of specic oating-point operations for the case of
normalized operands in a specic precision, e.g. [9, 26, 27, 32, 40, 43, 44], and these im-
plementations are highly optimized for speed. Therefore, it is an important question, how
to integrate the implementations of the dierent FP operations for normalized operands
into a oating-point unit that supports more than one precision, denormalized numbers
and special value results. This mainly includes the questions of which internal FP repre-
sentations should be used in a FP unit and how the microarchitecture of a FP unit could
be organized.
We present an answer to this question by developing and comparing three dierent
rounding microarchitectures for a FP unit:
(I) In the rst microarchitecture all the rounding computations are concentrated in
a shared general rounding unit. This rounding unit considers the rounding for all
IEEE results including the exponent wrapping and the FP exceptions for both single
and double precision operations. A basic specication of such a rounder was rst
described in [10]. Thereafter, this rounder was implemented by our group, resulting
in a version that will be included in [23], where also a rigorous correctness proof of
the compliance with the IEEE rounding denition will be found. This rounder is
further optimized in this thesis.
1
2 CHAPTER 1. INTRODUCTION
(II) In the second microarchitecture, the rounding for the case of normalized double
precision results is computed within each functional unit and this rounded result is
xed for all the remaining cases in a second rounding step implemented by a shared
gradual rounding unit. For the integrated rounding in the functional units assuming
normalized, double precision operands and results, several algorithms from literature
could be used. The implementation of the gradual rounder is based on the theory
from [21] about gradual rounding. This rounding technique is applied in this thesis
for full IEEE compliant rounding including the handling of denormalized results,
special values, exceptions and exponent wrapping.
(III) By the third rounding architecture a completely new architecture for an IEEE com-
pliant FPU is suggested. In this architecture no rounding hardware is shared, but
each functional unit contains a dedicated rounding implementation that computes
full IEEE rounding considering denormal and special values, exceptions and expo-
nent wrapping. The special problem with the implementation of this microarchitec-
ture is the implementation if the oating-point multiplication. The oating-point
multiplier conventionally requires normalized signicands in its operands and de-
livers an almost normalized result. For the fast integration of IEEE rounding into
the FP multiplier, the signicand has to be rounded in parallel to the mulplication
computations. For the case of denormalized results this rounding has to be com-
puted at a variable rounding position, that could be at each position within the
signicand. The idea, how to integrate such a variable position rounding into the
multiplication implementation is the key concept for this microarchitecture. Such
an implementation is developed in this work. Because such a multiplication imple-
mentation allows to work on normalized FP representations (even for denormalized
values) as inputs and outputs, the internal FP representations can eb changed to
normalized FP representations for this microarchitecture.
To nd out the impact of the microarchitecture choice on the quality of the oating-
point implementation, we model the performance and the cost of designs that dier by the
use of the dierent microarchitectures. This would already be possible by a comparison
of three FP designs, but to improve the expressiveness of the comparison, and to be able
to compare the rounding architectures under several conditions, we additionally vary the
FP multiplication and FP division implementation for each FP microarchitecture. For
this purpose, we choose between two dierent FP multiplication and three dierent FP
division implementations.
 For the FP multiplication implementation a Booth encoded adder tree is used either
in a full-sized version that is able to compute double precision and single precision
multiplications in one iteration or in a half-sized version that computes double pre-
cision multiplications in two iterations and single precision multiplications in one
iteration. For the Booth encoded adder trees the constructions from [30], where we
improved cost formula, are used.
 For the FP division implementation, we consider three dierent implementations
of the Newton-Raphson iteration with an initial reciprocal approximation with an
absolute approximation error bounded by 2
 8
, 2
 16
, and 2
 28
, respectively. For this
initial reciprocal approximation a fast implementation of a linear approximation
formula using partial compressions is used, that we developed in [36, 39].
3In combination with the three microarchitectures these options combine to a comparison
of 18 dierent FP implementations.
All the FPUs designed in this work are fully compliant with the IEEE standard for all
implemented operations, support both single and double precision, and deal with denor-
malized values and special cases in hardware. Because to design an IEEE compliant FPU
is a complex and error-prone task, all the FPU designs are specied in full detail at gate
level and the correctness of the FPU designs (in particular the compliance with the IEEE
standard) is proven.
The performance of the designs is measured by a trace-driven run-time simulation of
a R3000 like pipelined RISC processor [22, 23] that integrates the proposed oating-point
implementations. The simulations are computed on traces of the SPECfp92 Benchmarks
suite [17]. The costs of the designs are modeled by counting the gates that are required
by the dierent implementations. Thus, based on the performance and the cost of each
FP design, the quality of the FP designs and, in particular, the quality of the rounding
microarchitectures can be compared.
This thesis is partitioned into the following chapters. Chapter 2 prepares the deni-
tions of the IEEE FP standard in preparation for the description of the FP implementa-
tions. The basic description of the FP standard is similar to the description in [10, 23].
Moreover, in this chapter a general framework for the integrated description of dierent
rounding functions is developed. Rigorous correctness proofs for the partitioning of full
IEEE compliant rounding into these rounding functions are given. This chapter also pro-
vides computation utilities for the implementation of these rounding function. As one
important basic concept, injection-based rounding [9, 40, 11, 12] is introduced. Finally,
this chapter prepares the internal FP representations, by that the interfaces between the
functional units and the shared rounding hardware are specied. Chapter 3 overviews the
requirements on the implementation of a FPU und describes the microarchitectures and
the design choices for the proposed FP designs. Chapter 4 describes the implementations
of all basic FP operations for all three microarchitectures . In combination with a detailed
description of the implementations at gate level, the correctness of the designs and the
compliance with the IEEE standard is proven. Finally, in Chapter 5 the proposed FPU
implementations are quantitatively analyzed and compared.
Chapter 2
IEEE Floating-Point Standard
The IEEE oating-point Standard 754-1985 [19] species oating-point number formats,
operations and exception handling in detail. This chapter presents its information in a
slightly dierent form following [10, 23].
2.1 Notation
We denote real values by small-letter names xyz and bit-strings by small capitalized names
xyz. The single bits of a bit-string xyz 2 f0; 1g
n
can be indexed by xyz[n
2
: n
1
] =
(xyz[n
2
];    ;xyz[n
1
]) with integers n
2
= n
1
+ n   1. The operation < xyz[n
2
: n
1
] >
denes the binary value of xyz[n
2
: n
1
], < xyz[n
2
: n
1
] >
2
denes the value of xyz[n
2
: n
1
]
interpreted as a 2's-complement number, and < xyz[n
2
: n
1
] >
bias
n
denes the value of
xyz[n
2
: n
1
] interpreted as a biased binary number, that includes the bias bias
n
= 2
n 1
 1:
< xyz[n
2
: n
1
] > =
X
n
2
i=n
1
xyz[i]  2
i
< xyz[n
2
: n
1
] >
2
=  xyz[n
2
]  2
n
2
+
X
n
2
 1
i=n
1
xyz[i]  2
i
< xyz[n
2
: n
1
] >
bias
n
=
X
n
2
i=n
1
xyz[i]  2
i
  bias
n
:
To avoid negative indizes, we allow the right index of a bit-string to be larger than the
left index, like in xyz[n
1
: n
2
]. Then, we dene a second version of the operations <> and
<>
2
, that interpret the indizes to be multiplied by ( 1). These operations are dened by
< xyz[n
1
: n
2
] >
neg
=
X
n
2
i=n
1
xyz[i]  2
 i
< xyz[n
1
: n
2
] >
2neg
=  xyz[n
1
]  2
 n
1
+
X
n
2
i=n
1
+1
xyz[i]  2
 i
:
The operation bin
+n 1

(x) : IR  ! f0; 1g
n
computes the bit-string of the binary repre-
sentation of x of length n from bit-position with weight 2

to bit-position with weight
2
+n 1
. If x has two dierent binary representations, we choose the binary representation
with nite length, so that in x =
P
i
x[i]  2
i
the x[i] are unique and bin
+n 1

(x) can be
written by:
bin
+n 1

(x) = x[+n 1 : ]:
For x 2 f0; 1g
n
and s 2 f0; 1g we dene
x = (x[n  1]; : : : ;x[0]) and x s = (x[n  1] s; : : : ;x[0] s):
4
2.1. NOTATION 5
Some crucial properties of two's complement numbers are (see [MP95])
< 0;x[n  1 : 0] >
2
= < x[n  1 : 0] >
 < x[n  1 : 0] >
2
= < x[n  1 : 0] >
2
+ 1
< x[n  1];x[n  1 : 0] >
2
= < x[n  1 : 0] >
2
< x[n  1 : 0] >
2
 < x[n  2 : 0] > mod 2
n 1
:
From these properties one immediately derives the basic subtraction algorithm for binary
numbers. Let x;y 2 f0; 1g
n
and let <x>  <y>. Because 2
n
> <x>  <y> it suces
to compute the result modulo 2
n
. Thus
< x > < y > = < 0;x >
2
 < 0;y >
2
= < 0;x >
2
+< 1;y >
2
+ 1
 < x > + < y > +1 mod 2
n
:
Lemma 2.1 Biased number strings x[n   1 : 0] 6= 1
n
can be converted to two's com-
plement number strings by (i) an increment and the invertation of the sign bit. Using
< y[n  1 : 0] > = < x[n  1 : 0] >+ 1, we have:
< (0;x[n  1 : 0]) >
bias
n
= < (y[n  1];y[n  1];y[n  2 : 0]) >
2
:
(ii) In the conversion, the sequence of the sign bit inversion and the increment can also
be reversed, so that:
< (0;x[n  1 : 0]) >
bias
n
= < (x[n  1];x[n  1];x[n  2 : 0]) >
2
+ 1:
Proof: (i):
< (0;x[n  1 : 0]) >
bias
n
= < (0;x[n  1 : 0]) >
2
  bias
n
= < (0;x[n  1 : 0]) >
2
+< (1; 10
n 2
1) >
2
= < (0;y[n  1 : 0]) >
2
+< (1; 10
n 2
0) >
2
= < (y[n  1];y[n  1];y[n  2 : 0]) >
2
:
(ii):
< (0;x[n  1 : 0]) >
bias
n
= < (0;x[n  1 : 0]) >
2
+< (1; 10
n 2
1) >
2
= < (x[n  1];x[n  1];x[n  2 : 0]) >
2
+ 1:
2
Lemma 2.2 In the other direction, two's complement number strings x[n   1 : 0] 6=
(1; 0
n 1
) can be converted to biased number strings by an inversion and a decrement.
Using < y[n  1 : 0] >
2
= < (x[n  1];x[n  2 : 0]) >
2
  1, we have:
< x[n  1 : 0] >
2
= < y[n  1 : 0] >
bias
n
:
6 CHAPTER 2. IEEE FLOATING-POINT STANDARD
Proof:
< x[n  1 : 0] >
2
= < (x[n  1];x[n  1 : 0]) >
2
+ bias
n
  bias
n
= < (x[n  1];x[n  1 : 0]) >
2
+< (0; 01
n 1
) >
2
  bias
n
= < (x[n  1];x[n  1 : 0]) >
2
+< (0; 10
n 1
) >
2
  1  bias
n
= < (x[n  1];x[n  1];x[n  2 : 0]) >
2
  1  bias
n
= < (x[n  1];x[n  2 : 0]) >
2
  1  bias
n
= < y[n  1 : 0] >
2
 bias
n
= < y[n  1 : 0] >
bias
n
2
2.2 Numbers and Operations
2.2.1 Factorings
Every real number x can be factored into a sign factor (determined by a sign-bit s), a
scale factor (determined by an exponent e) and a signicand f :
x = ( 1)
s
 f  2
e
:
The tripel (s; e; f) is called a factoring and the operation
x = val(s; e; f) = ( 1)
s
 f  2
e
computes the value of this factoring.
Although every factoring (s; e; f) is mapped to exactly one real number x by the
operation val(s; e; f), every real number x could be represented by innitely many dierent
factorings, that correspond to the same value
x = val(sign(x); 0; jxj) = val(sign(x); 1; 2  jxj) =    :
Denition 2.1 For a set of numbers X , we dene the set of factorings, FACT (X ), that
represent numbers of X by
FACT (X ) = f(s; e; f) j s 2 f0; 1g; e 2 ZZ; f 2 IR and val(s; e; f) 2 Xg
To dene a unique factoring representation of a real number, normalized factorings are
introduced:
Denition 2.2 A normalized factoring (s
0
; e
0
; f
0
) is a factoring with s
0
2 f0; 1g, e
0
2 ZZ,
f
0
2 [1; 2[: The condition f
0
2 [1; 2[ denes a normalized signicand f
0
.
Thus, every non-zero real value can be represented by a unique normalized factoring.
Denition 2.3 For all non-zero factorings (s; e; f) with f 6= 0, we dene the operation
(s; e; f) = (s
0
; e
0
; f
0
) to compute the normalized factoring (s
0
; e
0
; f
0
), so that val(s
0
; e
0
; f
0
) =
val(s; e; f). For factorings of zero with f = 0, we dene  to compute the identity func-
tion: (s; e; 0) = (s; e; 0). As in the normalization operation  the exponent range is not
limited,  is called an unbounded normalization shift. The result of an unbounded normal-
ization shift, (s
0
; e
0
; f
0
), is called an unbounded normalized factoring. Note, that from the
denition of the unbounded normalization shift for zeros it follows, that also all factorings
of zero with f = 0 are unbounded normalized.
2.2. NUMBERS AND OPERATIONS 7
Lemma 2.3 (i) For f 6= 0 and k =  blog(f)c, the unbounded normalization shift (s; e; f)
can be computed by: (s; e; f) = (s; e  k; f  2
k
):
(ii) If 2
 
 f < 2, k can be interpreted as the number of leading zeros lz of the binary
representation bin
0
 
(f), so that (s; e; f) = (s; e  lz; f  2
lz
):
Proof: (i) The result of the unbounded normalization shift (s; e; f) has to be the
normalized factoring of (s; e; f). Therefore, (s; e  k; f  2
k
) has to fulll the properties (1)
val(s; e   k; f  2
k
) = val(s; e; f) and (2) f  2
k
2 [1; 2[:
(1) val(s; e  k; f  2
k
) = ( 1)
s
 f  2
k
 2
e k
= ( 1)
s
 f  2
e
= val(s; e; f):
(2) From  log(f)   blog(f)c <  log(f) + 1, it follows, that
f  2
 log(f)
 f  2
 blog(f)c
< f  2
 log(f)+1
f=f  f  2
 blog(f)c
< 2f=f;
and, therefore, f  2
k
2 [1; 2[, as required.
(ii) We know from (i), that f  2
k
2 [1; 2[, and therefore, f 2 [2
 k
; 2
 k+1
[. From the
condition f < 2
 k+1
, it follows, that in the binary representation of f , f [0 : ] = bin
0
 
(f),
the bits f [0 : k   1] have to be zero. From f > 2
 k
and f [0 : k   1] = 0
k
, it follows, that
f [k] = 1. Thus, f [0 : ] contains exactly lz = k leading zeros and the lemma follows. 2
Denition 2.4 In contrast to the denition of an unbounded normalization shift, we de-
ne a bounded normalization shift of (s; e; f) by the operation b

c(s; e; f) = (s
00
; e
00
; f
00
):
b

c(s; e; f) =

(s
0
; e
0
; f
0
) = (s; e; f) if e
0
 
(s; ; f  2
e 
) otherwise,
(2.1)
i.e., the factoring (s
00
; e
00
; f
00
) is normalized only if the normalization operation does not
produce an exponent smaller than . The result of a bounded normalization shift (s
00
; e
00
; f
00
)
is called a bounded normalized factoring. From val(s; e; f) = val(s; ; f
e 
) and deni-
tion 2.3, it follows that also the bounded normalization shift does not change the value of
the factoring and we have val(b

c(s; e; f)) = val(s; e; f).
2.2.2 IEEE Numbers
Floating-point number types form subsets of the Reals. They can be represented by
factorings with limited and discretized value ranges for exponents and signicands. The
IEEE oating-point types are dened by describing the possible choices for the sign, the
exponent and the signicand of a factoring and by the denition of some special values,
so that each IEEE oating-point type (precision) consists of:
 Normalized numbers are represented by normalized factorings (s
0
; e
0
; f
0
), where the
exponent e
0
is an integer in the range e
min
 e
0
 e
max
and the signicand f
0
belongs
to the discrete set < f
0
[0 : p  1] >
neg
2 f1; 1 + 2
 p+1
; :::; 2  2
 p+1
g. The condition
f
0
2 [1; 2[ denes a normalized signicand f
0
.
 Denormalized numbers are represented by factorings (s; e; f), where the exponent
is e = e
min
and the signicand f belongs to the discrete set < f [0 : p  1] >
neg
2
f0; 2
 p+1
; :::; 1   2
 p+1
g. As f 2 [0; 1[, and thus f 62 [1; 2[, the signicand is called
denormalized.
8 CHAPTER 2. IEEE FLOATING-POINT STANDARD
max
e
mine     +12
2
z-(p-1) min
e
2
2 min
e    -(p-1)
2 min
e    -(p-1)
0
2
22
2
z z+1
max
e     +1
max
x2 max
e    -(p-1)
Figure 2.1: Geometry of IEEE oating-point numbers.
 Special values are dened by the set consisting of +1,  1 and two types of Not
a Number (NaN): signaling NaN (sNaN) and quiet NaN (qNaN). These values can
not be represented by factorings with nite exponents. Therefore, special bit strings
for the representation of special numbers are used. Nevertheless, we use the symbols
(0; e
1
; f
1
) for the factoring of +1 and the symbols (1; e
1
; f
1
) for the factoring  1
corresponding to the special bit strings for the IEEE innity representations. The
NaN representations are not unique. Therefore, if it does not matter which represen-
tation is chosen, we use the symbols (s; e
sNaN
; f
sNaN
) for sNaN factorings and the
symbols (s; e
qNaN
; f
qNaN
) for qNaN factorings corresponding to an arbitrary IEEE
NaN representation. If we want to refer to a specic NaN representation, we index
them with a positive number, like in (s
1
; e
sNaN1
; f
sNaN1
) or (s
2
; e
qNaN2
; f
qNaN2
). We
dene these factorings of special values to be normalized, so that the normalization
shifts compute the identity function on them. Moreover, we extend the denition of
the function val by val(s; e
1
; f
1
) = ( 1)
s
 1, val(s; e
sNaN
; f
sNaN
) = sNaN and
val(s; e
qNaN
; f
qNaN
) = qNaN .
The union of normalized numbers and denormalized numbers form the representable num-
bers of an IEEE oating-point type. The geometry of the representable numbers is depicted
in gure 2.1 on page 8 and shows the following properties:
 For every exponent value e between e
min
and e
max
there are two intervals of rep-
resentable (normalized) numbers: [2
e
; 2
e+1
[ and ] 2
e+1
; 2
e
]. The gaps between
consecutive representable numbers in these intervals are 2
e (p 1)
.
 As the exponent value increases by one, the length of the interval [2
e
; 2
e+1
[ doubles,
and the gaps between the representable (normalized) numbers double as well. Thus,
the number of representable numbers per interval is xed and it equals 2
p 1
.
 The denormalized numbers are the representable numbers in the interval ] 2
e
min
; 2
e
min
[.
The gaps between consecutive representable numbers in this interval are 2
e
min
 (p 1)
.
Thus, the gaps in the interval [0; 2
e
min
[ equal the gaps in the interval [2
e
min
; 2
e
min
+1
[.
This property is called in the literature gradual underow since the large gap between
zero and 2
e
min
is lled with denormalized numbers.
The IEEE denition of normalized and denormalized oating-point numbers includes a
denition of their factorings, so that we distinguish between the following sets of factorings
of an IEEE oating-point type:
2.2. NUMBERS AND OPERATIONS 9
Denition 2.5 The set of normalized IEEE factorings, NORfact
n;p
, the set of denor-
malized IEEE factorings, DENfact
n;p
, and the set of special IEEE factorings, SPEfact :
NORfact
n;p
= f(s; e; f) j s 2 f0; 1g; e 2 ZZ with (e
min
 e  e
max
);
and b 2 IN with (0  b < 2
p 1
) : f = 1 + b  2
 (p 1)
o
DENfact
n;p
=
n
(s; e
min
; f) j s 2 f0; 1g; b 2 IN with (0  b < 2
p 1
) : f = b  2
 (p 1)
o
SPEfact = f(0; e
1
; f
1
); (1; e
1
; f
1
); (s; e
qNaN
; f
qNaN
); (s; e
sNaN
; f
sNaN
)g :
We dene the set of IEEE factorings by
IEEEfact
n;p
= DENfact
n;p
[NORfact
n;p
[ SPEfact:
Accordingly, the following sets of numbers are dened:
Denition 2.6 For each IEEE oating-point type, the set of normalized numbers, NOR
n;p
,
the set of denormalized numbers, DEN
n;p
, and the set of IEEE special values, SPE , are
dened by
NOR
n;p
= fx j 9(s; e; f) 2 NORfact
n;p
: x = val(s; e; f)g
DEN
n;p
= fx j 9(s; e; f) 2 DENfact
n;p
: x = val(s; e; f)g
SPE = f+1; 1; qNaN; sNaNg :
We dene the set of representable numbers of an IEEE oating-point type, REP
n;p
, by
REP
n;p
= DEN
n;p
[NOR
n;p
:
The set of values of an IEEE oating-point type, FP
n;p
, additionaly includes the special
values, so that
FP
n;p
= DEN
n;p
[NOR
n;p
[ SPE
= REP
n;p
[ SPE :
Lemma 2.4 The sets of denormalized and normalized IEEE numbers/factorings are dis-
junct: NOR
n;p
\ DEN
n;p
= ; and NORfact
n;p
\ DENfact
n;p
= ;. Thus, each IEEE
oating-point value x 2 FP
n;p
has a unique IEEE factoring (s; e; f) 2 IEEEfact
n;p
with
x = val(s; e; f).
Proof: For normalized IEEE factorings (s
nor
; e
nor
; f
nor
) 2 NORfact
n;p
, we have e
nor

e
min
and f
nor
 1, so that jx
nor
j = jval(s
nor
; e
nor
; f
nor
)j  2
e
min
: For denormalized IEEE
factorings (s
den
; e
den
; f
den
) 2 DENfact
n;p
, we have e
den
= e
min
and f
nor
< 1, so that
jx
den
j = jval(s
den
; e
den
; f
den
)j < 2
e
min
Thus, all normalized IEEE factorings have a larger
absolute value than each of the denormalized IEEE factorings, jx
nor
j > jx
den
j, so that
NOR
n;p
\ DEN
n;p
= ; and NORfact
n;p
\DENfact
n;p
= ;. For the second part of the
lemma we additionaly have to use, that also each special value has a unique factoring
representation. This can easily be seen from the denitions of SPE and SPEfact. 2
Lemma 2.5 From an arbitrary factoring (s; e; f) 2 FACT (FP
n;p
) of an IEEE FP num-
ber x = val(s; e; f) 2 FP
n;p
, the bounded normalization shift d
e
min
e(s; e; f) = (s
00
; e
00
; f
00
)
computes the corresponding IEEE factoring (s
00
; e
00
; f
00
) 2 IEEEfact
n;p
with val(s
00
; e
00
; f
00
) =
x = val(s; e; f).
10 CHAPTER 2. IEEE FLOATING-POINT STANDARD
precision p n bias
n
e
min
e
max
jxj
min
jxj
max
single 24 8 127  126 127 1:4 10
 45
3:4 10
38
single ext. 32 11 |  1022 1023 | |
double 53 11 1023  1022 1023 4:9 10
 322
1:8 10
310
double ext. 64 15 |  16382 16383 | |
Table 2.1: IEEE oating-point formats.
Proof: The proof consists of two parts for the cases: (a) val(s; e; f) 2 DEN
n;p
and (b)
(s; e; f) 2 NOR
n;p
.
(a) For val(s; e; f) 2 DEN
n;p
, there is a denormalized IEEE factoring (a; b; c) 2
DENfact
n;p
with val(s; e; f) = val(a; b; c). We know already from the denition of the
bounded normalization shift 2.4, that also val(s
00
; e
00
; f
00
) = val(s; e; f). From the denition
of denormalized IEEE factorings (see denition 2.5), it follows that b = e
min
. Therefore,
for the proof of (s
00
; e
00
; f
00
) = (a; b; c) and (s
00
; e
00
; f
00
) 2 DENfact
n;p
, it suces to show
that e
00
= e
min
. From val(s; e; f) 2 DEN
n;p
it follows, that jval(s; e; f)j < 2
e
min
. We con-
sider the normalized factoring (s; e
0
; f
0
) = (s; e; f). Because f
0
 1, and 2
e
0
 f
0
< 2
e
min
,
the exponent e
0
< e
min
is smaller than the exponent bound of the bounded normalization
shift. Therefore, it follows from the denition of d
e
min
e that e
00
= e
min
and part (a) of
the proof is completed.
(b) For val(s; e; f) 2 NOR
n;p
, we have to show that (s
00
; e
00
; f
00
) is normalized. From
val(s; e; f) 2 NOR
n;p
, it follows, that jval(s; e; f)j  2
e
min
. We consider the normalized
factoring (s; e
0
; f
0
) = (s; e; f). Because f
0
< 2, and 2
e
0
f
0
 2
e
min
, the exponent e
0
 e
min
is larger than or equal to the exponent bound of the bounded normalization shift. There-
fore, it follows from the denition of d
e
min
e that (s
00
; e
00
; f
00
) is the normalized factoring
(s
00
; e
00
; f
00
) = (s; e; f) and also part (b) of the proof is completed. 2
Denition 2.7 If an unbounded normalization shift is computed on the factorings from
FACT (FP
n;p
), we get a set, that includes the (unbounded) normalized factoring for each
IEEE number in FP
n;p
. In this way we dene the set of NF factorings NFfact
n;p
by:
NFfact
n;p
= f(s; e; f) j (s; e; f) 2 FACT (FP
n;p
) and (s; e; f) = (s; e; f)g :
In the early days of oating-point design, many dierent formats with dierent values
for e
min
, e
max
, n and p were used. The success of the IEEE oating-point Standard 754-
1985 [19] reduced the supported FP types to a few: single, double, single extended and
double extended. The parameters for these precisions are given in table 2.1. In an IEEE
compliant FPU-Design only some of these FP-types have to be implemented. We will focus
on the implementation of the single and double precision types, because these types are
most commonly used and the integration of additional types would be straight-forward.
2.2.3 Packed IEEE Floating-Point Format
At the bit level, numbers in the single and double formats are composed of three elds
corresponding to sign, biased exponent and fraction (signicand without rst bit) like
depicted in gure 2.2. In the biased exponent representation, a bias of bias
n
= 2
n 1
  1 is
used. Because bias
n
=  e
min
+1 = e
max
(see table 2.1 for single and double precision), the
2.2. NUMBERS AND OPERATIONS 11
S E[7:0] F[-1:-23]
E[10:0]S F[-1:-52]
Single Format (32 bits)
Double Format (64 bits)
Figure 2.2: Packed IEEE oating-point format.
value of e+bias
n
is in the range 1  e+bias
n
 2
n
 2. Thus, the bit strings for 0 and 2
n
 1
do not occur in the n-bit biased binary representation e[n  1 : 0] = bin
n 1
0
(e + bias
n
).
Therefore, the exponent strings e = 0
n
and e = 1
n
are used for the representation of
denormalized numbers and special values.
The signicand f of a representable number can be represented with p bits f[0 :p 1] =
bin
0
 p+1
(f). But only the fraction f[1 :p 1] is included in the number string, and the
hidden bit f[0] does not occur explicitely in the number representation. The hidden bit
f[0] equals 1, i f is normalized, and f[0] equals 0, i f is denormalized. Because the
exponent representation of e
min
for denormalized numbers diers from all exponent rep-
resentations from normalized numbers, the hidden bit f[0] can be extracted from the
exponent representation.
The value of a number x represented by the packed representation (s;e[n 1:0]; f[1 :p 1])
is dened by
1. If e[n  1 : 0] = 0
n
(denormalized numbers),
then x = ( 1)
s
< (0:f[1 : p  1]) >
neg
 2
e
min
.
2. If e[n  1 : 0] 6= 0
n
and e[n  1 : 0] 6= 1
n
(normalized numbers),
then x = ( 1)
s
< (1:f[1 : p  1]) >
neg
 2
<e[n 1:0]>
bias
n
.
3. If e[n  1 : 0] = 1
n
(special values),
then x is a special value depending on f[1 : p  1]:
 If f[1 : p  1] = 0
p 1
, then x is 1 and has the sign of ( 1)
s
.
 If f[1 : p  1] 6= 0
p 1
, then x is NaN regardless of s.
The standard does not specify how to distinguish between signaling and quiet
NaNs. We follow the specication used in [29] and distinguish between signaling
and quiet NaNs by the value of f [1]: If f [1] = 1, then x is a signaling NaN
(sNaN), otherwise x is a quiet NaN (qNaN).
2.2.4 Operations
Beside oating-point types, the IEEE FP Standard denes arithmetic operations that
have to be implemented in hardware or in software. In this section, we only dene exact
results of these operations for nite input operands x = val(s
x
; e
x
; f
x
) 2 REP
n;p
and
y = val(s
y
; e
y
; f
y
) 2 REP
n;p
. The computations involving special values will be described
later in combination with the exception handling.
12 CHAPTER 2. IEEE FLOATING-POINT STANDARD
 Addition/substraction. We use the bit sop to distinguish between addition (sop = 0)
and substraction (sop = 1). The exact value of the addition/substraction result is
dened by:
exact
ADD=SUB
= x+ ( 1)
sop
 y
The computation of the factoring of this value involves several steps. Therefore, we
postpone its specication to the description of the addition implementations.
 Multiplication. The exact product of x and y is denoted by:
exact
MULT
= x  y = ( 1)
s
x
+s
y
 (f
x
 f
y
)  2
e
x
+e
y
:
Thus, (s
x

 s
y
; f
x
 f
y
; e
x
+ e
y
) is a factoring of exact
MULT
.
 Division. The exact quotient of x and y is denoted by:
exact
DIV
= x=y = ( 1)
s
x
 s
y
 (f
x
=f
y
)  2
e
x
 e
y
:
Thus, (s
x

 s
y
; f
x
=f
y
; e
x
  e
y
) is a factoring of exact
DIV
.
 Square-root. For non-negative x  0 the exact square-root of x is denoted by:
exact
SQRT
=
p
x =
q
f
x
 2
(e
x
MOD2)
 2
e
x
DIV 2
Thus, (0;
p
f
x
 2
(e
x
MOD2)
; e
x
DIV 2) is a factoring of exact
SQRT
.
 Remainder. For non-zero y the exact remainder xREMy is dened by:
exact
REM
= x  y  n;
where n is the integer nearest the exact value x=y; whenever jn   x=yj = 0:5, then
n is even.
 Conversion. In conversions, the input operand has already the exact value of the
conversion. This value has than to be converted to the destination's format.
exact
CONV
= x:
In this operation we have to consider, that the input operand could also be an integer
< x >
2
in two's complement representation. Then,
exact
CONV
= < x >
2
:
A factoring of exact
CONV
is given by (s
x
; e
x
; f
x
) or (sign(< x >
2
); 0; j< x >
2
j).
Moreover, the computations of the absolute value (s
x
:= 0) and the negative of a oating-
point number (s
x
:= not(s
x
)) are suggested to be implemented.
The oating-point types are not closed on all of these arithmetic operations. Therefore,
the exact result of an operation might not belong to the same oating-point type. To be
able to operate on results of operations, nevertheless, it is a basic principle of the IEEE
standard to consider the exact result of an operation rst and map it to a oating-point
number by a selected rounding scheme to get a rounded result in the same oating-point
type, nally.
Apart from that, the test operation (comparison) delivers a boolean value from two
oating-point inputs. There are 26 dierent comparisons dened by the IEEE standard,
which we decode by 5 condition code bits cond[4 : 0]. The bits cond[3 : 0] switch the
conditions f>;<;=; UNORDERD(?)g, and cond[4] negates the boolean result bit. Only
26 of the 32 possible combinations are required by the standard. These are listed in
table 2.2.
2.2. NUMBERS AND OPERATIONS 13
condition cond[4 : 0] > < = ? INV if ?
= 00010 F F T F No
? <> 01101 T T F T No
> 01000 T F F F Yes
>= 01010 T F T F Yes
< 00100 F T F F Yes
<= 00110 F T T F Yes
? 00001 F F F T No
<> 01100 T T F F Yes
<=> 01110 T T T F Yes
? > 01001 T F F T No
? >= 01011 T F T T No
? < 00101 F T F T No
? <= 00111 F T T T No
? = 00011 F F T T No
NOT (>) 11000 F T T T Yes
NOT (>=) 11010 F T F T Yes
NOT (<) 10100 T F T T Yes
NOT (<=) 10110 T F F T Yes
NOT (?) 10001 T T T F No
NOT (<>) 11100 F F T T Yes
NOT (<=>) 11110 F F F T Yes
NOT (? >) 11001 F T T F No
NOT (? >=) 11011 F T F F No
NOT (? <) 10101 T F T F No
NOT (? <=) 10111 T F F F No
NOT (? =) 10011 T T F F No
Table 2.2: IEEE test operation (comparison).
14 CHAPTER 2. IEEE FLOATING-POINT STANDARD
2.3 Rounding
2.3.1 IEEE Rounding Denition
IEEE rounding is a mapping from the reals into an IEEE oating-point type. The IEEE
standard denes rounding in four rounding modes: round toward 0 (RZ), round to near-
est(even) (RNE), round toward +1 (RI) and round toward  1 (RMI). Let REP
1
=
REP [ f+1; 1g. For the rounding mode mode 2 fRZ;RNE;RI;RMIg, we present
the rounding denition of the IEEE standard by the description of the rounding function
r
mode
: IR  ! REP
1
. For the three directed rounding modes mode 2 fRZ;RI;RMIg
the obvious meaning of IEEE rounding is given by:
r
RI
(x) = minfy 2 REP
1
j x  yg
r
RMI
(x) = maxfy 2 REP
1
j x  yg
r
RZ
(x) =

r
RMI
(x) if x  0
r
RI
(x) if x < 0:
The denition of the rounding function r
RNE
is a bit more complicated. Let x

max
=
2
e
max
(2 2
 p
) and let y 2 REP be the representable number nearest to x if this is unique,
otherwise let y 2 REP be the even representable number, that is nearest to x. Then,
r
RNE
(x) =
8
<
:
+1 if x  x

max
 1 if x   x

max
y otherwise.
2.3.2 Rounding Functions
In this section we dene rounding for a particular precision , so that a real number x
is mapped to an integral multiple of 2
 
. For a precision , we dene four rounding
functions, that we index by the names of the four IEEE rounding modes RZ, RNE, RI,
and RMI. We will show in the next section how these rounding functions can be used
to implement IEEE rounding. For the denition of the rounding functions, we chose the
integer t, so that t  2
 
 x < (t+1)  2
 
and t

is the even number of the set ft; t+ 1g.
rnd
RI;
(x) =

t  2
 
if x = t  2
 
(t+ 1)  2
 
otherwise
(2.2)
rnd
RMI;
(x) = t  2
 
(2.3)
rnd
RZ;
(x) =

t  2
 
if x  0 OR x = t  2
 
(t+ 1)  2
 
otherwise
(2.4)
rnd
RNE;
(x) =
8
<
:
t  2
 
if x < (t+ 0:5)  2

t

 2
 
if x = (t+ 0:5)  2

(t+ 1)  2
 
otherwise
(2.5)
For the rounding of sign-magnitude representations with x = ( 1)
s
 jxj, the four IEEE
rounding modes for the rounding of x can be reduced to the three IEEE rounding modes
fRZ;RNE;RIg for the rounding of jxj [33]. This is done by reducing the directed rounding
modes RZ, RI and RMI to the rounding modes RZ and RI for the rounding on positive
arguments based on the sign s of the number. Thus, leaving only the three rounding modes
RZ, RNE, and RI that have only to operate on the positive argument jxj. In conjunction
2.3. ROUNDING 15
mode rnd mode[1:0] mode ? 0 : sr mode[1:0] mode ? 1 : sr mode[1:0]
RZ 00 RZ 00 RZ 00
RNE 01 RNE 01 RNE 01
RI 10 RI 10 RZ 00
RMI 11 RZ 00 RI 10
Table 2.3: rounding mode reduction for sign-magnitude arguments
with table 2.3 for the rounding mode reduction, we dene the ?-operation:
?: fRZ;RNE;RI;RMIg  f0; 1g  ! fRZ;RNE;RIg
(mode; s) 7 ! mode ? s
that maps the rounding mode mode and the sign s to the corresponding reduced rounding
mode mode ? s. Based on this denition, the rounding mode reduction can be written as:
rnd
mode;
(x) = rnd
mode;
(( 1)
s
 jxj) = ( 1)
s
 rnd
mode?s;
(jxj):
If we encode the four IEEE rounding modes by rnd mode[1 :0] and the three reduced
rounding modes by sr mode[1 :0] according to table 2.3, the ?-operation can be expressed
by the equations:
sr mode[1] = rnd mode[1] ^ (rnd mode[0]
s) (2.6)
sr mode[0] = rnd mode[1] ^ rnd mode[0]: (2.7)
Furthermore, Quach et al. [33] suggested to implement RNE by round to nearest up
(RNU). With an integer t, such that t  2
 
 jxj < (t+1)  2
 
, the rounding mode RNU
is dened by:
rnd
RNU;
(jxj) =

t  2
 
if jxj < (t+ 0:5)  2
 
(t+ 1)  2
 
otherwise.
(2.8)
The reason that RNE can be implemented by RNU is that rnd
RNU;
(x) 6= rnd
RNE;
(x)
i x = (t+ 0:5)  2
 
and the LSB of the binary encoding of (t+ 1)  2
 
is 1. Therefore,
obtaining rnd
RNE;
(x) from rnd
RNU;
(x) can be accomplished by \pulling down" the
LSB, when x = (t+ 0:5)  2
 
.
2.3.3 IEEE Rounding Functions
In this section a description of IEEE rounding is given, which is more practical than the
denition by the IEEE standard. The following lemma shows how the rounding functions
for a particular precision  from the previous section are related to IEEE rounding. After
that we will consider the IEEE rounding on factorings.
Lemma 2.6 For 2
e
0
 jxj <2
e
0
+1
andmode2fRZ;RNE;RI;RMIg, let e
00
=maxfe
0
; e
min
g
and xr = rnd
mode; e
00
+p 1
(x). Then,
r
mode
(x) =
8
>
>
<
>
>
>
:
1 if xr  2
e
max
+1
and mode 2 fRNE;RIg
x
max
if xr  2
e
max
+1
and mode 2 fRZ;RMIg
 x
max
if xr   2
e
max
+1
and mode 2 fRZ;RIg
 1 if xr   2
e
max
+1
and mode 2 fRNE;RMIg
xr otherwise
16 CHAPTER 2. IEEE FLOATING-POINT STANDARD
Proof: In the denitions of IEEE rounding, in all cases the rounded result r
mode
(x) is
either the nearest number of the destination FP type that is larger than the operand x or
the nearest number of the destination FP type that is smaller than or equal to the operand
x. In the following, we distinguish between the cases: (a) jxj < x
max
and (b) jxj  x
max
.
(a) For jxj < x
max
, we get jxrj  x
max
< 2
e
max
+1
, so that we have to show that
the IEEE rounding denition from r
mode
(x) is equivalent to rnd
mode; e
00
+p 1
(x) for all
rounding modes. We will rst show that the two possible rounding choices of the IEEE
rounding denitions are identical to the two possible results from the denition of the
function rnd
mode; e
00
+p 1
(x).
(i) For jxj < 2
e
min
, we are in the range of denormalized numbers, so that the gap
between two consecutive FP numbers is 2
e
min
 p+1
(see geometry of representable
numbers in section 2.2.2) and there is an integer k, such that
f1 = k  2
e
min
 p+1
 x < (k + 1)  2
e
min
 p+1
= f2
and f1; f2 2 FP
n;p
are the nearest oating-point numbers in FP
n;p
larger than and
smaller than or equal to x.
The denition of rnd
mode; e
00
+p 1
(x) uses the possible rounding results: f3 = l 
2
e
00
 p+1
and f4 = (l + 1)  2
e
00
 p+1
with f3  x < f4. Since jxj < 2
e
min
, we get
e
0
< e
min
and e
00
= e
min
. Thus, the possible rounding choices are the same like in the
IEEE rounding denition: f1 = k 2
e
min
 p+1
= f3 and f2 = (k+1) 2
e
min
 p+1
= f4.
(ii) For jxj  2
e
min
, we are in the range of normalized numbers, so that the gap between
two consecutive FP numbers is 2
e
0
 p+1
(see geometry of representable numbers in
section 2.2.2). In this case, there is an integer k, so that
f1 = k  2
e
0
 p+1
 x < (k + 1)  2
e
0
 p+1
= f2
and f1; f2 2 FP
n;p
are the nearest oating-point numbers in FP
n;p
larger than and
smaller than or equal to x. Because for jxj  2
e
min
, we get e
0
 e
min
and e
00
= e
0
,
the numbers f1 and f2 agree with the two possible rounding results of the function
rnd
mode; e
00
+p 1
(x) = rnd
mode; e
0
+p 1
(x) also in this case.
Based on the agreement of the two possible rounding choices, one can now easily check,
that also the rounding decisions are the same for both the IEEE denition r
mode
(x) and
the rounding functions rnd
mode; e
00
+p 1
(x) for all four rounding modes. This completes
the proof for case (a).
(b) For jxj  x
max
, the possible rounding choices for the IEEE rounding denition
r
mode
(x) are +=   x
max
and +=  1. One can easily check that the specication of the
rounding cases in the lemma corresponds to the IEEE denition for jrnd
mode;e
00
 p+1
(x)j 
2
e
max
+1
, so that we only have to proof part (b) for jrnd
mode;e
00
 p+1
(x)j < 2
e
max
+1
. For
jxj  x
max
, the condition jrnd
mode;e
00
 p+1
(x)j < 2
e
max
+1
can only be fullled, if also at
least one of the following ve conditions is fullled:
(i) jxj = x
max
;
(ii) mode = RZ;
(iii) x > 0 and mode = RMI;
(iv) x < 0 and mode = RI;
(v) jxj < (2  2
 p
)  2
e
max
and mode = RNE.
2.3. ROUNDING 17
For the rounding mode RNE, the value jxj = (2 2
 p
) 2
e
max
(rounding interval midpoint)
is not included in case (v), because this value is rounded to 2
e
max
+1
, which is the 'even'
value among the two rounding choices. Since (2   2
 p
)  2
e
max
= x

max
, in all of these
ve cases, the IEEE rounding denition leads to jr
mode
(x)j = x
max
. From jxj  x
max
,
we get e
00
= e
0
 e
max
, so that all results of rnd
mode; e
00
+p 1
(x) are integral multiples
of 2
e
max
 p+1
. Because the only multiples of 2
e
max
 p+1
, that have a magnitude larger
than or equal to x
max
and smaller than 2
e
max
+1
, are +=   x
max
, it follows that also
jrnd
mode;e
00
 p+1
(x)j = x
max
and the proof of the lemma is completed. 2
In an FP implementation, the exact result of an operation will be represented by a factor-
ing. In the following, we therefore dene IEEE rounding on factorings. We do not have
any conditions on the input factoring, but by the requirements for the destination factor-
ing we distinguish between two versions: We would like to get either the IEEE factoring
or the NF factoring of the rounded result.
Denition 2.8 For mode 2 fRZ;RNE;RI;RMIg, the rounding function iround
mode
:
FACT (IR)  ! IEEEfact is dened to compute the IEEE factoring of the rounded result:
iround
mode
(s; e; f) = (sr; er; fr) 2 IEEEfact, with val(sr; er; fr) = r
mode
(val(s; e; f))
and the rounding function nround
mode
: FACT (IR)  ! NFfact is dened to compute
the NF factoring of the rounded result:
nround
mode
(s; e; f) = (sr; er; fr) 2 NFfact, with val(sr; er; fr) = r
mode
(val(s; e; f)):
Moreover, we dene some functions that will be used for the rounding computations
Denition 2.9 For mode ? s 2 fRZ;RNE;RIg, aligned signicand rounding is the
rounding at the least signicant bit position p  1 of the signicand f :
a sig rnd
mode?s
(s; e; f) = (s; e; rnd
mode?s;p 1
(f)): (2.9)
With mode ? s 2 fRZ;RNE;RIg, and vp = (p   1)   maxf0; e
min
  eg, normalized
signicand rounding is the rounding of the signicand f at the (variable) position vp:
n sig rnd
mode?s
(s; e; f) = (s; e; rnd
mode?s;vp
(f)): (2.10)
We dene the post-normalization shift function, that normalizes a factoring, i the sig-
nicand f of the factoring equals f = 2:
post norm(s; e; f) =

(s; e+ 1; 1) if f = 2
(s; e; f) otherwise.
(2.11)
The exponent rounding maps factorings that represent magnitudes larger than or equal to
2
e
max
+1
to the factoring of +=  1 for the reduced rounding modes RNE or RI and to
the factoring of +=   x
max
in the reduced rounding mode RZ while restoring the sign of
the factoring:
exp rnd
mode?s
(s; e; f) =
8
>
>
<
>
>
>
:
(s; e
1
; f
1
) if jval(s; e; f)j  2
e
max
+1
AND val(s; e; f) =2 SPE
AND (mode ? s) 2 fRNE;RIg
(s; e
max
; f
max
) if jval(s; e; f)j  2
e
max
+1
AND
val(s; e; f) =2 SPE AND (mode ? s) = RZ
(s; e; f) otherwise.
(2.12)
18 CHAPTER 2. IEEE FLOATING-POINT STANDARD
In this denition, we distinguish between two dierent rounding functions for the signif-
icand: the aligned signicand rounding and the normalized signicand rounding. These
two rounding functions dier by the choice of the rounding position for the signicand.
The aligned signicand rounding assumes, that the signicand is aligned, in such a way,
that the signicand rounding position is always at signicand position p  1. This is the
case for IEEE factorings. The situation is dierent for NF factorings. Because they are
normalized even for denormalized values, the least signicant bit position of the signif-
icand, which is the signicand rounding position, could vary within a wide range. The
variable rounding position vp of the normalized signicand rounding takes care of this
rounding position shift. In this way, normalized signicand rounding is suitable for the
signicand rounding of NF factorings. The following two lemmas will show, how the com-
putation of IEEE rounding on factorings can be based on the functions from denition
2.9 and proove the above argumentation in detail. Lemma 2.7 will consider the IEEE
factoring and lemma 2.8 will consider the NF factoring of the rounded result.
Lemma 2.7 For mode 2 fRZ;RNE;RI;RMIg, the IEEE factoring iround
mode
(s; e; f) :
FACT (IR)  ! IEEEfact, with val(iround
mode
(s; e; f) = r
mode
(val(s; e; f)), can be com-
puted by the sequence of a bounded normalization shift, aligned signicand rounding, a
post-normalization shift and exponent rounding:
iround
mode
(s; e; f) = exp rnd
mode?s
(post norm(a sig rnd
mode?s
(b
e
min
c(s; e; f)))).
Proof: Let (s
1
; e
1
; f
1
) = b
e
min
c(s; e; f), and (s
2
; e
2
; f
2
) = a sig rnd
mode?s
(s
1
; e
1
; f
1
),
and (s
3
; e
3
; f
3
) =post norm(s
2
; e
2
; f
2
) and (s
ir
; e
ir
; f
ir
) = exp rnd
mode?s
(s
3
; e
3
; f
3
).
We devide the proof into two steps. We will rst show in part (a) of the proof,
that the factoring (s
ir
; e
ir
; f
ir
) has the value of the rounded result: val(s
ir
; e
ir
; f
ir
) =
r
mode
(val(s; e; f)). In part (b), it will then be shown, that the factoring (s
ir
; e
ir
; f
ir
) is an
IEEE factoring, namely that (s
ir
; e
ir
; f
ir
) 2 IEEEfact
n;p
.
(a) From the denitions of the bounded normalization shift and the post-normalization
shift it follows directly, that these two shift operations do not change the value of the
factoring, namely that val(s
1
; e
1
; f
1
) = val(s; e; f) and val(s
3
; e
3
; f
3
) = val(s
2
; e
2
; f
2
).
Thus, we have to show, that the combination of the aligned signicand rounding and the
exponent rounding implements IEEE rounding.
From the denition of the bounded normalization shift it also follows, that e
1
=
maxfe
min
; e
0
g = e
00
, where e
0
is the exponent of the corresponding unbounded normal-
ized factoring. Thus, we can write
val(s
3
; e
3
; f
3
) = val(s
2
; e
2
; f
2
)
= val(a sig rnd
mode?s
(s
1
; e
1
; f
1
))
= val(s
1
; e
1
; rnd
mode?s;p 1
(f
1
))
= val(s
1
; 0; 2
e
1
 rnd
mode?s;p 1
(f
1
))
= val(s
1
; 0; rnd
mode?s; e
1
+p 1
(2
e
1
 f
1
))
= val(0; 0; rnd
mode; e
1
+p 1
(( 1)
s
1
 2
e
1
 f
1
))
= rnd
mode; e
1
+p 1
(val(s
1
; e
1
; f
1
))
= rnd
mode; e
00
+p 1
(val(s; e; f)):
Let xr = rnd
mode; e
00
+p 1
(val(s; e; f)). Because val(s
3
; e
3
; f
3
) 62 SPE and s = s
1
= s
2
=
s
3
= s
ir
, we get for the value of the rounded result:
2.3. ROUNDING 19
val(s
ir
; e
ir
; f
ir
) = val(exp rnd
mode?s
(s
3
; e
3
; f
3
))
=
8
<
:
( 1)
s
 1 if jxrj  2
e
max
+1
AND (mode ? s) 2 fRNE;RIg
( 1)
s
 x
max
if jxrj  2
e
max
+1
AND (mode ? s) = RZ
xr otherwise.
= r
mode
(val(s; e; f))
The last of these equations follows from lemma 2.6 and from table 2.3 for the combination
of signs and reduced rounding modes. In this way step (a) of the proof is completed.
(b) We have to show, that (s
ir
; e
ir
; f
ir
) 2 IEEEfact
n;p
: For the factorings of += x
max
and += 1 in the exponent rounding denition this is obvious, so that we focus on the case
of representable rounding results with (s
ir
; e
ir
; f
ir
) = (s
3
; e
3
; f
3
) in the following. From
part (a) we already know that val(s
ir
; e
ir
; f
ir
) 2 FP
n;p
. Because of this and because IEEE
factoring representations are unique, it suces to show that the following two conditions
are fullled: (COND1) (jval(s
ir
; e
ir
; f
ir
)j  2
e
min
) =) (f
ir
2 [1; 2[); and
(COND2) (jval(s
ir
; e
ir
; f
ir
)j < 2
e
min
) =) (e
ir
= e
min
):
For the remaining part of the proof we distinguish between: (i) jval(s; e; f)j  2
e
min
;
and (ii) jval(s; e; f)j < 2
e
min
. For both of these cases we have to show (COND1) and
(COND2):
(i) Because 2
e
min
is a representable number, it follows from jval(s; e; f)j  2
e
min
, that
also the absolute value of the rounded result is larger than or equal to 2
e
min
. Hence,
the condition (COND2) is always fulllled for case (i).
From jval(s; e; f)j  2
e
min
, it follows, that the result of the bounded normalization
shift is normalized, so that f
1
2 [1; 2[. After signicand rounding we get a signi-
cand in the range f
2
2 [1; 2], so that the post-normalization shift always outputs a
normalized rounded signicand f
3
= f
ir
2 [1; 2[, and thus, also condition (COND1)
is fullled.
(ii) From jval(s; e; f)j < 2
e
min
it follows, that the result of the bounded normalization
shift is denormalized with f
1
2 [0; 1[ and e
1
= e
min
. For f
1
2 [0; 1[ we get a rounded
signicand in the range f
2
2 [0; 1], so the exponent rounding does no change and
we get the exponent of the rounded result e
ir
= e
3
= e
min
. In this way condition
(COND2) is fulllled. From val(s
ir
; e
ir
; f
ir
)  2
e
min
, f
ir
2 [0; 1] and e
ir
= e
min
, it
follows that f
ir
= 1 is normalized, so that also (COND1) is fullled.
2
Lemma 2.8 For mode 2 fRZ;RNE;RI;RMIg the NF factoring nround
mode
(s; e; f) :
FACT (IR)  ! NFfact, which has the value val(nround
mode
(s; e; f)) = r
mode
(val(s; e; f)),
can be computed by the sequence of an unbounded normalization shift, normalized signi-
cand rounding, another unbounded normalization shift and exponent rounding:
nround
mode
(s; e; f) = exp rnd
mode?s
((n sig rnd
mode?s
((s; e; f)))).
Proof: Let (s
1n
; e
1n
; f
1n
) = (s; e; f), and let (s
2n
; e
2n
; f
2n
) = n sig rnd
mode?s
(s
1n
; e
1n
; f
1n
).
In addition to this we use the notation from the previous lemma.
We devide the proof into the following two steps: We will rst show in part (a) that
the factoring (s
nr
; e
nr
; f
nr
) = exp rnd
mode?s
((n sig rnd
mode?s
((s; e; f)))) has the value
of the rounded result: val(s
nr
; e
nr
; f
nr
) = r
mode
(val(s; e; f)). In part (b) of the proof,
20 CHAPTER 2. IEEE FLOATING-POINT STANDARD
it will then be shown that the factoring (s
nr
; e
nr
; f
nr
) is a NF factoring, namely that
(s
nr
; e
nr
; f
nr
) 2 NFfact
n;p
.
(a) The normalization shifts do not change the value of a factoring and the value of
the exponent rounding only depends on the value of its input factoring. Hence, for the
proof of val(s
nr
; e
nr
; f
nr
) = val(s
ir
; e
ir
; f
ir
) = r
mode
(val(s; e; f)), it suces to show that
val(s
2n
; e
2n
; f
2n
) = n sig round((s; e; f)) = a sig round(d
e
min
e(s; e; f) = val(s
2
; e
2
; f
2
):
For the proof of this equation, we distinguish between: (i) jval(s; e; f)j  2
e
min
; and (ii)
jval(s; e; f)j < 2
e
min
.
(i) For jval(s; e; f)j  2
e
min
,the output of the bounded normalization shift is normalized,
so that (s
1n
; e
1n
; f
1n
) = (s
1
; e
1
; f
1
) and e
1n
= e
0
> e
min
. Hence, in the denition
of normalized signicand rounding, the variable rounding position becomes vp =
(p   1)  maxf0; e
min
  eg = p  1 and agrees with the rounding position p   1 of
the aligned signicand rounding.Thus, also the output factorings of both signicand
rounding functions are the same: val(s
2n
; e
2n
; f
2n
) = val(s
2
; e
2
; f
2
).
(ii) Because for jval(s; e; f)j = 0, none of the 2 steps in both computations change the
factoring, we only deal with non-zero numbers in the following. Since jval(s; e; f)j <
2
e
min
, we get for the exponent of the unbounded normalized factoring e
1n
= e
0
<
e
min
. For the same reason, the output of the bounded normalization shift is denor-
malized with e
1
= e
min
, so that the overall rounding position of aligned signicand
rounding becomes  e
min
+ p   1. In the case of normalized signicand rounding,
the variable signicand rounding position is vp = (p   1)  maxf0; e
min
  e
1n
g =
(p  1)   e
min
  e
1n
. In the combination with the exponent factor 2
e
1n
, we get the
overall rounding position  e
min
+ p  1 also in this case.
(b) We have to show, that (s
nr
; e
nr
; f
nr
) is a NF factoring. Hence, (s
nr
; e
nr
; f
nr
) has to
be normalized for all non-zero numbers. After the second unbounded normalization shift,
we get a normalized signicand f
nr
in the range f
nr
2 [1; 2[ for all non-zero representable
numbers. Because there is no condition on the NF factoring of a zero and the factoring
representations of += 1 and +=  x
max
are dened in the exponent rounding output to
be normalized, (s
nr
; e
nr
; f
nr
) is a NF factoring in any case. 2
We distinguish between rounding in single precision and double precision by the choice
of the corresponding values of: p, n, e
min
, e
max
, f
max
, e
1
and f
1
.
For the denition of exceptions, and some correctness proofs, it is helpful to have a
rounding function r with an unbounded exponent range. For a factoring (s; e; f) with
f 6= 0, the new rounding r is dened by:
r
mode
(s; e; f) = post norm(a sig rnd
mode?s
((s; e; f))): (2.13)
2.4. SPECIAL CASES 21
2.4 Special Cases
The IEEE standard denes six exceptions, that can occur, when a oating-point operation
is executed: overow, underow, inexact, invalid, division by zero and unimplemented FP
operation. The occurrences of these exceptions are signaled by the six IEEE ags ovf,
unf, inx, inv, dvz, and ufo. Exept for the combinations of inx with ovf or unf, at
most one FP exception can occur during an operation.
The trap handler enable-bits: ovf en, unf en, inx en, inv en, and dvz en are set
by the user. For an unimplemented FP operation, the corresponding trap handler is
always enabled. If a trap handler is enabled, i.e., the trap handler enable bit is active,
the occurrence of the corresponding exception starts the execution of an exception trap
routine. With a disabled trap handler, a result is returned immedately even for the
occurance of the corresponding exception. If inx en and ovf en or unf en are enabled
and both exceptions occur during the same operation, the ovf or unf-trap has precedence
over the execution of the inx-trap.
After describing the IEEE ags and exception handling in detail, we will overview
the results of operations on special values, that have a strong relationship to exceptions.
Finally, we will give a general summary on the computations for each IEEE operation.
2.4.1 IEEE Flags
We consider an arithmetic oating-point operation op 2 f ADD/SUB, MULT, DIV, SQRT,
CONVg operating on nite operands. This operation op delivers the exact result exact
op
,
that can be represented by the factoring (s
ex
; e
ex
; f
ex
). We denote the value of the rounded
result of the operation by bro = val(round
mode
((s
ex
; e
ex
; f
ex
)), and the value of the result
that is rounded with an unbounded exponent range by: uro = val(r
mode
(s
ex
; e
ex
; f
ex
)):
Overow The overow ag signals, that the magnitude of the unbounded rounded result
is bigger than the magnitude of the largest representable number:
juroj > jx
max
j = (2  2
 p+1
)  2
e
max
:
Underow The conditions for an underow dier depending on the value of unf en.
They are based on the denitions of tininess and loss-of-accuracy:
 There are two possible denitions for tininess given by the standard: A result is tiny-
before-rounding, if 0 6= jexact
op
j<2
e
min
, and tiny-after-rounding, if 0 6= juroj<2
e
min
:
 Similarly, the standard provides two loss-of-accuracy denitions: Loss-of-accuracy-a
occurs, if exact
op
6=0 AND uro 6=bro, and loss-of-accuracy-b occurs, if bro 6=exact
op
.
For both tininess and loss-of-accuracy, the implementor may choose one of the two deni-
tions provided by the standard, but these choices have to be the same for all operations
and precisions. Based on these conditions, the underow exception is dened by:
 If unf en = 0, then an underow occurs if tininess and loss-of-accuracy occurs.
 If unf en = 1, then an underow occurs if tininess occurs.
Denition 2.10 We dene the boolean function TINY (s; e; f), that delivers the boolean
value corresponding to the tininess condition (0 6= jval(s; e; f)j < 2
e
min
).
22 CHAPTER 2. IEEE FLOATING-POINT STANDARD
Lemma 2.9 For 0 6= f and the normalized factoring (s; e
0
; f
0
) = (s; e; f), the number
x = val(s; e; f) is tiny, signaled by (TINY (s; e; f) = 1), i (e
0
< e
min
).
Proof: Because (s; e
0
; f
0
) is normalized, 1  f
0
< 2 and 2
e
0
 jval(s; e; f)j < 2
e
0
+1
. Thus,
the tininess condition can be written as e
0
+1  e
min
. This is equivalent to e
0
< e
min
. 2
Inexact An inexact exception occurs if bro 6= exact
op
. This is exactly the loss-of-
accuracy-b condition and includes the case of an overow.
Division by Zero The dvz ag signals, that the second operand of a division equals
+0 or  0 and the rst operand is a nite non-zero number.
Invalid The inv ag is signaled:
1. for any operation, where at least one operand is a signaling NaN,
2. for eective subtractions of two innities,
3. for the multiplication of 0 and innity regardless of the signs,
4. for divisions of 0=0 or 1=1 regardless of the signs,
5. for remainders, where the rst operand is innite or the second is a zero,
6. for the square root of an operand less than zero,
7. for comparisons with a condition code that demands 'invalid if unordered'(see ta-
ble 2.2) and the operands are unordered.
Unimplemented FP operation The ufo ag is signaled for any FP operation that is
not implemented in hardware.
2.4.2 Exceptions
If an exception is recognized during the computation of an operation, the corresponding
IEEE ag(s) are set to 1. The IEEE ags are sticky, i.e., if they have been set once, they
stay active till they are cleared by the user. The further computation depends on the
value of the corresponding trap handler enable bit:
Disabled Trap Handler In most cases the delivered result has to be the correctly
rounded result bro, but for a division by zero, a correctly signed innity, and for an invalid
exception, a qNaN has to be delivered. Moreover, for operations on special values and
zeros, the results are summarized in the next paragraph. With the computation of the
result, the execution of the operation is nished.
2.4. SPECIAL CASES 23
Enabled Trap Handler The operation starts the corresponding trap routine, that is
responsible for the further computations. The operands for the trap routine are specied
by the standard and dier from the above results depending on the exception:
Each trap should get the operation type of the operation that caused the exception,
the information, which exception occured and the destination's format. In the case of a
trapped invalid or a trapped division by zero, the operand values have to be accessible to
the trap routine. In a trapped inexact exception the correctly rounded result is given to
the trap routine. In trapped overows and trapped underows exponent wrapping has to
be computed, before the result is fed to the trap routine:
 Trapped Overow. If a trapped overow occurs, then the wrapped exponent e  
is used and a factoring of r
mode
(exact
op
 2
 
) is delivered to the trap routine, where
 = 3  2
n 2
. We will consider the corresponding IEEE factoring iround
mode
(s; e  ; f)
or the corresponding NF factoring nround
mode
(s; e  ; f))
The magnitude of all exact overow results is larger than lbound
ovf
= 2
e
max
=
2
2
n 1
 1
. An upper bound on the magnitude of exact results is found looking at the
case, that the largest representable number, that is smaller than 2  2
e
max
= 2
2
n 1
,
is divided by the representable number with the smallest magnitude 2
e
min
 p+1
=
2
 2
n 1
+2 p+1
:
2
2
n 1
=2
 2
n 1
+2 p+1
= 2
2
n 1
+2
n 1
 2+p 1
= 2
2
n
+p 3
Thus, the magnitude of all exact results of the standard's operations on representable
numbers is smaller than ubound
ovf
= 2
2
n
+p 3
. The exponent wrapping by  
reduces the lower bound on the magnitude of exact overow results to
lbound
ovf
 2
 
= 2
2
n 1
 1 32
n 2
= 2
 2
n 2
 1
> 2
 2
n 1
+2
= 2
e
min
and the upper bound on the magnitude of exact overow results to
ubound
ovf
 2
 
= 2
2
n
+p 3 32
n 2
= 2
2
n 2
+p 3
< 2
2
n 1
 1
= 2
e
max
:
Therefore, after exponent wrapping all overow results have values of normalized
numbers.
 Trapped Underow. If a trapped underow occurs, then the wrapped expo-
nent e+  is used and a factoring of r
mode
(exact
op
 2

) is delivered to the trap
routine, where  = 3  2
n 2
. We will consider the corresponding IEEE factoring
iround
mode
(s; e+ ; f) or the corresponding NF factoring nround
mode
(s; e+ ; f))
The magnitude of all exact underow results is smaller than ubound
unf
= 2
e
min
=
2
 2
n 1
+2
. A lower bound on the magnitude of exact results is found looking at
the case, that the representable number with the smallest magnitude 2
e
min
 p+1
=
2
 2
n 1
 p+3
is multiplied by itself:
2
 2
n 1
 p+3
 2
 2
n 1
 p+3
= 2
 2
n
 2p+6
:
Thus, the magnitude of all exact results of the standard's operations on representable
numbers is larger than or equal to lbound
unf
= 2
 2
n
 2p+6
.
The exponent wrapping by  increases the lower bound on the magnitude of exact
underow results to
lbound
unf
 2

= 2
 2
n
 2p+6+32
n 2
= 2
 2
n 2
 2p+6
> 2
 2
n 1
+2
= 2
e
min
24 CHAPTER 2. IEEE FLOATING-POINT STANDARD
ADD +0  0 +y  y +1  1 qNaN2 sNaN
+0 +0 +0( 0RMI) +y  y +1  1 qNaN2 qNaN
 0 +0( 0RMI)  0 +y  y +1  1 qNaN2 qNaN
+x +x +x n:s: n:s: +1  1 qNaN2 qNaN
 x  x  x n:s: n:s: +1  1 qNaN2 qNaN
+1 +1 +1 +1 +1 +1 qNaN qNaN2 qNaN
 1  1  1  1  1 qNaN  1 qNaN2 qNaN
qNaN1 qNaN1 qNaN1 qNaN1 qNaN1 qNaN1 qNaN1 qNaN1/2 qNaN
sNaN qNaN qNaN qNaN qNaN qNaN qNaN qNaN qNaN
Table 2.4: Results of additions on special values.
ADD +=  0 +y  y +1  1 qNaN2 sNaN
+=  0 no no no no no no INV
+x no OVF/UNF/INX/no UNF/INX/no no no no INV
 x no UNF/INX/no OVF/UNF/INX/no no no no INV
+1 no no no no INV no INV
 1 no no no INV no no INV
qNaN1 no no no no no no INV
sNaN INV INV INV INV INV INV INV
Table 2.5: Exceptions of additions.
and the upper bound on the magnitude of exact underow results to
ubound
unf
 2

= 2
 2
n 1
+2+32
n 2
= 2
2
n 2
+2
< 2
2
n 1
 1
= 2
e
max
:
Therefore, after exponent wrapping also all underow results have values of normal-
ized numbers.
Corollary 2.10 After exponent wrapping all results of operations on representable num-
bers have values of normalized numbers.
2.4.3 Operations on Special Values
In this section, we summarize the results of additions (see table 2.4, for subtractions the
second operand has to be multiplied by  1), multiplications (see table 2.6), divisions (see
table 2.8), and square roots (see table 2.10) on special values and zeros and list the possible
exceptions (see table 2.5,2.7,2.9, and 2.10).
In the tables, dierent possibilities of one entry are separated by '/', 'n.s.' means that
the corresponding entry can not be specied in general, 'no' means, that no exception
occurs, and '( 0 RMI)' means, that the result is  0 if the rounding mode is RMI. Because
the representation of qNaNs is not unique, we enumerate such operands by qNaN1 and
qNaN2.
2.4. SPECIAL CASES 25
MULT +0  0 +y  y +1  1 qNaN2 sNaN
+0 +0  0 +0  0 qNaN qNaN qNaN2 qNaN
 0  0 +0  0 +0 qNaN qNaN qNaN2 qNaN
+x +0  0 n:s: n:s: +1  1 qNaN2 qNaN
 x  0 +0 n:s: n:s:  1 +1 qNaN2 qNaN
+1 qNaN qNaN +1  1 +1  1 qNaN2 qNaN
 1 qNaN qNaN  1 +1  1 +1 qNaN2 qNaN
qNaN1 qNaN1 qNaN1 qNaN1 qNaN1 qNaN1 qNaN1 qNaN1/2 qNaN
sNaN qNaN qNaN qNaN qNaN qNaN qNaN qNaN qNaN
Table 2.6: Results of multiplications on special values.
MULT +=  0 +=  y += 1 qNaN2 sNaN
+=  0 no no INV no INV
+=  x no OVF/UNF/INX/no no no INV
+= 1 INV no no no INV
qNaN1 no no no no INV
sNaN INV INV INV INV INV
Table 2.7: Multiplication exceptions.
DIV +0  0 +y  y +1  1 qNaN2 sNaN
+0 qNaN qNaN +0  0 +0  0 qNaN2 qNaN
 0 qNaN qNaN  0 +0  0 +0 qNaN2 qNaN
+x +1  1 n:s: n:s: +0  0 qNaN2 qNaN
 x  1 +1 n:s: n:s:  0 +0 qNaN2 qNaN
+1 +1  1 +1  1 qNaN qNaN qNaN2 qNaN
 1  1 +1  1 +1 qNaN qNaN qNaN2 qNaN
qNaN1 qNaN1 qNaN1 qNaN1 qNaN1 qNaN1 qNaN1 qNaN1/2 qNaN
sNaN qNaN qNaN qNaN qNaN qNaN qNaN qNaN qNaN
Table 2.8: results of division on special values.
DIV +=  0 += y += 1 qNaN sNaN
+=  0 INV no no no INV
+=  x DVZ OVF/UNF/INX/no no no INV
+= 1 no no INV no INV
qNaN no no no no INV
sNaN INV INV INV INV INV
Table 2.9: Division exceptions.
SQRT +0  0 +y  y +1  1 qNaN1 sNaN
result
exception
+0
no
 0
no
n:s:
INX/no
qNaN
INV
+1
no
qNaN
INV
qNaN1
no
qNaN
INV
Table 2.10: Results and exceptions of squareroots on special values.
26 CHAPTER 2. IEEE FLOATING-POINT STANDARD
2.4.4 Summary of IEEE Computations
In the previous sections, various aspects of the computations for IEEE operations were
described separately. In this section all aspects of the computations will be summarized
for each IEEE operation.
In the previous section about the computation on special value and zero operands x
and y, we saw, that for these cases the result can have only one of a few possible values,
namely, += 0, += 1, qNaN , x, y. If none of these special cases occurs, the IEEE
rounded result with or without exponent wrapping should be output.
Assume, that we have a factoring (s
rc
; e
rc
; f
rc
), that represents the exact result exact
op
for op 2 fADD=SUB;MULT;DIV;SQRT;CONVg and for non-zero representable operands.
We dene ve special condition ags scqnan, scinf, scx, scy, and sczero that corre-
spond to the occurance of the special cases results: qNaN , += 1, x = val(sa; ea; fa),
y = val(sb; eb; fb), and += 0. Moreover, the exponent wrapping constant is dened by:
wec =
8
<
:
  if ovf AND ovf en (wrapped overow)
+ if unf AND unf en (wrapped underow)
0 otherwise,
(2.14)
Based on these denitions, the IEEE factoring of the nal result of an IEEE operation
can be selected by:
(s
ifnl
; e
ifnl
; f
ifnl
) =
8
>
>
>
>
<
>
>
>
>
>
:
(0; e
qNaN
; f
qNaN
) if scqnan
(s
inf
; e
1
; f
1
) if scinf
(sa; ea; fa) if scx
(sb; eb; fb) if scy
(s
0
; e
0
; 0) if sczero
iround(s
rc
; e
rc
+ wec; f
rc
) otherwise
(2.15)
The corresponding NF factoring of the rounded IEEE operation result is given by (see
lemma 2.8):
(s
nfnl
; e
nfnl
; f
nfnl
) =
8
>
>
>
>
<
>
>
>
>
>
:
(0; e
qNaN
; f
qNaN
) if scqnan
(s
inf
; e
1
; f
1
) if scinf
(sa; ea; fa) if scx
(sb; eb; fb) if scy
(s
0
; e
0
; 0) if sczero
nround(s
rc
; e
rc
+ wec; f
rc
) otherwise
(2.16)
Denition 2.11 We extend the denition of the function iround on factorings of special
values (s
sp
; e
sp
; f
sp
) 2 SPEfact by the identity iround(s
sp
; e
sp
; f
sp
) = (s
sp
; e
sp
; f
sp
).
Also this extension is included in the computation sequence for iround from lemma
2.7. The reason for this is, that we dene the factorings of special values to be exact
and normalized and with an exponent of e
max
+ 1. Thus, the rst three steps of the
bounded normalization shift, the signicand rounding and the post-normalization shift do
not change the factorings of special values. Also in the last step the factoring is not
changed, because the denition of the exponent rounding in equation 2.12 already includes
this case.
Note, that the exponent wrapping constant is 0 for operations on special values, because
no overow or underow can occur for them. Thus, with the extension of the denition
2.4. SPECIAL CASES 27
of the function iround and the denition of the exact result factoring:
(s
ex
; e
ex
; f
ex
) =
8
>
>
>
>
<
>
>
>
>
:
(0; e
qNaN
; f
qNaN
) if scqnan
(s
inf
; e
1
; f
1
) if scinf
(sa; ea; fa) if scx
(sb; eb; fb) if scy
(s
0
; e
0
; 0) if sczero
(s
rc
; e
rc
; f
rc
) otherwise,
(2.17)
for all cases the IEEE factoring of the nal result (s
ifnl
; e
ifnl
; f
ifnl
) can be described by:
(s
ifnl
; e
ifnl
; f
ifnl
) = iround(s
ex
; e
ex
+ wec; f
ex
) (2.18)
With the same extension of the function nround and a similar argumentation for the
computation sequence for nround from lemma 2.8, for all cases the corresponding NF
factoring of the nal result (s
nfnl
; e
nfnl
; f
nfnl
) is computed by
(s
nfnl
; e
nfnl
; f
nfnl
) = nround(s
ex
; e
ex
+ wec; f
ex
) (2.19)
The equations for the special condition ags and the sign s
sc
can be easily extracted for
each IEEE operation from the tables on the special value results in the previous section.
With the factorings of the input operands (sa; ea; fa) and (sb; eb; fb), and the following
conditions on these factorings
zeroa () (jval(sa; ea; fa)j = 0) zerob () (jval(sb; eb; fb)j = 0)
infa () (jval(sa; ea; fa)j =1) infb () (jval(sb; eb; fb)j =1)
qnana () (val(sa; ea; fa) = qNaN) qnanb () (val(sb; eb; fb) = qNaN)
snana () (val(sa; ea; fa) = sNaN) snanb () (val(sb; eb; fb) = sNaN)
zero
rc
() (f
rc
= 0);
we get the following equations:
 addition/subtraction:
scqnan = snana _ snanb_ (infa ^ infb ^ (sa
 sb)) (2.20)
scx = (qnana ^ snanb) _ (zerob ^ zeroa ^ snana) (2.21)
scy = (qnanb ^ qnana ^ snana) _ (zeroa ^ zerob ^ snanb) (2.22)
scinf = scqnan ^ scx ^ scy ^ (infa _ infb) (2.23)
sczero = (zeroa ^ zerob) _ zero
rc
(2.24)
s
inf
= (sa ^ infa) _ (sb ^ infb) (2.25)
s
0
= (is RMI ^ (sa sb sop)) _ (sa ^ (sb sop)) (2.26)
 multiplication:
scqnan = snana _ snanb _ (infa ^ zerob) _ (infb ^ zeroa) (2.27)
scx = qnana ^ snanb (2.28)
scy = qnanb ^ scx ^ snana (2.29)
scinf = scqnan^ scx ^ scy ^ (infa _ infb) (2.30)
sczero = (zeroa ^ zerob) _ (scqnan _ scx _ scy) (2.31)
s
inf
= sa
 sb (2.32)
s
0
= s
inf
(2.33)
28 CHAPTER 2. IEEE FLOATING-POINT STANDARD
 division:
scqnan = snana _ snanb _ (zeroa ^ zerob) _ (infa ^ infb) (2.34)
scx = qnana ^ snanb (2.35)
scy = qnanb ^ scx ^ snana (2.36)
scinf = scqnan^ scx ^ scy ^ (infa _ zerob) (2.37)
sczero = scqnan^ scx ^ scy ^ (infb _ zeroa) (2.38)
s
inf
= sa
 sb (2.39)
s
0
= s
inf
(2.40)
 square-root:
scqnan = (zeroa ^ sa ^ qnana) _ snana (2.41)
scx = qnana (2.42)
scy = 0 (2.43)
scinf = infa ^ sa (2.44)
sczero = zeroa (2.45)
s
inf
= 0 (2.46)
s
0
= sa (2.47)
This completes the specication of the IEEE operations. In the next section we will
provide some methodologies, by that the rounding computations can be simplied.
2.5. ROUNDING COMPUTATION UTILITIES 29
2.5 Rounding Computation Utilities
2.5.1 Representatives
Denition 2.12 For an integer , two real numbers x
1
and x
2
are -equivalent, denoted
by x
1

= x
2
if there exists an integer q such that x
1
; x
2
2]q  2
 
; (q+1)  2
 
[ or x
1
= x
2
=
q  2
 
.
Thus, the binary representation of -equivalent reals must agree in the rst  positions to
the right of the binary point. We choose the -representatives of the equivalence classes
as follows:
Denition 2.13 Let x denote a real number and  an integer. Let q denote the integer
satisfying: q2
 
 x < (q + 1)2
 
. The -representative of x, denoted by rep

(x), is
dened by:
rep

(x) =

q2
 
if x = q2
 
(q + 0:5)2
 
if x 2]q2
 
; (q + 1)2
 
[.
The -representatives form integral multiples of 2
  1
. Thus, they can be represented
by  + 1 bits to the right of the binary point. Note, that the least signicant bit in this
representation indicates whether the corresponding equivalence class is a single point or
an open interval. The following lemma describes, how the -representative of a binary
number can be computed.
Lemma 2.11 With f 2 [0; 2[, integers 0 <  < k, so that f is a multiple of 2
 k
, and
f[0 : k] = bin
0
 k
(f), we dene:
sticky bit

(f) = or(f[ + 1 : k])
sticky

(f) = < f[0 : ] >
neg
+ sticky bit

(f)  2
  1
:
The -representative of f is then given by
rep

(f) = sticky

(f):
Proof: The binary representation rep

(f) is identical to f[0 : k] up to the position with
weight 2
 
. If f is an integral multiple of 2
 
, then f = rep

(f) =< f[0 : ] >
neg
and
f[+1 : k] is all zeros, and so is sticky bit

(f). If f is not an integral multiple of 2
 
, then
f[ + 1 : k] is not all zeros, and therefore, sticky bit

(f) = 1, and rep

(f) = sticky

(f),
as required. 2
Lemma 2.12 For integers 
1
; 
2
, with 0< 
2
< 
1
, and f = < f[0 : k] >
neg
(i) one can
derive rep

2
(f) from rep

1
(f) = < rep1[0 : 
1
+1] >
neg
by
rep

2
(f) = < (rep1[0 : 
2
];or(rep1[
2
+1 : 
1
+1])) >
neg
;
(ii) f
x

1
= f
y
 ! f
x

2
= f
y
.
Proof: (i) The bits sticky bit

1
(f) and sticky bit

2
(f) are dened by
sticky bit

1
(f) = or(f[
1
+ 1 : k])
sticky bit

2
(f) = or(f[
2
+ 1 : k]):
30 CHAPTER 2. IEEE FLOATING-POINT STANDARD
Substitution of the sticky bit

1
(f)-denition in the sticky bit

2
(f)-denition, yields
sticky bit

2
(f) = or(f[
2
+ 1 : 
1
]; sticky bit

1
(f)):
Because rep1[0 : 
1
] = f[0 : 
1
] and rep1[
1
+ 1] = sticky bit

1
(f) by the denition of
rep

1
(f), we have
sticky bit

2
(f) = or(rep1[
2
+ 1 : 
1
+ 1])
and part (i) of the lemma follows.
(ii) We have rep

1
(f
x
) = rep

1
(f
y
). In the rst part was shown, that 
2
-representatives
can be computed from 
1
-representatives. Therefore, rep

2
(f
x
) = rep

2
(f
y
), and part (ii)
of the lemma follows. 2
For mode 2 fRZ;RNE;RIg and f
0
= rep

(f) one can additionaly show the following
equations:
f
x

= f
y
i 2
k
 f
x
 k
= f
y
 2
k
for an integer k (2.48)
rnd
mode; 1
(f) = rnd
mode; 1
(f
0
) (2.49)
rnd
mode; 1
(f) = f i rnd
mode; 1
(f
0
) = f
0
(2.50)
rnd
mode; 1
(f
0
) = f
0
i f
0
= q  2
 ( 1)
for an integer q (2.51)
f
0
= f i rnd
mode; 1
(f
0
) = f: (2.52)
For the computation of rounded factorings we will use the following properties.
Lemma 2.13 Let (s
x
;e
x
;f
x
)and (s
y
;e
y
;f
y
)be two factorings and let (s
0
x
;e
0
x
;f
0
x
)and (s
0
y
;e
0
y
;f
0
y
)
be the corresponding bounded normalized factorings: (s
0
x
; e
0
x
; f
0
x
)=b
e
min
c(s
x
; e
x
; f
x
) and
(s
0
y
; e
0
y
; f
0
y
) = b
e
min
c(s
y
; e
y
; f
y
). If the values of (s
x
; e
x
; f
x
) and (s
y
; e
y
; f
y
) are (p   e
0
x
)-
equivalent:
x = val(s
x
; e
x
; f
x
) = val(s
0
x
; e
0
x
; f
0
x
)
p e
0
x
= val(s
0
y
; e
0
y
; f
0
y
) = val(s
y
; e
y
; f
y
) = y;
then (i) s
0
x
= s
0
y
, e
0
x
= e
0
y
, and f
0
x
p
= f
0
y
;(ii) iround
mode
(s
x
; e
x
; f
x
) = iround
mode
(s
y
; e
y
; f
y
);
and (iii) nround
mode
(s
x
; e
x
; f
x
) = nround
mode
(s
y
; e
y
; f
y
).
Proof: The assumption that x
p e
x
= y means that either x = y or x; y 2 I, where I =]q 
2
e
0
x
 p
; (q+1)2
e
0
x
 p
[, for some integer q. If x = y then the claim follows from the uniqueness
of the (bounded) normalized factoring representations (s
0
x
; e
0
x
; f
0
x
) and (s
0
y
; e
0
y
; f
0
y
). For the
second case, since the interval I cannot contain both negative and positive numbers, let
us assume that x > 0, and hence y > 0 as well.
Note also, that the interval I either consists only of denormalized values or normalized
values. The reason is that 2
e
min
can not belong to the interval I. Since both factorings
(s
0
x
; e
0
x
; f
0
x
) and (s
0
y
; e
0
y
; f
0
y
) are bounded normalized, it follows that either f
0
x
; f
0
y
2 [0; 1[
or f
0
x
; f
0
y
2 [1; 2[. If f
0
x
; f
0
y
2 [0; 1[, then it follows that e
0
x
= e
0
y
= e
min
. Therefore,
f
0
x
; f
0
y
2]q  2
 p
; (q + 1)  2
 p
[, and f
0
x
p
= f
0
y
, as required in part (i).
If f
0
x
; f
0
y
2 [1; 2[, let us assume by contradiction that e
0
x
> e
0
y
. This would imply that
f
0
y
 2
e
0
y
< 2
1+e
0
y
 2
e
0
x
 f
0
x
 2
e
0
x
:
But the interval I cannot contain 2
e
0
x
, so that we have a contradiction to our assumption.
Therefore, e
x
= e
y
, and as before, this implies f
0
x
p
= f
0
y
, as required. Part (ii) follows from
2.5. ROUNDING COMPUTATION UTILITIES 31
the computation of the rounding function iround(s; e; f) according to lemma 2.7, the
denition of aligned signicand rounding in equation 2.9 and equation 2.49 with  = p.
We use from denition 2.8, that the rounded values are the same for both rounding
functions iround
mode
(s; e; f) and nround
mode
(s; e; f), so that from part (ii) it follows, that
val(nround
mode
(s
x
; e
x
; f
x
)) = val(nround
mode
(s
y
; e
y
; f
y
)): Part (iii) then follows from the
uniqueness of NF factoring representations. 2
Similarly, for the computation of r
mode
(s; e; f), we have:
Lemma 2.14 Let (s
x
; e
x
; f
x
) and (s
y
; e
y
; f
y
) be two factorings and let (s
0
x
; e
0
x
; f
0
x
) and
(s
0
y
; e
0
y
; f
0
y
) be the corresponding unbounded normalized factorings: (s
0
x
; e
0
x
; f
0
x
) = (s
x
; e
x
; f
x
)
and (s
0
y
; e
0
y
; f
0
y
) = (s
y
; e
y
; f
y
). If the values of (s
0
x
; e
0
x
; f
0
x
) and (s
0
y
; e
0
y
; f
0
y
) are (p   e
0
x
)-
equivalent:
val(s
x
; e
x
; f
x
)
p e
0
x
= val(s
y
; e
y
; f
y
);
then (i) s
0
x
= s
0
y
, e
0
x
= e
0
y
, and f
0
x
p
= f
0
y
; and (ii) r
mode
(s
x
; e
x
; f
x
) = r
mode
(s
y
; e
y
; f
y
).
Proof: The proof is a simplied version of the proof of the previous Lemma, because all
factorings are normalized in this case and no distinction between normalized and denor-
malized factorings is necessary. 2
Based on Lemma 2.13 and Lemma 2.14 a rounding circuitry only has to know the p-
representative of the signicand of the unbounded or bounded normalized factoring and
not its precise value to be able to round the factoring correctly.
Usually, no bounded or unbounded normalized factoring, but only an arbitrary fac-
toring is considered as input of the rounding computations. Then, the knowledge of the
p-representative of the signicand does not directly ensure the possibility of correct IEEE
rounding like in the cases of Lemma 2.13 and Lemma 2.14. But if a simple additional
condition on this p-representative of the signicand is fullled, it is possible to nd the
correctly rounded result nevertheless:
Lemma 2.15 Let (s; e; f) be a an arbitrary factoring and e
0
the exponent of the corre-
sponding normalized factoring. If a positive integer p
0
 p exists, so that fr = rep
p
0
(f)
and the following condition is fullled:
fr  1 OR fr = f;
then (s; e; f)
p e
0
= (s; e; fr), iround
mode
(s; e; f) = iround
mode
(s; e; fr) and
nround
mode
(s; e; f) = nround
mode
(s; e; fr):
Proof: We separate the conditions (i) fr = f and (ii) fr  1:
(i) If fr = f , it is obvious that round
mode
(s; e; f) = round
mode
(s; e; fr): (ii) By equa-
tion 2.48 from fr
p
0
= f , it follows that val(s; e; fr)
p
0
 e
= val(s; e; f). Let (s
0
; e
0
; f
0
) and
(s
00
; e
00
; fr
0
) be the bounded normalized factoring corresponding to (s; e; f) and (s; e; fr):
(s
0
; e
0
; f
0
) = b
e
min
c(s; e; f) and (s
00
; e
00
; fr
0
) = b
e
min
c(s; e; fr). From fr  1 it follows
that f  1, so that with f
0
 2, and val(s; e; f) = val(s
0
; e
0
; f
0
), we have e
0
 e. Using
p  e
0
 p
0
  e and lemma 2.12 we get
val(s
00
; e
00
; fr
0
) = val(s; e; fr)
p e
0
= val(s; e; f) = val(s
0
; e
0
; f
0
):
The use of lemma 2.13(ii)-(iii) on this equation completes the proof. 2
32 CHAPTER 2. IEEE FLOATING-POINT STANDARD
l r sticky RZ RNE RI
d.c. 0 0 ftr ftr ftr
d.c. 0 1 ftr ftr ftri
0 1 0 ftr ftr ftri
1 1 0 ftr ftri ftri
d.c. 1 1 ftr ftri ftri
Table 2.11: Signicand rounding on representatives.
Lemma 2.16 We consider an integer p, positive values x, x
h
and the value x
l
with x =
x
h
+ x
l
, x
h
= k  2
 p
for an integer k and jx
l
j < 2
 p
, and a non-zero positive value q with
q  jx
l
j < 2
 p
. The value x
0
= x
h
+ q  x
l
then is p-equivalent to x, so that
rep
p
(x) = rep
p
(x
0
):
Proof: We separate the proof in three cases: (a) (x
l
= 0); (b) (x
l
> 0); and (c)
(x
l
> 0). The proof of case (a) follows directly from x = x
h
= x
0
. In case (b), from
0 < x
l
< 2
 p
it follows, that x
h
< x < x
h
+ 2
 p
, and rep
p
(x) = x
h
+ 2
 p 1
. In
the same way from 0 < q  x
l
< 2
 p
, it follows, that x
h
< x
0
< x
h
+ 2
 p
and, thus,
rep
p
(x
0
) = x
h
+ 2
 p 1
= rep
p
(x). In case (c), from  2
 p
< x
l
< 0 it follows, that
x
h
  2
 p
< x < x
h
, and rep
p
(x) = x
h
  2
 p 1
. In the same way from  2
 p
< q  x
l
< 0,
it follows, that x
h
  2
 p
< x
0
< x
h
. Thus, rep
p
(x
0
) = x
h
  2
 p 1
= rep
p
(x) and the proof
of the lemma is completed. 2
Finally, we describe some details of signicand rounding on representatives.
Denition 2.14 For the rounding at position  1 of a positive signicand f < 2 with the
-representative frep =<frep[0 :+1]>
neg
= rep

(f); we dene the truncated signicand
ftr =<frep[0 : 1]>
neg
and the incremented signicand ftri = ftr + 2
 +1
.
Because ftr  f < ftr+2
 +1
= ftri, the values ftr and ftri are the two possible results
of rnd
(mode?s); 1
(frep). Which of them is chosen, depends only on the rounding mode
(mode ? s) 2 fRZ;RNE;RIg that is encoded by sr mode[1 :0] according to table2.3 and
the three least signicant bits of the representative, the L-bit l = frep[ 1], the round-bit
r = frep[], and the sticky bit sticky = frep[+1]. Table 2.11 lists all dierent rounding
cases according to the rounding denitions for positive arguments from equation 2.4-2.2.
In this table an entry 'd.c.' (don't care) means, that the value of this bit does not eect
the result. From this table one can easily derive the equation for the condition that the
incremented signicand ftri has to be chosen as the rounded signicand. This condition
is called the condition for the rounding increment:
rinc = is RI(mode) ^ (r _ sticky) _ is RNE(mode) ^ r ^ (l _ sticky) (2.53)
= sr mode[1] ^ (r _ sticky) _ sr mode[0] ^ r ^ (l _ sticky) (2.54)
so that with  = p signicand rounding can be written by:
sig rnd
mode?s
(s; e; f) =

(s; e; ftri) if rinc
(s; e; ftr) otherwise.
(2.55)
2.5. ROUNDING COMPUTATION UTILITIES 33
+0.5
rounding
RNU rounding values
RNU rounding intervals
rounding
RZ rounding intervals
real line
representable numbers
RNU rounding intervals
after injection
+0.999...
rounding
RI rounding values
RI rounding intervals
rounding
RZ rounding intervals
real line
representable numbers
RI rounding intervals
after injection
Figure 2.3: Injection mapping
Lemma 2.17 For a factoring (s; e; f) with a signicand f < 2, the p-representative
frep = < frep[0 : p+1] >
neg
= rep
p
(f), the rounding mode (mode?s) 2 fRZ;RNE;RIg
and the rounded factoring (s; e; frnd) = sig rnd
mode?s
(s; e; frep), the case that signifcand
rounding changes the value of the signicand can be recognized by:
(frep[p] OR frep[p+ 1]) () (frnd 6= f)
We call this condition the signicand rounding inexactness.
Proof: The lemma follows from equation 2.51-2.52 with  = p and the use of the property
that frep is an integral multiple of 2
 p+1
, i (frep[p] OR frep[p+ 1]) = 0. 2
2.5.2 Injection Based Rounding
Rounding by injection reduces the rounding modes RI and RNU to RZ [9, 40, 11, 12].
This reduction is possible for the rounding of operands x, that are integral multiples of
2
 k
, with an integer k, that is larger than the rounding position . The rounding mode
reduction is based on adding an injection:
inj =
8
<
:
0 if RZ
2
  1
if RNU
2
 
  2
 k
if RI,
that depends only on the rounding mode.
Lemma 2.18 With mode 2 fRZ;RNU;RIg, the eect of adding inj can be described by
rnd
mode;
(x) = rnd
RZ;
(x+ inj):
Proof: Figure 2.3 depicts this reduction of RNU and RI to RZ. 2
34 CHAPTER 2. IEEE FLOATING-POINT STANDARD
2.5.3 Gradual Rounding
In this section we deal with the situation, that rounding of a positive value x is not
computed at the proper position 
2
in a single step like in
sires = rnd
mode;
2
(x);
but that the rounding result has to be computed in multiple steps, where a result of one
rounding step is the input of the next rounding step with a smaller rounding precision
0 < 
2
 
1
like in
mures = rnd
mode;
2
(rnd
mode;
1
(x)):
In [21], the principles and problems of such gradual rounding are described. If only the
rounded result rores
1
= rnd
mode;
1
(x) of a rounding step is used in the succeeding round-
ing decision, information gets lost and the multi-step rounding result mures could dier
from the correct single-step rounding result sires (like in gure 2.4). In [21] this situation
is called a step error and it is proven that such a step error can only occur in rounding
mode RNE. To prevent step errors, two tag bits are required for the rounding decision in
addition to the rounded result of the previous step:
 tinx is active if the rounded result of the previous step was inexact:
(rnd
mode;
1
(x) 6= x):
Corresponding to the inexactness recognition in signicand rounding, tinx can be
computed from the round-bit r
1
and the sticky-bit sticky
1
of the previous rounding
step:
tinx = r
1
OR sticky
1
: (2.56)
 tinc is active if the previous rounding decision was a rounding increment (rinc=1):
(bin
 
1
 
1
(rnd
mode;
1
(x)) 6= bin
 
1
 
1
(x)): (2.57)
Like in the conventional rounding, the rounded result of the previous rounding step rores
1
lies between two rounding possibilities ftr = t  2
 
2
 rores
1
< (t+ 1)  2
 
2
= ftri,
so that the gradual rounding of rores
1
at position 
2
corresponds to the selection
rores
2
= sires =

ftri if grinc
ftr otherwise.
(2.58)
Using the two tag bits tinx and tinc in the gradual rounding decision grinc enables
to simulate single-step rounding by multi-step rounding in all rounding modes (mode ?
s) 2 fRZ;RNE;RIg (encoded by sr mode[1 :0]). As a solution [21] suggests to use the
following equations to compute grinc and the two tag bits tinx
2
and tinc
2
of the actual
rounding step, where (l
2
;r
2
; sticky
2
) = rep

2
+1
(rores
1
)[
2
:
2
+2] and tinx
1
and tinc
1
are the corresponding tag bits of the previous rounding step:
grinc =
(sr mode[1] ^ (r
2
_ sticky
2
))_
((sr mode[0] ^ r
2
) ^ (sticky
2
_ tinc
1
^ (l
2
_ tinx
1
)))
(2.59)
=
(sr mode[1] ^ (r
2
_ sticky
2
)) ^
((sr mode[0] ^ r
2
) ^ (sticky
2
^ (tinc
1
^ (l
2
_ tinx
1
))))
(2.60)
tinx
2
= r
2
_ sticky
2
_ tinx
1
(2.61)
tinc
2
= grinc _ (sticky
2
^ r
2
^ tinc
1
): (2.62)
2.5. ROUNDING COMPUTATION UTILITIES 35
2
1
x(1b*)
(1a)(2) (1b)
sires mures
λ-(2k-1) 2 k’ 2 λ- - λ2k 22 1
rores
Figure 2.4: Gradual rounding: In rounding mode RNE the value x should be rounded
at position 
2
to the value sires = (2k   1)  2
 
2
like depicted by arrow (2). If we
round the value x in a rst step at position 
1
, we get the intermediate rounded result
rores
1
= k
0
2
 
1
like depicted by arrow (1a). The result of a second conventional rounding
step of rores
1
at position 
2
(arrow (1b)) is mures = 2k  2
 
2
and diers from the single
step rounding result sires. A gradual rounding step on rores
1
at position 
2
that uses
additional information from the previous rounding step yields the rounded result sires
like depicted by arrow (1b*).
Based on the equations for the gradual rounding, we dene gradual rounding functions
corresponding to the denition of the rounding functions rnd
mode;
:
Denition 2.15 For mode 2 fRZ;RNE;RIg encoded by sr mode[1 : 0], the gradual
rounding function grnd
mode;
: IR f0:1g
2
 ! IR f0:1g
2
is dened by
grnd
mode;
(f;tinc
1
;tinx
1
) =

(ftri;tinc
2
;tinx
2
) if grinc
(ftr;tinc
2
;tinx
2
) otherwise.
with the truncated signicand ftr and the incremented signicand ftri from denition
2.14, and the computation of grinc, tinc
2
and tinx
2
according to equations 2.60-2.62.
Lemma 2.19 For any integers 
2
 
1
the rounding function rnd
mode;
2
can be decom-
posed into two gradual rounding steps, so that for
(frnd;x[1 : 0]) = grnd
mode;
2
(grnd
mode;
1
(f; 00) ;
we get frnd = rnd
mode;
2
(f):
Proof: This lemma just summarizes the previous descriptions of the properties of gradual
rounding using our denition 2.15 of the gradual rounding function grnd
mode?s;
. 2
Denition 2.16 For the signicand rounding we dene two gradual rounding steps by the
signicand rounding functions sgrnd1 : FACT (IR)  ! FACT (IR)f0; 1g
2
and sgrnd2 :
FACT (IR) f0; 1g
2
 ! FACT (IR). With (frnd
1
;tinc
1
;tinx
1
) = grnd
mode?s;52
(f; 00);
sgrnd1
mode?s
(s; e; f) = ((s; e; frnd
1
);tinc
1
;tinx
1
)
and with (frnd
3
;tinc
3
;tinx
3
) = grnd
mode?s;52
(frnd
2
;tinc
2
;tinx
2
);
sgrnd2
mode?s
((s; e; frnd
2
);tinc
2
;tinx
2
) = (s; e; frnd
3
)
Additionally, we extend the denitions of the bounded normalization shift, the post-normaliza-
tion shift and the function val on outputs of the rst gradual rounding step:
d

e ((s; e; f);x[1 : 0]) = (d

e(s; e; f);x[1 : 0])
post norm ((s; e; f);x[1 : 0]) = (post norm(s; e; f);x[1 : 0])
val ((s; e; f);x[1 : 0]) = val(s; e; f)
36 CHAPTER 2. IEEE FLOATING-POINT STANDARD
Lemma 2.20 The rounding function iround can be docomposed into a normalization
shift, a rst signicand gradual rounding step by sig grnd1, a bounded normalization shift,
a second signicand gradual rounding step by sig grnd1, a post-normalization shift and
exponent rounding. Thus, with the denition of the normalized gradual result factoring
((s
GF
; e
GF
; f
GF
);tinc;tinx) = post norm(sgrnd1
mode?s
((s; e; f))); (2.63)
the IEEE factoring of the rounded result can be computed by
(s
res
; e
res
; f
res
) =
exp rnd
mode?s
(post norm(sgrnd2
mode?s
(d
e
min
e((s
GF
; e
GF
; f
GF
);tinc;tinx))));
so that iround
mode
(s; e; f) = (s
res
; e
res
; f
res
).
Proof: We rst introduce some notation: We denote the input factoring of the rst grad-
ual rounding step by (s
1
; e
1
; f
1
) = (s; e; f) and the output by ((s
2
; e
2
; f
2
)tinc;tinx) =
sgrnd1
mode?s
(s
1
; e
1
; f
1
). The input factoring of the second gradual rounding step is de-
noted by ((s
3
; e
3
; f
3
)tinc;tinx) = d
e
min
e((s
GF
; e
GF
; f
GF
);tinc;tinx) and the output is
denoted by (s
4
; e
4
; f
4
) = sgrnd2
mode?s
((s
3
; e
3
; f
3
)tinc;tinx).
The lemma will be proven in two steps. In the rst step (a) we will show that
the value of val(s
res
; e
res
; f
res
) corresponds to the value of the IEEE rounded result
val(iround(s; e; f)). In the second step (b) we will show, that also (s
res
; e
res
; f
res
) is
an IEEE factoring like iround(s; e; f).
In part (a) of the proof, we have only to consider the values during the computation.
The value of the input factoring might only be changed by the two gradual rounding
steps and by the exponent rounding, so that x
1
= val(s; e; f) = val(s
1
; e
1
; f
1
) and x
2
=
val(s
2
; e
2
; f
2
) = val(s
3
; e
3
; f
3
). Because the exponent rounding function is the same like
in the computation of iround(s; e; f) according to lemma 2.7, we only have to compare
the signicand rounding, namely we only have to show that for e
00
= maxfe
min
; e
1
g
val(s
4
; e
4
; f
4
) = rnd
mode; e
00
+p 1
(x):
Like for the rounding function rnd, we can also use for the gradual rounding function
grnd, that for integers x: 2
x
val(grnd
mode;
(f;x[1 :0])) = val(grnd
mode; x
(2
x
f;x[1 :0])).
In this way the rounded values of the gradual rounding steps x
2
and x
4
= val(s
4
; e
4
; f
4
)
can be written by:
x
4
= ( 1)
s
4
 2
e
4
 f
4
(2.64)
= ( 1)
s
3
 2
e
3
 val(grnd
mode?s;p 1
(f
3
;tinc;tinx)) (2.65)
= ( 1)
s
3
 val(grnd
mode?s; e
3
+p 1
(2
e
3
 f
3
;tinc;tinx)) (2.66)
x
3
= ( 1)
s
3
 2
e
3
 f
3
(2.67)
= ( 1)
s
2
 2
e
2
 f
2
(2.68)
= ( 1)
s
1
 2
e
1
 val(grnd
mode?s;52
(f
1
; 00)) (2.69)
= ( 1)
s
1
 val(grnd
mode?s; e
1
+52
(2
e
1
 f
1
; 00)) (2.70)
= ( 1)
s
 val(grnd
mode?s; e
1
+52
(abs(x); 00)) (2.71)
In the computation steps between the two gradual rounding steps, the exponent could be
changed by the post-normalization shift, so that e
GF
2 fe
1
; e
1
+ 1g and by the bounded
normalization shift, so that e
3
= maxfe
min
; e
GF
g (Note, that f
GF
is normalized, so that
2.5. ROUNDING COMPUTATION UTILITIES 37
e
GF
is already the normalized exponent.) Because both operations only could increase the
exponent and because we consider p 2 f24; 53g, we get  e
3
+ p  1   e
1
+ 52. For this
reason we can use lemma 2.19 on the two gradual rounding steps with 
1
=  e
1
+52 and

2
=  e
3
+ p  1. In this way equation 2.67, 2.68 and 2.71 combine to
x
4
= ( 1)
s
 grnd
mode?s; e
3
+p 1
(grnd
mode?s; e
1
+52
(abs(x); 00)) (2.72)
= ( 1)
s
 rnd
mode?s; e
3
+p 1
(abs(x)) (2.73)
= rnd
mode; e
3
+p 1
(x): (2.74)
There are only two cases possible, namely: (i) e
GF
= e
1
, and (ii) e
GF
= e
1
+ 1:
(i) From e
GF
= e
1
, it follows, that e
3
= maxfe
min
; e
1
g = e
00
, so that we get x
4
=
rnd
mode; e
00
+p 1
(x), as required.
(ii) For e
GF
= e
1
+1, it follows from the denition of the post-normalization shift, that
the signicand is changed to f
GF
= 1. The rounding does not change the operand
(( 1)
s
 2
e
1
+1
) regardless of whether rounding position  (e
1
+1)+ p  1 or rounding
position  e
1
+ p   1 is considered. Thus, we get x
4
= rnd
mode; e
00
+p 1
(x) also in
this case and part (a) of the proof is completed.
Part (b) can be proven like part (b) of lemma 2.7 starting with the input factoring
of the bounded normalization shift (s
GF
; e
GF
; f
GF
), because the computations in the four
steps that are computed on (s
GF
; e
GF
; f
GF
) in this lemma correspond to the four steps in
the computation of iround(s; e; f) according to lemma 2.7. 2
Denition 2.17 For mode 2 fRZ;RNE;RI;RMIg, we dene the two gradual rounding
functions ground1
mode
and ground2
mode
by
ground1
mode
(s; e; f) = post norm(sgrnd1
mode?s
((s; e; f)))
ground2
mode
((s
GF
; e
GF
; f
GF
);tinc;tinx) =
exp rnd
mode?s
(post norm(sgrnd2
mode?s
(d
e
min
e((s
GF
; e
GF
; f
GF
);tinc;tinx)))):
Corollary 2.21 With the denition 2.17, lemma 2.20 can be written by:
iround
mode
(s; e; f) = ground2
mode
(ground1
mode
(s; e; f)):
Lemma 2.22 The equation ((s
GF
; e
GF
; f
GF
);tinc;tinx) = ground1
mode
(s; e; f) is in-
variant on the addition of k 2 IR to the exponent, namely
((s
GF
; e
GF
+ k; f
GF
);tinc;tinx) = ground1
mode
(s; e+ k; f);
so that it does not matter if k is added to the exponent of the input or the output factoring.
Proof: The computations in each of the three steps according to denition 2.17 of
function ground1, namely, the unbounded normalization shift, the gradual rounding and
the post-normalization shift only depend on the values of the signicands and the signs
of the factorings and not on the exponent values and so does the sequence of these three
steps in function ground1. 2
Corollary 2.23 With ((s
GF
; e
GF
; f
GF
);tinc;tinx)=ground1
mode
(s
ex
; e
ex
; f
ex
) and corol-
lary 2.21, the IEEE factoring of the nal result according to equation 2.18 is given by
iround(s
ex
; e
ex
+ wec; f
ex
) = ground2
mode
((s
GF
; e
GF
+ wec; f
GF
);tinc;tinx):
38 CHAPTER 2. IEEE FLOATING-POINT STANDARD
2.6 Internal Representations
In this section, based on factoring representations, we dene oating-point number rep-
resentations at the bit level. In each presented format the number representations are
integrated for the cases of single precision and double precision. The rst three formats,
namely the packed format, the unpacked format and the normalized format contain the
representation of single precision and double precision IEEE values. The last two formats,
the representative format and the gradual result format, represent results of IEEE opera-
tions that have not been fully rounded yet, In these cases some further computation steps
are required to achieve the corresponding single precision or double precision IEEE FP
value and there could be two or more representations in these formats that lead to the
same IEEE FP value after rounding.
2.6.1 Packed Format
The number representations in the packed format (PF) are based on the packed repre-
sentations for single precision and double precision dened by the IEEE standard. These
packed representations encode the IEEE factoring of a number in a binary form. In the
packed format, the IEEE packed representations for single and double precision (see g-
ure 2.2) are integrated into a 64 bit wide representation, where the smaller single precision
representations are left aligned and padded with 32 zeros (gure 2.5). We index a bus
with this format by BUS
PF
[63 : 0]. For single precision usage we have:
s
PF
= BUS
PF
[63] (2.75)
e
PF
[7 : 0] = BUS
PF
[62 : 55] (2.76)
f
PF
[1 : 23] = BUS
PF
[54 : 32] (2.77)
and for double precision usage we have:
s
PF
= BUS
PF
[63] (2.78)
e
PF
[10 : 0] = BUS
PF
[62 : 52] (2.79)
f
PF
[1 : 52] = BUS
PF
[51 : 0]: (2.80)
Denition 2.18 For single and double precision with (n; p) 2 f(8; 24); (11; 53)g we dene
the function pf : IEEEfact
n;p
 ! f0; 1g
64
, that computes the representation of an IEEE
factoring (s; e; f) 2 IEEEfact
n;p
in the packed format. With e = <e
PF
[n :0]>
bias
n
and
f = <f
PF
[0 :p 1]>
neg
for representable numbers and quiet NaNs, the function pf is
dened by
pf(s; e; f) =
8
>
<
>
>
:
(s; 1
n 1
; 0
64 n
) if f = f
1
(s; 1
n 1
; 1; 0
63 n
) if f = f
sNaN
(s; 1
n 1
; 0; f
PF
[2 :p 1]; 0
64 n p
) if f = f
qNaN
 
s; (e
PF
[n 1:0] ^ f
PF
[0]); f
PF
[1 :p 1]; 0
64 n p

otherwise.
In the opposite direction the function fact
PF
: f0; 1g
64
 ! IEEEfact
n;p
computes the
IEEE factoring that is represented by BUS
PF
[63 : 0] in the packed format. With
(s;e
PF
[n  1 : 0]; f
PF
[1 :p 1]) = BUS
PF
[63 : 63 n p 1];
the denormalized factoring (s; e
den
; f
den
) = (s; e
min
; < (0; f
PF
[1 :p 1]) >
neg
) and the nor-
malized factoring (s; e
nor
; f
nor
) = (s;< e
PF
[n 1:0] >
bias
n
; < (1; f
PF
[1 :p 1]) >
neg
) ; the
2.6. INTERNAL REPRESENTATIONS 39
BUS PF
BUS PF
S E[7:0] F[1:23] 032
63 62 55 54 32 31 0
Packed single format (64 bits)
S E[10:0] F[1:52]
62 52 0
Packed double format (64 bits)
63 51
Figure 2.5: Packed format for single and double precision.
function fact
PF
is dened according to the denitions of the IEEE packed representations
from section 2.2.3 by
fact
PF
(BUS
PF
[63 :0]) =
8
>
>
<
>
>
>
:
(s; e
den
; f
den
) if e
PF
[n 1:0] = 0
n
(s; e
nor
; f
nor
) if e
PF
[n 1:0] 6= 1
n
^ f
PF
[1 :p 1] 6= 0
p 1
(s; e
1
; f
1
) if e
PF
[n 1:0] = 1
n
^ f
PF
[1 :p 1] = 0
p 1
(s; e
sNaN
; f
sNaN
) if e
PF
[n 1:0] = 1
n
^ f
PF
[1] = 1
(s; e
qNaN
; f
qNaN
) otherwise
2.6.2 Unpacked Format
The unpacked format is also a binary encoding for IEEE factorings. But in this case
the number representations are unpacked, i.e., information about an IEEE factoring is
not provided with the minimum amount of bits, but additional bits are included in the
representation to have better access to certain informations about the number:
The hidden bit f
UF
[0] is included in the unpacked number representation. For repre-
sentable numbers this bit is well dened by the exponent representation from the packed
format. For special values like += 1 and NaNs, we dene the hidden bit to have the
value f
UF
[0] = 1, so that 1 and NaN representations always include a normalized signif-
icand. Moreover, special values and zeros are indicated by 4 additional bits: zero, inf,
qnan and snan. At most one of these bits can be active in a number representation.
In the unpacked format the exponent is represented in the two's-complement repre-
sentation for representable numbers. We dene the exponent of special values to have
the two's complement representation of e
UF
= e
max
+ 1 in analogy to the packed format.
To be able to include the two's complement representation of e
max
+ 1 = 2
n 1
in the
exponent, the exponent representation is extended by one bit to a width of n+1-bits. For
representable numbers this bit extension is computed by a sign extension. Representations
of zero include an arbitrary exponent. In this case the value of the number is indicated
by the sign and the additional bit zero in a unique way.
For an integrated representation of single and double precision values three bit elds
are separated according to sign, exponent and signicand (see gure 2.6). For single
precision usage the signicand is padded with 29 zeros on the right and a single precision
exponent needs 3 additional bits on the left, that are computed by sign extension. We
index a bus with this format by BUS
UF
[69 : 0] and have:
s
UF
= BUS
UF
[69]
e
UF
[11 : 0] = BUS
UF
[68 : 57]
f
UF
[0 : 52] = BUS
UF
[56 : 4]
40 CHAPTER 2. IEEE FLOATING-POINT STANDARD
S
69 68
Unpacked single format (70 bits)
S E[11:0] F[0:52]
68 57 4
Unpacked double format (70 bits)
69 56BUS UF
BUS UF
E[8:0] F[0:23] 029
ZE
R
O
QN
AN
SN
A
N
IN
F
ZE
R
O
IN
F
QN
AN
SN
A
N
01232
023 1
335657 4 3
3E[8]
Figure 2.6: Unpacked format for single and double precision.
zero
UF
= BUS
UF
[3] inf
UF
= BUS
UF
[2]
qnan
UF
= BUS
UF
[1] snan
UF
= BUS
UF
[0]:
Denition 2.19 We dene the function uf : IEEEfact
n;p
 ! f0; 1g
70
, that computes
the representation of an IEEE factoring (s; e; f) 2 IEEEfact in the unpacked format.
With e = <e
UF
[11 :0]>
2
and f = <f
UF
[0 :52]>
neg
for representable numbers and quiet
NaNs, the function uf is dened by
uf(s; e; f) =
8
>
>
>
<
>
>
>
:
(s; 0
12
; 0; 0
52
; 1; 0; 0; 0) if f = 0
(s; 0; 1; 0
10
; 1; 0
52
; 0; 1; 0; 0) if f = f
1
(s; 0; 1; 0
10
; 1; 0; f
UF
[2 :52]; 0; 0; 1; 0) if f = f
qNaN
(s; 0; 1; 0
10
; 1; 1; 0
51
; 0; 0; 0; 1) if f = f
sNaN
(s; e
UF
[11 :0]; f
UF
[0 :52]; 0; 0; 0; 0) otherwise.
In the opposite direction the function fact
UF
: f0; 1g
70
 ! IEEEfact
n;p
computes the
IEEE factoring that is represented by BUS
UF
[69 : 0] in the unpacked format. With
(s;e
UF
[11 : 0]; f
UF
[0 :52]; zero; inf;qnan; snan) = BUS
UF
[69 : 0];
the function fact
UF
is dened by
fact
UF
(BUS
UF
[69 :0]) =
8
>
>
>
<
>
>
>
>
:
(s; e
0
; 0) if zero
(s; e
1
; f
1
) if inf
(s; e
qNaN
; f
qNaN
) if qnan
(s; e
sNaN
; f
sNaN
) if snan
(s;<e
UF
[11 :0]>
2
; <f
UF
[0 :52]>
neg
) otherwise.
2.6.2.1 Packed Format  ! Unpacked Format
For the conversion of a number from a packed representation to the corresponding un-
packed representation we summarize the conditions on the additional bits that have to be
computed:
f
UF
[0] = 0 , i e
PF
= 0
n
zero = 1 , i e
PF
= 0
n
AND f
PF
[1 :p 1] = 0
p 1
inf = 1 , i e
PF
= 1
n
AND f
PF
[1 :p 1] = 0
p 1
qnan = 1 , i e
PF
= 1
n
AND f
PF
[1] = 0 AND f
PF
[2 :p 1] 6= 0
p 2
snan = 1 , i e
PF
= 1
n
AND f
PF
[1] = 1:
(2.81)
Among the other bits, only the exponent representation changes by the subtraction of the
corresponding bias and the non-redundant representation of e
min
. The following equation
2.6. INTERNAL REPRESENTATIONS 41
uses a sign extension for single precision and double precision exponents and lemma 2.1(ii)
to convert the exponent from biased to two's complement representation for e
PF
6= 1
n
.
Because for e
PF
= 1
n
,
bin
n
0
(< (e
PF
[n 1]
2
;e
PF
[n 2 : 0)] > +1) = < (0
2
; 1
n 2
) > +1 = 2
n 1
= e
max
+ 1;
also the case e
PF
= 1
n
is included in the rst and the third line.
e
UF
[11 : 0] =
8
>
>
<
>
>
:
bin
11
0
(< (e
PF
[10]
2
;e
PF
[9 : 0)] > +1) if double AND e
PF
6= 0
11
(11; 0
8
; 10) if double AND e
PF
= 0
11
bin
11
0
(< (e
PF
[7]
5
;e
PF
[6 : 0]) > +1) if single AND e
PF
6= 0
7
(11111; 0
5
; 10) if single AND e
PF
= 0
7
.
(2.82)
2.6.2.2 Unpacked Format  ! Packed Format
Also in the conversion direction from unpacked number representations to packed number
representations, the sign bit s
PF
and the fraction f
PF
[1 :p 1] are copied identically. For
the exponent conversion we have to distinguish between two cases of: (a) normalized
numbers or special values; and (b) denormalized numbers or zero.
(a) In the case of normalized numbers or special values like += 1 or NaNs, we have
to convert the exponent from the two's complement representation to the biased
representation. For normalized numbers, the bit e
UF
[n] is only a sign extension:
e
UF
[n] = e
UF
[n 1], so that the conversion is computed with the help of lemma 2.2:
e
0
PF
[n 1:0] = bin
n 1
0
(< (e
UF
[n 1];e
UF
[n 2 : 0]) >  1):
For += 1 or NaNs, we have e
UF
= (01; 0
n 1
) and the above formula yields
e
0
PF
[n 1:0] = 1
n
; as required for packed 1- or NaN -representations. Thus, this
formula can be used for normalized numbers, += 1 or NaNs. Because
bin
7
0
(<(e
UF
[7];e
UF
[6 :0])> 1) = bin
7
0
(<(e
UF
[10];e
UF
[9 :8];e
UF
[7];e
UF
[6 :0])> 1);
the formula can be integrated for single and double precision by
e
0
PF
[n 1:0] = bin
n 1
0
(<(e
UF
[10];e
UF
[9 : 8];e
UF
[7]dbl;e
UF
[6 : 0])>  1):
(b) In the case of a zero or a denormalized number, the exponent representation 0
n
is
required in the packed format.
Because f
UF
[0] = 0, i the number is a zero or a denormalized number, we can distinguish
between the two cases by the value of f
UF
[0]. Thus, the exponent conversion can be
summarized by:
e
PF
=

bin
n 1
0
(<(e
UF
[10];e
UF
[9 : 8];e
UF
[7]dbl;e
UF
[6 : 0])> 1) if f
UF
[0]
0
n
otherwise.
(2.83)
= bin
n 1
0
(<(e
UF
[10];e
UF
[9 : 8];e
UF
[7]dbl;e
UF
[6 : 0])> 1) ^ f
UF
[0] (2.84)
42 CHAPTER 2. IEEE FLOATING-POINT STANDARD
2.6.3 Normalized Format
The only dierence between the unpacked and the normalized format (NF) is, that in
the case of the normalized format, the NF factoring of an IEEE value is encoded, and
not the IEEE factoring like in the unpacked format. Therefore, the representation in the
normalized format only diers from the unpacked format representation for denormalized
numbers, which are also represented with a normalized signicand in the normalized for-
mat. The only numbers which still contain a leading zero signicand bit f
NF
[0] in the
normalized format are +=  0.
Figure 2.7 indexes a bus with the normalized format by BUS
NF
[69 : 0] and separates
bit elds similar to the unpacked representation:
s
NF
= BUS
NF
[69]
e
NF
[11 : 0] = BUS
NF
[68 : 57]
f
NF
[0 : 52] = BUS
NF
[56 : 4]
zero
NF
= BUS
NF
[3] inf
NF
= BUS
NF
[2]
qnan
NF
= BUS
NF
[1] snan
NF
= BUS
NF
[0]:
Denition 2.20 We dene the function nf : NFfact
n;p
 ! f0; 1g
70
, that computes the
representation of an NF factoring (s; e; f) 2 NFfact
n;p
in the normalized format. With
e = <e
NF
[11 :0]>
2
and f = <f
NF
[0 :52]>
neg
for representable numbers and quiet NaNs,
the function nf is dened by
nf(s; e; f) =
8
>
>
>
<
>
>
>
:
(s; 0
12
; 0; 0
52
; 1; 0; 0; 0) if f = 0
(s; 0; 1; 0
10
; 1; 0
52
; 0; 1; 0; 0) if f = f
1
(s; 0; 1; 0
10
; 1; 0; f
NF
[2 :52]; 0; 0; 1; 0) if f = f
qNaN
(s; 0; 1; 0
10
; 1; 1; 0
51
; 0; 0; 0; 1) if f = f
sNaN
(s; e
NF
[11 :0]; f
NF
[0 :52]; 0; 0; 0; 0) otherwise.
In the opposite direction the function fact
NF
: f0; 1g
70
 ! NFfact
n;p
computes the NF
factoring that is represented by BUS
NF
[69 : 0] in the normalized format. With
(s;e
NF
[11 : 0]; f
NF
[0 :52]; zero; inf;qnan; snan) = BUS
NF
[69 : 0];
the function fact
NF
is dened by
fact
NF
(BUS
NF
[69 :0]) =
8
>
>
>
<
>
>
:
(s; e
0
; 0) if zero
(s; e
1
; f
1
) if inf
(s; e
qNaN
; f
qNaN
) if qnan
(s; e
sNaN
; f
sNaN
) if snan
(s;<e
NF
[11 :0]>
2
; <f
NF
[0 :52]>
neg
) otherwise.
2.6.3.1 Unpacked Format  ! Normalized Format
To convert numbers from the unpacked representation to the normalized representation,
an unbounded normalization shift  for non-zero denormalized numbers according to de-
nition 2.3 has to be computed. Because we can recognize non-zero denormalized numbers
by the condition (f[0] AND zero), the conversion could be described by
(s
NF
; <e
NF
>
2
; <f
NF
>
neg
) =

(s
UF
; <e
UF
>
2
; <f
UF
>
neg
) if (f[0] AND zero)
(s
UF
; <e
UF
>
2
; <f
UF
>
neg
) otherwise.
(2.85)
2.6. INTERNAL REPRESENTATIONS 43
S
69 68
Normalized single format (70 bits)
S E[11:0] F[0:52]
68 57 4
Normalized double format (70 bits)
69 56BUS NF
BUS NF
E[8:0] F[0:23] 029
ZE
R
O
QN
AN
SN
A
N
IN
F
ZE
R
O
IN
F
QN
AN
SN
A
N
01232
023 1
335657 4 3
3E[8]
Figure 2.7: Normalized format for single and double precision.
To determine the shift amount for the unbounded normalization shift, the amount of
leading zeros lz in the signicand f
UF
[0 : 52] has to be detected, following lemma 2.3(ii).
For the normalization, the signicand is left-shifted and the exponent is decremented by the
amount lz. In this way, the normalization shift might decrease the exponent by a maximum
of 53, because the widest signicand contains 53 bits, that could all be zero. Therefore,
by the exponent adjustment the range of the exponent is enlarged to [1024 :  1075] for
double precision and a 12-bit two's-complement exponent representation is sucient in
the normalized representation.
If (f[0]ANDzero) = 0, the number is either a zero and the signicand is f
UF
[0 : 52] =
0
53
, or the number is a normalized number, += 1 or a NaN, so that f
UF
[0] = 1, and
therefore, the shift amount is zero: lz = 0. In both cases the signicand representation is
not changed by an normalization shift by lz positions. Thus, the normalization shift can
also be computed for (f[0] AND zero) = 0, so that for all cases
f
NF
[0 : 52] = (f
UF
[lz : 52]; 0
lz
): (2.86)
For the exponent adjustment the shift amount lz is substracted. Because the exponent
representation is not valid for zeros, this subtraction can also be computed for all cases.
< e
NF
[11 : 0] >
2
=< (e
UF
[10];e
UF
[10 : 0]) >
2
 lz: (2.87)
Because lz = 0 for innities and NaNs, in the normalized representation we get the
exponent e
NF
= emax+ 1 = e
UF
for them like in the unpacked representation. Also the
sign bit and additional bits stay the same in both representations.
As denormalized signicands are shifted, these signicands do not end with weight
2
 p+1
, but the least signicant bit of the signicand is changed to the the signicand
position with weight 2
 p+lz+1
. Because signicand rounding is done at this least signicant
bit position of the signicand, the signicand rounding position changes to the position
with weight 2
 p+lz+1
for denormalized numbers. in the normalized format. In combination
with the changed exponent e lz this results in the rounding position  = e lz p+lz+1 =
e p+1, which agrees with the previous IEEE rounding description and with the rounding
procedure according to lemma 2.8.
2.6.3.2 Normalized Format  ! Unpacked Format
In this conversion direction from normalized to unpacked number representations, the
representations of denormalized IEEE values have to be changed. For these numbers, the
factoring representations have to be denormalized, so that the exponent is adjusted to
e
min
. Because the exponent of denormals in the normalized representation is smaller than
44 CHAPTER 2. IEEE FLOATING-POINT STANDARD
e
min
, the shift distance lz can be computed by
lz =

e
min
  < e
NF
[11 : 0] >
2
if (e
min
 < e
NF
[11 : 0] >
2
)  0
0 otherwise.
(2.88)
The conversion then changes the exponent and the signicand representations by
f
UF
[0 : 52] = (0
lz
; f
NF
[0 : 52  lz]); (2.89)
< e
UF
[11 : 0] >
2
=

< e
NF
[11 : 0] >
2
if f
UF
[0]
e
min
otherwise.
(2.90)
The sign bit and additional bits stay the same in both representations.
2.6.4 Representative Format
The representative format (RF) is a representation for results of IEEE operations on IEEE
values in preparation for IEEE rounding. For a detailed description of the representative
format, we rst dene the set of values IRES
n;p
and the set of factorings RFfact
n;p
on
that the representative format is based and show some properties of these sets.
Denition 2.21 The set of result values IRES
n;p
is dened by
IRES
n;p
:= f0g [ fx 2 IR j 2
 2
n
 2p+6
< abs(x) < 2
2
n
+p 3
g [ SPE :
and the set of RF factorings RFfact
n;p
is dened by
RFfact
n;p
= f(s; e; f) j 9x 2 IRES
n;p
: val(s; e; f) = rep
p e
(x) AND
(f < 4 AND (f  1 OR (f is multiple of 2
 p+1
))
	
If x
p e
RF
= val(s
RF
; e
RF
; f
RF
) for a value x 2 IRES
n;p
and a factoring (s
RF
; e
RF
; f
RF
) 2
RFfact
n;p
, then (s
RF
; e
RF
; f
RF
) is called a RF factoring representation of x.
Note, that because special values have an exact and normalized signicand representation
and we dened the exponent of special factorings to have properties like e
sp
= e
max
+1,
we get SPEfact  RFfact
n;p
, and special value factorings are RF factorings of the
corresponding special values.
Lemma 2.24 This lemma consists of four parts:
(a) Each exact result of an IEEE operation on IEEE values has a value from IRES
n;p
.
(b) Each value x 2 IRES
n;p
, has at least one RF factoring representation (s
RF
; e
RF
; f
RF
) 2
RFfact
n;p
with val(s
RF
; e
RF
; f
RF
) = rep
p e
RF
(x).
(c) Each exact result x of an IEEE operation on IEEE values in single precision or
double precision has at least one RF factoring representation (s
RF
; e
RF
; f
RF
) 2
RFfact
11;53
.
(d) If (s
RF
; e
RF
; f
RF
) 2 RFfact
11;53
is a RF factoring of the exact result x 2 IRES
11;53
then for mode 2 fRZ;RNE;RI;RMIg, IEEE rounding of x in single and double
precision can also be computed on the factoring (s
RF
; e
RF
; f
RF
) by
r
mode
(x) = val(iround
mode
(s
RF
; e
RF
; f
RF
)):
2.6. INTERNAL REPRESENTATIONS 45
S E[12:0] F[-1:54]
72 60 473 59BUS RF
ZE
R
O
IN
F
QN
AN
SN
A
N
023 1
Representative format (74 bits)
Figure 2.8: Representative format for single and double precision.
Proof: We proof the four parts of the lemma separately:
(a) Obviously all possible zero or special value results of IEEE operations are included
in IRES
n;p
. All non-zero representable results have a magnitude that is larger than
2
 2
n
 2p+6
and smaller than 2
2
n
+p 3
(see section 2.4.2). All real numbers with these
properties are inluded in IRES
n;p
, so that there can not be an IEEE result, that is not
included in IRES
n;p
.
(b) If (s; e; f) is an arbitrary factoring of x 2 IRES
n;p
, and (s
0
; e
0
; f
0
) is the corre-
sponding normalized factoring (s
0
; e
0
; f
0
) = (s; e; f), then obviously, (s
0
; e
0
; rep
53
(f
0
)) 2
RFfact
n;p
and (s
0
; e
0
; rep
53
(f
0
)) is a RF factoring of x. Thus, indeed each x has at least
one RF factoring in RFfact
n;p
.
(c) Part (c) follows from part (a) and part (b) using that IRES
8;24
 IRES
11;53
.
(d) If (s
RF
; e
RF
; f
RF
) 2 RFfact
11;53
is a RF factoring of the exact result x 2 IRES
11;53
,
then val(s
RF
; e
RF
; f
RF
) = rep
53 e
RF
(x). There is a unique factoring (s
RF
; e
RF
; f) with
x = val(s
RF
; e
RF
; f). For the signicand f of this factoring we get f
RF
= rep
53
(f).
Because for both, single and double precision, p  53, it follows from lemma 2.15 with
p
0
= 53, that iround
mode
(s
RF
; e
RF
; f
RF
) = iround
mode
(s
RF
; e
RF
; f) and by denition 2.8
of iround
mode
we get val(iround
mode
(s
RF
; e
RF
; f
RF
)) = rnd
mode
(x); as required. 2
Corollary 2.25 The previous lemma has shown that each exact result x of an IEEE op-
eration on IEEE values in both single and double precision has at least one RF factoring
representation (s
RF
; e
RF
; f
RF
) 2 RFfact
11;53
and that IEEE rounding of x can be com-
puted by rounding of (s
RF
; e
RF
; f
RF
).
The representative format is an encoding of RF factorings (s
RF
; e
RF
; f
RF
) 2 RFfact
11;53
.
From the conditions on RF factorings in denition 2.21 it follows, that the exponent e
RF
can be represented by a 13-bit 2's complement representation e
RF
=< e
RF
[12 : 0] >
2
and
the representation of the signicand f
RF
requires 56-bits: f
RF
=< f
RF
[ 1 : 54] >
neg
.
Like in the unpacked and normalized format, the special values and zeros are indicated
by the 4 additional bits: zero, inf, qnan, and snan. For these cases, the sign s
RF
and
the signicand representation f
RF
[1 : 52] correspond to the IEEE representation and the
bits f
RF
[ 1], f
RF
[53 : 54] are dened to be zero. Additionaly, for zeros, f
RF
[0 :52] = 0
53
and the exponent representation is not valid. For += 1 and NaNs, we dene f
RF
[0] = 1
and e
RF
= e
max
+ 1 like in the unpacked and the normalized format.
Figure 2.8 depicts a representation in the representative format BUS
RF
[73 : 0] with
bit elds:
s
RF
= BUS
RF
[73]
e
RF
[12 : 0] = BUS
RF
[72 : 60]
f
RF
[ 1 : 54] = BUS
RF
[59 : 4]
zero
RF
= BUS
RF
[3] inf
RF
= BUS
RF
[2]
qnan
RF
= BUS
RF
[1] snan
RF
= BUS
RF
[0]:
46 CHAPTER 2. IEEE FLOATING-POINT STANDARD
Denition 2.22 We dene the function rf : RFfact
n;p
 ! f0; 1g
74
, that computes the
representation of an RF factoring (s
RF
; e
RFF
; f
RF
) 2 RFfact
n;p
in the representative
format. With e
RF
= <e
RF
[12 :0]>
2
and f
RF
= <f
RF
[ 1:54]>
neg
for representable
numbers and quiet NaNs, the function rf is dened by
rf(s; e; f) =
8
>
>
>
<
>
>
:
(s; 0
13
; 00; 0
54
; 1; 0; 0; 0) if f = 0
(s; 00; 1; 0
10
; 01; 0
54
; 0; 1; 0; 0) if f = f
1
(s; 00; 1; 0
10
; 01; 0; f
RF
[2 :52]; 00; 0; 0; 1; 0) if f = f
qNaN
(s; 00; 1; 0
10
; 01; 1; 0
53
; 0; 0; 0; 1) if f = f
sNaN
(s; e
RF
[12 :0]; f
RF
[ 1:54]; 0; 0; 0; 0) otherwise.
In the opposite direction the function fact
RF
: f0; 1g
74
 ! RFfact
n;p
computes the RF
factoring that is represented by BUS
RF
[73 : 0] in the representative format. With
(s;e
RF
[12 : 0]; f
RF
[ 1:54]; zero; inf;qnan; snan) = BUS
RF
[73 : 0];
the function fact
RF
is dened by
fact
RF
(BUS
RF
[73 :0]) =
8
>
>
>
<
>
>
>
:
(s; e
0
; 0) if zero
(s; e
1
; f
1
) if inf
(s; e
qNaN
; f
qNaN
) if qnan
(s; e
sNaN
; f
sNaN
) if snan
(s;<e
RF
[12 :0]>
2
; <f
RF
[ 1:54]>
neg
) otherwise.
Note, that not all possible bit combinations from f0; 1g
74
are valid representations in the
representative format. A representation BUS
RF
[73 : 0] could be invalid for two reasons:
1. BUS
RF
[73 : 0] would be the RF representation of a number that is not in IRES
11;53
.
2. the signicand conditions from denition 2.21 for RF factorings are not fulllled.
We are only interested in the second case, because we will only deal with numbers from
IRES
11;53
. Therefore, we formulate the conditions that have to be fulllled for the sig-
nicand of RF representations at bit level in the following:
Corollary 2.26 The conditions on the signicand f
RF
=< f
RF
[ 1 : 54] >
neg
of a RF
factoring, namely (f
RF
< 4 AND (f
RF
 1 OR (f
RF
is multiple of 2
 52
)) are fulllled,
i (f
RF
[ 1] _ f
RF
[0] _ f
RF
[54]). This means, that either one of the most signicant two
bits f
RF
[ 1 : 0] = BUS
RF
[59 : 58] has to be one or the least signicant bit f
RF
[54] =
BUS
RF
[4] has to be zero.
2.6.5 Gradual Result Format
Also in the gradual result format, the results of IEEE operations on IEEE values should
be represented. But in contrast to the representative format, in the gradual result format
already part of the rounding has been computed on the exact IEEE results. For a detailed
description of the gradual result format, rst we introduce the set of factorings, on that
the gradual result format is based.
Denition 2.23 The set of GF factorings GFfact is dened by:
GFfact = f((s
GF
; e
GF
; f
GF
);tinx;tinc) j 9(s; e; f) 2 FACT (IRES
11;53
) :
((s
GF
; e
GF
; f
GF
);tinx;tinc) = post norm(sgrnd1((s; e; f)))g
2.6. INTERNAL REPRESENTATIONS 47
By this denition, GF factorings are the result of a gradual rounding step according to
the intermediate result from lemma 2.20. Thus, IEEE rounding can also be computed on
these factorings:
Corollary 2.27 If ((s
GF
; e
GF
; f
GF
);tinx;tinc) 2 GFfact is the GF factoring of the
value x 2 IRES
11;53
, then for mode 2 fRZ;RNE;RI;RMIg IEEE rounding of x in
single and double precision can be computed on ((s
GF
; e
GF
; f
GF
);tinc;tinx) according to
lemma 2.20 by
iround
mode
(s; e; f) =
exp rnd
mode?s
GF
(post norm(sgrnd2
mode?s
GF
(d
e
min
e((s
GF
; e
GF
; f
GF
);tinc;tinx))))
The gradual result format is an encoding of GF factorings ((s
GF
; e
GF
; f
GF
);tinc;tinx) 2
GFfact. Because the range of the represented numbers is not changed signicantly by
the previous gradual rounding step, also in this case the exponent e
GF
is represented by
13 bits: e
GF
= <e
GF
[12 :0]>
2
. The signicand, which was rounded at position 52 and
which was post-normalized, is represented by f
GF
= <f
GF
[0 :52]>
neg
.
Like in the unpacked, normalized and representative format the special values and
zeros are indicated by 4 additional bits. For these cases, the sign s
GF
and the signi-
cand representation f
GF
[1 :52] correspond to the packed IEEE representation. For zeros,
f
GF
[0] = 0 and the exponent representation is not valid. For += 1 and NaNs, we dene
f
GF
[0] = 1 and e
GF
= e
max
+1 like in the unpacked, the normalized and the representative
format. Thus, in the gradual result format the signicand is normalized for all non-zero
numbers, i.e., if the additional bit zero
GF
= 0, then f
GF
[0] must be 1.
Moreover, the two rounding tags tinc
GF
and tinx
GF
from the previous gradual round-
ing step are included in the number representations of the gradual result format. For
special values, which have an exact representation, tinc
GF
and tinx
GF
have to be zero,
so that no rounding will be computed for them.
Figure 2.9 depicts a bus in the gradual result format indexed by BUS
GF
[72 : 0] with
bit elds:
s
GF
= BUS
GF
[72]
e
GF
[12 : 0] = BUS
GF
[71 : 59]
f
GF
[0 : 52] = BUS
GF
[58 : 6]
tinc
GF
= BUS
GF
[5] tinx
GF
= BUS
RF
[4]
zero
GF
= BUS
GF
[3] inf
GF
= BUS
RF
[2]
qnan
GF
= BUS
GF
[1] snan
GF
= BUS
GF
[0]:
Denition 2.24 We dene the function gf : GFfact  ! f0; 1g
73
, that computes the
representation of an GF factoring ((s
GF
; e
GF
; f
GF
)tinc;tinx) 2 GFfact
n;p
in the gradual
result format. With e
GF
= <e
GF
[12 :0]>
2
and f
GF
= <f
GF
[0 :52]>
neg
for representable
numbers and quiet NaNs, the function gf is dened by
gf(s; e; f) =
8
>
>
>
>
<
>
>
>
:
(s; 0
13
; 0; 0
52
; 00; 1; 0; 0; 0) if f = 0
(s; 00; 1; 0
10
; 1; 0
52
; 00; 0; 1; 0; 0) if f = f
1
(s; 00; 1; 0
10
; 1; 0; f
GF
[2 :52]; 00; 0; 0; 1; 0) if f = f
qNaN
(s; 00; 1; 0
10
; 1; 1; 0
51
; 00; 0; 0; 0; 1) if f = f
sNaN
(s; e
GF
[12 :0]; f
GF
[0 :52]; tinc;tinx; 0; 0; 0; 0) otherwise.
48 CHAPTER 2. IEEE FLOATING-POINT STANDARD
S E[12:0] F[0:52]
71
Gradual result format (73 bits)
72 58BUS GF
ZE
R
O
IN
F
QN
AN
SN
A
N
023 1
TI
N
X
TI
N
C
5 4659
Figure 2.9: Gradual result format for single and double precision.
In the opposite direction the function fact
GF
: f0; 1g
73
 ! GFfact
n;p
computes the GF
factoring that is represented by BUS
GF
[72 : 0] in the representative format. With
(s;e
GF
[12 : 0]; f
GF
[0 :52];tinc;tinx; zero; inf;qnan; snan) = BUS
GF
[72 : 0];
the function fact
GF
is dened by
fact
GF
(BUS
GF
[72 :0]) =
8
>
>
<
>
>
>
:
((s; e
0
; 0); 00) if zero
((s; e
1
; f
1
); 00) if inf
((s; e
qNaN
; f
qNaN
); 00) if qnan
((s; e
sNaN
; f
sNaN
); 00) if snan
((s;<e
GF
[12 :0]>
2
; <f
GF
[0 :52]>
neg
);tinc;tinx) otherwise.
Chapter 3
FP Microarchitectures
In the previous section the denitions and the requirements of the IEEE FP standard 754
were presented. From this section we know that an IEEE compliant FP implementation
has to implement FP additions/subtractions, FP multiplications, FP divisions, FP square-
roots, FP comparisons and FP conversions in hardware or in software.
The FP operations have dierent importance. One measure of the importance of these
FP operations could be the frequency of their usage in an average workload of current mi-
croprocessors. As such a measure the frequency of the operations in traces of the SPEC92fp
benchmark suite [17] is depicted in gure 3.1. Obviously, the FP addition/subtraction and
the FP multiplication are the most frequent arithmetic FP operations in these Benchmark
traces. This result agrees with the analysis from [26]. To accelerate the FP performance
of a microprocessor, it would make sense to spend the most eort on accelerating the
implementation of the most frequent FP operations. It can be seen in table 3.1 from the
latencies of the FP operations in commercial microprocessors that indeed the FP addition
and the FP multiplication, which are used frequently, are implemented much faster than
the FP division, which is only rarely used. Because the IEEE FP standard even allows
to implement parts of the FP computations in software, some very infrequent operations
like the FP square-root or the FP division-rest even have no hardware realization in most
commercial microprocessors. Although the cheap and slow implementation of operations,
that are infrequently used, might be cost-eective, one should also think about the eect,
that perhaps some operations are infrequently used and tried to be avoided only because
current microprocessors provide such a poor performance for these operations. This ques-
tion could only be answered by the use of benchmarks and compilers, that use as less as
possible about the hardware implementation details.
We base our choice of which arithmetic FP operations are implemented in hardware
on the processor model, into which our FP designs will be integrated later. We will use a
pipelined RISC-processor from [23], that implements the R3000 instruction set. For this
reason the performance of the FP designs will also be determined on R3000 traces of the
SPEC92fp Benchmarks. The R3000 instruction set includes the FP addition/subtraction,
FP multiplication, FP division, FP test, FP conversion, FP absolute value and FP negative
value. Therefore, we propose IEEE compliant FP designs that support exactly these
arithmetic operations in hardware.
We present three basic microarchitectures of our oating-point designs in the follow-
ing. The main dierences of these microarchitectures is the amout of rounding hardware
that is shared between the functional units. If the functional units share some rounding
hardware at all, the FP microarchitecture is mainly determined by the specication of
49
50 CHAPTER 3. FP MICROARCHITECTURES
[%]
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
ad
d/
su
b 
im
m
ed
ia
te
lo
gi
ca
l i
m
m
ed
ia
te
sh
ift
 im
m
ed
ia
te
FP
-c
om
pa
ris
on
FP
-m
ov
e
FP
-a
bs
,n
eg
FP
-b
ra
nc
h
o
th
er
s
FP
-lo
ad
FP
-c
on
ve
rs
io
n
FP
-s
to
re
lo
ad
st
or
e
FP
-d
iv
isi
on
FP
-m
ul
tip
lic
at
io
n
FP
-a
dd
/su
b
di
vi
sio
n
m
u
lti
pl
ic
at
io
n
i-b
ra
nc
h
jum
ps
sp
ec
ia
l m
ov
es
ad
d/
su
b 
tw
o 
re
gs
lo
gi
ca
l t
w
o 
re
gs
sh
ift
 tw
o 
re
gs
Figure 3.1: Operation frequencies in the traces of the SPECfp 92 benchmarks.
the intermediate FP representation at the interfaces between the functional units and the
shared rounding hardware. Our three rounding microarchitectures are based on the inter-
mediate FP representations, that were dened in the previous section. The lemmas 2.7,
2.20 and 2.8 and corollary 2.21 about the dierent possible partitionings of IEEE rounding
computations already suggest the possible partitionings of the rounding implementations.
(I) In the rst microarchitecture all the rounding computations are concentrated in
a shared general rounding unit. This rounding unit considers the rounding for all
IEEE results including the exponent wrapping and the FP exceptions for both single
and double precision operations. A basic specication of such a rounder was rst
described in [10]. Thereafter, this rounder was implemented by our group, resulting
in a version that will be included in [23], where also a rigorous correctness proof of the
compliance with the IEEE rounding denition will be found. This rounder is further
optimized in this thesis. The interface between the functional units and the shared
general rounder is the RF factoring representation from denition 2.21. We require,
that the functional units compute a RF factoring representation (s
RF
; e
RF
; f
RF
) of
the exact result exact
op
. The shared general rounder then has to compute IEEE
rounding on the RF factoring representation iround(s
RF
; e
RF
; f
RF
). Lemma 2.7
guarantuees, that the IEEE rounding of the RF factoring (s
RF
; e
RF
; f
RF
) agrees
with IEEE rounding of exact
op
including the cases of denormalized and special values
results, exceptions and exponent wrapping. In this microarchitecture the integrated
packed FP representation is used in the memory and in the registerle.
(II) In the second microarchitecture, the rounding for the case of normalized double
precision results is computed within each functional unit and this rounded result
is xed for all the remaining cases in a second rounding step implemented by a
shared gradual rounding unit. For the integrated rounding in the functional units
51
latency
processor ALU FP add FP mult
FP div
single
FP div
double
FP sqrt
ULTRA-Sparc 1 1 3 3 12 22 12-22
ULTRA-Sparc 3 1 4 4 12 17 12-24
Pentium Pro 1 3 5 17 36 -
PowerPC 1 5 5 17 21 -
Alpha 21064 1 4 4 34 63 -
Alpha 21164 1 4 4 19 31 -
Alpha 21264 1 4 4 12 15 -
R10000 1 2 2 19 33 -
PA-8000 1 3 3 31 31 -
Table 3.1: Latencies of oating-point operations in commercial microprocessors.
assuming normalized, double precision operands and results, several algorithms from
literature could be used. The implementation of the gradual rounder is based on
the theory from [21] about gradual rounding. This rounding technique is applied in
this thesis for full IEEE compliant rounding including the handling of denormalized
results, special values, exceptions and exponent wrapping. The interface between
the functional units and the gradual rounder is specied by the gradual result for-
mat. We require, that if an exact or an RF factoring of the exact result exact
op
is
given by (s
ex
; e
ex
; f
ex
), the functional units have to compute the GF factoring (see
denition 2.23) ((s
GF
; e
GF
; f
GF
);tinc
GF
;tinx
GF
) = ground1(s
ex
; e
ex
; f
ex
). This
computation already includes a rst gradual rounding step by the gradual rounding
function ground1, which assumes a normalized double precision result. The second
gradual rounding step is then computed in the shared gradual rounding unit by
the function ground2((s
GF
; e
GF
; f
GF
);tinc
GF
;tinx
GF
). Corollary 2.21 and lemma
2.20 guarantuees, that the sequence of the rounding by the gradual rounding func-
tions ground1 and ground2 on the factoring (s
ex
; e
ex
; f
ex
) simulates IEEE rounding
of the factoring (s
ex
; e
ex
; f
ex
) including the cases of denormalized and special val-
ues results, exceptions and exponent wrapping. Also in this microarchitecture the
integrated packed FP representation is used in the memory and in the registerle.
(III) By the third rounding architecture a completely new architecture for an IEEE com-
pliant FPU is suggested. In this architecture no rounding hardware is shared, but
each functional unit contains a dedicated rounding implementation that computes
full IEEE rounding considering denormal and special values, exceptions and expo-
nent wrapping. The special problem with the implementation of this microarchitec-
ture is the implementation if the oating-point multiplication. The oating-point
multiplier conventionally requires normalized signicands in its operands and deliv-
ers an almost normalized result. For the fast integration of IEEE rounding into the
FP multiplier, the signicand has to be rounded in parallel to the mulplication com-
putations. For the case of denormalized results this rounding has to be computed at
a variable rounding position, that could be at each position within the signicand.
The idea, how to integrate such a variable position rounding into the multiplication
implementation is the key concept for this microarchitecture. Such a multiplication
implementation including variable position rounding will be presented later (see also
52 CHAPTER 3. FP MICROARCHITECTURES
UnpackA UnpackBTest
memory system
CONV I
PACK
Registerfile (packed format)
packed format
packed format
normalized format
representative format
packed format
FXU
FXU
General rounding +
ADD I MULT I DIV I
Figure 3.2: FP unit microarchitecture using a shared general rounding unit
[37]). Because such a multiplication implementation allows to work on normalized
FP representations (even for denormalized values) as inputs and outputs, the inter-
nal FP representations can be changed to normalized NF factoring representations
for this microarchitecture. Thus, the registerle contains the operands in the NF fac-
toring representation and the functional units have to compute the NF factoring of
the IEEE rounded result. This is specied by the rounding function nround, so that
if an exact or an RF factoring of the exact result exact
op
is given by (s
ex
; e
ex
; f
ex
),
the functional units have to compute the NF factoring nround(s
ex
; e
ex
; f
ex
). Def-
inition 2.8 and lemma 2.8 guarantuee that this function computes IEEE rounding
of the exact result exact
op
including the cases of denormalized and special values
results, exceptions and exponent wrapping. The computation of this rounding func-
tion according to lemma 2.8 contains the computation of the normalized signicand
rounding function n sig rnd
mode?s
, where the signicand has to be rounded at the
variable rounding position vp = vp = (p   1)  maxf0; e
min
  e
ex
g (see denition
2.9) that depends on the exponent e
ex
, so that the rounding position could vary in
a wide range as mentioned above.
In the following the main structures and implementation details for the three microarchi-
tectures are described:
Rounding architecture I using general rounding Figure 3.2 depicts the basic
structure of this FP rounding architecture. The operands are stored in the registerle in
the packed representation. The unpacking units convert them to a representation in the
normalized format. The unpacking is computed in two steps, a conversion from the packed
to the unpacked format, followed by a conversion from the unpacked to the normalized
format. The normalized operands are necessary for multiplications and divisions. But as
in our designs the whole normalization easily ts into one clock cycle, there is no overhead
53
DIV IIMULT IIADD II
UNPACK
memory system
CONV II
PACK
Registerfile (packed format)
packed format
packed format
normalized format
gradual result format
packed format
FXU
FXU
TESTUNPACK
Gradual rounding +
Figure 3.3: FP microarchitecture using a shared gradual rounding unit
in also providing the normalized representation instead of the unpacked representation for
additions.
After unpacking, a representative representation (RF factoring) of the exact result of
the operation has to be computed. This can be achieved with simple standard algorithms
for the addition, multiplication and subtraction, including the computation of a sticky
bit like described in the representative computation according to lemma 2.11. The rep-
resentative representation of the result is fed into the shared general rounding unit, that
delivers and feeds back the packed representation of the rounded result into the registerle.
This general rounder is also able to deal with denormalized results, special value results,
exponent wrapping and exceptions. The computations, that have to be computed in the
general rounder are quite complex. This circuit has to deal with leading zero detections
to nd out the range of the number, with a normalization shift for results with values of
normalized umbers, with a denormalization shift for results with values of denormalized
numbers, with signicand rounding, a post-normalization shift, exponent rounding and
exponent wrapping.
Rounding architecture II using gradual rounding In contrast to the general round-
ing, in the gradual rounding architecture II (see gure 3.3), a part of the rounding is shifted
to the functional units, so that the functional units output normalized results (GF fac-
torings) in the gradual result format. This result is then rounded following the IEEE
specications in a second step in the gradual rounding unit [21], that delivers a packed
representation of the rounded result to the registerle. Because the input to the gradual
rounding unit is already normalized, gradual rounding is simpler than general rounding.
In particular the leading zero prediction and the normalization shift can be saved from
the general rounding implementation
In the functional units the gradual rounding (computation of the rounding function
ground1) for normalized double precision numbers has to be computed. There are several
54 CHAPTER 3. FP MICROARCHITECTURES
ADD III MULT III DIV III TEST III CONV III
memory system
normalized format
packed format
UNPACK PACK
Registerfile (normalized format)
normalized format FXU
FXUnormalized format
Figure 3.4: FP microarchitecture using a variable position rounding
algorithms from literature [9, 27, 31, 32, 33, 34, 36, 40, 44, 45] for each arithmetic operation
that could be used for this situation. In the context of this thesis we introduced injection-
based rounding and developed based on this technique new algorithms for FP addition
rounding (see also [40]) and FP multiplication rounding (see also [9, 11]), that are to the
best of our knowledge the fastest FP addition rounding and FP multiplication rounding
algorithms published todate. For the division impledmentation the Newton-Raphson it-
eration (see also [26, 28, 23])is used. For the initial reciprocal approximation we designed
a new fast implementation of a linear approximation formula (see also [36, 39]).
Rounding architecture III using variable position rounding In contrast to both
previous architectures, in this variable position rounding implementation (see gure 3.4),
we do not operate on the packed format in the registerle, but on the normalized format
in the registerle [37]. In this way, the normalization conversions inside the unpack and
pack units are moved towards the communication between the registerle and the memory
system. Fortunately, in our designs these conversions can be integrated in the load and
store operations without increasing their latency.
Among the arithmetic operations, the multiplication becomes most dicult, because
for denormalized results, variable position rounding becomes necessary. An algorithm for
fast variable position rounding in multipliers was developed in the context of this work
and is presented in [37]. In this way, the multiplication implementation including full
IEEE compliant rounding also for denormalized results is the core for the use of rounding
architecture III with the normalized internal FP representation (NF factoring).
In the FP addition, the variable position rounding does not cause problems, because
using our FP adder implementation from [40] all denormal results are exact and do not
need to be rounded. Some selections for the computations on special values and the
exception handling has to be included in the implementation, so that the implementation
from [40] has to be modied for this rounding architecture III to implement full IEEE
55
compliant rounding.
Further options Apart from the three dierent rounding architectures described above,
we vary the FP multiplication implementation to contain either a full-sized or a half-sized
adder tree that both use Booth2 recoding [30, 3, 1]. We have developed an improved cost
formula for these adder trees was in [30]. Moreover we consider three dierent imple-
mentations of the division based on the Newton-Raphson iteration [26, 23] with dierent
starting accuracies of the initial reciprocal approximation. These multiplicative division
implementations are integrated into the implementations of the multiplier. For the initial
reciprocal approximation we use the fast implementation of a linear approximation for-
mula that we already presented in [36, 39] with an absolute approximation error bounded
by 2
 8
, 2
 16
and 2
 28
, respectively.
Chapter 4
Basic FP Operations
4.1 Internal Format Conversions
4.1.1 Unpacking I-III (packed  ! normalized format)
This section describes the unpacking, i.e., the conversion from a packed single or double
precision FP representation to the corresponding FP representation in the normalized
format. The choice whether we have a single or double precision input is signaled by
the bit dbl. In addition to this bit the packed FP input is given by BUS
PF
[63 : 0]
(section2.6.1). The FP output in the normalized format is denoted by BUS
NF
[69 : 0]
(section2.6.3).
First, we deal with the problem to extract the bits belonging to sign, exponent and
signicand from the packed representation integrating the cases for single and double
precision. Regarding the sign this is an easy task, namely s
PF
= BUS
PF
[63]. The
exponent and signicand extractions are implemented in the exponent-extract circuit and
in the signicand-extract circuit in gure 4.1, that can be realized by a row of muxes each,
described by (see equation2.75-equation2.80):
e
PF
[10 : 0] =

BUS
PF
[62 : 52] if dbl
(000; BUS
PF
[62 : 55]) otherwise,
f
PF
[1 : 52] =

BUS
PF
[51 : 0] if dbl
BUS
PF
[54 : 3] otherwise.
The conversion from the packed to the normalized format can be constructed in two steps:
(i) a conversion from the packed to the unpacked format (see section 2.6.2, p.40) followed
by (ii) a conversion from the unpacked to the normalized format (see section 2.6.3, p.42):
(i) The unpacked format diers from the packed format by 5 additional bits and a
dierent exponent representation:
 The conditions for the 5 additional bits: f[0], snan, qnan, inf, and zero, can be
easily read o from equation2.81. To implement these conditions three zero testers
are necessary according to:
fzero = is zero(f
PF
[1 : 52])
ezero = is zero(e
PF
[10 : 0])
eone =

is zero(e
PF
[10 : 8];e
PF
[7 : 0]) if dbl
is zero(e
PF
[10 : 8];e
PF
[7 : 0]) otherwise
= is zero(e
PF
[10 : 8]
 dbl;e
PF
[7 : 0])
56
4.1. INTERNAL FORMAT CONVERSIONS 57
Based on fzero, ezero, and eone, the additional bits can be computed by:
f
UF
[0] = ezero zero = fzero ^ ezero
inf = fzero ^ eone snan = f[0] ^ eone
qnan = fzero ^ f[0] ^ eone:
This completes the description of the additional bits circuit in gure 4.1.
 The exponent representation e
PF
[10 : 0] has to be converted from packed to two's
complement representation, where the single and double precision case have to be
integrated. One can easily check, that the following equation describes the conversion
for non-zero packed exponent representations (ezero = 0) given by equation2.82
except a missing increment:
e
1
[11 : 0] =
(
(e
PF
[10]
2
;e
PF
[9 : 0]) if dbl
(e
PF
[7]
5
;e
PF
[6 : 0]) otherwise.
(4.1)
This missing increment is postponed to the exponent adjustment circuit. Note, that
only the most signicant 5 bits are diering, so that the selection can be implemented
by 5 muxes. In the next step, we integrate the case ezero = 1:
e
2
[11 : 0] =

(1
2
;dbl
3
; 0
6
; 1) if ezero
e
1
[11 : 0] otherwise.
(4.2)
Note, that the value of < e
2
[11 : 0] >
2
=< e
UF
[11 : 0] >
2
 1 is also dened to be too
small by one in the case of ezero = 1, so that the postponed exponent increment
will correct the exponent value for all cases. In this way equation 4.1 and equation
4.2 specify the implementation of the exponent conversion circuit in gure 4.1.
(ii) To convert from the unpacked format to the normalized format, we have to imple-
ment the exponent and signicand conversion (unbounded normalization shift for denor-
malized numbers) according to equations 2.86 and 2.87.
To simplify the exponent adjustment, we compute lzi = lz   1 = lzero(f[0 : 52])   1
in the shift amount circuit. This is done in two paths and a nal selection depending on
the value of f[0]:
< lzi[11 : 0] >
2
= lz   1 =

< (0
6
; lzero(f[1 : 52]; 0
12
)[5 : 0]) >
2
if f[0] = 0
< 1
12
>
2
=  1 if f[0] = 1
(4.3)
To compute lzero(f[1 : 52]; 0
12
)[5 : 0], we use circuit lzero from [23] with t=64 and get
the amount of leading zeros lz   1 = < lzi[5 : 0] > for the case f[0] = 0 represented by
lzi[5 : 0]. The most signicant 0
6
in equation 4.3 are the sign extension to the 12-bit two's
complement representation, because for f[0] = 0, lzi = lz   1  0 is non-negative. The
normalization shift is then computed by:
 A left-shift of f
UF
[0 : 52] by lz positions: f
NF
[0 : 52] = (f
UF
[lz : 52]; 0
lz
). Because
we do not know lz, but only lzi = lz   1, we use a cyclic left-shifte, that computes
the following function cls on an 64-bit input input[0 : 63] and a shift amount sfta
given in a 6-bit binary representation:
cls(input[0 : 63]; sfta) = (input[sfta : 63]; input[0 : sfta  1])
58 CHAPTER 4. BASIC FP OPERATIONS
ZERO
INF
QNAN
SNAN
BUS      [63:0]PF
F     [0:52]PF
E     [10:0]PF F     [1:52]PF
BUS      [69:0]NF normalized representation
E      [11:0]NF F      [0:52]NF
adjustment
exponent
significand shift
 
   
 
 
extraction extraction
exponent
conversion
exponent significand
DBL
S
F[0]
E   [11:0]2
unpacked representation
packed representation
additional bits
LZI[5:0]
LZI[11:0]
shift amount
DBL
EZ
ER
O
Figure 4.1: Unpack unit
With this circuit we compute in the signicand shift circuit in gure 4.1
f cls[0 : 63] = cls((f
UF
[0 : 52]; 0
11
); < lzi[5 : 0] >)
= (f
UF
[< lzi[5 : 0] >: 63]; f
UF
[0 :< lzi[5 : 0] >  1])
=

(0; f
UF
[lz : 52]; 0
lz+10
) if lzi[5 : 0] 6= 1
6
(0; f
UF
[0 : 52]; 0
10
) if lzi[5 : 0] = 1
6
;
so that we get, as required,
f
NF
[0 : 52] = (f
UF
[lz : 52]; 0
lz
) = f cls[1 : 53]:
 An exponent adjustment e
2
  lz. The exponent adjustment circuit in gure 4.1
implements this subtraction including the postponed exponent increment e
2
  lz+1
from the exponent conversion circuit:
< e
NF
[11 : 0] >
2
= < e
2
[11 : 0] >
2
 lz + 1
= < e
2
[11 : 0] >
2
 (lz   1)
= < e
2
[11 : 0] >
2
  < lzi[11 : 0] >
2
= < e
2
[11 : 0] >
2
+ < lzi[11 : 0] >
2
+1
A 12-bit wide conditional sum adder is used to compute this sum. The additional 1
is fed into the carry-in input of the adder.
This completes the description of the unpack unit (gure 4.1) that outputs the normalized
FP representation
BUS
NF
[69 : 0] = (s
PF
;e
NF
[11 : 0]; f
NF
[0 : 52]; zero; inf;qnan; snan):
4.1. INTERNAL FORMAT CONVERSIONS 59
4.1.2 General Rounding I (representative  ! packed format)
This section describes a general dual mode rounding unit that is able to round and to
compress a FP number from the representative format BUS
RF
[73 :0] (section 2.6.4) to
the single precision or the double precision packed FP representation BUS
PF
[63 :0] (sec-
tion 2.6.1). The mode, whether the destination is single or double precision, is selected
by the bit dbl. The additional inputs of the rounding mode by rmode[1 :0] and the trap
handlers unf en and ovf en select dierent IEEE rounding options. The IEEE rounding
with these options has to be computed on the input factoring
(s
RF
; e
RF
; f
RF
) = fact
RF
(BUS
RF
[73 :0]):
Because the packed representation of the rounded result is based on the IEEE factoring
of the rounded result, the packed representation of the rounded result can be specied
according to denition 2.8 by
BUS
PF
[63 :0] = pf(iround(s
RF
; e
RF
+ wec; f
RF
)):
In addition to the rounding computations, the occurance of an overow, underow and
inexact exception should be signaled by ovf, unf, and inx, respectively.
We rst consider to compute the IEEE factoring of the rounded result (s
PF
; e
PF
; f
PF
) =
iround(s
RF
; e
RF
+wec; f
RF
). A conversion from the IEEE factoring (s
PF
; e
PF
; f
PF
) to the
packed representation BUS
PF
[63 :0] (packing) by the function pf then yields the required
result representation.
According to lemma 2.7, namely,
iround
mode
(s; e; f) = exp rnd
mode?s
(post norm(sig rnd
mode?s
(b
e
min
c(s; e; f)))),
the rounding function iround
mode
(s
RF
; e
RF
+wec; f
RF
) is computed in four steps:
1. a factoring (s
1
; e
1
+wec; rep
53
(f
1
)) corresponding to the bounded normalization shift
(s
1
; e
1
+wec; f
1
) = b
e
min
c(s
RF
; e
RF
+ wec; f
RF
);
2. signicand rounding (s
2
; e
2
+wec; f
2
) = a sig rnd
mode?s
1
(s
1
; e
1
+wec; rep
53
(f
1
)) (Note,
that for the signicand rounding denition 2.9, equation 2.49 and lemma 2.12, we
have also (s
2
; e
2
+wec; f
2
) = a sig rnd
mode?s
1
(s
1
; e
1
+wec; f
1
) for single and double
precision),
3. a post-normalization shift (s
3
; e
3
+wec; f
3
) = post norm(s
2
; e
2
+wec; f
2
); and
4. exponent rounding (s
PF
; e
PF
; f
PF
) = exp rnd
mode?s
3
(s
3
; e
3
+wec; f
3
):
We treat the implementation of these 4 steps separately in the next paragraphs (the
structure of the whole implementation is depicted in gure 4.5):
Normalization shift & representative computation (1.) The bounded normaliza-
tion shift is dened by equation 2.1. Using the denition of the function TINY it can be
described by:
(s
1
; e
1
+wec; f
1
) = b
e
min
c(s
RF
; e
RF
+ wec; f
RF
)
=

(s
RF
; e
RF
+wec; f
RF
) if TINY (s
RF
; e
RF
+wec; f
RF
)
(s
RF
; e
min
; f
RF
 2
e
RF
+wec e
min
) otherwise.
60 CHAPTER 4. BASIC FP OPERATIONS
Because from TINY (s
RF
; e
RF
 ; f
RF
) it follows that TINY (s
RF
; e
RF
; f
RF
), all over-
ow cases are already contained in the condition TINY (s
RF
; e
RF
; f
RF
). Because after
exponent wrapping all representable results have values of normalized numbers accord-
ing to corollary 2.10, these results can not be tiny and wec = 0 for the second line.
Moreover, TINY (s
RF
; e
RF
+; f
RF
) follows from unf en for underows, so that with
tiny = TINY (s
RF
; e
RF
; f
RF
) () (e
0
< e
min
) the above equation for the bounded nor-
malization shift can be reduced to:
(s
1
; e
1
+wec; f
1
) =

(s
RF
; e
0
+wec; f
0
) = (s
RF
; e
RF
+wec; f
RF
) if tiny OR unf en
(s
RF
; e
min
; f
RF
 2
e
RF
 e
min
) otherwise.
To simplify the implementation we postpone the wrapping exponent correction after the
normalization shift computations and consider:
(s
1
; e
1
; f
1
) =

(s
RF
; e
0
; f
0
) = (s
RF
; e
RF
; f
RF
) if tiny OR unf en
(s
RF
; e
min
; f
RF
 2
e
RF
 e
min
) otherwise.
Lemma 4.1 With the bit-strings
sfta[5 : 0] =

lzii[5 : 0] = lzero((0; f
RF
[ 1 : 54]; 0
7
) if tiny OR unf en
bin
5
0
(e
RF
  e
min
+ 2) otherwise.
f
000
[0 : 63] = cls((0; f
RF
[ 1 : 54]; 0
7
); sfta)
sftmask[0 : 63] =
8
<
:
1
64
if e
RF
  e
min
+ 2  0 OR unf en
0
64
if e
RF
  e
min
+ 2 <  64
hdec(sfta)[63 : 0] otherwise;
the factoring (s
1
; e
1
; rep
53
(f
1
)) can be computed from (s
RF
; e
RF
; f
RF
) by:
s
1
= s
RF
e
1
=

e
0
= e
RF
+ 2  lzii if tiny OR unf en
e
min
otherwise.
rep
53
(f
1
)[0 : 53] = f
000
[0 : 53] AND sftmask[0 : 53]
rep
53
(f
1
)[54] = OR(f
000
[0 : 53] AND sftmask[0 : 53]; f
000
[54 : 63]):
Proof: We separate case (a) (tiny OR unf en) and (b) (tiny NOR unf en):
(a) First, for (tiny OR unf en), we deal with the unbounded normalization shift
(s
1
; e
1
; f
1
) = (s
RF
; e
0
; f
0
) = (s
RF
; e
RF
; f
RF
) = (s
RF
; e
RF
+ 2; f
RF
=4):
If we only consider non-zero signicands f
0
= f
RF
=4, that have the binary represen-
tation f
0
[0 : 56] = (0; f
RF
[ 1 : 54]), then these signicands have values in the range
[2
 55
; 2[, so that lemma 2.3(ii) can be used: Thus, with lzii = lzero(f
0
[0 : 56]) =
lzero((0; f
RF
[ 1 : 54]))  1, the unbounded normalization shift (s
RF
; e
RF
; f
RF
)
can be computed by a left-shift of (0; f
RF
[ 1 : 54]) by lzii = sfta positions (Note,
that this computation is also valid for the case of zero signicands, because they are
not changed by the normalization shift regardless of the shift amount.)
f
1
[0 : 63] = (f
0
[lzii : 56]; 0
lzii+7
) = (f
RF
[lzii  2 : 54]; 0
lzii+7
)
= cls((0; f
RF
[ 1 : 54]; 0
7
); sfta)
= f
000
[0 : 63]
4.1. INTERNAL FORMAT CONVERSIONS 61
and the exponent adjustment (Note, that also this exponent adjustment is valid for
zeros, because their factoring representation may contain an arbitrary exponent.)
e
1
= e
0
= e
RF
+ 2  lzii:
With lemma 2.11, the 53-representative of f
1
, rep
53
(f
1
), is computed by
rep
53
(f
1
)[0 : 54] = (f
1
[0 : 53]; OR(f
1
[54 : 63])):
Because from tiny it follows, that (e
RF
+ 2   e
min
 0), we get for case (a)
sftmask[0 : 53] = 1
54
, so that
rep
53
(f
1
)[0 : 53] = f
000
[0 : 53] AND sftmask[0 : 53]
rep
53
(f
1
)[54] = OR(f
000
[0 : 53] AND sftmask[0 : 53]; f
000
[54 : 63]);
as required.
(b) For (tiny NOR unf en), the resulting exponent after the unbounded normaliza-
tion shift is e
min
, and all we have to compute is the 53-representative rep
53
(f
00
)
of the signicand f
00
=< f
00
>
neg
= f
RF
 2
e
RF
 e
min
= f
0
 2
e
RF
 e
min
+2
. The mul-
tiplication of f
0
= < f
0
[0 : 56] >
neg
by 2
e
RF
 e
min
+2
is a left-shift of f
0
[0 : 56] by
sft den = < sft den[12 : 0] >
2
= e
RF
  e
min
+ 2 positions, where a positive shift
amount sft den > 0 corresponds to an eective left-shift and a negative shift amount
sft den < 0 corresponds to an eective right-shift by jsft denj positions. In the com-
putation of rep
53
(f
00
), we dier between 3 cases depending on the range of sft den:
i. sft den  0: Because we deal with denormalized numbers, we have 0 
sft den < lzii < 56, so that sft den can be represented with 6 bits sft den =
< sft den[5 : 0] > = sfta and
f
00
[0 : 63] = (f
0
[sfta : 56]; 0
sfta+7
) = (f
RF
[sfta  2 : 54]; 0
sfta+7
)
= cls((0; f
RF
[ 1 : 54]; 0
7
); sfta)
= f
000
[0 : 63]:
Thus, the 53-representative of f
00
= f
000
= f
1
is computed by (see lemma 2.11):
rep
53
(f
1
)[0 : 54] = (f
000
[0 : 53]; OR(f
000
[54 : 63])):
Also in this case sftmask[0 : 53] = 1
54
, so that we have
rep
53
(f
1
)[0 : 53] = f
000
[0 : 53] AND sftmask[0 : 53]
rep
53
(f
1
)[54] = OR(f
000
[0 : 53] AND sftmask[0 : 53]; f
000
[54 : 63]);
as required for case (i).
ii. 0 > sft den   53: Because sft den is negative, the computation of f
00
=
< f
00
[0 : 56+jsft denj] >
neg
requires a right-shift of (0; f
RF
[ 1 : 54]) by jsft denj
positions:
f
00
[0 : 56+jsft denj] = (0
jsft denj+1
; f
RF
[ 1 : 54]):
Because sft den is in the range [ 1 :  64], the two's complement representation
sft den[12 : 0] can be split into:
sft den = < (1111111000000) >
2
+ < sft den[5 : 0] >
=  64+ < sft den[5 : 0] >;
62 CHAPTER 4. BASIC FP OPERATIONS
so that sfta =< sft den[5 : 0] >= 64   jsft denj  0. Using a 64-bit cyclic
left-shifter with the shift amount sfta =< sft den[5 : 0] > on f
0
[0 : 63] =
(0; f
RF
[ 1 : 54]; 0
7
), we get
f
000
[0 : 63] = cls((0; f
RF
[ 1 : 54]; 0
7
); sfta)
= ((f
0
[64   jsft denj : 63]; f
0
[0 : 64  jsft denj   1]))
= (f
0
[64  jsft denj : 63]; 0; f
RF
[ 1 : 62  jsft denj])
Thus, f
000
[0 : 53] could dier from f
00
[0 : 53] only in the jsft denjmost signicant
bits, that have to be cleared. The mask
sftmask[0 : 63] = hdec(sfta)[63 : 0] = (0
jsft denj
; 1
sfta
)
has exactly zeros in these positions of the signicand, so that
rep
53
(f
1
)[0 : 53] = f
00
[0 : 53] = f
000
[0 : 53] AND sftmask[0 : 53]:
The sticky bit is computed from all the remaining bit positions, that are selected
by the inverted mask sftmask[0 : 53] and signicand positions [54 : 63], so that
rep
53
(f
00
)[54] = OR(f
000
[0 : 53] AND sftmask[0 : 53]; f
000
[54 : 63]);
as required for case (ii).
iii.  53 > sft den: In this case for the computation of f
00
, the signicand f
0
[0 : 56] =
(0; f
RF
[ 1 : 54]) is right-shifted by more than 53 positions, so that f
00
[0 : 53] =
0
54
and no signicand bit of f
RF
[ 1 : 54] contributes to rep
53
(f
1
)[0 : 53]. Only
the sticky bit in the representative rep
53
(f
1
)[54] = OR(f
RF
[ 1 : 54]) is in-
uenced. If  53 > sft den   64, we have sfta = < sft den[5 : 0] > =
64   jsft denj like in case (ii), so that sfta  10 and sftmask[0 : 53] =
hdec(sfta)[63 : 10] = 0
54
. But also if sft den = e
RF
  e
min
+ 2 <  64,
we have sftmask[0 : 53] = 0
54
by denition. Thus,
rep
53
(f
1
)[0 : 53] = f
00
[0 : 53] = 0
54
= f
000
[0 : 53] AND sftmask[0 : 53]
rep
53
(f
00
)[54] = OR(f
RF
[ 1 : 54])
= OR(f
000
[0 : 63])
= OR((f
000
[0 : 53] AND sftmask[0 : 53]) OR f
000
[54 : 63]);
as required for case (iii). This completes the proof of the lemma.
2
4.1. INTERNAL FORMAT CONVERSIONS 63
E  [13:0]1EI  [13:0]1 TINY
OVF2a
OVF1
E     [12:0]RF
REP    (f   )[25]124
REP    (f   )[54]153REP    (f   )[0:53]153
 
 
 
 
 
leading zero
computation
AND NOR
F     [-1:54]RF
F’’’[0:53] [54:63]
0
0
0
0
ORtree ORtree
OR
[25:53]
cyclic
left-shifter
SFTMASK[0:53]
SFTA[5:0]
sfta, sftmask,
exponent
UNF_ENDBL
7
7
[0:53]
LZII[5:0]
(1.)
Figure 4.2: Normalization shift implementation in the General rounding unit
The implementation of the normalization shift and the 53-representative computation
corresponding to lemma 4.1 is depicted in gure 4.2. Additionaly, this gure includes the
computation of the sticky bit of the 24-representative rep
24
(f
1
)[25] from rep
53
(f
1
)[25 :54]
according to lemma 2.12. The implementation of the the 'sfta, sftmask and exponent'
circuit has to be further specied. This circuit is responsible for the computation of the
shift amount sfta[5 : 0], the mask sftmask[0 : 53], the exponent e
1
and the incremented
exponent ei
1
= e
1
+1. We consider the biased exponents e
1b
=< e
1
[13 : 0] >
2
= e
1
+ bias
n
and ei
1b
=< ei
1
[13 : 0] >
2
= ei
1
+ bias
n
, so that
e
1
= < e
1
[13 : 0] >
2
  bias
n
and ei
1
=< ei
1
[13 : 0] >
2
 bias
n
:
Moreover, the bits ovf1, ovf2a and tiny are computed, that indicate the conditions:
ovf1 () (e
1
> e
max
)
ovf2a () (e
1
= e
max
)
tiny () TINY (s
RF
; e
RF
; f
RF
) () (e
0
< e
min
):
The following lemma species how all the outputs of the 'sfta, sftmask and exponent'
circuit can be computed from the inputs e
RF
[12 : 0], lzii[5 : 0], dbl and unf en.
64 CHAPTER 4. BASIC FP OPERATIONS
Lemma 4.2 After the computation of the intermediate values
he = < he[13 :6] >
2
= <(e
RF
[12];e
RF
[12 :6]; 0
6
) >
2
+<(0
4
;dbl
3
; 1; 0
6
)> (4.4)
hei = < hei[13 :6] >
2
= he+ 2
6
(4.5)
hf = < hf[6 :0] > = < e
RF
[5 :0] > + < lzii[5 :0] > +1 (4.6)
mask1 () (hei[13] OR unf en) (4.7)
mask0 () (hei[13] NOR(ANDtree(hei[12 :6]))) (4.8)
hb = < hb[13 :0] >
2
=

< hei[13 :6];hf[5 :0] >
2
if hf[6]
< he[13 :6];hf[5 :0] >
2
otherwise,
(4.9)
the outputs of the 'sfta, sftmask and exponent' circuit can be computed by
tiny ()

hei[13] if hf[6]
he[13] otherwise
(4.10)
sfta[5 :0] =

lzii[5 :0] if tiny OR unf en
e
RF
[5 :0] otherwise.
(4.11)
sftmask[0 :53] = ((hdec(sfta)[63 :10] NOR mask1) NOR mask0) (4.12)
e
1b
= < e
1
[13 :0] >
2
= < hc[13 :0] >
2
+1 (4.13)
ei
1b
= < ei
1
[13 :0] >
2
= < hc[13 :0] >
2
+2 (4.14)
ovf1 () ei
1
[13] AND (ORtree(ei
1
[12 :11]; (ei
1
[10 :8] AND dbl))) (4.15)
ovf2a () ANDtree(ei
1
[13 :11]; (ei
1
[10 :8]dbl);ei
1
[7 :0]) (4.16)
using the denition of
hc[13 :0] = hb[13 :0] AND (tiny NAND unf en):
Proof: First, we show some properties of the intermediate values, so that we can
then prove the correctness of the output computations using these properties. Because
 e
min
+ 1 =bias = < 0
4
;dbl
3
; 1
7
> for single and double precision, we have
sft den = < sft den[13 : 0] >
2
= e
RF
  e
min
+ 2
= < (e
RF
[12];e
RF
[12 : 0]) >
2
+ < (0
4
;dbl
3
; 1
7
) > +1
= < (e
RF
[12];e
RF
[12 : 0]) >
2
+ < (0
4
;dbl
3
; 1; 0
6
) > +2
6
= < hei[13 : 6];e
RF
[5 : 0] >
2
Based on this one can show, that the bits mask0 and mask1 implement the conditions
mask1 () (hei[13] OR unf en) () ((sft den  0) OR unf en)
mask0 () (hei[13] NOR (ANDtree(hei[12 : 6])))
() (hei[13] AND (NOT(ANDtree(hei[12 : 6]))))
() (sft den <  64):
Exactly these conditions are required to select the proper case in the computation of
sftmask[0 : 53]. The intermediate value hb is dened by
hb =

< hei[13 : 6];hf[5 : 0] >
2
if hf[6]
< he[13 : 6];hf[5 : 0] >
2
otherwise.
4.1. INTERNAL FORMAT CONVERSIONS 65
= < he[13 : 6]; 0
6
>
2
+ < hf[6 : 0] >
= < hei[13 : 6]; 0
6
>
2
+ < 1
8
; 0
6
>
2
+ < e
RF
[5 : 0] > + < lzii[5 : 0] > +1
= < hei[13 : 6];e
RF
[5 : 0] >
2
+ < 1
8
; lzii[5 : 0] >
2
+1
= < sft den[13 : 0] >
2
+ < 1
8
; lzii[5 : 0] >
2
+1
= sft den  lzii
= e
RF
  e
min
+ 2  lzii:
Thus, starting from the denition of tiny, we get
tiny () (e
0
(= e
RF
+ 2  lzii) < e
min
)
() (e
RF
  e
min
+ 2  lzii < 0)
() (hb < 0)
()

hei[13] if hf[6]
he[13] otherwise.
Because bin
5
0
(e
RF
  e
min
+ 2) = bin
5
0
(sft den) = e
RF
[5 : 0], the computation formula of
sfta[5 : 0] follows directly from the denition of sfta[5 : 0]. Using the conditions mask1
and mask0, the denition of sftmask[0 : 53] becomes
sftmask[0 : 53] =
8
<
:
1
54
if mask1
0
54
if mask0
hdec(sfta)[63 : 10] otherwise:
(4.17)
One can easily check that this is equivalent to
sftmask[0 : 53] = ((hdec(sfta)[63 : 10] NOR mask1) NOR mask0):
The bit string hc[13 : 0] can be written as:
hc[13 : 0] = hb[13 : 0] AND (tiny NAND unf en)
=

hb[13 : 0] if tiny OR unf en
0
14
otherwise:
Using < 0
14
>
2
= e
min
  1+ bias
n
and hb = e
RF
  e
min
+2  lzii = e
RF
+ bias
n
+1  lzii,
we get
hc = < hc[13 : 0] >
2
=

e
RF
+ 1  lzii+ bias
n
if tiny OR unf en
e
min
  1 + bias
n
otherwise
= e
1
  1 + bias
n
so that corresponding to the denition of e
1b
, we get the computation formula
e
1b
= hc+ 1 =< hc[13 : 0] >
2
+1
The computation formula of ei
1b
follows directly from the denition of the incremented
biased exponent ei
1b
= e
1b
+ 1. Because e
1b
= e
1
+ bias
n
and e
max
+ bias
n
= 2
n
  2 =
< 0
3
;dbl
3
; 1
7
; 0 >
2
for single and double precision, the condition (e
1
 e
max
) can be
written as
(e
1
 e
max
) () (e
1b
 2
n
  2)
() (ei
1b
 < 0
3
;dbl
3
; 1
8
>
2
):
66 CHAPTER 4. BASIC FP OPERATIONS
E      [5:0]RF
(E      [12]., E      [12:6])RFRF (0   , DBL   , 1)4 3
MUX
0 1
E      [5:0]RF
EWI  [13:0]1
EI  [13:0]1 E  [13:0]1
 
 
 
 
 
[63:10]
HDEC(64)
NOR
MASK0
MASK1
MASK0,1
HEI[13:6]
SFTA[5:0] MUX
1 0
1
[5:0][6]
COMPOUND
ADDER(8)
CSA(6)
HF
LZII[5:0]
HB[13:6]
OVF1,2a
HEI HE
[13:6] [13:6]
SFTMASK[0:53] OVF1 OVF2a
HB[5:0]
TINY
COMPOUND
INC(14)
+2 +1HB[13]
AND
HC[13:0]
MUXMUX
10 0 1
LZII[5:0]
HF[6]
UNF_ENUNF_EN
HE[13] HEI[13]
OROR
HF[6]
DBL
UNF_EN HB[13]TINY
NAND
NOR
Figure 4.3: 'sfta, sftmask and exponent' circuit in the general rounding unit.
The condition ovf1 is the 'greater than' case of the above condition and the condition
ovf2a is the equality case, so that
ovf1 () (< ei
1
[13 : 0] >
2
> < 0
3
;dbl
3
; 1
8
>
2
)
() ei
1
[13] AND (ORtree(ei
1
[12 :11]; (ei
1
[10 :8] AND dbl)))
ovf2a () (ei
1
[13 : 0] = (0
3
;dbl
3
; 1
8
))
() ANDtree(ei
1
[13 : 11]; (ei
1
[10 : 8]dbl);ei
1
[7 : 0])
as required by the lemma. 2
In this way the 'sfta, sftmask and exponent' circuit can be implemented like depicted in
gure 4.3. In this gure the mask0,1 and the ovf1,2a signal are implemented according
to equations 4.8, 4.7 and 4.15, 4.16 respectively. This completes the description of the
implementation of the normalization and representative computations.
Signicand rounding (2.) In this paragraph, we consider the signicand rounding:
(s
2
; e
2
+ wec; f
2
) = sig rnd
mode?s
1
(s
1
; e
1
+ wec; rep
53
(f
1
)):
Because only the signicand is aected by this operation, we have s
2
= s
1
and e
2
+ wec =
e
1
+wec = e
1b
  bias
n
+wec: Therefore, we only focus on the signicand in the following.
Depending on the bit dbl, we compute the signicand rounding at signicand position
(p   1) on the p-representative rep
p
(f
1
) in single (p = 24) or double (p = 53) precision.
From the representative computation we get the 53-representative rep
53
(f
1
)[ 1 : 54] and
the bit rep
24
(f
1
)[25], so that we also have the 24-representative by
rep
24
(f
1
)[ 1 : 25] = (rep
53
(f
1
)[ 1 : 24];rep
24
(f
1
)[25]):
4.1. INTERNAL FORMAT CONVERSIONS 67
The signicand rounding on p-representatives is already described in section 2.5.1. Fol-
lowing this description the rounding of the p-representative rep
p
(f
1
) results either in the
truncated signicand ftr = <ftr[ 1:52]>
neg
=< rep
53
(f
1
)[ 1 : (p  1)] >
neg
or the
incremented signicand finc = <ftri[ 1:52]>
neg
= ftr + 2
 p+1
(see denition 2.14).
Obviously, for both single and double precision these signicands can be computed by
ftr[ 1 : 52] = (rep
53
(f
1
)[ 1 : 23];rep
53
(f
1
)[24 : 52] AND dbl)
< ftri[ 1 : 52] >
neg
= < (rep
53
(f
1
)[ 1 : 23];rep
53
(f
1
)[24 : 52] OR dbl) >
neg
+2
 52
:
Moreover, the three least signicant bits of the p-representative have to be selected:
(l;r; sticky) = rep
p
(f
1
)[(p 1) :(p+1)]
=

rep
53
(f
1
)[52 :54] if dbl
(rep
53
(f
1
)[23 :24];rep
24
(f
1
)[25]) otherwise
and the IEEE rounding modemode 2 fRZ;RNE;RI;RMIg encoded by rnd mode[1 : 0]
has to be reduced for the use on the positive signicand to: (mode?s
1
) 2 fRZ;RNE;RIg
encoded by sr mode[1:0] (according to equations 2.6-2.7 and table 2.3):
sr mode[1] = rnd mode[1] ^ (rnd mode[0]
s) (4.18)
sr mode[0] = rnd mode[1] ^ rnd mode[0]; (4.19)
to implement the rounding increment decision (equation 2.54):
rinc = sr mode[1] ^ (r _ sticky) _ sr mode[0] ^ r ^ (l _ sticky): (4.20)
Based on rinc, the signicand can be rounded according to equation 2.55:
f
2
[ 1 : 52] =

ftri[ 1 : 52] if rinc
ftr[ 1 : 52] otherwise.
This results in an implementation of signicand rounding like depicted in region (2.) of
gure 4.4. In this region, the rounding decision circuit contains the implementation of
equations 4.18, 4.19 and 4.20. Moreover, a conditional sum incrementer implementation
is used for the implementation.
Because signicand rounding could change the value of the factoring, in the round-
ing decision circuit we also compute the condition inx1, that recognizes the signicand
rounding inexactness condition according to lemma 2.17:
inx1 () (f
2
6= f
1
) () (f
2
6= rep
p
(f
1
)) () (r OR sticky): (4.21)
Post-normalization (3.) The post-normalization shift is the implementation of
(s
3
; e
3
+ wec; f
3
) = post norm(s
2
; e
2
+wec; f
2
)
= post norm(s
2
; e
1b
  bias
n
+ wec; f
2
)
=

(s
2
; ei
1b
  bias
n
+ wec; 1) if f
2
= 2
(s
2
; e
1b
  bias
n
+wec; f
2
) otherwise.
Because f
2
can not become larger than 2, the condition (f
2
= 2) is recognized by bit
f
2
[ 1], so that the post-normalization shift of the signicand can be implemented by a
simple OR-gate
f
3
[0 : 52] = (f
2
[ 1] OR f
2
[0]; f
2
[1 : 52]):
68 CHAPTER 4. BASIC FP OPERATIONS
We do not compute e
3
, but the biased exponent e
3b
=< e
3
[13 : 0] >
2
= e
3
+ bias
n
, that
can be selected from the previous computed e
1b
and ei
1b
:
e
3b
=

ei
1b
if f
2
[ 1]
e
1b
otherwise.
(4.22)
The case (f
2
= 2)() f
2
[ 1] is called signicand overow and signaled by the bit sigovf.
This results in an implementation of the post-normalization shift like depicted in gure
4.4, where region (3.a) includes the post-normalization of the signicand and region (3.b)
includes the exponent selection according to equation 4.22.
Exponent rounding (4.) and packing In this paragraph we describe rst, how the
exception conditions ovf, inx and unf can be recognized and how the wrapping exponent
correction is added to the exponent. We then describe the exponent rounding followed by
the packing conversion of the rounded result to the packed representation, that is required
as output of the general rounding unit.
Lemma 4.3 With the bit
finop = (zero AND inf AND qnan AND snan); (4.23)
(a) the overow exception condition ovf, (b) the inexact exception condition inx and (c)
the underow exception condition unf can be computed by
ovf () (ovf1 OR (ovf2a AND sigovf)) AND finop: (4.24)
inx () (inx1 OR ovf) (4.25)
unf () (tiny AND (inx OR unf en)): (4.26)
Proof: (a) An overow occurs, i (i) the magnitude of the unbounded rounded result
is larger than x
max
and (ii) the rounding input is the representation of a non-zero nite
number. A representative number representation is non-zero and nite, i it does not
represent a special value and zero = inf = qnan = snan = 0, so that the condition
finop = (zero AND inf AND qnan AND snan) is equivalent to part(ii) of the overow
condition. Part (i) of the overow condition can be written as
jval(s
3
; e
3
; f
3
)j > x
max
= (2  2
 p+1
)  2
e
max
:
We rst assume that no tininess occurs. Thus, the signicand f
3
=< f
3
[0 : p 1] >
neg
is
normalized and we have 1  f
3
 (2  2
 p+1
), so that
ovf () (jval(s
3
; e
3
; f
3
)j > x
max
) AND finop (4.27)
() (e
3
> e
max
) AND finop: (4.28)
For e
1b
= e
1
+ bias
n
and ei
1b
= ei
1
+ bias
n
we can extract the following formula for e
3
from equation 4.22 :
e
3
= e
3b
  bias
n
=

ei
1
if (f
2
[ 1] = 1)
e
1
otherwise.
Because ei
1
= e
1
+ 1, from (e
1
> x
max
) it follows that also (ei
1
> x
max
). Therefore,
(e
3
> x
max
) () (e
1
> x
max
) OR ((ei
1
> x
max
) AND (f
2
[ 1] = 1)): (4.29)
4.1. INTERNAL FORMAT CONVERSIONS 69
The substitution of the denitions ovf1 () (e
1
> x
max
), ovf2a () (ei
1
> x
max
) and
sigovf () (f
2
[ 1] = 1) in equation 4.29 in combination with equation 4.28 then yields
part (a) of the lemma for non-tiny values. If tininess occurs, then jval(s
3
; e
3
; f
3
)j < 2
e
min
and e
3
= e
min
, so that ovf = 0, ovf1 = 0 and ovf2a = 0 by the overow denitions.
Therefore, the overow formula follows also for tiny numbers and the proof of part (a) of
the lemma is completed.
(b) An inexact exception occurs, i the rounded result diers from the exact result. This
can be caused by two parts of the rounding procedure: the signicand and the exponent
rounding. (The normalization shifts do not change the value of the factorings.) The
inexactness caused by signicand rounding is already recognized by the condition inx1.
The exponent rounding (including exponent wrapping) changes the value of the operand,
i the unbounded rounded result would be larger than x
max
. This is exactly the ovf
condition, so that inx () (inx1 OR ovf); as required for part (b) of the lemma.
(c) For our choice of the loss-of-accuracy denition, the underow exception is dened by
unf ()

tiny if (unf en = 1)
tiny AND inx otherwise.
Obviously, this is equivalent to part (c) of the lemma. 2
Now, we consider the exponent wrapping. Because for unf en = 1, we have unf ()
tiny, the trapped underow condition:
tunf () (tiny AND unf en) (4.30)
signals the case that a trapped underow occurs. Moreover, we dene the trapped overow
condition tovf, that indicates the occurance of a trapped overow:
tovf () (ovf AND ovf en): (4.31)
Based on these denitions, the exponent wrapping on e
3b
can be described by:
ew
3
= e
3b
+ wec =
8
<
:
< e
3
[13 : 0] >
2
  if tovf
< e
3
[13 : 0] >
2
+ if tunf
< e
3
[13 : 0] >
2
otherwise.
(4.32)
Because the signal tunf is valid ealier than tovf, we dene a predicted wrapping exponent
correction pwec based on tunf and we compute it by a selection using that for single and
double precision + =< +alpha[13 : 6] >
2
= 3  2
n 2
=< (0
3
;dbl
2
; 0;dbl
2
; 0
6
) >
2
and
  =<  alpha[13 : 6] >
2
=  3  2
n 2
=< (1
3
;dbl; 1;dbl; 0;dbl; 0
6
) >
2
:
pwec =< pwec[13 : 6] >
2
=

+ =< +alpha[13 : 6] >
2
if tunf
  =<  alpha[13 : 6] >
2
otherwise.
Lemma 4.4 After the computation of a predicted wrapped exponent pew
3
by the addition
of the predicted wrapped exponent correction,
pew
3
=< pew
3
[13 : 0] >
2
= e
3b
+ pwec (4.33)
and the denition of the wrapping exponent condition ewrap:
ewrap = (tunf OR tovf) (4.34)
the exponent wrapping on e
3b
can be computed by the selection:
ew
3
= e
3b
+ wec =

pew
3
if ewrap
e
3b
otherwise.
(4.35)
70 CHAPTER 4. BASIC FP OPERATIONS
Proof: For tovf = 1, the predicted wrapped exponent correction is pwec =  , so that
ew
3
= e
3b
  in equation 4.32 and in equation 4.35. In the same way, the identity of these
two equations can be shown for the remaining two cases: tunf = 1 and ewrap = 0. 2
The exponent rounding is inuenced by the reduced rounding mode (mode ? s) that is
encoded by sr mode[1 : 0] according to table 2.3. We already get sr mode[1 : 0] from
the signicand rounding circuit, so that we can compute the condition
rndup () sr mode[1] OR sr mode[0] (4.36)
() (mode ? s 2 fRNE;RIg) (4.37)
rndup () (mode ? s = RZ) (4.38)
Moreover, we dene the untrapped overow condition uovf, that indicates the occurance
of an untrapped overow:
uovf () (ovf AND ovf en): (4.39)
Because (jval(s
3
; e
3
+wec; f
3
)j > x
max
), i an untrapped overow occurs (uovf = 1), the
exponent rounding can be described by:
(s
PF
; e
PF
; f
PF
) = exp rnd
mode?s
3
(s
3
; e
3
+wec; f
3
) (4.40)
=
8
<
:
(s
3
; e
1
; f
1
) if uovf AND rndup
(s
3
; e
max
; f
max
) if uovf AND rndup
(s
3
; ew
3
  bias
n
; f
3
) otherwise.
(4.41)
This selection of the exponent and the signicand for the exponent rounding is computed in
combination with the conversion step from the IEEE factoring (s
PF
; e
PF
; f
PF
) to the cor-
responding packed representation, that consists of (s
PF
;e
PF
[n 1:0]; f
PF
[1 :p 1]; 0
64 n p
).
Lemma 4.5 With the denition of the conditions
rinf = (uovf AND rndup) (4.42)
rmax = (uovf AND rndup) (4.43)
we can compute the bits of the packed representation of the rounded result by
s
PF
= s
3
(4.44)
e
PF
[10 :1] = ((ew
3
[10 :1] NOR uovf) NOR f
3
[0]) (4.45)
e
PF
[0] = ((ew
3
[0] NOR rinf) NOR (rmax OR f
3
[0])) (4.46)
f
PF
[1 :52] = ((f
3
[0 :52] NOR rmax) NOR rinf) (4.47)
Proof: The conversion from the unpacked representation to the packed representation
can be computed according to section 2.6.2.2. Thus, the sign and the signicand are
unchanged, only the hidden bit f
3
[0] is removed from the representation of the signicand.
With the use of f
max
[1 :p 1] = 1
p 1
and f
1
[1 :p 1] = 0
p 1
, we get
s
PF
= s
UF
(4.48)
f
PF
[1 :p 1] = f
UF
[1 :p 1] (4.49)
=
8
<
:
f
1
[1 :p 1] = 0
p 1
if rinf
f
max
[1 :p 1] = 1
p 1
if rmax
f
3
[1 :p 1] otherwise.
(4.50)
= ((f
3
[1 :p 1] NOR rmax) NOR rinf) (4.51)
4.1. INTERNAL FORMAT CONVERSIONS 71
Because in the packing for single precision only the signicand bits f
PF
[1 :23] are regarded
and the bits f
PF
[24 :52] are ignored, we can compute the packed signicand representation
for both precisions by equation 4.51 with p = 53 as stated by the lemma.
For the conversion of the exponent, we have to consider the n-bit biased represen-
tation of e
UF
and to integrate the redundant exponent representation for e
min
, where
e
min
[n 1:0] = 0
n
for denormalized numbers and zeros according to equation 2.84. Be-
cause after the conditional exponent wrapping and the exponent rounding, the exponent
is representable in this n-bit packed format for all cases, it is sucient for both single and
double precision to regard only the exponent bits at positions [10 : 0]. For single precision
the exponent bits at positions [10 : 8] are ignored later.
The biased representations of e
max
and e
1
are given by (1
n 1
; 0) and 1
n
respectively.
For (rinf = 1) =) (f
3
[0] = 1) and (rmax = 1) =) (f
3
[0] = 1), we can compute the
packed exponent representation by
e
PF
[10 : 0] =

bin
10
0
(e
PF
+ bias
n
) if f
3
[0]
0
11
otherwise.
(4.52)
=
8
>
<
>
>
:
1
n
if rinf
(1
n 1
; 0) if rmax
0
11
if f
3
[0]
ew
3
[10 : 0] otherwise.
(4.53)
If we separate equation 4.53 into an equation regarding the exponent positions [10 : 1] and
an equation regarding exponent position [0], we can simplify these equations to
e
PF
[10 :1] = ((ew
3
[10 :1] NOR uovf) NOR f
3
[0])
e
PF
[0] = ((ew
3
[0] NOR rinf) NOR (rmax OR f
3
[0]))
This completes the proof of the lemma. 2
72 CHAPTER 4. BASIC FP OPERATIONS
MUX0 1
1 MUX 0
AND
F  [0]3 F  [1:52]3
E  [13:6]
3
E  [5:0]
3
EW  [0]3
F    [1:52]PF
F  [0]3 
 
 
 
 
      
 
  
 
 
 
                        
                              
RINC
OR
1MUX0
SIGOVF
MUX0 1
UNF_EN
OVF_ENSM
O
D
E[
1:0
]
AND
DBL
NOR
NOR
NOR
NOR OR
SIGOVF
PWEC[13:6]
CSA(5)
EXCEPTIONS
SIGOVF
OR
[0
:23
]
PEW  [13:0](2.)
(3.a)
(4.b)
(3.b)
(4.a)
(4.b)
MUX 10
[-1] [0]
NOR
NOR
[24:52] [23:24][52:54][0
:23
]
[2
4:5
2]
L,R,STICKY
RINC INX1
FTRI[-1:52]FTR[-1:52]
R
M
O
D
E[
1:0
]
ZERO
SNAN
QNAN
TINY
OVF1
OVF2a
INF -ALPHA[13:6]
TINY UNF_EN
EI  [13:0] E  [13:0]1 1
3
EW  [10:6] EW  [5:0]33
EW  [10:1]3
OVF
AND
OVF_EN
OVF INX UNF RINF RMAX
EWRAP
[1:52]
REP   (f  )[25]REP   (f  )[0:54]
OVF INX UNF E    [0] E    [10:1]
153 24 1 S
PF PF
+ALPHA[13:6]
decision
Rounding
CSI(54)
Incrementer
F  2
Figure 4.4: Signicand rounding (2.), post-normalization (3.a, 3.b), exponent wrapping
(4.a) and exponent rounding (4.b) implementation.
Figure 4.4 depicts the implementation of the exponent wrapping (region 4.a), the expo-
nent rounding and the computation of the exceptions (region 4.b) corresponding to the
descriptions of this paragraph. In region (4.a) of gure 4.4 a 5-bit conditional sum adder
is used for the implementation.
Because only the bit positions [10 : 0] of the unpacked exponent representation are
required, it suces to compute all additions for the exponent computation modulo 2
10
.
This is already considered in gures 4.4 and 4.5, where we only consider ei
1
[10 :0], e
1
[10 :0],
e
3
[10 :0], pew
3
[10 :0], pwec[10 :0], +[10 :6] and  [10 :6] instead of ei
1
[13 :0], e
1
[13 :0],
e
3
[13 :0], pew
3
[13 :0], pwec[13 :0], +[13 :6] and  [13 :6].
The exceptions circuit in region (4.b) of gure 4.4 implements the exception conditions
ovf, inx and unf according to equations 4.23-4.26. Additionaly, in this circuit the bits
rinf, rmax and ewrap are computed. In the previous descriptions we used many inter-
mediate conditions from that these bits could be derived. Based on the input bits of the
exceptions circuit the computations can be summarized by the following three equations:
ewrap = (ovf AND ovf en) OR (tiny AND unf en) (4.54)
rmax = (ovf AND ovf en AND (sr mode[1] NOR sr mode[0])) (4.55)
rinf = (ovf AND ovf en AND (sr mode[1] OR sr mode[0])) (4.56)
Finally, we pack the result representation into the packed format BUS
PF
[63 :0] for single
and double precision results and get according to equations 2.75-2.80 the following selection
4.1. INTERNAL FORMAT CONVERSIONS 73
OVF
UNF
INX
E  [10:0]3F  [0:52]3
 
 
 
 
 
                                  
                                                                        
ZERO
INF
SNAN
QNAN
OVF1
OVF2a
F    [1:52] E    [10:0]
EW  [10:0]
[1:0]
SRMODE
TINY INX1
EWRAP (4.a)
Packing
Exponent rounding
+ Exceptions (4.b) 3
F    [-1:54] E    [12:0]S
SIGOVF
REP   (f  )[25]
F  [-1:52]
(3.a) (3.b)
DBL
UNF_EN
OVF_EN
E  [10:0]EI  [10:0]
Post-norm Post-norm
Significand
rounding
Normalization (1.)
RF RF
24 TINY
2
SIGOVF
Exponent wrapping
(2.)
153REP   (f  )[0:54]1 1 1
PF PF
[1:0]
RND_MODE
[73] [3:0] [59:4] [72:60]
[63] [62:0]
BUS   [63:0]
BUS   [73:0] representative format
packed format
Figure 4.5: Structure of the general rounding unit.
that can be computed by a row of muxes:
BUS
PF
[63 :0] =

(s
PF
;e
PF
[10 : 0]; f
PF
[1 : 52]) if dbl
(s
PF
;e
PF
[7 : 0]; f
PF
[1 : 23]; 0
32
) otherwise.
(4.57)
This completes the description of the general rounding unit that has a structure like
depicted in gure 4.5.
74 CHAPTER 4. BASIC FP OPERATIONS
4.1.3 Gradual Rounding II (gradual result  ! packed format)
This section describes a dual mode gradual rounding unit, that is able to round and
to compress a FP number from the gradual result format BUS
GF
[72 :0] (section 2.6.5)
to the single precision or the double precision packed FP representation BUS
PF
[63 :0]
(section 2.6.1). Like for the general rounding the IEEE rounding options are given by
the precision of the destination dbl, the rounding mode rmode[1 :0] and the trap han-
dlers unf en and ovf en. The gradual rounding unit outputs the packed representation
BUS
PF
[63 :0] corresponding to the value of the IEEE rounded result and the exception
ags ovf, inx and unf corresponding to the occurance of an overow, inexact or underow
exception. In this case the IEEE rounding is computed on the GF factoring:
((s
GF
; e
GF
; f
GF
);tinc;tinx) = fact
GF
(BUS
GF
[73 :0]);
so that according to corollary 2.21 the rounded result can be specied by
BUS
PF
[63 :0] = pf(ground2((s
GF
; e
GF
+ wec; f
GF
);tinc;tinx)):
The only dierence between this specication with the rounding function ground2,
and the specication of the general rounding unit from the previous section with the
rounding function iround, is the signicand rounding according to lemma 2.7 and 2.20.
For the signicand rounding in the gradual rounding unit, the computation of the rounding
decision rinc from the previous section, has to be substituted by the gradual rounding
decision grinc according to equation 2.60. Moreover, the rounding inexactness could also
be caused in the previous gradual rounding step, so that the rounding inexact signal inx1
has to be substituted by tinx
2
according to equation 2.61.
All remaining computations are independent of the tag bits. From this point of view,
the GF factorings (without tag bits) are a subset of the RF factorings and with the def-
inition of f
GF
[ 1] = f
GF
[53] = f
GF
[54] = 0 we can interpret a GF factoring input as a
normalized RF factoring. Thus, the remaining part of the general rounding implementa-
tion from the previous section could be used identically also for the implementation of the
gradual rounding unit.
Nevertheless, we will consider some additional changes to optimize the gradual round-
ing implementation. These optimizations are based on the property that the signicands
of all non-zero numbers are already normalized in the gradual result format. Not the whole
implementation will be involved in the optimizations. The implementation of the post-
normalization(see gure 4.5(3.a)), the exponent rounding + exceptions (see gure 4.5(4.b))
and the packing are used from the previous section like depicted in gure 4.6. Therefore,
only the denormalization, gradual rounding and exponent wrapping circuit in this gure
will be further specied. The computations in this circuit combine the computations of
the normalization shift (see gure 4.5 circuit (1.)), the signicand rounding (see gure 4.5
circuit (2.)), the exponent part of the post-normalization shift (see gure 4.5 circuit (3.b))
and the exponent wrapping circuit (see gure 4.5 circuit (4.a)) from the previous section.
For the considerations about the normalization shift distance we can assume non-zero
operands like in the previous section. Thus, for the gradual result format we can use
lzii = 2, so that in lemma 4.1 and 4.2, the computations of f
000
, sftmask and hf can be
optimized in the following way:
4.1. INTERNAL FORMAT CONVERSIONS 75
PF
UNF
INX
F  [0:52]3
F    [-1:54]RF
[59:4]
F  [-1:52]2
 
 
 
 
             
ZERO
INF
SNAN
QNAN
OVF1
OVF2a
F    [1:52] E    [10:0]
EW  [10:0]
[1:0]
SRMODE
TINY INX1
Packing
Exponent rounding
+ Exceptions (4.b) 3
(3.a)
Post-norm
PF PF
[63] [62:0]
BUS     [63:0] packed representation
S
DBL
UNF_EN
OVF_EN
[73] [3:0]
BUS     [73:0]
EWRAP
SIGOVF
SIGOVF
Denormalization, gradual round
and exponent wrapping
TINX TINC
[4] [5]
E    [12:0]RF
[72:60]
RND_MODE
[1:0]
gradual result representationGF
OVF
Figure 4.6: Structure of the gradual round unit.
Lemma 4.6 In the gradual rouning unit f
000
, sftmask, and hf and can be computed by
f
000
[0 :63] =

(f
GF
[0 : 52]; 0
11
) if tiny OR unf en
cls((0
2
; f
GF
[0 :52]; 0
9
); < e
GF
[5 :0] >) otherwise
sftmask[0 :53] = (hdec(<e
GF
[5 :0]>)[63 :10] NOR mask1) NOR mask0
<hf [6 :0]> = <e
GF
[5 :0]> + <(111110)> :
Proof: Because lzii = 2 and e
GF
[5 :0] = bin
5
0
(e
GF
  e
min
+ 2), we get
sfta =

2 if tiny OR unf en
< e
GF
[5 :0] > otherwise,
(4.58)
so that with cls((0
2
; f
GF
[0 :52]; 0
9
); 2) = (f
GF
[0 : 52]; 0
11
), we have as required
f
000
[0 :63] =

(f
GF
[0 : 52]; 0
11
) if tiny OR unf en
cls((0
2
; f
GF
[0 :52]; 0
9
); < e
GF
[5 :0] >) otherwise.
The formula for sftmask in this lemma can be written as
sftmask[0 :53] =
8
<
:
1
54
if mask1
0
54
if mask0
hdec(< e
GF
[5 : 0] >)[63 :10] otherwise.
and diers from equation 4.17 by the substitution of sfta with < e
GF
[5 : 0] >. According
to equation 4.58 the value sfta could only dier from < e
GF
[5 : 0] >, if (tiny OR
unf en). But in the case (tiny OR unf en)() ((e
GF
 e
min
 0) OR unf en), we also
have mask1 () ((e
GF
  e
min
+ 2  0) OR unf en) and sftmask[0 :53] = 1
54
. Thus,
< e
GF
[5 : 0] > is equal to sfta, whenever these values are involved in the computations.
This completes the proof of the equation for sftmask. The formula for <hf[6 :0]> follows
directly from lemma 4.2 and lzii[5 :0] = (000010). 2
76 CHAPTER 4. BASIC FP OPERATIONS
sftmask, hc
AND
F  2 EW  [10:0]3
EW  [10:0]3
PEW  [10:0]3E  [10:0]3
PEW
I   [10:0]
1
E  [10:0]
1
PEW
   [10:0]
1
GFF     [0:52] E     [5:0]GF
1EI  [13:0]
EI  [10:0]
1
 
 
 
 
 
      
  
 
 
 
 
 
                              
AND NOR
ORtree ORtree
OR
[0:53] [25:53]
[54]
GRINC
1MUX0
SR
M
O
D
E[
1:0
]
DBL
OR
[0
:23
]
MUX 10
[-1] [0]
[24:52] [23:24][52:54][0
:23
]
[2
4:5
2]
L,R,STICKY
GRINC TINX2
FTRI[-1:52]FTR[-1:52]
R
M
O
D
E[
1:0
]
[1:52]
REP   (f  )[25]REP   (f  )[0:54]153 24 1 S
CSI(54)
Incrementer
decision
rounding
Gradual
TINX
TINC
AND
TINY UNF_EN
+ALPHA[13:6]
0 1MUX
-ALPHA[13:6]
hc
ipwec
Compound 
adder
Compound 
incrementer
INX1 SR
M
O
D
E[
1:0
]
HC[10:0]
(14) (11)
MUX MUX
1
EWRAP
0MUX
01 01
SIGOVF
PWEC[10:6] (000001)[0:53] [54:63]F’’’
DBL UNF_ENE     [12:0]GF
MUX1
F     [0:52]
0
GF
11 left-shifter
cyclic
0902
0
SFTMASK[0:53]
UNF_EN
TINY OR
TINY
OVF1
OVF2a
DBL
OVF1,2a
TINY
Figure 4.7: Implementation of the 'denormalization, gradual round and exponent wrapping'
circuit in the gradual rounding unit.
Lemma 4.7 With the incremented predicted wrapping exponent correction
ipwec = pwec+ 1 = <(pwec[13 :6]; 000001)>
2
the predicted exponents pew
1
= e
1b
+ pwec and pewi
1
= ei
1b
+ pwec can be computed by
pew
1
= hc+ ipwec
pewi
1
= hc+ ipweci + 1
using a compound adder. Based on them, the predicted wrapped exponent pew
3
can be
computed by the selection
pew
3
=

pewi
1
if sigovf
pew
1
otherwise.
Proof: The equations for pew
1
and pewi
1
follow directly from hc + 1 = e
1b
in lemma
4.2. Starting from equation 4.33 we get
pew
3
= e
3b
+ pwec =

ei
1b
+ pwec if sigovf
e
1b
+ pwec otherwise
=

pewi
1
if sigovf
pew
1
otherwise.
as required by the lemma. 2
4.1. INTERNAL FORMAT CONVERSIONS 77
HC[13:0]
RF
(E      [12]., E      [12:6])RFRF (0   , DBL   , 1)4 3
TINY
HB[13]
E      [5:0]RF
[63:10]
HDEC(64)
NOR
MASK0
MASK1
MASK0,1
HEI[13:6]
SFTMASK[0:53]
NOR
MUX
1 0
[5:0][6]
COMPOUND
ADDER(8)
CSA(6)
HF
(111110)
HB[13:6]
HEI HE
[13:6] [13:6]
HB[5:0]
AND
HF[6]
UNF_EN HB[13]TINY
NAND
E      [5:0]
Figure 4.8: 'sftmask, hc and tiny' circuit in the gradual rounding unit.
Figure 4.7 depicts the implementation of the 'denormalization, gradual rounding and ex-
ponent wrapping' circuit including the necessary changes for the gradual rounding and the
optimizations corresponding to lemma 4.6 and 4.7. The computations of pew
1
and pewi
1
corresponding to lemma 4.7 use a 11-bit compound adder, the gradual rounding decision
circuit contains the implementation of equations 2.60, 2.61, 4.18 and 4.19 and the 'ovf1
and ovf2a' circuit contains the implementation of the equations 4.15 and 4.16. The op-
timized implementation of the 'sftmask, hc and tiny' circuit corresponding to lemma 4.6
is depicted in gure 4.8.
4.1.4 Packing III (normalized  ! packed format)
This section describes a packing unit, that is able to convert a FP number from the nor-
malized format BUS
NF
[69 :0] (section 2.6.3) to the single precision or the double precision
packed FP representation BUS
PF
[63 :0] (section 2.6.1). Like in the previous sections the
precision of the packed FP format is signaled by the bit dbl.
The packing conversion is the combination of the two steps: (i) a bounded normaliza-
tion shift from the normalized to the unpacked format following section 2.6.3, p.43 and
(ii) a conversion from the unpacked to the packed format following the descriptions of
section 2.6.2, p.41.
The bounded normalization shift can be implemented like in the gradual rounding
unit, because also in the case of the normalized input format, the input factoring has a
normalized signicand for all non-zero values. Because in this case the input factoring
already has the correct value of the result and no rounding has to be computed on the
signicand, all bits of the signicand are used for the result and no masking of the shifted
signicand is necessary. Moreover, trapped underows do not have to be considered in
this case. Thus,
f
PF
[0 : 52] = f
00
[0 : 52] = f
000
[0 : 53]
=

cls((0
2
; f
NF
[0 :52]; 0
9
); < e
NF
[5 :0] >)[0 : 52] if tiny
f
NF
[0 : 52] otherwise.
78 CHAPTER 4. BASIC FP OPERATIONS
(0 ,DBL ,1 ,0)3 62
F    [1:52]PFF    [0]PF E    [10:0]PF
XOR
   
 
 
   
 
0 1
0
[0:52]
02 9
F     [0:52]NF E     [11:0]NF
Packing
CLS(64)
[11:0]
S
BUS    [63:0]
BUS    [69:0]NF
PF
[69] [56:4] [68:57]
[5:0]
[0:52]
[63] [62:0]
MUX
[11]
CSA(12)
packed representation
normalized representation
DBL
NOR(11)
DEC(11)
[10]
[9:8]
[6:0]
[7]
TINY
Figure 4.9: Structure of the packing unit.
With  e
min
=< (0
2
;dbl
2
; 1
6
; 0) >
2
for single and double precision, the condition
tiny() (e
NF
  e
min
< 0)
is detected as sign bit of the sum < e
NF
[11 : 0] >
2
+ < (0
2
;dbl
2
; 1
6
; 0) >
2
. Because
for normalized numbers with (f
UF
[0] = 1), the exponent is not changed in equation 2.90,
namely e
UF
[11 :0] = e
NF
[11 :0], the packed exponent representation can be computed
according to equation 2.84 that is equivalent to
e
PF
[10 : 0] = bin
10
0
(<(e
NF
[10];e
NF
[9 : 8];e
NF
[7]dbl;e
NF
[6 : 0])> 1) NOR f
UF
[0]:
A nal packing selection according to equation 4.57 then yields the packed result rep-
resentation BUS
PF
[63 :0]. This completes the description of the packing unit, that is
implemented like depicted in gure 4.9.
4.2. ADDITION/SUBTRACTION 79
4.2 Addition/Subtraction
4.2.1 Addition/Subtraction I (normalized  ! representative format)
This section describes a FP addition/subtraction unit, that is able to add or to subtract
two FP numbers given in the normalized representations (section 2.6.3):
BUSa
NF
[69 :0] = (sa;ea[11 :0]; fa[0 :52]; zeroa; infa;qnana; snana) (4.59)
BUSb
NF
[69 :0] = (sb;eb[11 :0]; fb[0 :52]; zerob; infb;qnanb; snanb); (4.60)
which represent the factorings (sa; ea; fa) = fact
NF
(BUSa
NF
[69 :0]) and (sb; eb; fb) =
fact
NF
(BUSb
NF
[69 :0]). The mode, whether the addition or the subtraction should be
computed, is selected by the input bit sop. For the special computation of the sign of
zero results, also the input of the rounding mode by rmode[1 :0] is required.
In the case, that both operands have representable values, the exact sum/dierence
exact
add=sub
is dened by (section 2.2.4):
exact
add=sub
= ( 1)
sa
 2
ea
 fa+ ( 1)
sopsb
 2
eb
 fb:
If (s
rc
; e
rc
; f
rc
) is a RF factoring of exact
add=sub
for non-zero representable inputs, then
for the general case of arbitrary input values, a RF factoring of the addition/subtractionI
output is given by:
(s
RF
; e
RF
; f
RF
) =
8
>
>
>
>
<
>
>
>
>
:
(0; e
qNaN
; f
qNaN
) if scqnan
(s
inf
; e
1
; f
1
) if scinf
(sa; ea; fa) if scx
(sb; eb; fb) if scy
(s
0
; e
0
; 0) if sczero
(s
rc
; e
rc
; f
rc
) otherwise,
(4.61)
so that the sum output of the addition/subtraction I unit is specied by the corresponding
representation in the representative format BUS
RF
[73 :0] = rf(s
RF
; e
RF
; f
RF
): Moreover,
the invalid ag inv should be signaled according to the occurance of an invalid exception.
The computations of the special conditions in equation 4.61 are already summarized
in section 2.4.4 by equations 2.20-2.26. We postpone the discussion of the special sign,
signicand and exponent selections and consider the computation of (s
rc
; e
rc
; f
rc
) for the
regular case in the following. For this we assume non-zero representable input operands.
Denition 4.1 Let seff = sopsasb. The case that seff = 0 is called eective addi-
tion and the case that seff = 1 is called eective subtraction. For eective subtractions,
we multiply the signicands of both operands by 2. This operation is called the pre-shift
and can be computed by a left-shift of the binary signicand representations by one bit
position. The signicands fa
0
and fb
0
that include the conditional pre-shift are dened by:
fa
0
=

2  fa if seff = 1
fa otherwise
fb
0
=

2  fb if seff = 1
fb otherwise
We dene the exponent dierence  = ea  eb and the sign of the exponent dierence
sdelta () ( < 0). The \large" operand, (sl; el; f l), the signicand of the \small"
operand, fs, and the exponent e
1
are dened as follows:
(sl; el; f l) =

(sa; ea; fa
0
) if sdelta = 0
(sop sb; eb; fb
0
) otherwise
fs =

fb
0
if sdelta = 0
fa
0
otherwise.
e
1
=

el   1 if seff = 1
el otherwise.
(4.62)
80 CHAPTER 4. BASIC FP OPERATIONS
Lemma 4.8 Based on the previous denitions, the exact sum can be written as
exact
add=sub
= ( 1)
sl
 2
e
1
 (fl + ( 1)
seff
(fs  2
 jj
)): (4.63)
Proof:
exact
add=sub
= ( 1)
sa
 2
ea
 fa+ ( 1)
sopsb
 2
eb
 fb:
=

( 1)
sa
 2
ea
 (fa+ ( 1)
sopsbsa
(fb  2
eb ea
)) if   0
( 1)
sopsb
 2
eb
 (fb+ ( 1)
sopsbsa
(fa  2
ea eb
)) otherwise
=

( 1)
sl
 2
el 1
 (fl + ( 1)
seff
(fs  2
 jj
)) if seff = 1
( 1)
sl
 2
el
 (fl + ( 1)
seff
(fs  2
 jj
)) otherwise
= ( 1)
sl
 2
e
1
 (fl + ( 1)
seff
(fs  2
 jj
)):
2
The most complex part in the addition/subtraction computation corresponding to equa-
tion 4.63 is the computation of the signicand sum fsum:
fsum = fl+ ( 1)
seff
 fs  2
 jj
:
With the denition of the absolute signicand sum abs fsum = jfsumj, and the sign of
fsum: sfsum() (fsum < 0), we can write fsum = ( 1)
sfsum
 abs fsum; so that
exact
add=sub
= ( 1)
slsfsum
 2
e
1
 abs fsum: (4.64)
In the following lemma it is shown, that the 53-representative of the absolute signicand
sum, rep
53
(abs fsum), meets the requirements for signicands in the representative for-
mat:
Lemma 4.9 The 53-representative of the absolute signicand sum rep
53
(abs fsum) is
smaller than 4 and is either an integral multiple of 2
 52
or is larger than or equal to 1, as
required for signicands in the representative format (see section 2.6.4).
Proof: The absolute signicand sum is dened by:
abs fsum = jfl + ( 1)
seff
 fs  2
 jj
j: (4.65)
We separate the proof for: (a) eective additions; and (b) eective subtractions. (a) For
eective additions, abs fsum = jfl+ fs  2
 jj
j. Because 1  fl < 2, 1  fs < 2 and 0 <
2
 jj
 1, the absolute signicand sum is in the range 1  abs fsum < 4. Thus, also the
53-representative of the absolute signicand sum is in the range 1  rep
53
(abs fsum) < 4
and the proof of case (a) is completed.
(b) For eective subtractions, abs fsum = jfl  fs  2
 jj
j. Because of the preshifts fl
and fs are now both in the range [2; 4[ and fl and fs are both multiples of 2
 51
. From
this, it follows directly, that 0 < abs fsum < 4. For the remaining part of the proof, we
dier between the two cases: (i) jj  1; and (ii) jj > 1.
(i) Because jj  1, fs  2
 jj
is a multiple of 2
 52
and abs fsum is a multiple of 2
 52
.
Thus, also the 53-representative rep
53
(abs fsum) is a multiple of 2
 52
and the lemma
follows for case (i). (ii) Because jj > 1, (fs  2
 jj
) < 1. Thus, abs fsum > 2  1 = 1 and
also rep
53
(abs fsum) > 1. This completes case (b) and the proof of the whole lemma. 2
Because of equation 4.64 and lemma 4.9, the value val(slsfsum; e
1
; rep
53
(abs fsum)) is
e
1
 53-equivalent to exact
add=sub
. Thus, (s
rc
; e
rc
; f
rc
)=(sl  sfsum; e
1
; rep
53
(abs fsum))
is a RF factoring of the exact sum exact
add=sub
. In the following the computation of this
RF factoring is described.
4.2. ADDITION/SUBTRACTION 81
Denition 4.2 We dene the limited absolute exponent dierence deltalim by
deltalim =

jj if jj  63
63 otherwise.
(4.66)
Because 0  deltalim  63, we can use the 6 bit binary representation: deltalim =
<deltalim[5 :0]>. Moreover, we dene the negated signicand fsn = ( 1)
seff
 fs and
the aligned signicand fsa = fsn  2
 jj
.
Lemma 4.10 The 53-representative of the absolute signicand sum rep
53
(abs fsum) can
be computed by using the limited absolute exponent dierence deltalim instead of jj:
rep
53
(abs fsum) = rep
53
(jfl + ( 1)
seff
 fs  2
 deltalim
j): (4.67)
Proof: We separate the proof for: (a) the case of jj  63; and (b) the case of jj > 63.
(a) For jj  63, we have deltalim = jj, so that the lemma follows directly from
equation 4.65.
(b) For jj > 63, we have fsum > 0 for both eective subtractions and additions,
so that abs fsum = fsum and rep
53
(abs fsum) = rep
53
(fsum). Remember, that the
signicand fl is a multiple of 2
 53
and fs < 4. Let x
h
= fl, x
l
= ( 1)
seff
 fs  2
 jj
,
and q = 2
jj deltalim
. We then get abs fsum = x
h
+ x
l
, x
h
= k  2
 53
for an integer k,
jx
l
j < 4  2
 63
< 2
 53
, and
q  x
l
= ( 1)
seff
 fs  2
 jj
 2
jj deltalim
= ( 1)
seff
 fs  2
 deltalim
= ( 1)
seff
 fs  2
 63
;
so that also q  jx
l
j < 2
 53
. Therefore, lemma 2.16 with p = 53 can be used and we get
rep
53
(x
h
+ x
l
) = rep
53
(x
h
+ q  x
l
). This equation can be written as rep
53
(abs fsum) =
rep
53
(jfl + ( 1)
seff
 fs  2
 deltalim
j); so that the proof of the lemma is completed. 2
The computation of rep
53
(abs fsum) according to lemma 4.10 is partitioned into the
following steps:
1. computation of the limited absolute exponent dierence deltalim = <deltalim[5 :0]>,
and the sign of the exponent dierence sdelta.
2. operand swapping (computation of sl,el =<el[11 :0]>
2
,fl =<fl[ 2:52]>
2neg
and
fs =<fs[ 2:52]>
2neg
including the preshifts for eective subtractions)
(fa
0
[ 1:52]; fb
0
[ 1:52]) =

(fa[0 :52]; 0; fb[0 :52]; 0) if seff = 1
(0; fa[0 :52]; 0; fb[0 :52]) otherwise
(4.68)
(sl;el[11 :0]; fl[ 2:52]) =

(sb sop;eb[11 :0]; 0; fb
0
[ 1:52]) if sdelta
(sa;ea[11 :0]; 0; fa
0
[ 1:52]) otherwise
(4.69)
fs[ 2:52] =

(0; fa
0
[ 1:52]) if sdelta
(0; fb
0
[ 1:52]) otherwise
(4.70)
3. signicand negation of fs for eective subtractions. Because fs =< fs[ 2:52] >
neg
and fs[ 2] = 0, fsn = ( 1)
seff
fs can be computed by
fsn = < fsn[ 2:52] >
2neg
(4.71)
=

< (fs[ 2:52]) >
2neg
+2
 52
if seff
< fs[ 2:52] >
2neg
otherwise.
(4.72)
This equation is implemented by a 55-bit incrementer and a 55-bit mux selection.
82 CHAPTER 4. BASIC FP OPERATIONS
4. alignment shift of fsn[ 2:52] by deltalim positions (fsa = fsn2
 deltalim
). Because
0  deltalim  63, fsa can be represented by fsa =< fsa[ 2 : 115] >
2neg
and
because fsn is also represented in the two's complement representation, the ll bit
fsa[ 2] has to be shifted in for sign extension:
fsa[ 2:115] = (fsn[ 2]
deltalim
; fsn[ 2:52]; 0
63 deltalim
) (4.73)
= rsft(fsn[ 2:52]; <deltalim[5 :0]>; fsn[ 2]; 0) (4.74)
This right shift is implemented with a 55-bit shifter.
5. signicand addition fsum = fl + fsa:
<fsum[ 2:115]>
2neg
=<fl[ 2:52]>
2neg
+ <fsa[ 2:115] >
2neg
This addition is partitioned into a lower part and into an upper part:
fsum[53 :115] = fsa[53 :115] (4.75)
<fsum[ 2:52]>
2neg
= <fl[ 2:52]>
2neg
+ <fsa[ 2:52] >
2neg
(4.76)
The addition of the upper part is implemented by a 55-bit carry-look-ahead adder
implementation.
6. conversion for negative fsum (computation of abs fsum =<abs fsum[ 1:115]>
neg
=
jfsumj). Because fsum is negative, i fsa > fl, in this case deltalim = 0 and
abs fsum[53 :115] = fsum[53 :115] = fsa[53 :115] = 0
63
. Thus, only the upper part
[ 2:52] is involved in the conversion
abs fsum[53 :115] = fsum[53 :115] (4.77)
<abs fsum[ 1:52]>
2neg
=

<(fsum[ 1:52])>
2neg
+2
 52
if fsum[ 2]
<fsum[ 1:52]>
2neg
otherwise.
(4.78)
This equation is implemented by a 55 bit incrementer and a 55-bit mux selection.
The sign of fsum is given by sfsum = fsum[ 2].
7. representative computation according to lemma 2.11:
f
rc
= < f
rc
[ 1 : 54] >
neg
= < rep
53
(abs fsum)[ 1 : 54] >
neg
= < (abs fsum[ 1 : 53]; Ortree(abs fsum[54 : 115]) >
neg
Among these steps only the implementation of the rst step has to be further specied.
This is done by the following lemma:
Lemma 4.11 With the computation of
 = < delta[13 :0] >
2
= ea  eb (4.79)
= < (0;ea[12 :0]) >
2
+ < (1;ea[12 :0]) >
2
+1 (4.80)
jj = <abs delta[13 :0]> (4.81)
=

<delta[13 :0]> +1 if delta[13]
<delta[13 :0]> otherwise.
(4.82)
deltaovf = ORtree(abs delta[13 :6]) (4.83)
4.2. ADDITION/SUBTRACTION 83
we get
sdelta = delta[13] (4.84)
deltalim = <deltalim[5 :0]> (4.85)
= < (abs delta[5 : 0] OR deltaovf) > (4.86)
Proof: The equations for  = <delta[13 :0]>
2
and jj = <abs delta[13 :0]> are
a straight-forward implementation of the denitions using the properties of two's com-
plement numbers. Obviously, in the two's complement representation delta[13 :0], the
sign of  is given by sdelta = delta[13]. The bit deltaovf implements the condition
(jj > 63). In equation 4.86, deltalim is set to <111111> = 63 for deltaovf = 1 and
deltalim = <abs delta[5 :0]> = jj for deltaovf = 0, as required by the denition of
deltalim in equation 4.66. 2
The exponent e
rc
= <e
rc
[12 :0]>
2
= e
1
(equation 4.62) is computed by:
e
rc
= <e
1
[11 :0]>
2
(4.87)
=

<el[11 :0]>
2
 1 if seff = 1
<el[11 :0]>
2
otherwise.
(4.88)
This equation is implemented by a 12-bit decrementer and a 12-bit selection mux. The sign
computation implements the equation s
rc
= (sl sfsum). This completes the description
of the computation of (s
rc
; e
rc
; f
rc
). In the following we will integate this result for the
regular case with the special cases results according to equation 4.61 and we will consider
the recognition of the invalid exception.
We separate the nal result selection according to equation 4.61 for the signicand, the
exponent and the sign of the result. The denitions of the special case conditions scqnan,
scinf, scx, scy, and sczero are given in equations 2.20-2.24. For the computation of
the zero condition sczero, we rst have to detect the condition zero
rc
of zero results
for regular operands. Because the computation of zero
rc
based on the result of the
regular path would be quite slow, we compute this signal directly from the input operands.
Obviously,
zero
rc
() seff ^ zerotest ((ea[11 :0]; fa[0 :52])  (eb[11 :0]; fb[0 :52])) :(4.89)
This equation will be implemented in the special cases circuit.
Based on the special case conditions we get for the signicand:
f
RF
= <f
RF
[ 1:54]>
neg
=
8
>
>
>
>
>
<
>
>
>
>
:
f
qnan
=<(0; 1010
52
)>
neg
if scqnan
f
1
=<(0; 10
54
)>
neg
if scinf
fa =<(0; fa[0 :52]; 0
2
)>
neg
if scx
fb =<(0; fb[0 :52]; 0
2
)>
neg
if scy
f
0
=<(0
56
)>
neg
if sczero
f
rc
=<f
rc
[ 1:54]>
neg
otherwise.
By the denition of the special case signicand representation f
sc
[ 1:54]
f
sc
[ 1:54] =
8
>
>
>
>
<
>
>
>
:
(0; 1010
52
) if scqnan
(0; 10
54
) if scinf
(0; fa[0 :52]; 0
2
) if scx
(0; fb[0 :52]; 0
2
) if scy
0
56
otherwise
(4.90)
84 CHAPTER 4. BASIC FP OPERATIONS
and the special case condition
spca = scqnan_ scx _ scy _ scinf _ sczero; (4.91)
the representation of the signicand f
RF
can be selected by
f
RF
[ 1:54] =

f
sc
[ 1:54] if spca
f
rc
[ 1:54] otherwise.
(4.92)
The computations for f
sc
[ 1:54] and spca are implemented in the special cases circuit.
Based on the special case conditions, already all four aditional bits of the result rep-
resentation are given by zero
RF
= sczero, inf
RF
= scinf, snan
RF
= 0 and
qnan
RF
() qnana _ qnanb _ scqnan: (4.93)
For the exponent e
RF
the selection is even simpler, because for all special value results, we
dened e
RF
= e
max
+1. Because in addition/subtraction for a special value result, at least
one of the input operands has also a special value, and to avoid the distinction between
the single and the double precision case, the special exponent representation can be copied
from one special input operand for all special value results. We dene the condition
nrega () infa _ qnana _ snana: (4.94)
If there is at least one special input operand, then a special exponent representation is
copied from the inputs by
e
sc
[11 :0] =

ea[11 :0] if nrega
eb[11 :0] otherwise,
(4.95)
so that the exponent e
RF
of the representative result can be selected by:
e
RF
= <e
RF
[12 :0]>
2
(4.96)
=

e
sc
=<(e
sc
[11];e
sc
[11 :0])>
2
if spca
e
rc
=<(e
rc
[11];e
rc
[11 :0])>
2
otherwise.
(4.97)
Note, that for zero results the exponent e
sc
is selected, but in this case it does not matter
which value this exponent has, because zero representations in the representative format
may contain an arbitrary exponent value.
We dene the special case sign s
sc
and the preliminary sign s
0
RF
by
s
sc
=
8
>
<
>
>
:
0 if scqnan
sa if scx
sb if scy
(sa ^ infa) _ (sb ^ infb) otherwise
(4.98)
s
0
RF
=

s
sc
if spca
s
rc
otherwise.
(4.99)
Because the rounding mode RMI is encoded by rmode[1 :0] = (11) according to table 2.3,
and (sa ^ (sb  sop))  (sl ^ seff) we then get the sign of the result s
RF
according to
equation 2.15 by
s
RF
=

s
0
if sczero
s
0
RF
otherwise.
(4.100)
=

(seff ^ rmode[1] ^ rmode[0]) _ (sl ^ seff) if sczero
s
0
RF
otherwise.
(4.101)
4.2. ADDITION/SUBTRACTION 85
1
1
1 1
01 0 1
1
0
0
1
01
0
0
1
1 0
01
0
0
1
0
10 0
0
1
Mux
XOR
E      [12:0]
Mux
F    [-1:54]sc
sc
RF
cases
special
ZEROb
sc
QNANb
S
QNANa
S
SNANa
ZEROa
BUSa      [69:0]NF
Mux
INFb
F    [-1:54]sc F    [-1:52]rc F    [53:54]rc
Mux
Mux
SNANb
ZERO
RF
INFa
F      [-1:54]RF
RFZERO
INF
QNAN
SNAN
RF
RF
RF
RFS’
RFS’
 
 
XOR
Mux
 
ORtree
 
  
  
 
 
 
 
   
 
RF
SRFE      [12:0]
  
XOR
 
 
 
 
 
 
 
 







Mux
0
Mux
Mux INC 
CLA(14)
[13:0]DELTA
[13:0]
[13]
FB’[-2:52]
FB[0:52]
0 0 1
2
Mux
FA[0:52]
0 0
SEFF
XOR
DEC INC 
ORtree
DELTAOVF
0
DELTALIM[5:0]
ABS_DELTA
Mux
OR(6)FSN[-2:52]
FSA[54:115]
FSN[-2]
Right-Shift
SEFF
Mux
0
[13:6]
SPCA
SEFF
SDELTA
Mux
[-2:52]FSA FSA [53]
SA SB
FA’[-2:52]
EA[11:0]
EB[11:0]
EA[11:0] EB[11:0]
EL[11:0]
FL[-2:52]
FS[-2:52]
A
BS_D
ELTA[5:0]
SD
ELTA
SA
SB SOPSOP
SL
SDELTA
SLSEFFSPCA
rcS
FSUM[-2]
SFSUM
SOP
FL[-2:52]
INV
INV
EB[11:0]
EA[11:0]
FB[0:52]
FA[0:52]
SB
SA
[3:0]
[3:0]
[68:57]
[68:57]
[56:4] [69] [68:57] [69]
[56:4] [68:57] [69][69]
BUSb      [69:0]NF
BUS      [73:0]RF
[3:0] [72:60] [59:4] [73]
E    [11:0]
sc
E    [11:0]
rc
[11][11]
SEFF
INC 
SPCA
Mux
FSU
M
[-2]
FSU
M
[-1:52]
CLA(55)
RMODE[1:0]
AND
SL
Mux
2
SEFF
Figure 4.10: Structure of the addition/subtraction unit I.
In this way, the sum output in the representative format is given by:
BUS
RF
[72 :0] = (s
RF
;e
RF
[12 :0]; f
RF
[ 1:54]; zero
RF
; inf
RF
;qnan
RF
; snan
RF
)(4.102)
The cases for the occurance of an invalid exception are listed in table 2.5. Obviously,
the invalid exception occurs, i the addition/subtraction results in a quiet NaN , where
scqnan = 1, so that
inv () scqnan: (4.103)
86 CHAPTER 4. BASIC FP OPERATIONS
This completes the description of the addition/subtraction I implementation which is
depicted in gure4.10. The only part which is included in this gure without details is
the special cases circuit. This special cases circuit includes the computations of equations
2.20-2.24, 4.89, 4.90-4.91,4.93-4.95, and 4.98.
4.2.2 Addition/Subtraction II (normalized  ! gradual result format)
Like in the previous section also in this section the FP addition/subtraction is computed
from the inputs of the normalized representations BUSa
NF
[69 :0] and BUSb
NF
[69 :0]
(section 2.6.3), the rounding mode represented by rmode[1 : 0] and the bit sop that
signals the case of addition or subtraction. But in contrast to the previous implementation
where a representative of the exact operation result had to be delivered, in this case the
gradual rounding function ground1 has to be computed on the exact operation result.
After this gradual rounding step the sum/dierence should be output in the gradual
result format BUS
GF
[73 :0] (section 2.6.5). Formally, with the notation from the previous
section and with ((s
grc
; e
grc
; f
grc
);tinc;tinx) = ground1
mode
(s
rc
; e
rc
; f
rc
), the required
addition/subtraction result is based on the following GF factoring (Note, that the rounding
can be computed on the RF factoring (s
rc
; e
rc
; f
rc
) instead of a factoring of the exact
operation result according to lemma 2.7):
((s
GF
; e
GF
; f
GF
);tinc
GF
;tinx
GF
)=
8
>
>
>
>
<
>
>
>
>
:
((0; e
qNaN
; f
qNaN
); 0; 0) if scqnan
((s
inf
; e
1
; f
1
); 0; 0) if scinf
((sa; ea; fa); 0; 0) if scx
((sb; eb; fb); 0; 0) if scy
((s
0
; e
0
; 0); 0; 0) if sczero
((s
grc
; e
grc
; f
grc
);tinc;tinx) otherwise,
(4.104)
so that the sum output of the addition/subtraction unit in this section is specied by the
corresponding gradual result representation BUS
GF
[73 :0] = gf((s
GF
; e
GF
; f
GF
);tinc;tinx).
The occurance of an invalid exception should be signaled by the bit inv also in this case.
The special cases conditions and values in equation 4.104 are identical to that in the
specication of the previous section. In the implementation of this special cases selection,
the only dierence to the previous section is that a representation in the gradual result
format has 3 bits less in the signicand, which have been lled with zeros in the repre-
sentative format. Moreover, the gradual result format requires two additional rounding
tags, which have to be zero for special value results. For the special cases selections, these
small adjustments are integrated in the implementation depicted in gure 4.11. Also in
the equations, that are implemented in the special cases circuit, the selections for bit posi-
tions [ 1] and [53:54] have to be neglected. This already completes the description of the
special cases computation and we only have to describe the computation of the gradual
result representation of ((s
grc
; e
grc
; f
grc
);tinc;tinx) in the following.
The computation of the GF factoring ((s
grc
; e
grc
; f
grc
);tinc;tinx) can be based on
the computation of the RF factoring (s
rc
; e
rc
; f
rc
) from the previous section:
(s
rc
; e
rc
; f
rc
) = (sl sfsum; e
1
; rep
53
(jfl + ( 1)
seff
 fs  2
 deltalim
j);
so that
((s
grc
; e
grc
; f
grc
);tinc;tinx) = ground1
mode
(s
rc
; e
rc
; f
rc
) (4.105)
= post norm(sgrnd1
mode?s
((s
rc
; e
rc
; f
rc
))):(4.106)
4.2. ADDITION/SUBTRACTION 87
The three additional steps of the normalization shift, the rounding computation and the
post-normalization shift could have a large additional delay in a straight-forward imple-
mentation. To speed up the computations, we divide the implementation into two parallel
paths that work under dierent assumptions. The computations in each path can then be
simplied and some of the computation steps only have to be considered exclusively in one
of the two paths. Such a 'two path' approach for oating-point addition was rst described
in [14]. In this description the two paths dier by the assumptions on the magnitude of
the exponent dierence: the far path is dened for large exponent dierences jj > 1, and
the near path is dened for small exponent dierences jj  1.
Our partitioning is slightly dierent. Based on the following denition of the path
selection condition is r, we dene the 'R'-path (R for Rounding) for is r = 1 and the
'N'-path (N like Near, Negation and Normalization) for is r = 0. As will be shown later,
the advantage of our approach is that a conventional implementation of a far path can
be used to implement also the 'R'-path, but the implementation of the 'N'-path could be
simplied in comparison to a near path implementation.
Denition 4.3 We dene the path selection condition is r based on the computation of
(s
rc
; e
rc
; f
rc
) from the previous section with fsum = fl+ ( 1)
seff
 fs  2
 deltalim
:
is r () (seff _ (fsum 2 [1; 4[)) ; (4.107)
i.e., the results of the 'R'-path have to be valid, if (is r = 1) and the results of the 'N'-path
have to be valid, if (is r = 0), so that a valid result could be selected by
((s
grc
; e
grc
; f
grc
);tinc;tinx) =

((r s; r e; r f);r tinc;r tinx) if is r
((n s; n e; n f);n tinc;n tinx) otherwise.
Lemma 4.12 For the two paths we get the following properties
(a) 'R'-path: is r =) fsum 2 [1; 4[
(b) 'N'-path: is r =) seff = 1
(c) 'N'-path: is r =)  2 f 1; 0; 1g
(d) 'N'-path: is r =) fsum 2 ]  2; 1[ AND fsum is an integral multiple of 2
 52
:
Proof: (a) Because one part of the denition of is r already includes the condition
fsum 2 [1; 4[, we have to show part (a) only for the case of seff = 0. For this case
of eective additions it was already shown in part (a) of the proof of lemma 4.9, that
fsum 2 [1; 4[.
Part (b) follows directly from the denition of the path selection condition is r.
(c) For is r = 1, we have an eective subtraction with fsum < 1. In eective sub-
tractions both signicands have been preshifted, so that both fl and fs are in the range
[2; 4[. Thus,
(fsum < 1) =) (fl   fs  2
 deltalim
< 1)
=) (fs  2
 deltalim
> 1)
=) (2
 deltalim
> 0:25)
=) (deltalim < 2)
88 CHAPTER 4. BASIC FP OPERATIONS
This last condition is only fullled for exponent dierences  2 f 1; 0; 1g, as required by
the lemma.
(d) From part (c) we know that for is r = 1, we get eective subtractions with
deltalim  1. Because of the preshifts, the signicands fl and fs are both integral
multiples of 2
 51
, so that the aligned signicand fs  2
 deltalim
is an integral multiple of
2
 52
. Thus, also the signicand sum fsum is an integral multiple of 2
 52
. Because in
general for eective subtractions, fsum 2 ]  2; 4[, and for is r = 1, we have fsum < 1,
we also get fsum 2 ]  2; 1[, as required. 2
These properties of the two paths make the following optimizations possible:
(a) Because of (a), there can be no negative fsum in the 'R'-path, so that the conversion
step can be avoided in this path. Moreover, we know from (a), that in the 'R'-path
the range of the signicand sum fsum consists only of two binades, so that only a
very small a normalization shift by at most one position is required in the 'R'-path.
(b) Because of (b), we can use seff = 1 in the whole computations of the 'N'-path and
optimize the implementation accordingly.
(c) Because of (c), deltalim in the 'N'-path can be determined already from the two least
signicant bits in the two's complement representation of the exponent dierence.
(d) Because of (d), also after the conversion step and the normalization shift, the sig-
nicand is a multiple of 2
 52
, so that the rounding computation by the function
ground1 does no change on the signicand and the rounding can be neglegted in the
'N'-path.
The additional advantanges of the 'N'-path in our approach in comparison to the 'near'-
path from [], are the properties in (b) and in (d). The main structure of our implementation
of the addition/subtraction II unit is depicted in gure 4.11, that uses the results of the
'R'-path computations ((r s; r e; r f);r tinc;r tinx), the result of the 'N'-path compu-
tations ((n s; n e; n f);n tinc;n tinx) and the condition is r, that decides, which result
has to be selected according to denition 4.3. Moreover, the special case computations
from the previous section according to equation 4.104 are adopted for the output in the
gradual result format in this gure.
In the following, we describe the implementation of the 'R'-path and the 'N'-path
separately, after giving a denition of some values that will be used in both paths:
Denition 4.4 We dene the signicand fso (o for one's complement), where the con-
ditional two's complement negation from the signicand fsn is replaced by a conditional
one's complement negation:
fso = <fso[ 2:52]>
2neg
(4.108)
= <fs[ 2:52]  seff>
2neg
(4.109)
= <fsn[ 2:52]>
2neg
 seff  2
 52
(4.110)
= fsn  seff  2
 52
(4.111)
and we dene the corresponding values that are based on fso instead of fsn:
fsoa = <fsoa[ 2:115]>
2neg
(4.112)
= <(seff
deltalim
; fso[ 1:52]; seff
63 deltalim
)>
2neg
(4.113)
4.2. ADDITION/SUBTRACTION 89
1
1
0
01
0
0
1
1
0
S grc
SB
EB[11:0]
grc
FB[0:52]
NF
E      [12:0]
E     [11:0]
grc
BUSa      [69:0]
F     [0:52]
0
NF
BUSb      [69:0]
Mux
GF
grcSscSF    [0:52]
GFF      [0:52]
grc
Mux
E     [11:0]grc
GFZERO
TINC GFTINXGF
F     [0:52]
[4] [5]
ZEROb
sc
QNANb
SNANb
INFa
QNANa
SNANa
ZEROa
 
 
 
 
 
 
 
 scE    [11:0]
INFb
[3:0]
SPCA
SPCA
Mux
SNAN GF
BUS      [72:0]GF
[71:59] [58:6]
(SA,EA[11:0],FA[0:52])
R
M
O
D
E[1:0]
TINC
Mux
S
IS_R
N
_F[0:52]
N
_TIN
X
N
_TIN
C
N
_S
INV
SOP
RMODE[1:0]
(SB,EB[11:0],FB[0:52])
[69:4]
[69:4]
SA
EA[11:0]
FA[0:52]
(SA,EA[11:0],FA[0:52])
R
_E[11:0]
R
_F[0:52]
R
_TIN
X
R
_TIN
C
R
_S
IS_R
S.EFF
SPCA
IS_R2
QNAN GF
INFGF
AND
SPCA
Mux
[72]
GFS’
GFS
SEFF
SEFF
sc
N
_E[11:0]
computation
N-path R-path
computation
SPCA
ZEROGF
TINX
AND
special
cases
[3:0]
[3:0]
(SB,EB[11:0],FB[0:52])
INV
Figure 4.11: Structure of the addition/subtraction unit II.
= fso  2
 deltalim
+ seff  (2
 52 deltalim
  2
 115
) (4.114)
= fsn  2
 deltalim
  seff  2
 115
(4.115)
= fsa  seff  2
 115
(4.116)
fosum = <fosum[ 2:116]>
2neg
(4.117)
= fl + fsoa (4.118)
= fsum  seff  2
 115
(4.119)
90 CHAPTER 4. BASIC FP OPERATIONS
'R'-path The computations in the 'R'-path are described on the basis of the adder
implementation from the previous section. As discussed above (see lemma 4.12(a)), we
can use for the 'R'-path: abs fsum = fsum 2 [1; 4[ and sfsum = 0. Based on this, the
required factoring in the 'R'-path ((r s; r e; r f);r tinc;r tinx) can be written as
((r s; r e; r f);r tinc;r tinx) = post norm(sgrnd1
mode?sl
((sl; e
1
; fsum))):(4.120)
Denition 4.5 For f 2 [1; 4[, we dene the generalized post-normalization shift by
gpost norm(s; e; f) =

(s; e+ 1; f=2) if f 2 [2; 4[
(s; e; f) if f 2 [1; 2[
(4.121)
Lemma 4.13 In the 'R'-path, the computation of (r s; r e; r f) can be simplied to
(r s; r e; r f) =

gpost norm(sl; e
1
; rnd
mode?sl;52
(fsum)) if fsum 2 [1; 2[
gpost norm(sl; e
1
; rnd
mode?sl;51
(fsum)) if fsum 2 [2; 4[
(4.122)
Proof: It follows directly from the denition of the gradual rounding function grnd, that
for zero rounding tags at the input like in (rfsum;tinx;tinc) = grnd
mode?s;
(fsum; 00)),
conventional rounding delivers the same rounded result, so that we also have rfsum =
rnd
mode?s;
(fsum). Moreover, we use for the reduction that fa; fb < (2   2
 52
), so that
fsum 2 [1; (4   2
 51
)]. Because 1 and (4   2
 51
) are integral multiples of 2
 51
, for
fsum 2 [1; (4   2
 51
)] and   51, also the rounded result rfsum = rnd
mode?s;
(fsum)
is in the same range, namely rfsum 2 [1; (4   2
 51
)]. In the following we dier between
the two cases: (a) fsum 2 [1; 2[ and (b) fsum 2 [2; 4[.
(a) For fsum 2 [1; 2[, the normalization shift can be neglected, so that with denition
of the function sgrnd1 we get
post norm(sgrnd1
mode?sl
((sl; e
1
; fsum))) = post norm(sl; e
1
; grnd1
mode?sl
(fsum))
and thus (r s; r e; r f) = gpost norm(sl; e
1
; rnd
mode?sl;52
(fsum)).
(b) For fsum 2 [2; 4[, we get (sl; e
1
; fsum) = (sl; e
1
+ 1; fsum=2), so that because
of fsum  (4  2
 51
)]:
(r s; r e; r f) = gpost norm(sl; e
1
+ 1; rnd
mode?sl;52
(fsum=2))
= gpost norm(sl; e
1
; rnd
mode?sl;51
(fsum))
2
4.2. ADDITION/SUBTRACTION 91
Denition 4.6 Based on the previous lemma and with the denition of the signicand
overow condition
cond
[2;4[
() fsum 2 [2; 4[; (4.123)
we dene the rounded signicand rnd fsum by
rnd fsum = <rnd fsum[ 1:52]>
neg
=

rnd
mode?sl;51
(fsum) if cond
[2;4[
rnd
mode?sl;52
(fsum) otherwise,
so that (r s; r e; r f) = gpost norm(sl; e
1
; rnd fsum):
In the following, the computation of the rounded signicand rnd fsum and the rounding
functions rnd
mode?sl;52
(fsum) and rnd
mode?sl;51
(fsum) are described using the injection-
based rounding reduction from section 2.5.2. We denote the additive rounding injection
by inj
[1;2[
for fsum 2 [1; 2[ and by inj
[2;4[
for fsum 2 [2; 4[. With srmode = mode ? sl,
these injections are dened by
inj
[1;2[
=
8
<
:
0 if srmode = RZ
2
 53
if srmode = RN
2
 52
  2
 115
otherwise
inj
[2;4[
=
8
<
:
0 if srmode = RZ
2
 52
if srmode = RN
2
 51
  2
 115
otherwise.
Based on the injections, we can reduce the previous rounding functions to
rnd
srmode;51
(fsum) = rnd
RZ;51
(fsum+ inj
[2;4[
) (4.124)
= rnd
RZ;51
(fosum+ seff  2
 115
+ inj
[2;4[
) (4.125)
rnd
srmode;52
(fsum) = rnd
RZ;52
(fsum+ inj
[1;2[
) (4.126)
= rnd
RZ;52
(fosum+ seff  2
 115
+ inj
[1;2[
): (4.127)
According to denition 4.4, the signicand sum fosum consists of the signicands fl
and fsoa. Thus, fl[ 1 : 52] and fsoa[ 1 : 115] can be interpreted as a carry-save re-
presentation of fosum. We compress this carry-save representation by a half-adder-
line with the sum outputs sfosum[ 1 : 115] and carry outputs cfosum[ 1 : 51], so that
<sfosum[ 1:115]>
neg
+<cfosum[ 1:51]>
neg
= <fl[ 1:52]>
neg
+<fsoa[ 1:115]>
neg
,
and fosum = sfosum+ cfosum. After that, we partition the addition of
finj =<finj[ 1:115]>
neg
= fsum+ inx
X
(4.128)
= fosum+ seff  2
 115
+ inj
X
(4.129)
= sfosum+ cfosum+ seff  2
 115
+ inj
X
(4.130)
into three parts: an upper part with positions [ 1 : 51], a mid part including positions
[52:53], and a lower part with positions [54:115]. The additions are computed separately
for these three parts considering the carries from the lower to the mid part and from the
mid part to the upper part.
The binary representation of the injection constants inj
[1;2[
and inj
[2;4[
could have
non-zero digits only in positions [52 : 115], which are in the mid part and in the lower
92 CHAPTER 4. BASIC FP OPERATIONS
part, so that the injections can be represented by inj
[1;2[
= < inj
[1;2[
[52 :115]>
neg
and
inj
[2;4[
= < inj
[2;4[
[52 :115]>
neg
with:
inj
[1;2[
[52 :53] =

00 if srmode = RZ
01 otherwise
(4.131)
inj
[2;4[
[52 :53] =
8
<
:
11 if srmode = RI
10 if srmode = RN
00 otherwise
(4.132)
inj[54 :115] = inj
[1;2[
[54 :115] = inj
[2;4[
[54 :115] (4.133)
=

1
60
if srmode = RI
0
60
otherwise.
(4.134)
Because in the lower part we have
lpart = <sfosum[54 :115]>
neg
+< inj[54 :115]>
neg
+ seff  2
 115
< 2
 52
;
there can be at most one carry bit from the lower part into position [53] of the mid part.
This carry bit into position [53] is called c53 with c53() (lpart  2
 53
):
With the consideration of the carry c53 we have in the mid part:
mpart = < sfosum[52 :53] >
neg
+ < inj[52 :53] >
neg
+c53  2
 53
(4.135)
= < (c51; l;r) >
neg
(4.136)
< 2
 50
(4.137)
Thus, there can also be at most one carry bit from the mid part into position [51] of the
upper part. This carry bit into position [51] is called c51 with c51() (mpart  2
 51
):
The value in the mid part depends on whether fsum 2 [1; 2[ or fsum 2 [2; 4[. There-
fore, we compute two dierent versions of (c51; l;r), namely (c51
[1;2[
; l
[1;2[
;r
[1;2[
) under
the assumption that fsum 2 [1; 2[ (cond
[2;4[
= 0) and (c51
[2;4[
; l
[2;4[
;r
[2;4[
) under the
assumption that fsum 2 [2; 4[ (cond
[2;4[
= 1):
<(c51
[1;2[
; l
[1;2[
;r
[1;2[
)>
neg
= <sfosum[52 :53]>
neg
+< inj
[1;2[
[52 :53]>
neg
+c532
 53
(4.138)
<(c51
[2;4[
; l
[2;4[
;r
[2;4[
)>
neg
= <sfosum[52 :53]>
neg
+< inj
[2;4[
[52 :53]>
neg
+c532
 53
(4.139)
Moreover, the upper part of finj in positions [ 1:51] can only have either the value
usum = <usum[ 1:51]>
neg
(4.140)
= <sfosum[ 1:51]>
neg
+<cfosum[ 1:51]>
neg
(4.141)
or the value usumi = <usumi[ 1:51]>
neg
= usum + 2
 51
, because of equation 4.137.
Based on this and with the denition of the rounding increment condition
rinc ()
 
(c51
[2;4[
^ cond
[2;4[
) OR (c51
[1;2[
^ cond
[2;4[
)

; (4.142)
the required bits of the injected signicand finj[ 1:52] can be selected by
finj[ 1:51] =

usumi[ 1:51] if rinc
usum[ 1:51] otherwise
(4.143)
finj[52] =

l
[2;4[
if cond
[2;4[
l
[1;2[
otherwise
(4.144)
4.2. ADDITION/SUBTRACTION 93
to prepare the injection-based rounding mode reduction for the rounding modes mode 2
fRZ;RNU;RI;RMIg.
To implement the IEEE rounding mode RNE instead of RNU, we have to consider the
'L-bit x' for the case of a tie according to section 2.3.2, namely, the least signicant bit
of the rounded signicand has to be pulled down for the case, that the rounding operand
lies exactly between two consequtive rounding choices in rounding mode RNE. We denote
the condition, that an 'L-bit x' is required by lfix
[2;4[
for cond
[2;4[
= 1 and by lfix
[1;2[
for cond
[2;4[
= 0 with
lfix
[2;4[
() (fsum[52 : 115] = (1; 0
63
)) AND srmode = RNE (4.145)
lfix
[1;2[
() (fsum[53 : 115] = (1; 0
62
)) AND srmode = RNE (4.146)
Thus, we get (substitution of eq. 4.143-4.146 and eq. 4.128 in eq. 4.124-4.127)
rnd
srmode;51
(fsum) = <(finj[ 1:50]; finj[51] ^ lfix
[2;4[
; 0)>
neg
(4.147)
rnd
srmode;52
(fsum) = <(finj[ 1:51]; l
[1;2[
^ lfix
[1;2[
) >
neg
: (4.148)
According to denition 4.6 the rounded signicand rnd fsum = <rnd fsum[ 1:52]>
neg
can be written by
rnd fsum[ 1 : 52] =

(finj[ 1:50]; finj[51] ^ lfix
[2;4[
; 0) if cond
[2;4[
(finj[ 1:51]; l
[1;2[
^ lfix
[1;2[
) otherwise.
(4.149)
The following lemma provides the missing details for the implementation of the round-
ing decision.
Lemma 4.14 Based on the denition of the sticky bit:
sticky = ORtree(fosum[54 :115]  seff)
the signals c53, fsum[51 :53], lfix
[1;2[
, lfix
[2;4[
, r tinx and r tinc can be computed by:
c53 = (seff ^ sticky) _ ((sticky _ seff) ^ (srmode = RI))
<fsum[51 :53]>
neg
 <sfosum[51 :53]>
neg
+ <cfosum[51]>
neg
+
+(seff ^ sticky)  2
 53
mod 2
 50
lfix
[1;2[
= fsum[53] ^ sticky ^ (srmode = RNE)
lfix
[2;4[
= fsum[52] ^ fsum[53] ^ sticky ^ (srmode = RNE)
r tinx = sticky _ ((cond
[2;4[
^OR(fsum[52 :53])) _ (cond
[2;4[
^ fsum[53]))
r tinc = (((finj[51] ^ lfix
[2;4[
)fsum[51]) ^ cond
[2;4[
) _
_ (((l
[1;2[
^ lfix
[1;2[
)fsum[52]) ^ cond
[2;4[
)
Proof: We rst show, that the sticky-bit has the property:
sticky () (fsum[54 :115] = 0
62
) (4.150)
() (fsum is integral multiple of 2
 53
): (4.151)
To prove this, we distinguish the two cases: (a) seff = 0; and (b) seff = 1. (a) For
seff = 0, we get fosum = fsum, so that
(fsum[54 : 115] = 0
62
)() (fosum[54 : 115] = 0
62
)() sticky:
94 CHAPTER 4. BASIC FP OPERATIONS
(b) For seff = 1, we have fosum = fsum  2
 115
, so that in this case (fsum[54 : 115] =
0
62
)() (fosum[54 : 115] = 1
62
)() sticky; as required. Moreover, we can immediately
conclude from equation 4.150 that sticky() (<fsum[54 :115]>
neg
 2
 115
):
The carry bit c53 signals the condition (lpart  2
 53
). By denition
lpart = <fosum[54 :115]>
neg
+ < inj[54 :115]>
neg
+seff  2
 115
:
The injection bits inj[54 : 115] can only be either (i) 1
62
for srmode = RI or (ii) 0
62
otherwise. (i) If inj[54 :115] = 1
62
, then
(lpart  2
 53
)() ((fsum[54 :115] 6= 0
62
) _ seff)() (sticky _ seff):
(ii) If inj[54 :115] = 0
62
, then
(lpart  2
 53
)() ((fosum[54 :115] = 1
62
) ^ seff)() sticky ^ seff;
as required.
In the equation for <fsum[51 :53]>
neg
, the carry from the low part into position [53]
without considering an injection (inj[54 : 115] = 0
62
) has to be used, namely, in this case
c53
0
= (sticky ^ seff). Thus, we get as required
<fsum[51 :53]>
neg
= <sfosum[51 :53]>
neg
+ <cfosum[51]>
neg
+
+(sticky ^ seff)  2
 53
mod 2
 50
:
The equations for the 'L-bit'-x conditons are the straight-forward implementation of their
denition from equations 4.145-4.146 using sticky() (fsum[54 :115] = 0
62
).
The inexactness rounding tag r tinx (equation 2.56) can be written as
r tinx =

OR(fsum[52 :115]) if cond
[2;4[
OR(fsum[53 :115]) otherwise.
By the substitution of sticky() (fsum[54 :115] 6= 0
62
)() OR(fsum[54 :115]) we get
as required
r tinx = sticky _ ((cond
[2;4[
^OR(fsum[52 :53])) _ (cond
[2;4[
^ fsum[53])):
According to equation 2.57 the increment rounding tag r tinc can be written as
r tinx =

(rnd fsum[51] 6= fsum[51]) if cond
[2;4[
(rnd fsum[52] 6= fsum[52]) otherwise.
We get the required form of this equation by the substitution of rnd fsum[51] and
rnd fsum[52] according to equation 4.149. 2
Lemma 4.15 In the rounding computations, the condition on the range of the signicand
sum cond
[2;4[
() (fsum 2 [2; 4[) can be substituted by usum[ 1], so that the rounding
increment decision rinc is given by
rinc ()

(c51
[2;4[
^ usum[ 1]) OR (c51
[1;2[
^ usum[ 1])

; (4.152)
the rounded signicand rnd fsum[ 1:52] can be selected by
rnd fsum[ 1 : 52] =

(finj[ 1:50]; finj[51] ^ lfix
[2;4[
; 0) if usum[ 1]
(finj[ 1:51]; l
[1;2[
^ lfix
[1;2[
) otherwise.
(4.153)
4.2. ADDITION/SUBTRACTION 95
and the rounding tags tinc and tinx can be computed according to
r tinx = sticky _ ((usum[ 1] ^OR(fsum[52 :53])) _ (usum[ 1] ^ fsum[53]))(4.154)
r tinc = (((finj[51] ^ lfix
[2;4[
)fsum[51]) ^ usum[ 1]) _ (4.155)
_ (((l
[1;2[
^ lfix
[1;2[
)fsum[52]) ^ usum[ 1]): (4.156)
Proof: Because usum+ < sfosum[52 : 115] >
neg
+seff  2
 115
= fsum and because
< sfosum[52 : 115] >
neg
+seff  2
 115
 2
 51
, the values of usum and fsum dier at
most by 2
 51
with fsum  usum. Thus, the values usum[ 1] and cond
[2;4[
dier, i
seff = 1, fsum[ 1 : 51] = (1;0
52
), usum[ 1 : 51] = (0;1
52
); usumi[ 1 : 51] = (1;0
52
); and
sfosum[52 :115] = 1
64
: In this situation, we have usum[ 1] = 0 and cond
[2;4[
= 1: More-
over, it follows, that mpart  2
 51
; so that c51
[2;4[
= c51
[1;2[
= 1, and the incremented
upper sum < usumi[ 1:51] >
neg
= (1;0
52
) is selected for both range conditions: usum[ 1]
and cond
[2;4[
. Thus, it also does not matter which of them is chosen for the selection of
rnd fsum[ 1:50] for the case that usum[ 1] 6= cond
[2;4[
.
We still consider the case usum[ 1] 6= cond
[2;4[
in the following. Because rinc = 1
we have finj[51] = usumi[51] = 0 and because of sfosum[52 : 53] = 1
2
, c53 = 1 and
inj
[1;2[
[52] = 0, we get from equation 4.138, that l
[1;2[
= 0. Thus, it follows from equation
4.149 that also the selection of rnd fsum[51 : 52] = 0
2
is independent of the value of
cond
[2;4[
and usum[ 1].
We still have to show, that we also get the same rounding tags for both range detections.
From the above we know, that in the case which we have to consider, fsum[51 : 53] = 0
3
.
Thus, according to the equation fro r tinx from lemma 4.14 we get in this case r tinx =
sticky independent of the value of cond
[2;4[
and usum[ 1].
Because inj
[2;4[
< 2
 51
, inj
[1;2[
< 2
 52
and finj = fsum+ inj
X
, we have finj[51] =
0 = l
[1;2[
, so that according to the equation from lemma 4.14 also the rounding tag
r tinc = 0 does not depend on the value of cond
[2;4[
and usum[ 1] in this case.
Thus, as required, the substitutions of cond
[2;4[
by usum[ 1] in the equations of this
lemma do not change the results of these equations. 2
The following lemma integrates the rounding computations according to equations 4.143-
4.144,4.153 with the generalized post-normalization shift according to equation 4.122 to
compute the nal results of the 'R'-path:
Lemma 4.16 In the 'R'-path, the signicand and exponent bits are given by
r f[0 :51] =

rnd fsum[ 1:50] if rnd fsum[ 1]
rnd fsum[0 :51] otherwise.
r f[52] =
8
<
:
l
0
(inc) = usumi[51] ^ lfix
[2;4[
if rnd fsum[ 1] ^ rinc
l
0
(ninc) = usum[51] ^ lfix
[2;4[
if rnd fsum[ 1] ^ rinc
l = l
[1;2[
^ lfix
[1;2[
if rnd fsum[ 1]
<r e[11 :0]>
2
=

<e
1
[11 :0]>
2
+ 1 if rnd fsum[ 1]
<e
1
[11 :0]>
2
otherwise.
Proof: Because rnd fsum 2 [2; 4[, i rnd fsum[ 1], the equation for r f[0 : 51] and
r e[11 : 0] are straight-forward implementations of denition 4.6 and denition 4.5 using
the previous rounding description from equations 4.143-4.144 and equation 4.153. Because
rnd fsum  usum, it follows from rnd fsum[ 1] = 0 that also usum[ 1] = 0 and, thus,
96 CHAPTER 4. BASIC FP OPERATIONS
for the case of rnd fsum[ 1] = 0 we have rnd fsum[52] = (l
[1;2[
^ lfix
[1;2[
). Therefore,
according to equation 4.15
r f[52] =

finj[51] ^ lfix
[2;4[
if rnd fsum[ 1]
l
[1;2[
^ lfix
[1;2[
if rnd fsum[ 1],
so that by the substitution of finj[51] with respect to the value of rinc according to
equation 4.143, we get the equation for r f[52] from the lemma. 2
In the following we summarize the computation steps in the 'R'-path:
1.-2. computation of the limited absolute exponent dierence deltalim, the sign of the
exponent dierence sdelta and the operand swapping like in the previous section.
3. signicand one's complement negation of fs for eective subtractions (equation 4.109):
fso[ 1:52] = fs[ 1:52]  seff:
4. alignment shift of fso[ 1:52] by deltalim positions (equation 4.113):
fsoa[ 1:115] = (seff
deltalim
; fso[ 1:52]; seff
63 deltalim
) (4.157)
= rsft(fso[ 1:52]; deltalim; seff; seff) (4.158)
5. signicand addition: (a) compression of positions [ 1:52] by a halfadder line
<sfosum[ 1:115]>
neg
+<cfosum[ 1:51]>
neg
= <fl[ 1:52]>
neg
+<fsoa[ 1:115]>
neg
and (b) computation of the upper sum usum[ 1 : 51] and incremented upper sum
usumi[ 1:51] by a compound adder (equation 4.141):
<usum[ 1:51]>
neg
= <sfosum[ 1:51]>
neg
+<cfosum[ 1:51]>
neg
<usumi[ 1:51]>
neg
= <usum[ 1:51]>
neg
+ 2
 51
8. rounding decisions: computation of rinc, r tinx, r tinc, l
0
(ninc) = usum[51] ^
lfix
[2;4[
, l
0
(inc) = usumi[51] ^ lfix
[2;4[
and l = l
[1;2[
^ lfix
[1;2[
in the rounding
decision circuit according to lemma 4.15, lemma 4.14 and equations 4.138-4.139 from
the inputs sfosum[51 : 53], cfosum[51], seff, sl, sticky, rmode[1 : 0], usum[ 1],
usum[51] and usumi[51]. This 'rounding decisions' circuit is depicted in detail in
gure 4.14, only for 3 small parts in it, some additional explanations have to be
given:
{ the 'inj generation' circuit implements the rounding mode reduction (according
to equation 2.6-2.6) and the generation of the injection bits: inj
[2;4[
[52] =
inj
[1;2[
[53] = OR(sr mode[1 :0]), inj
[2;4[
[53] = sr mode[1] and inj
[1;2[
[52] = 0.
{ the 'Carry lower part' circuit computes the carries from the lower part, c53
according to lemma 4.14 with (srmode = RI) () sr mode[1], and c53
0
=
(sticky ^ seff).
{ the 'lfix' circuit implements the equations for lfix
[2;4[
and lfix
[1;2[
according
to lemma 4.14 with (srmode = RNE)() sr mode[0].
4.2. ADDITION/SUBTRACTION 97
1
011 0
1
0
1
01 0 1
0
10
0
XOR
Mux
ORtree
Mux
0
0
XOR
2
ORtreeOR(6)
 
   
2
 
 
 
 
 
 
 
 
XOR
 
 
 
Mux
0
Mux
Mux INC 
CLA(14)
[13:0]DELTA
[13:0]
[13]
FB’[-2:52]
0 0 1
Mux
0 0
SEFF
XOR
DELTALIM[5:0]
Right-Shift
SEFFSEFF
SDELTA
Mux
FA’[-2:52]
EL[11:0]
SD
ELTA
SA
SDELTA
EA[11:0]
EB[11:0]
FA[0:52] SA SB SOP FB[0:52] EA[11:0] EB[11:0] SB SOP
SEFF
[-1:115]FSOA
FS[-2:52]
FL[-2:52] ABS_DELTA
[13:6]
Mux
2
SL
ABS_DELTA
[5:0]
SA
SEFF FSO[-2:52] SEFF
DELTAOVF
2
1
FL[-2:52]SEFF SLIS_R1
DEC
(12)
E  [11:0]1
Figure 4.12: Structure of the 'R'-path (rst part) of the addition/subtraction unit II.
rounding selections (equation 4.143 and lemma 4.16):
finj[ 1:51] =

usumi[ 1:51] rinc
usum[ 1:51] otherwise.
l
0
=

l
0
(inc) rinc
l
0
(ninc) otherwise.
9. post-normalization shift of the rounded signicand (lemma 4.16 and equation 4.153):
r f[0 : 52] =

(finj[ 1:50]; l
0
) if rnd fsum[ 1]
(finj[0 :51]; l) otherwise.
10. According to lemma 4.13, the sign of the 'R'-path is given by r s = sl, which
we compute like in the previous section. The exponent of the 'R'-path is based
on e
1
= <e
1
[11 :0] >
2
, which is computed like in the previous section, and on the
selection according to lemma 4.16:
<r e[11 :0]>
2
=

<e
1
[11 :0] >
2
+ 1 if rnd fsum[ 1]
<e
1
[11 :0] >
2
otherwise.
Moreover, during the computation of the exponent dierence, we compute the con-
dition is r1 () (jea   ebj > 1) () ORtree((deltaovf;abs delta[5 : 1])). which
will be used later for the selection of the valid path.
The implementation is depicted in two parts in gure 4.12 and 4.13. Additionaly, a more
detailed block diagram of the 'Rounding decisions' circuit is shown in gure 4.14. This
completes the description of the 'R'-path of the addition/subtractionII unit.
98 CHAPTER 4. BASIC FP OPERATIONS
1E   [11:0]
SEFF
USUMI USUM
R_F[0:51]
[-1]
0
[-1:51] [-1:51]
FSOAFL
[-1:52][-1:52]
[-1:51]
Compound
Adder(53)
RINC
R
IN
C
L’
L’(inc)
L’(ninc)
RINC
[-1:50] [0:51] [-1]
MUX 1
L
L
RND_FSUM[-1]
Rounding
Decisions
RN
D
_FSU
M
[-1]
INC
(12)
R_SR_E[11:0]R_F[52] R_TINX
R_TINC
SL
IS_R1
ORtree
IS_R2
01 MUX
E   [11:0]1
SEFF
[1:0]
RMODE
STICKY
OrTree(63)
[54:115]
FSOASEFF
SL
SFOSUM
[51:52]
XORHA(54)
CFOSUM SFOSUM
[-1:52][-1:51]
[53]
FSOA
CFOSUM[51]
COND[2,4[
01 MUX
1 MUX 0
1 MUX 0
IS_R
FINJRND_FSUM
Figure 4.13: Structure of the 'R'-path (second part) of the addition/subtraction unit II.
'N'-path The computations in the 'N'-path are described on the basis of the adder
implementation from the previous section, which is optimized regarding the specic prop-
erties of the 'N'-path. As discussed above (see lemma 4.12(b), (c)), we can use for the
'N'-path: seff = 1 and  2 f 1; 0; 1g. Additionaly, because of lemma 4.12(d) the gradual
rounding has no eect, so that the required factoring in the 'N'-path according to equation
4.106 and denition 4.3 can be written as
((n s; n e; n f);n tinc;n tinx) = ((sl sfsum; e
1
; abs fsum); 00): (4.159)
Because  2 f 1; 0; 1g, the exponent dierence can be represented by the two bits
delta[1 :0] with <delta[1 :0]>
2
=  = ea  eb = <ea[1 :0]>
2
+<eb[1 :0]>
2
+ 1; where
the two bits delta[1 : 0] can already be interpreted as the sign-magnitude representation
of , so that abs delta = delta[0] and sdelta = delta[1].
Because for  2 f 1; 0; 1g, the bit combination delta[1 : 0] = 10 can not occur, the
alignment shift can be integrated with the swapping and the unconditional pre-shift into
the following selections:
fsoa[ 2:52] =
8
<
:
(1; fb[0 :52]; 1) if delta[1] AND delta[0]
(11; fb[0 :52]) if delta[1] AND delta[0]
(11; fa[0 :52]) if delta[1].
(sl; el[11 :0]; fl[ 2:52]) =

(sb; eb[11 :0]; 0; fb[0 :52]; 0) if delta[1]
(sa; ea[11 :0]; 0; fa[0 :52]; 0) otherwise.
Thus, <fsoa[ 2:52]>
2neg
= <fsa[ 2:52]>
2neg
  2
 52
, and <fosum[ 2:52]>
2neg
=
<fsum[ 2:52]>
2neg
  2
 52
.
4.2. ADDITION/SUBTRACTION 99
0
2
[1,2[L
[1,2[L
C51[2,4[ C51[1,2[
C51[2,4[ C51[1,2[
R_TINXL’(ninc) L’(inc) L
A2(2) A2(3)
MUX MUX MUX
MUX
MUX
AND
AND AND OR
OR
XNOR XOR
SUM+1 SUM SUMSUM+1 SUM+1 SUM
LFIX
0 1 [2,4[ [1,2[
INJ
generation
[2,4[ [1,2[
C53 C53’
lower part
Carry
[51][51]USUMI USUM
SRMODE[1]
SRMODE[0]
C53 C53’
U
SU
M
[-1]
STICKY SEFF
SFO
SU
M
[51]
CFO
SU
M
[51]
CFO
SU
M
[51]
SFO
SU
M
[51:53]
SFO
SU
M
[52:53]
[51] [51:52]
C53
[51:53]
USUM[51]
[51]
STICKY
FSUM[51:53]
FSU
M
[51]
FSU
M
[52]
FSU
M
[53]
USUMI
XNOR
FSU
M
[51]
R_TINC
MUX
XNOR
FSU
M
[52]
L
RINC
MUX
U
SU
M
[-1]
STICK
Y
1 1 00 01
1 0 01 01
[52:53][52:53]
A2(2)
RMODE[1:0]SL
Figure 4.14: Implementations of the 'rounding decision circuit' in the 'R'-path of the
addition/subtraction unit II.
Because of <fosum[ 2:52]>
2neg
=  <fosum[ 2:52]>
2neg
+2
 52
=  fsum, we can get
the binary representation of the absolute signicand sum abs fsum=<abs fsum[ 1:52]>
neg
with a compound adder, that computes the sum fosum = <fosum[ 2:52]>
2neg
and the
incremented sum fosumi = <fosumi[ 2:52]>
2neg
= fosum+ 2
 52
by the selection
abs fsum[ 1:52] =

fosum[ 1:52] if fosumi[ 2]
fosumi[ 1:52] otherwise.
In this way, we get the factoring (slfosumi[ 2]; <el[11 :0]>
2
 1; <abs fsum[ 1:52]>
neg
),
which already has the value of the 'N'-path-result, but according to equation 4.120, we
still have to compute an unbounded normalization shift on this factoring.
Denition 4.7 We dene the term of an imprecise normalized factoring for factorings
(s
ipn
; e
ipn
; f
ipn
), whose signicand fulllls the condition f
ipn
2 [1; 4[. An operation ipnorm,
that computes an imprecise normalized factoring (s
ipn
; e
ipn
; f
ipn
) = ipnorm(s; e; f) with
val(s
ipn
; e
ipn
; f
ipn
) = val(s; e; f) for an arbitrary non-zero factoring (s; e; f), is called im-
precise normalization shift. Note, that if lz is the shift distance of an unbounded normaliza-
tion shift, then an imprecise normalization shift uses one of the shift distances flz; lz+1g.
Obviously, an unbounded normalization shift  can be partitioned into an imprecise
normalization shift followed by a generalized post-normalization shift:
(s; e; f) = gpost(ipnorm(s; e; f)):
100 CHAPTER 4. BASIC FP OPERATIONS
MUX1
-
0
Adder(12)
1
-
Compound
N_E[11:0]
6
N_TINX N_S
+
IS_R2
 
 
 
 
 
N_TINC
 
 
N_F[0:52]
  
 
 
 
 
2-bit ADD
1
EB[1:0]EA[1:0]
FBO[0:52]
INV
FAO[0:52]
INV
FAO[0:52]
DELTA[1] DELTA[0]
DELTA[1]
10 MUX
1
10 MUX
XNOR ORP[-2:52]
1
FSOA[-1:52]
P[-2:52] Gen_C[-2:52]
FL[-1:51]
+
PN-recode
0
PENC(54) PENC(55)
XOR(55)
1 0MUX
11
SA,EA[11:0],FA[0:52]
SB,EB[11:0],FB[0:52]
FBO[0:52]
SL
Prop_C[-1:52]
Parallel Prefix Adder(55)
SFSUM FOSUMI [-2:52]
[-1:52]
[-2]
MUX 01 1 0MUX
CLS
(64)
OR
ANDLZP[5:0]
[-2][-1:52] [-1:0]
ABS_FSUM [-1:52]
IN_FSUM [-1:52]
[-1:51] [0:52]
[-1]
LZP1[5:0] LZP2[5:0] FOSUM [-2:52]
[-2]
MUX1 0
00
0 0
FL[-1:51]
EL[11:0]
FSOA[-1:52]
[-1:52] [-2:52]
FB[0:52] FA[0:52]
XOR
sum sum+1
B[-2:52] A[-2:52]
XOR
SL
SFSUM
Figure 4.15: Structure of the 'N'-path in the addition/subtraction unit II.
Like suggested in the previous denition, we will partition the computation of the
normalization shift into a rst step of an imprecise normalization shift followed by a
second step of a generalized post-normalization shift. The advantage of this approach
is, that the shift-amout for the imprecise normalization shift can be already determined
from the carry-save representation of the signicand sum in parallel to the signicand
addition, so that we can save the delay of the slow serial leading-zero computation after
the signicand addition which was used in the previous section.
The generalized post-normalization shift is computed like in the 'R'-path. The only
dierence between the implementation of the imprecise normalization shift and the conven-
tional normalization shift from the previous section is the computation of the shift-amout.
Therefore, we will focus on the description of the shift-amount computation in the follow-
ing. For this purpose we require some notations and techniques from [8]. We summarize
them in the next denitions and lemmas in preparation for our leading-zero estimation.
4.2. ADDITION/SUBTRACTION 101
Denition 4.8 In a Borrow-Save representation, a number is represented by two binary
strings: we call the tupel (a[n
1
:n
2
];b[n
1
:n
2
]) with a positively weigted bit string a[n
1
:n
2
]
and a negatively weigted bit string b[n
1
:n
2
] a Borrow-Save representation of the number c,
i c = < a[n
1
:n
2
] >
neg
  < b[n
1
:n
2
] >
neg
. To annote that a is the positively weighted
bit-string and b is the negatively weighted bit string, we also write a
+
and b
 
. The digits
[i] = a[i]   b[i] 2 f 1; 0; 1g are called Borrow-Save digits and we denote the value of a
string of Borrow-Save digits [n
1
:n
2
] 2 f 1; 0; 1g
n
2
 n
1
by < [n
1
:n
2
] >
bs
= c. We also
write:
c = < [n
1
:n
2
] >
bs
=

a
+
[n
1
:n
2
]
b
+
[n
1
:n
2
]

bs
=

a[n
1
];a[n
1
+ 1];    ;a[n
2
]
b[n
1
];b[n
1
+ 1];    ;b[n
2
]

bs
:
For  2 [n
1
: n
2
], the fraction of a Borrow-Save representation [n
1
:n
2
] 2 f 1; 0; 1g
n
2
 n
1
at position  is dened by:
fract

([n
1
:n
2
]) = 2


n
2
X
j=+1
[j]  2
 j
= [+1]  2
 1
+ [+2]  2
 2
+   + [n
2
]  2
 n
2
+
:
The fraction range of a Borrow-Save representation [n
1
:n
2
] 2 f 1; 0; 1g
n
2
 n
1
is dened
by the interval FRANGE([n
1
: n
2
]) = [a; b], with a = minffract

([n
1
: n
2
])j 2 [n
1
:
n
2
]g and b = maxffract

([n
1
:n
2
])j 2 [n
1
: n
2
]g. Obviously, for arbitrary Borrow-Save
representations [n
1
:n
2
] 2 f 1; 0; 1g
n
2
 n
1
, we have FRANGE([n
1
:n
2
])  ]  1; 1[.
In the following denition we introduce the 'P'-carry and the 'N'-carry-recoding that will
be used for the compression of the fraction range in our leading-zero estimation.
Denition 4.9 The 'P'-carry-recoding computes from a Borrow-Save representation B =
(a[n
1
:n
2
];b[n
1
:n
2
]), the Borrow-Save representation B
0
=P (B)=(a
0
[n
1
 1:n
2
];b
0
[n
1
 1:n
2
]),
where for all  2 [n
1
:n
2
]:
Carry: a
0
[+1] = a[] ^ b[] Residual: b
0
[] = a[] b[]:
The 'N'-carry-recoding computes from a Borrow-Save representation B = (a[n
1
:n
2
];b[n
1
:
n
2
]), the Borrow-Save representation B
0
= N(B) = (a
0
[n
1
 1:n
2
];b
0
[n
1
 1:n
2
]) where for
all  2 [n
1
:n
2
]:
Carry: b
0
[+1] = a[] ^ b[] Residual: a
0
[] = a[] b[]:
The following lemma shows some properties of 'P'-carry- and 'N'-carry-recodings:
102 CHAPTER 4. BASIC FP OPERATIONS
Lemma 4.17 This lemma consists of 4 parts:
(a) Both 'P'-carry- and 'N'-carry-recoding do not change the value of a Borrow-Save
representation, namely: <B>
bs
= <P (B)>
bs
= <N(B)>
bs
.
(b) The 'P'-carry-recoding compresses the fraction range FRANGE(B)  ]a; b[ of a
Borrow-Save representation B to FRANGE(P (B))  ] 1=2 + a=2; b=2[.
(c) The 'N'-carry-recoding compresses the fraction range FRANGE(B)  ]a; b[ of a
Borrow-Save representation B to FRANGE(N(B))  ]a=2; 1=2 + b=2[.
(d) 'PN'-recoding reduces the fraction range of an arbitrary Borrow-Save representation
B to FRANGE(N(P (B)))  ] 3=4; 1=2[.
Proof: (a) There are only 4 possible bit combinations for the Borrow-Save digit at
position  by the two bits a[] and b[]. These bit combinations encode the 3 possible
values of a Borrow-Save digit like summarized in table 4.1. After 'P'-carry-recoding, this
Borrow-Save digit is represented by the carry a
0
[   1] and the residual b
0
[], and we
can read o from table 4.1, that the 'P'-recoding equations exactly fullll the equation
a[]   b[] = 2a
0
[   1]   b
0
[], so that < B >
bs
= < P (B) >
bs
. Accordingly, the 'N'-
carry-recoding represents the two bits a[] and b[] by the carry b
0
[ 1] and the residual
a
0
[] and implements the equation a[]  b[] = a
0
[]  2b
0
[  1] (see table 4.1), so that
also < B >
bs
= < N(B) >
bs
.
(b) With the BS representations B and P (B):
B =

a[n
1
];a[n
1
+1];    ;a[n
2
]
b[n
1
];b[n
1
+1];    ;b[n
2
]

and P (B) =

a
0
[n
1
 1]; a
0
[n
1
];    ; a
0
[n
2
]
0; b
0
[n
1
];    ; b
0
[n
2
]

;
we obtain by extracting a term from the radix polynomial of P (B) for  2 [n
1
: n
2
]:
fract

(P (B)) = fract


a
0
[+1]; a
0
[+2];    ; a
0
[n
2
]
b
0
[+1]; b
0
[+2];    ; b
0
[n
2
]

=  b
0
[+1]  2
 1
+ fract


a
0
[+1]; a
0
[+2];    ; a
0
[n
2
]
0; b
0
[+2];    ; b
0
[n
2
]

=  b
0
[+1]  2
 1
+ fract


0; a[+2];    ; a[n
2
]
0; b[+2];    ; b[n
2
]

=  b
0
[+1]  2
 1
+
1
2
fract
+1

a[+2]; a[+3];    ; a[n
2
]
b[+2]; b[+3];    ; b[n
2
]

:
=  b
0
[+1]  2
 1
+
1
2
fract
+1
(B)
Since  b
0
[+1]  2
 1
2 f 
1
2
; 0g and fract
+1
(B) ]a; b[, we obtain fract

(P (B)) 2
] 1=2 + a=2; b=2[ for all  2 [n
1
: n
2
], so that FRANGE(P (B))  ] 1=2 + a=2; b=2[, as
required. Part (c) can be proven in analogy to part (b).
(d) Starting from the fraction range FRANGE(B)  ]  1; 1[ of an arbitrary Borrow-
Save-representation B, the use of part (b) and part (c) of this lemma directly yields
FRANGE(P (N(B)))  ] 3=4; 1=2[, as required. 2
The following lemma describes the application of 'PN'-recoding for the imprecize normal-
ization shift of the 'N'-path.
4.2. ADDITION/SUBTRACTION 103
Borrow-Save representation 'P'-carry-recoding 'N'-carry-recoding
a
+
[] b
 
[] a
+
[]  b
 
[] = [] a
0
[  1] b
0
[] b
0
[  1] a
0
[]
0 1  1 0 1 1 1
0 0 0 0 0 0 0
1 1 0 0 0 0 0
1 0 1 1 1 0 1
Table 4.1: Summary of the cases in the 'P'-carry and the 'N'-carry-recoding.
Lemma 4.18 With the computation of

fsum
[ 4:52] =
 
a
+
fsum
[ 4:52]
b
 
fsum
[ 4:52]
!
= P

N

fl[ 2:52]
NOT (fsoa[ 2:52])

lzp1[5 : 0] = penc

a
+
fsum
[ 1:52]  b
 
fsum
[ 1:52]

lzp2[5 : 0] = penc

a
+
fsum
[ 2:52]  b
 
fsum
[ 2:52]

lzp[5 : 0] =

lzp1[5 : 0] if sfsum
lzp2[5 : 0] otherwise.
in fsum[ 1:52] = cls(abs fsum[ 1:52]; < lzp[5 :0] >
2
)
< in e[11 :0]>
2
= <el[11 :0]>
2
+<(111111; lzp[5 :0]>
2
we get the imprecisely normalized factoring
(n s; in e; in fsum) = (n s;< in e[11 :0]>
2
; < in fsum[ 1:52] >
neg
);
so that val(n s; in e; in fsum) = val(n s;< el[11 : 0] >
2
 1; < abs fsum[ 1 : 52] >
neg
)
and in fsum 2 [1; 4[.
Proof: Because fsum = < fl[ 2:52] >
neg
 < fsoa[ 2:52] >
neg
and because of lemma
4.17 (a), we get <
fsum
>
bs
= fsum. For all non-zero fsum 6= 0, the borrow-save
representation 
fsum
[ 4:52] includes at least one non-zero digit, so that for a k 2 [ 4:52],
it has the form 
fsum
[ 4:52] = (0
k+4
; 
fsum
[k :52]) with 
fsum
[k] 2 f 1; 1g.
By lemma 4.17(d) we obtain the fraction range FRANGE(
fsum
[ 4:52])  ] 3=4; 1=2[.
For the determination of the range of abs fsum from 
fsum
[k :52] and the fraction range,
we dier between the two cases: (a) 
fsum
[k] = 1; and (b) 
fsum
[k] =  1.
(a) From 
fsum
[k] = 1 and fract
k
(
fsum
[k+1 : 52]) 2]   3=4; 1=2[, it follows, that
fract
k 1
(
fsum
[k : 52]) 2 ]1=2   3=8; 1=2 + 1=4[ = ]1=8; 3=4[. Moreover, the fraction
range is also valid for the fraction at position k 1, so that fract
k 1
(
fsum
[k :52]) 2
]1=8; 3=4[ \ ]  3=4; 1=2[ =]1=8; 1=2[. Thus, 2
k+2
 abs fsum 2]1; 4[. According to
lemma 4.12, we assume in the 'N'-path, that fsum < 1. Thus, we dene lzp2 =
(k + 2) and get for case (a), lzp2 = (k + 2)  0.
(b) Correspondingly, for 
fsum
[k] =  1, we get fract
k 1
(
fsum
[k :52]) 2 ] 
1
2
 
3
8
; 
1
2
+
1
4
[\
] 
3
4
;
1
2
[ = ] 
3
4
; 
1
4
[. Thus, 2
k+1
 abs fsum 2]1; 3[. For this case we dene lzp1 =
(k + 1), so that we can derive from abs fsum < 2, that also this leading zero pre-
diction has to be non-negative lzp1 = (k + 1)  0.
104 CHAPTER 4. BASIC FP OPERATIONS
Because the representation of a Borrow-Save digit by a
+
[] and b
 
[] is non-zero, i
a
+
[]  b
 
[] and lzp1; lzp2  0, the number lzp2 = k + 2 can be interpreted as the
number of leading zeros in the string 
fsum
[ 2:52], so that
lzp2 =< lzp2[5 : 0] >= <penc(a
+
[ 2:52]  b
 
[ 2:52])>:
Accordingly, the number lzp1 = lzp2   1 = k + 1 can be recognized as the number of
leading zeros in the string 
fsum
[ 1:52], so that
lzp1 =< lzp1[5 : 0] >= k + 1 = <penc(a
+
[ 1:52]  b
 
[ 1:52])>:
Obviously, the above case (a) occurs, i fsum > 0 and the above case (b) occurs,
i fsum < 0. Because in case (a), 2
lzp2
 abs fsum 2 ]1; 4[, lzp2  0 and abs fsum =
<abs fsum[ 1:52]>
neg
, the signicand in fsum = 2
lzp2
 abs fsum is imprecisely nor-
malized and can be represented by in fsum[ 1 : 52] and the multiplication of abs fsum
by 2
lzp2
can be implemented by a left-shift of abs fsum[ 1 : 52] by lzp2 positions. Ac-
cordingly, in case (b) the signicand in fsum = 2
lzp1
abs fsum is imprecisely normalized
and can be represented by in fsum[ 1 : 52] and the multiplication of abs fsum by 2
lzp1
can be implemented by a left-shift of abs fsum[ 1:52] by lzp1 positions.
The denition of the leading zero prediction lzp selects either the lzp2 for case (a)
and lzp1 for case (b), so that the left-shift of abs fsum[ 1 : 52] by lzp positions exactly
computes the binary representation of the imprecisely normalized signicand in fsum =
< in fsum[ 1:52]>
neg
. In this way in fsum = abs fsum  2
lzp
.
The term lzp is adjusted in the exponent by in e = < in e[11 :0]>
2
= el   1  lzp, so
that in fsum  2
in e
= abs fsum  2
el 1
, as required by the lemma. 2
Based on the results of this lemma, the 'N'-path result is computed from sl, sfsum,
in e[11 :0], and in fsum[ 1:52] by the nal generalized post-normalization shift:
(n s; n e; n f)= post norm(sl sfsum; < in e[11 :0]>
2
; < in fsum[ 1:52]>
neg
))
=

(slsfsum; < in e[11 :0]>
2
+1; < in fsum[ 1:51]>
neg
=2) if in fsum[ 1]
(slsfsum; < in e[11 :0]>
2
; < in fsum[0 :52]>
neg
) otherwise.
The incremented exponent is already precomputed with a compound adder during the
exponent adjustment from lemma 4.18, so that the post-normalization shift can be realized
by a simple selection depending on the value of in fsum[ 1]. Additionaly, we compute
the condition is r2 = fosumi[ 2] ^ (fosumi[ 1] _ fosumi[0]), that will be used for the
path selection. This completes the description of the 'N'-path, a block diagram of which
is depicted in gure 4.15.
Path selection In the following we explain how the general path selection condition
is r is computed from the signals is r1, is r2 and seff. We start from the denition of
the path selection condition according to lemma 4.3:
is r () seff _ (fsum 2 [1; 4))
() seff _ (fsum 2 [1; 4) under the assumption seff = 1)
Because for is r = 0, we have abs delta  1 according to lemma 4.12, which is signaled
by is r1, the above equation can be further extended to:
is r () seff _ is r1 _
 
fsum 2 [1; 4) under the assumption seff and is r1

4.2. ADDITION/SUBTRACTION 105
Because the assumptions seff = 1 and is r1 are exactly the assumptions, that we use
during the computation of fsum in the 'N'-path, the condition is r2 exactly implements
the expression
 
fsum 2 [1; 4) under the assumption seff = 1 and is r1

, so that
is r = seff _ is r1 _ is r2 (4.160)
The condition is r1 and the signal seff are computed in the 'R'-path and the condition
is r2 is computed in the 'N'-path. These three parts of the path selection condition are
combined in the 'R'-path like depicted in gure 4.13, so that the result of the valid path
can be selected by the combined path selection condition is r like depicted in gure 4.11.
This completes the description of the addition/subtractionII unit.
4.2.3 Addition/Subtraction III (normalized  ! normalized format)
Like in the two previous sections also in this section the FP addition/subtraction is com-
puted from the inputs of the normalized representationsBUSa
NF
[69 :0] andBUSb
NF
[69 :0]
(section 2.6.3), the rounding mode represented by rmode[1 : 0] and the bit sop that sig-
nals the case of addition or subtraction. But in contrast to the previous implementations,
where a representative of the exact operation result or a gradual rounded result had to
be delivered, in this case the addition unit III already has to compute the IEEE factoring
representation of the rounded result in the normalized format.
Formally, with the notation from equation 2.16 and lemma 2.8 and with
(s
nrc
; e
nrc
; f
nrc
)=nround
mode
(s
rc
; e
rc
+ wec; f
rc
) (4.161)
=exp rnd
mode?s
rc
((n sig rnd
mode?s
rc
((s
rc
; e
rc
+wec; f
rc
)))) (4.162)
the required addition result is based on the following NF factoring
(s
NF
; e
NF
; f
NF
) =
8
>
>
>
<
>
>
>
>
:
(0; e
qNaN
; f
qNaN
) if scqnan
(s
inf
; e
1
; f
1
) if scinf
(sa; ea; fa) if scx
(sb; eb; fb) if scy
(s
0
; e
0
; 0) if sczero
(s
nrc
; e
nrc
; f
nrc
) otherwise.
(4.163)
so that the sum output of the addition/subtraction unit III is specied by the corre-
sponding representation in the normalized format BUS
NF
[70 :0] = nf(s
NF
; e
NF
; f
NF
).
To compute the exponent wrapping, the inputs of the trap handler enable bits unf en
and ovf en are required. Moreover, the occurance of an invalid, inexact, overow and
underow exception should be signaled by the bit inv, inx, ovf and unf, respectively.
The computation of the special value results according to equation 4.163 is imple-
mented like in the previous section. The only dierence is that in this section an innity
result might also be generated in the regular case because of the exponent rounding. Thus,
the special case condition for an innity result from the special cases circuit scinf is only
valid for spca = 1. We denote this condition by inf
sc
= scinf in this section. Accord-
ingly, we dene the condition inf
nrc
that signals the case of an innity result for the
regular case inf
nrc
() (val(s
nrc
; e
nrc
; f
nrc
) = 1), so that we get the nal innity ag
by
inf
NF
=

inf
sc
if spca
inf
nrc
otherwise.
(4.164)
106 CHAPTER 4. BASIC FP OPERATIONS
For the regular case the computations of (s
nrc
; e
nrc
; f
nrc
) have to be modied in com-
parison to the computations of (s
grc
; e
grc
; f
grc
) from the previous section. The dierence in
these computations is that in this section we already have to consider single step rounding
at the nal rounding position vp while integrating the cases for single precision and double
precision. Moreover, we also have to consider the exponent wrapping and the exponent
rounding.
We base the computation of (s
nrc
; e
nrc
; f
nrc
) on the two-path addition algorithm from
the previous section with the path selection condition is r. In this section we denote the
factoring output of the R-path (is r = 1) by (r sn; r en; r fn) and the factoring output
of the N-path (is r = 0) by (n sn; n en; n fn), so that the factoring for the regular case
is selected by
(s
nrc
; e
nrc
; f
nrc
) =

(r sn; r en; r fn) if is r
(n sn; n en; n fn) otherwise.
Moreover, the exceptions are detected seperately for the two paths by (n inx;n unf;n ovf)
for is r = 0 and by (r inx;r unf;r ovf) for is r = 1, so that in general the occurance
of the inexact, the underow and the overow exception are signaled by
(inx;unf;ovf) =

(r inx;r unf;r ovf) if is r
(n inx;n unf;n ovf) otherwise.
In this way the main structure of the implementation in this section (see gure 4.16) is
very similar to that from the previous section (see gure 4.11). In the following we describe
the details of the implementation of the R-path and the implementation of the N-path
separately.
N-path In the N-path we have to compute the representation of the factoring
(n sn; n en; n fn) = exp rnd
mode?s
rc
((n sig rnd
mode?s
rc
((s
rc
; e
rc
+wec; f
rc
))));
where we can use the condition that is r = 0. We partition the discussion of these
computations into two steps: the rst step with the computation of
(n sn1; n en1; n fn1) = (n sig rnd
mode?s
rc
((s
rc
; e
rc
; f
rc
))); (4.165)
so that we get the nal result of the N-path by the second computation step of
(n sn; n en; n fn) = exp rnd
mode?s
rc
(n sn1; n en1 + wec; n fn1): (4.166)
In the rst computation step, the rounding function n sig rnd
mode?s
rc
diers from the
rounding function sgrnd1
mode?s
rc
and the second unbounded normalization shift diers
from the post-normalization shift in the previous section. The two rounding functions
only dier by the rounding position, which is vp = (p 1) maxf0; e
min
 e
0
rc
g in this case.
In the previous section, the rounding position was 52 and it was shown, that this rounding
function does not have any eect in the N-path. We will show in the following lemma, that
also the signicand rounding at the variable rounding position vp can be neglected in the
N-path. In this way the rounding output is still normalized from the rst normalization
shift, so that also the second normalization shift is not required. Thus, all computations
for the rst step (equation 4.165) are already considered by the N-path implementation
from the previous section:
4.2. ADDITION/SUBTRACTION 107
10
0
1
1
0
0
0
0
1
1
1
BUSb      [69:0]NF
Mux
NFBUSa      [69:0]
FB[0:52]
sc
AND
EB[11:0]
SB
Mux
nrc
sc
sc F     [0:52]nrc
E     [11:0]nrc
NFZERO
INFnrc
S
sc
E    [11:0]
F      [0:52]
QNANb
SNANb
INFa
QNANa
SNANa
NF
F     [0:52]
Mux
F    [0:52]
INFsc INF
INF
INFNF
[2]
R
_IN
X
R
_U
N
F, R_O
V
F
nrcS
E     [11:0]nrc
OVF_EN
UNF_EN
 
 
 
INFb
ZEROb
 
 
 
ZEROa
nrc
 
[1:0]
SPCA
SPCA
 
Mux
SNAN NF
BUS      [69:0]NF
[68:57] [56:4]
(SA,EA[11:0],FA[0:52])
Mux
S
N-path R-path
INV
INV
SOP
RMODE[1:0]
(SB,EB[11:0],FB[0:52])
[69:4]
[69:4]
SA
EA[11:0]
FA[0:52]
(SA,EA[11:0],FA[0:52])
SPCA
IS_R2
QNAN NF
Mux
[69]
GFS’
NFS
SEFF
sc
OVF
UNF
R
M
O
D
E[1:0]nrcS
E      [11:0]NF
computation computation
IS_R
IS_R
S.EFF
INX
[3:0]
[3:0]
(SB,EB[11:0],FB[0:52])
special
cases
IN
F
n
rc
NFZERO
[3]
SPCA
INX, UNF, OVF
R
_EN[11:0]
R
_FN[0:52]
R
_SN
N
_FN[0:52]
N
_EN[11:0]
N
_SN
0, N
_U
N
F, 0
DBL
Figure 4.16: Structure of the addition/subtraction unit III.
Lemma 4.19 (a) For a xed precision with xed p and e
min
, all exact addition/subtraction
results are integral multiples of 2
e
min
 p+1
.
(b) If the signicand rounding position is variable, namely vp 6= p  1, then no rounding
computation is required in the addition/subtraction implementation.
(c) In the N-path of the adder III, no implementation of rounding or the second nor-
malization shift is required, so that for is r = 0 we have
(n sn1; n en1; n fn1) = (n s; n e; n f) = (s
rc
; e
rc
; f
rc
) = (slsfsum; e
1
; abs fsum):
108 CHAPTER 4. BASIC FP OPERATIONS
Proof: (a) Because for a xed p and e
min
, all operands are integral multiples of 2
e
min
 p+1
,
also the exact sum and the exact dierence of these operands are integral multiples of
2
e
min
 p+1
.
(b) The rounding position vp satises vp 6= p  1, i the exponent e
0
rc
of the rounding
operand is smaller than e
min
. In this case, the weight of the rounding position vp is
2
e
0
rc
 p+1
< 2
e
min
 p+1
. Because from (a) it follows, that all exact addition/subtraction
results are integral multiples of this rounding position weight, the rounding computation
has no eect for the case that vp 6= p  1.
(c) We rst show, that there is no rounding computation required in the N-path. For
this proof we distinguish between the cases of single precision and double precision and the
cases that the rounding position fullls either vp = p  1 or vp 6= p  1: For vp 6= p  1 the
proof already follows from part (b). For double precision and vp = p 1, we have vp = 52,
so that the setting is like in the previous section and it follows from lemma 4.12(d) that
no rounding computation is required in the N-path in this case. For single precision with
p = 24 and vp = p  1 = 23, the proof of lemma 4.12(d) could be adopted accordingly, so
that also for this case, no rounding computations in the N-path are required. Thus, the
rounding computations can be neglected in the N-path for all cases. In this way the input
of the second normalization shift in equation 4.165 is still normalized, so that even the
second normalization shift can be neglected in equation 4.165. In this way we have exactly
the same situation like in the N-path of the previous section, so that we get as required
(n sn1; n en1; n fn1)=(s
rc
; e
rc
; f
rc
)=(sl sfsum; e
1
; abs fsum)=(n s; n e; n f): 2
Thus, for the computation of (n sn1; n en1; n fn1) = (n s; n e; n f) (equation 4.165),
the N-path implementation from the previous section is used. The computation of the
second part according to equation 4.166, requires the detection of the overow and the
underow exception. In the following lemma we consider both, the exception detections
and the implementation of the second step (equation 4.166) to compute the factoring
(n sn; n en; n fn).
Lemma 4.20 In the N-path:
(a) no overow can occur: n ovf = 0.
(b) exponent rounding has no eect., so that (n sn; n en; n fn) = (n s; n e+wec; n f):
(c) all results are exact and an inexact exception can not occur: n inx = 0:
(d) with the computation of <n tt[11 :0]>
2
= <n e[11 :0]>
2
+<(0;dbl
3
; 1
6
; 0)>
2
, the
underow exception can be detected by: n unf() (n tt[11] ^ unf en ^ spca):
(e) wec =

+ = < alpha[11 :7] >
2
= < (0;dbl
2
; 0;dbl
2
) >
2
if n unf
0 otherwise.
Proof: (a) Because we consider non-zero representable input operands in the N-path,
the exponent of the \larger" operand el is smaller than or equal to e
max
. Because in the
N-path we have fsum < 1, all results in the N-path have values smaller than 2
e
max
, so
that no overow can occur in the N-path and n ovf = 0.
(b) Because all results in the N-path have values smaller than 2
e
max
, the exponent
rounding in equation 4.166 becomes the identity function, so that we get the result of the
N-path by (n sn; n en; n fn) = (n s; n e+ wec; n f); as required.
(c) Because both the signicand rounding and the exponent rounding do not change
the value of the result, all results in the N-path are exact, so that n inx = 0.
4.2. ADDITION/SUBTRACTION 109
 
 
 
 
CSA(12)
SPCA
UNF_EN
N_EN[11:0] N_UNF
CSA(5)
MUX 01
AND
N_E[11:0]
+ALPHA[11:7]
-E       [11:0]
N_E[11:7]
N_E[11:0]
N_UNF
N_E[6:0]
[11:0]
[11:7]
N_TT[11]
min
Figure 4.17: Additional circuits for the N-path of the addition/subtraction unit III.
(d) Because the factoring (n s; n e; n f) is normalized with n f 2 [1; 2[, the value of the
result val(n s; n e; n f) is tiny, i n e < e
min
. From (c) and the denition of the underow
exception in section 2.4.1 it then follows, that n unf() ((n e < e
min
)^unf en^spca):
Because  e
min
= <(0;dbl
3
; 1
6
; 0)>
2
, we have (n e < e
min
) () n tt[11], so that the
above equation for n unf can be written as n unf() (n tt[11] ^ unf en ^ spca):
(e) From (c) it follows, that n unf () n unf ^ unf en. Because no overow can
occur in the N-path according to (a), the denition of the exponent wrapping constant
from equation 2.14 becomes
wec =

+ = < alpha[11 :7] >
2
= < (0;dbl
2
; 0;dbl
2
) >
2
if n unf
0 otherwise.
2
Thus, in the second computation step of the N-path only the underow exception has
to be detected according to part (b) and the exponent wrapping n en = n e+wec has to
be computed according to part (b) and (e). The implementation of these extensions for
the N-path are depicted in gure 4.17. This completes the description of the N-path for
the adder III unit.
110 CHAPTER 4. BASIC FP OPERATIONS
R-path In the R-path we have to compute the representation of the NF-factoring
(r sn; r en; r fn) = exp rnd
mode?s
rc
((n sig rnd
mode?s
rc
((s
rc
; e
rc
+wec; f
rc
))));
where we can use the condition that is r = 1. Like in the description of the N-path, we
also partition the discussion of the R-path computations into two steps: the rst step with
the computation of
(r sn1; r en1; r fn1) = (n sig rnd
mode?s
rc
((s
rc
; e
rc
; f
rc
))); (4.167)
so that we get the nal R-path result by the second computation step with
(r sn; r en; r fn) = exp rnd
mode?s
rc
(r sn1; r en1 + wec; r fn1): (4.168)
First, we deal with the computations for equation 4.167. This formula for the com-
putation of (r sn1; r en1; r fn1) and the formula for (r s; r e; r f) in the R-path of the
previous section (equation 4.120) dier for the signicand rounding functions and the sec-
ond normalization shift. Let (s
0
rc
; e
0
rc
; f
0
rc
) = (s
rc
; e
rc
; f
rc
). With the rounding position
vp = (p 1) maxf0; e
min
 e
0
rc
g, the above factorings can be written as (see rounding
function denitions 2.16 and 2.9):
(r sn1; r en1; r fn1) = (s
0
rc
; e
0
rc
; rnd
mode?sl;vp
(f
0
rc
)) (4.169)
(r s; r e; r f) = post norm(s
0
rc
; e
0
rc
; rnd
mode?sl;52
(f
0
rc
)) (4.170)
Thus, the only dierence of the signicand rounding functions are the rounding positions:
in this section signicand rounding at the variable signicand position vp is considered,
while in the previous section signicand rounding at the xed rounding position 52 was
computed.
The following lemma shows, that for the addition/subtraction implementation, the
variable rounding position vp can be substituted by vp
0
, a xed rounding position for
single precision and a xed rounding position for double precision, so that the rounding
implementation from the previous section could be adopted either for the single precision
or for the double precision case. As we know from the previous section, that the post-
normalization shift in equation 4.170 already normalizes the rounded factoring and we
have a similar rounding computation in equation 4.169, also in equation 4.169 a post-
normalization shift will be sucient to normalize the result instead of an unbounded
normalization shift:
Lemma 4.21 In the addition/subtraction implementation, the variable rounding position
vp = (p 1) maxf0; e
min
 e
0
rc
g can be substituted by vp
0
=

52 if dbl
23 otherwise
without
changing the rounded result, so that
(r sn1; r en1; r fn1) = (s
0
rc
; e
0
rc
; rnd
mode?sl;vp
0
(f
0
rc
)) (4.171)
=

post norm(s
0
rc
; e
0
rc
; rnd
mode?sl;52
(f
0
rc
)) if dbl
post norm(s
0
rc
; e
0
rc
; rnd
mode?sl;23
(f
0
rc
)) otherwise.
(4.172)
Proof: According to lemma 4.19(b), no rounding computation is required in the adder
implementation for vp 6= p   1. Therefore, and because vp  p   1 , we always could
set the rounding position to p  1 without changing the rounding result. The integration
of the case for single precision (p   1 = 23) and double precision (p   1 = 52) exactly
4.2. ADDITION/SUBTRACTION 111
yields the rounding position vp
0
. With the rounding computation at the xed position
vp
0
, which is the least signicant bit position for single and double precision, also the
rounding result rnd
mode?sl;vp
0
(f
0
rc
) is in the range [1; 2], so that a post-normalization shift
and an unbounded normalization shift have the same eect on the rounded factoring, and
we can replace  by post norm in the lemma. 2
Based on this lemma we could apply the implementation of the R-path from the previous
section for the computation of (r sn1; r en1; r fn1) = (r s; r e; r f) in the double precision
case. Thus, we get for double precision according to lemma 4.13:
(r sn1; r en1; r fn1) =

gpost norm(sl; e
1
; rnd
mode?sl;52
(fsum)) if fsum 2 [1; 2[ ^ dbl
gpost norm(sl; e
1
; rnd
mode?sl;51
(fsum)) if fsum 2 [2; 4[ ^ dbl
Accordingly, lemma 4.13 could also be adopted for p   1 = 23, so that the single
precision case could be integrated and we get
(r sn1; r en1; r fn1) =
8
>
>
<
>
:
gpost norm(sl; e
1
; rnd
mode?sl;52
(fsum)) if fsum 2 [1; 2[ ^ dbl
gpost norm(sl; e
1
; rnd
mode?sl;51
(fsum)) if fsum 2 [2; 4[ ^ dbl
gpost norm(sl; e
1
; rnd
mode?sl;23
(fsum)) if fsum 2 [1; 2[ ^ dbl
gpost norm(sl; e
1
; rnd
mode?sl;22
(fsum)) if fsum 2 [2; 4[ ^ dbl
The sign and the exponent are constant in the four choices, so that for the following
discussion we isolate the signicand computation by
r frnd =
8
>
<
>
>
:
rnd
mode?sl;52
(fsum) if fsum 2 [1; 2[ ^ dbl
rnd
mode?sl;51
(fsum) if fsum 2 [2; 4[ ^ dbl
rnd
mode?sl;23
(fsum) if fsum 2 [1; 2[ ^ dbl
rnd
mode?sl;22
(fsum) if fsum 2 [2; 4[ ^ dbl
and have then to compute (r sn1; r en1; r fn1) = gpost norm(sl; e
1
; r frnd).
Because the injection based rounding reduction only depends on the rounding position
and not on the value of the rounding operand, we align the rounding positions for single
precision and double precision by
r frnd =
8
>
<
>
:
rnd
mode?sl;52
(fsum) if fsum 2 [1; 2[ ^ dbl
rnd
mode?sl;51
(fsum) if fsum 2 [2; 4[ ^ dbl
2
29
 rnd
mode?sl;52
(2
 29
 fsum) if fsum 2 [1; 2[ ^ dbl
2
29
 rnd
mode?sl;51
(2
 29
 fsum) if fsum 2 [2; 4[ ^ dbl
(4.173)
In the implementation, the multiplication of the rounding operand by 2
 29
in the case of
single precision is achieved by a conditional left-shift of the representations of both input
signicands by 29 positions for dbl = 0. We denote these aligned operands by
faq =<faq[0 :52]>
neg
=

fa =<fa[0 :52]>
neg
if dbl
2
 29
 fa =<(0
29
; fa[0 :23])>
neg
otherwise.
(4.174)
fbq =<fbq[0 :52]>
neg
=

fb =<fb[0 :52]>
neg
if dbl
2
 29
 fb =<(0
29
; fb[0 :23])>
neg
otherwise.
(4.175)
Accordingly, we indicate all corresponding values that are computed from faq and fbq
instead of fa and fb by appending a 'q' to their name. With this notation and with the
112 CHAPTER 4. BASIC FP OPERATIONS
inputs of faq[0 :52] and fbq[0 :52], the R-path implementation from the previous section
computes
fsumq = <fsumq[ 1:115]>
neg
(4.176)
=

fsum = <fsum[ 1:115]>
neg
if dbl
2
 29
 fsum = <(0
29
; fsum[ 1:86])>
neg
otherwise.
(4.177)
usumq = <usumq[ 1 : 51]>
neg
(4.178)
=

usum = <usum[ 1 : 51]>
neg
if dbl
2
 29
 usum = <(0
29
;usum[ 1 : 22])>
neg
otherwise.
(4.179)
In equation 4.179 the signal usum[ 1], which substitutes the condition fsum 2 [2; 4[
according to lemma 4.15, is shifted to position usumq[28] for single precision. Thus, the
signal usum[ 1] can be selected by
usum[ 1] = condq
[2;4[
=

usumq[ 1] if dbl
usumq[28] otherwise.
(4.180)
With the substitution of condq
[2;4[
for the condition fsum 2 [2; 4[, equation 4.173 can be
written as
r frnd =
8
>
<
>
:
rnd
mode?sl;52
(fsumq) if condq
[2;4[
^ dbl
rnd
mode?sl;51
(fsumq) if condq
[2;4[
^ dbl
2
29
 rnd
mode?sl;52
(fsumq) if condq
[2;4[
^ dbl
2
29
 rnd
mode?sl;51
(fsumq) if condq
[2;4[
^ dbl
(4.181)
If the condition cond
[2;4[
() fsum 2 [2; 4[ is also substituted by the bit condq
[2;4[
in
the modied rounding computations from the previous section, then this R-path imple-
mentation computes
rnd fsumq = < rnd fsumq[ 1:52] >
neg
(4.182)
=

rnd
mode?sl;52
(fsumq) if condq
[2;4[
rnd
mode?sl;51
(fsumq) if condq
[2;4[
.
(4.183)
Obviously, the rounded signicand r frnd can then be computed by a conditional left
shift of rnd fsumq[ 1:52] by 29 positions for the case of single precision (see eq. 4.181)
r frnd = < r frnd[ 1:52] >
neg
(4.184)
=

rnd fsumq = < rnd fsumq[ 1:52] >
neg
if dbl
2
29
 rnd fsumq = <(rnd fsumq[28 :52]; 0
29
)>
neg
otherwise.
(4.185)
Although, this is already an equation for the required signicand r frnd, we would like to
postpone the re-alignment-shift by 29 positions in this equation after the computation of
the generalized post-normalization shift. One can easily read o from equation 4.185, that
the rounded signicand r frnd is in the range [2; 4[ for the post-normalization condition
pscond () (r frnd 2 [2; 4[) (4.186)
() (rnd fsumq[ 1] ^ dbl) _ (rnd fsumq[28] ^ dbl): (4.187)
Thus, we get for the generalized post-normalization shift,
(r sn1; r en1; r fn1) = gpost norm(sl; e
1
; r frnd) (4.188)
=

(sl; e
1
+ 1; r frnd=2) if pscond
(sl; e
1
; r frnd) otherwise.
(4.189)
4.2. ADDITION/SUBTRACTION 113
With the preliminary signicand r fq
r fq = <r fq[0 :52]>
neg
(4.190)
=

rnd fsumq = < rnd fsumq[0 :52] >
neg
if pscond
rnd fsumq=2 = < rnd fsumq[ 1:51] >
neg
otherwise,
(4.191)
the signicand for the factoring (r sn1; r en1; r fn1) can be computed by the following
re-alignment selection
r fn1 = <r fn1[0 :52]>
neg
(4.192)
=

r fq = < r fq[0 :52] >
neg
if dbl
2
29
 r fq = <(r fq[29 :52]; 0
29
)>
neg
otherwise,
(4.193)
Note, that equation 4.191 describes the generalized post-normalization shift of the mod-
ied R-path implementation from the previous section, where only the control signal
rnd fsum[ 1] is substituted by the post-normalization condition pscond. In this way the
computation of the factoring (r sn1; r en1; r fn1) on the basis of the modied R-path im-
plementation from the previous section requires only the ve additional circuits according
to the equations 4.174, 4.175, 4.180, 4.187 and 4.193. The integration of these additional
circuits around the R-path implementation from the previous section is depicted in gure
4.18, where the additional circuits are represented by shaded boxes. This completes the
description of rst step in the R-path computations according to equation 4.167.
In the following we consider the second step of the R-path computations according to
equation 4.168. This includes the detection of the exceptions, the exponent rounding and
the exponent wrapping.
Lemma 4.22 With the computation of
<r tt[11 :0]>
2
= <e
1
[11 :0]>
2
+<(0;dbl
3
; 1
6
; 0)>
2
<r tti[11 :0]>
2
= <r tt[11 :0]>
2
+ 1
the exceptions in the R-path can be detected by:
r ovf () (zerotest(e
1
[11 :0]  (0;dbl
3
; 1
7
)) ^ spca ^ pscond)
r inx () (r tinxq _ r ovf)
r unf ()

r tti[11] ^ (unf en _ r inx) ^ spca if pscond
r tt[11] ^ (unf en _ r inx) ^ spca otherwise
Proof: The condition for an overow in the R-path is given by
r ovf() (jval(r sn1; r en1; r fn1)j  2
e
max
+1
):
Because of the normalized signicand r fn1 2 [1; 2[, this overow condition can be written
as r ovf() (r en1  e
max
+ 1). Because in the R-path we only have to consider non-
zero representable operands, the exponent of the 'larger' operand el and also e
1
are smaller
than or equal to e
max
. Thus, according to equation 4.189, the exponent of the normalized
factoring r en1 can only become larger than e
max
, if r en1 = ei
1
= e
max
+ 1. Thus,
r ovf() (ei
1
= e
max
+1)^ pscond^ spca. Because (ei
1
= e
max
+1)() (e
1
= e
max
)
and e
max
=< (0;dbl
3
; 1
7
) >
2
, the equation for r ovf can be written as r ovf ()
(zerotest(e
1
[11 :0]  (0;dbl
3
; 1
7
)) ^ pscond ^ spca, as required.
114 CHAPTER 4. BASIC FP OPERATIONS
SEFF
E   [11:0]1
EB[11:0]
EA[11:0]
F      [0:52]
29
0
29
max
0
290
DBL
DBL
DBL
[-1:51]
L’[-1:50] [0:51] L
SFOSUMQ
[51:52]
CFOSUMQ SFOSUMQ
[-1:52][-1:51] CFOSUMQ[51]
FSOAQ[53] SEFF
SL
Exponent &
Exception
computation
SRMODE[1:0]
Compound
Adder(53)
USUMIQ USUMQ
L’(ninc)
L’(inc)
R
IN
CQ
L
RND_FSUMQ FINJQ
[-1:51] [-1:51]
[-1]
MUX1 0
1 0
MUX1 0
1
MUX
PSCOND
[28]
[-1] [28]
R_FQ[0:51] R_FQ[52]
MUX0
0MUX1OR(SRMODE[1:0])
Decisions
Rounding
RMODE
[1:0]
R_TINXQ
PSCOND
[2,4[CONDQ
RINCQ
[0:52] [29:52] F    [0:52]inf
SL
R_SN IS_R
SPCA
DBL
UNF_EN
OVF_EN
MUX1 0
R_FN[0:52] R_EN[11:0] R_OVF
R_UNF
R_INX
STICKYQ
Computations like in R-path for addition/subtraction II
MUX1 0
MUX1 0
MUX1 0
DBL
FAQ[0:52]
MUX 10
FA[0:52]FA[0:23]
FBQ[0:52]
MUX0 1
FB[0:52]FB[0:23]
IS_R
SL
SB
SA
SOP
Figure 4.18: R-path implementation for the addition/subtraction unit III. Shaded boxes
had to be added to the R-path implementation of the addition/subtraction unit II.
The inexactness of an result can have two reasons, either the signicand rounding
or the case of an overow. The signicand rounding inexactness was computed in the
previous section by r tinx, The rounding position and rounding computation has not
changed in this section and only the range detection condition cond
[2;4[
was substituted
by condq
[2;4[
. Thus, the signicand rounding inexactness from lemma 4.14 becomes
r tinxq = stickyq _ ((condq
[2;4[
^OR(fsumq[52 :53])) _ (condq
[2;4[
^ fsumq[53]))
and we get the R-path inexactness condition by r inx = r tinxq _ r ovf.
With r tiny () jval(r sn1; r en1; r fn1)j < 2
e
min
, the underow exception for the
4.2. ADDITION/SUBTRACTION 115
R-path is given by
r unf() r tiny ^ (unf en _ r inx) ^ spca:
The normalized factoring (r sn1; r en1; r fn1) is tiny, namely r tiny = 1, i r en1 <
e
min
. Because of the exponent selection of r en1 from e
1
and ei
1
according to equa-
tion 4.189 and  e
min
= <(0;dbl
3
; 1
6
; 0)>
2
, so that <r tt[11 :0]>
2
= e
1
  e
min
and
<r tti[11 :0]>
2
= ei
1
 e
min
, we get
r tiny =

r tti[11] if pscond
r tt[11] otherwise.
Thus, the underow condition for the R-path can be written as
r unf()

r tti[11] ^ (unf en _ r inx) ^ spca if pscond
r tt[11] ^ (unf en _ r inx) ^ spca otherwise.
2
Because the signal pscond is valid rather late, we compute each exception in two
parallel paths: one path under the assumption that pscond = 1 with the signals r ovf[1],
r inx[1] and r unf[1] and the other path under the assumption that pscond = 0 with
the signals r ovf[0], r inx[0] and r unf[0]: Obviously, the exception ags can then be
selected by:
(r ovf;r inx;r unf) =

(r ovf[1];r inx[1];r unf[1]) if pscond
(r ovf[0];r inx[0];r unf[0]) otherwise.
(4.194)
From lemma 4.22 one can easily read o the following equations for the two paths:
r ovf[1] = (zerotest(e
1
[11 :0]  (0;dbl
3
; 1
7
)) ^ spca (4.195)
r ovf[0] = 0 (4.196)
r inx[1] = r tinxq _ r ovf[1] (4.197)
r inx[0] = r tinxq (4.198)
r unf[1] = r tti[11] ^ (unf en _ r inx[1]) ^ spca (4.199)
r unf[0] = r tt[11] ^ (unf en _ r inx[0]) ^ spca (4.200)
In the following we describe the computations of the exponent wrapping and the exponent
rounding, which are required in the second step according to equation 4.168.
We split the computations for the sign, the exponent and the signicand. The sign is
given by r sn1 = sl. With the preselection of the signicand result for an untrapped over-
ow based on the roundingmode by (Note, that srmode 6= RZ for OR(sr mode[1 :0])=1):
r fovf = <r fovf[0 :52]>
neg
=

f
1
= < (1; 0
52
) >
neg
if OR(sr mode[1 :0])
f
max
= < (1
24
;dbl
29
) >
neg
otherwise.
the nal signicand can be selected according to 2.12 by
r fn =

r fovf if r ovf ^ ovf en
r fn1 otherwise.
116 CHAPTER 4. BASIC FP OPERATIONS
For the exponent computations we predict the wrapping exponent constant based on the
sign of the exponent e
1
.
pwec =

+ = < +alpha[11 :7] >
2
= < (0;dbl
2
; 0;dbl
2
; 0
6
) >
2
if e
1
[11]
  = <  alpha[11 :7] >
2
= < (1;dbl; 1;dbl; 0;dbl; 0
6
) >
2
otherwise.
This prediction can be done due to the fact, that for a positive exponent e
1
 0, also
r en1  0, so that no underow can occur and for a negative exponent e
1
< 0, we have
r en1  0, so that no overow can occur. The exponent wrapping constant can then be
selected by:
wec =

pwec if ((r ovf ^ ovf en) _ (r unf ^ unf en))
0 otherwise.
The nal exponent selection including the exponent wrapping and rounding is given by
r en =
8
>
>
<
>
>
:
e
max
+ 1 if r ovf ^ ovf en ^OR(sr mode[1 :0])
e
max
if r ovf ^ ovf en ^NOR(sr mode[1 :0])
r e+ pwec if r ovf ^ ovf en _ r unf ^ unf en
r e otherwise.
(4.201)
By the denition of
r erp =

e
max
+ 1 if OR(sr mode[1 :0])
e
max
otherwise
(4.202)
r eop =

r erp if ovf en ^ e
1
[11]
r en1 + pwec otherwise
(4.203)
the exponent selection according to equation 4.201 can be written as
r en =

r eop if r ovf _ (r unf ^ unf en)
r en1 otherwise
(4.204)
Because the computation of r en1 is selected from e
1
and ei
1
depending on the post-
normalization shift condition pscond, we also compute the selections of the exponents in
two parallel paths for pscond = 0 and for pscond = 1 like in the computation of the
exception conditions. With the convention that the appendix of the letter `i` to a variable
name indicates the incremented version of this variable, we get the nal exponent by the
following selections
ew = e
1
+ pwec ewi = ei
1
+ pwec
eop =

r erp if ovf en ^ e
1
[11]
ew otherwise
eopi =

r erp if ovf en ^ e
1
[11]
ewi otherwise
en =
8
<
:
eop if r ovf[0]_
(r unf[0] ^ unf en)
e
1
otherwise
eni =
8
<
:
eopi if r ovf[1]_
(r unf[1] ^ unf en)
ei
1
otherwise
r en =

eni if pscond
en otherwise
This completes the description of the second computation step according to equation 4.168.
Additionaly, we have to compute the signal inf
reg
, that indicates the case of an innity
result in the regular case:
inf
reg
() r ovf ^ ovf en ^OR(sr mode[1 :0]):
4.2. ADDITION/SUBTRACTION 117
max
1E   [11] AND
E        +1[11:0]
1E   [11:0]
1E   [11:0]
EI   [11:0]1
1E   [11:0]
maxE        [11:0]
maxE        [11:0]
1E   [11]
SPCA
SPCA
AND
AND AND
MUX MUX
PSCO
N
D
INCcompound
adder(12)
compound
adder(12)
XOR
zerotest(12)
1 0011
MUX
MUX
PSCO
N
D
MUX
MUX
R_OVF[1] OR
(R_UNF[1] AND
UNF_EN) AND UNF_EN
R_UNF[0]
1 0 1 0
1 0 U
N
F_EN
(12)
O
R(SRM
ODE[1:0])
-ALPHA[11:0]+ALPHA[11:0]
OVF_EN
0MUX
1 0
E        [11:0]min
R_EN[11:0] R_OVF R_INX R_UNF
R_OVF[1]
R_UNF[0]R_UNF[1]
R
_IN
X[1]
R
_IN
X[0]
R_TTI[11] R_TT[11]EW[11:0]EWI[11:0]
EOPI[11:0] EOP[11:0]
ENI[11:0] EN[11:0]
R_ERP[11:0]
PWEC[11:0]
MUX AND
OR
OR OR
ANDAND
R
_TIN
XQ
1 0 R_IN
X[1]
Figure 4.19: Exponent and exceptions circuit for the R-path of the addition/subtraction
unit III. Shaded boxes had to be added to the R-path implementation of the addi-
tion/subtraction unit II.
The extensions and changes for the R-path of the addition/subtraction unit III based on
the R-path implementation of the previous section are depicted in gure 4.18. A more
detailed block diagram of the exponent and exceptions circuit is given in gure 4.19. Also
in this gure the shaded circuits are required in addition to the R-path implementation
from the previous section. The path selection condition is not changed in this section,
so that we can use the same implementation for is r like in the previous section. This
completes the description of the addition/subtraction III unit.
118 CHAPTER 4. BASIC FP OPERATIONS
4.3. MULTIPLICATION 119
4.3 Multiplication
4.3.1 Multiplication I (normalized  ! representative format)
Specication. This section describes a FP multiplication unit, that is able to multiply
two FP numbers given in the normalized representations (section 2.6.3):
BUSa
NF
[69 :0] = (sa;ea[11 :0]; fa[0 :52]; zeroa; infa;qnana; snana) (4.205)
BUSb
NF
[69 :0] = (sb;eb[11 :0]; fb[0 :52]; zerob; infb;qnanb; snanb); (4.206)
which represent the factorings (sa; ea; fa) = fact
NF
(BUSa
NF
[69 :0]) and (sb; eb; fb) =
fact
NF
(BUSb
NF
[69 :0]).
In the case, that both operands have representable values, the exact product exact
mult
is dened by (section 2.2.4):
exact
mult
= ( 1)
sasb
 2
ea+eb
 fa  fb: (4.207)
If (s
rc
; e
rc
; f
rc
) is a RF factoring of this exact product exact
mult
for non-zero representable
inputs, then for the general case of arbitrary input values, a RF factoring of the required
product is given by (see equation 2.17):
(s
RF
; e
RF
; f
RF
) =
8
>
>
>
>
>
<
>
>
>
>
:
(0; e
qNaN
; f
qNaN
) if scqnan
(s
inf
; e
1
; f
1
) if scinf
(sa; ea; fa) if scx
(sb; eb; fb) if scy
(s
0
; e
0
; 0) if sczero
(s
rc
; e
rc
; f
rc
) otherwise.
(4.208)
The product output of the multiplication I unit is then specied by the corresponding rep-
resentation in the representative format BUS
RF
[73 :0] = rf(s
RF
; e
RF
; f
RF
): Moreover, in
the multiplication I unit the invalid ag inv should be signaled according to the occurance
of an invalid exception.
Implementation. The computations of the special conditions in equation 4.208 are
already summarized in section 2.4.4 by equations 2.27-2.33. Like in the addition imple-
mentations, we select the result from equation 4.61 in two steps by the denition of the
sign s
sc
, the exponent e
sc
and the signicand f
sc
for the special case:
(s
sc
; e
sc
; f
sc
) =
8
>
>
>
<
>
>
>
:
(0; e
qNaN
; f
qNaN
) if scqnan
(s
inf
; e
1
; f
1
) if scinf
(sa; ea; fa) if scx
(sb; eb; fb) if scy
(s
0
; e
0
; 0) otherwise.
(4.209)
These computations are implemented in the special cases circuit in gure 4.20. Obviously,
according to tables2.6-2.7, an invalid exception occurs for a multiplication, i the result
is a quiet NaN, in which case we have scqnan = 1. Thus, we already get the invalid ag
by inv() scqnan: With the denition of the special case condition spca by
spca () scqnan _ scinf _ scx _ scy _ sczero (4.210)
like in the addition implementations, the nal multiplication result can be selected by
(s
RF
; e
RF
; f
RF
) =

(s
sc
; e
sc
; f
sc
) if spca
(s
rc
; e
rc
; f
rc
) otherwise.
(4.211)
120 CHAPTER 4. BASIC FP OPERATIONS
This completes the description of the computations for the special cases and the exception
recognition.
In the following the computation of the RF factoring (s
rc
; e
rc
; f
rc
) for the regular case
is described. For this computation we can assume non-zero representable operands.
For non-zero operands the signicands are normalized with fa; fb 2 [1; 2[, so that
the product of the signicands is in the range fpr = fa  fb 2 [1; 4[. Thus, according to
denition 2.21, the factoring (s
rc
; e
rc
; f
rc
) = (sasb; ea+eb; rep
53
(fpr)) is a RF factoring
of exact
mult
for representable operands. In this way the sign s
rc
and the exponent e
rc
for
the regular case can be computed by:
s
rc
= sa sb (4.212)
e
rc
= <e
rc
[12 :0]>
2
= <(0;ea[11 :0]>
2
+<(0;eb[11 :0]>
2
: (4.213)
We deal with the computation of the signicand f
rc
= rep
53
(fpr) in the following. Because
the signicands fa and fb are both integral multiples of 2
 52
, the product fpr = fa fb 2
[1; 4[ is an integral multiple of 2
 104
and can be represented by fpr = <fpr[ 1:104]>
neg
.
From this representation of the signicand product, the 53-representative f
rc
= rep
53
(fpr)
can then easily be generated following lemma 2.11:
f
rc
[ 1:54] = (fpr[ 1:53];ORtree (fpr[54 :104])) : (4.214)
The computation of fpr[ 1:104] is partitioned into two steps:
(A) the computation of a carry-save representation of the product fpr by
<fprs[ 1:104]>
neg
+<fprc[ 1:104]>
neg
= <fa[0 :52]>
neg
<fb[0 :52]>
neg
:(4.215)
(B) the compression from the carry-save representation of the product to the binary
product representation fpr[ 1:104] with
<fpr[ 1:104]>
neg
= <fprs[ 1:104]>
neg
+<fprc[ 1:104]>
neg
:
This equation is implemented by a 106-bit carry-lookahead adder.
The computation for step (A) consists of the partial product generation and reduction of
the signicand multiplication and has to be further specied. We consider two dierent
implementations using a Booth encoded adder tree:
In the rst, full-sized implementation we directly use a 53-bit by 53-bit Booth2 encoded
partial product generation and reduction implementation, which we denote by the function
btree, to implement equation 4.215 by
(fprs[ 1:104]; fprc[ 1:104]) = btree
53;53
(fa[0 :52]; fb[0 :52]):
This implementation is depicted in gure 4.21.
In the second implementation of the partial product generation and reduction for step
(A), we use a half-sized 53-bit by 27-bit Booth2 encoded adder tree, that is able to consider
two additional constants which we denote by the function boothtreepp. In this \half-sized"
implementation, the computations of step (A) are implemented in two iterations for double
precisio and in one iteration for single precision. For the double precision computation
in two iterations, we require the signal iter2, that indicates the case that we are in the
second iteration. The following lemma describes the underlying partitioning of the partial
partial product formula for the signicand product fpr:
4.3. MULTIPLICATION 121
Lemma 4.23 With the selection of the signicand half fbsel = <fbsel[0 :26]>
neg
and
the denition of the sums pp1, for that we use iter2 = 0, and pp2, for that we use
iter2 = 1, according to
fbsel[0 :26] =

(fb[27 :52]; 0) if dbl AND iter2
fb[0 :26] otherwise.
(4.216)
pp1 =
26
X
i=0
fa ^ fbsel[i]  2
 i
(4.217)
pp2 =
26
X
i=0
fa ^ fbsel[i]  2
 i
+ 2
 27
 pp1 (4.218)
the signicand product fpr can be selected by
fpr =

pp2 if dbl
pp1 otherwise
Proof: The partial product formula for the signicand product fpr can be written as
fpr = fa  fb = fa <fb[0 :52]>
neg
(4.219)
=
52
X
i=0
fa ^ fb[i]  2
 i
(4.220)
=
26
X
i=0
fa ^ fb[i]  2
 i
+
52
X
i=27
fa ^ fb[i]  2
 i
(4.221)
=
26
X
i=0
fa ^ fb[i]  2
 i
+ 2
 27

26
X
i=0
fa ^ fb[i+ 27]  2
 i
: (4.222)
For double precision we have in the rst iteration fbsel[0 : 26] = (fb[27 : 52]; 0), so that
pp1 =
P
26
i=0
fa^fb[i+27] 2
 i
. Thus, because in the second iteration for double precision
we have fbsel[0 :26] = fb[0 :26], it follows directly from equation 4.222, that fprod = pp2,
as required for double precision.
Because for single precision fb[27 : 52] = 0
27
and fbsel[0 : 26] = fb[0 : 26], we get
fprod =
P
26
i=0
fa ^ fb[i]  2
 i
= pp1, as required for single precision. 2
With the denition of the feedback operand
fdb =

2
 27
 pp1 if dbl AND iter2
0 otherwise
(4.223)
the equations from lemma 4.23 for pp1 with iter2 = 0 and for pp2 with iter2 = 1 can be
written as
pp1 = fa  fbsel + fdb (4.224)
pp2 = fa  fbsel + fdb (4.225)
Because fa  fbsel = <fa[0 :52]>
neg
 <fbsel[0 :26]>
neg
is an integral multiple of 2
 78
,
the lower part of the binary representation of pp2 = <pp2[ 1:105]>
neg
could be directly
copied from the lower part of fdb = <fdb[ 1:105]>
neg
by
pp2[79:105] = fdb[79 :105]:
122 CHAPTER 4. BASIC FP OPERATIONS
0 11 0
NF
BUSb      [69:0]NF
S
BUSa      [69:0]
Mux
SPCA
scS rc
S rc
F     [-1:54]rc
SrcF    [-1:54]sc
F     [54]rc
E     [12:0]rc
E     [12:0]rc
F     [-1:54]
sc
ZEROb
INFb
QNANb
SNANb
INFa
QNANa
SNANa
ZEROa
E    [12:0]
rc
 
 
 
 
 
 
 
   
sc
E     [12:0]
[3:0] SNAN RF
BUS      [73:0]RF
[72:60] [59:4]
INV
QNAN RF E      [12:0]RF
(SB,EB[11:0],FB[0:52])
special
cases
DBL
[3:0]
[3:0]
DBL
INV
INFRF
RFZERO
SPCA
Mux
F     [-1:53]rc
XOR
CLA(13)
partial product
generation &
reduction
CLA(106)
(SA,EA[11:0],FA[0:52])
SBSA
ORtree(51)
FB[0:52]
FA[0:52] FB[0:52] EA[11:0] EB[11:0] SA SB
FA[0:52]
EA[11:0] EB[11:0]
RFS[73]
F      [-1:54]RF
FPR[-1:53] FPR[54:104]
FPRS[-1:104] FPRC[-1:104]
[69]
[56:4] [68:57] [69]
[68:57][56:4]
EA[11]
EB[11]
Figure 4.20: Block digram of the multiplication unit I.
During the computations of step (A), we use carry-save representations to represent
pp1, fdb and pp2. These carry-save representations are denoted according to the equations
pp1 = <pps1[ 1 : 78]>
neg
+<ppc1[ 1 : 78]>
neg
(4.226)
fdb = <fdbs[ 1 : 105]>
neg
+<fdbc[ 1 : 105]>
neg
(4.227)
pp2 = <pps2[ 1 : 105]>
neg
+<ppc2[ 1 : 105]>
neg
(4.228)
Following equations 4.224-4.225, the carry-save representations of pp1 and of pp2 can then
be computed with the \half-sized" adder tree by the function btreepp
53;27
:
(pps1[ 1:78];ppc1[ 1:78])=btreepp
53;27
(fa[0 :52]; fbsel[0 :26]; fdbs[ 1:78]; fdbc[ 1:78])
(fdbs[ 1:78]; fdbc[ 1:78])=(0
27
;pps1[ 1:51]; 0
27
;ppc1[ 1:51]) AND (dbl ^ iter2)
(pps2[ 1:78];ppc2[ 1:78])=btreepp
53;27
(fa[0 :52]; fbsel[0 :26]; fdbs[ 1:78]; fdbc[ 1:78])
(pps2[79:105];ppc2[79:105])=(pps1[52:78];ppc1[52:78]) AND (dbl ^ iter2);
4.3. MULTIPLICATION 123
FA[0:52] FB[0:52]
FPRC[-1:104]FPRS[-1:104]
Partial Product Generation (Booth2)
& Reduction (53x53)
Figure 4.21: Full-sized implementation of the partial product generation and reduction
for multiplication unit I.
so that we get the carry-save representation of the signicand product fpr with the carry-
string fprs[ 1:104] and the sum-string fprc[ 1:104] by
fprs[ 1: 104] =

(pps2[ 1:78];pps1[52:77]) if dbl
(pps1[ 1:78]; 0
26
) otherwise.
(4.229)
fprc[ 1:104] =

(ppc2[ 1:78];ppc1[52:77]) if dbl
(ppc1[ 1:78]; 0
26
) otherwise.
(4.230)
after one iteration for single precision and after two iterations for double precision. Based
on this formula, equation 4.215 for the computation of step (A) is implemented based on
the 'half-sized' adder-tree like depicted in gure 4.22. In this implementation, the results
of the adder-tree are saved in the carry-save registers ppregs[ 1:78] and ppregc[ 1:78]
after each iteration. The feedback to the adder tree is split into an upper part considering
positions [ 1 :78] and a lower part considering positions [79 : 104]. The upper part of the
feedback operand fdb[ 1:78] is directly input into the adder tree including the right-shift
by 27 positions for double precision. Because the lower part of the feedback in fdbs[79 :104]
and fdbc[79 :104] is not changed by the adder tree, it can be directly saved into registers
ppregs[79 : 104] and ppregc[79 : 104]. In this way, the registers ppregs[ 1:104] and
ppregc[ 1:104] contain the carry-save representation of the signicand product fpr by
fprs[ 1:104] and fprs[ 1:104] after two iterations for double precision and after one
iteration for single precision.
This completes the description of the half-sized implementation for step (A), so that
the descriptions of both implementations of the multiplication I unit are completed.
4.3.2 Multiplication II (normalized  ! gradual result format)
Specication. Like in the previous section also in this section the FP multiplication is
computed from the inputs of the normalized representations (section 2.6.3) BUSa
NF
[69 :0]
and BUSb
NF
[69 :0]. Because some rounding computations have to be considered in this
section, also the input of the rounding mode, represented by rmode[1 :0], is required.
In this section, the exact multiplication result according to equation 4.207 has to be
rounded by the general rounding function ground1. After this gradual rounding step the
124 CHAPTER 4. BASIC FP OPERATIONS
AND
AND
DBL AND ITER2
27
0
   
   
 
 
 
MUX 10
FA[0:52] FB[0:26] FB[27:52],0
FPRS[-1:104] FPRC[-1:104]
PPS[-1:78] PPC[-1:78]
PPREGS[-1:104]
PPREGC[-1:104]
[-1:51]
[52:78][-1:51]
[79:104]
[52:78]
[79:104]
+ 2 additive constants
& Reduction (53x27)
Partial Product Generation (Booth2)
DBL AND ITER2
DBL AND ITER2
FDBS[79:104] FDBC[79:105]
FBSEL[0:26]
FDBS[26:78]
FDBC[26:78]
[52:77]
Figure 4.22: Half-sized implementation of the partial product generation and reduction
for multiplication unit I.
product should be output in the gradual result format BUS
GF
[73 :0] (section 2.6.5). Ac-
cording to equation 4.207, a factoring of the exact product is given by (s
ex
; e
ex
; fpr) =
(sa sb; ea+ eb; fa  fb) for non-zero representable operands. With the gradual rounded
product ((s
grc
; e
grc
; f
grc
);tinc;tinx) = ground1(s
ex
; e
ex
; fpr) and the following GF fac-
toring of the result for the case of arbitrary IEEE operands
((s
GF
; e
GF
; f
GF
);tinc
GF
;tinx
GF
)=
8
>
>
>
<
>
>
>
>
:
((0; e
qNaN
; f
qNaN
); 0; 0) if scqnan
((s
inf
; e
1
; f
1
); 0; 0) if scinf
((sa; ea; fa); 0; 0) if scx
((sb; eb; fb); 0; 0) if scy
((s
0
; e
0
; 0); 0; 0) if sczero
((s
grc
; e
grc
; f
grc
);tinc;tinx) otherwise,
(4.231)
the product output of the multiplication unit II is specied by the gradual result rep-
resentation BUS
GF
[73 :0] = gf((s
GF
; e
GF
; f
GF
);tinc
GF
;tinx
GF
). The occurance of an
invalid exception should be signaled by the bit inv also in this section.
Implementation. The special cases conditions and values in equation 4.231 are identical
to that in the specication of the previous section. In the implementation of this special
cases selection, the only dierence to the previous section is that a representation in
the gradual result format has 3 bits less in the signicand, which have been lled with
zeros in the representative format. Moreover, the gradual result format requires two
additional rounding tags, which have to be zero for special value results. For the special
cases selections, these small adjustments are integrated in the implementation depicted in
gure 4.23. Also in the equations, that are implemented in the special cases circuit, the
selections for bit positions [ 1] and [53:54] have to be neglected.
4.3. MULTIPLICATION 125
0
0
0 1
1
1
QNANb
INFbINFa
SNANb
ZEROb
QNANa
scS
NF
SNANa
BUSb      [69:0]
ZEROa
SPCA
scS
Mux
S
E    [12:0]sc
F      [0:52]GF
S rc
GF
sc
NF
rc
F     [0:52]grc
 
 
 
 
 
 
S
XOR
F    [0:52]
(SB,EB[11:0],FB[0:52])
[3:0]
[3:0]
(SA,EA[11:0],FA[0:52])
FA[0:52] FB[0:52] EA[11:0] EB[11:0] SA SB
[69]
[56:4] [68:57] [69]
[68:57][56:4]
[3:0] SNAN GF
BUS     [72:0]GF
[71:59]
QNAN GF
INFGF
GFZERO
SPCA
[72]
E      [11:0]GF
Mux
[58:6]
INV
special
cases
DBL
DBL
INV
BUSa      [69:0]
MuxFPRND[0:52]
Compression &
gradual rounding
SRMODE[1:0]
FA[0:52],SA
EA[11:0] EB[11:0]
SA SB0 0
injected partial product
generation &
reduction
RMODE[1:0]
Compound
adder (13)
RMODE[1:0]
FB[0:52],SB
PSCOND
AND
SPCAF     [0:52]
E     [12:0]
E     [12:0]grc
grc
grc
TINC
TINXGF
GF[5:4]
TINX
TINC
FINPRS[-1:104] FINPRC[-1:104]
Figure 4.23: Block digram of the multiplication unit II.
This already completes the description of the special cases computation and we only
have to describe the computation of ((s
grc
; e
grc
; f
grc
);tinc;tinx) for non-zero representable
operands in the following.
According to denition 2.17, the gradual rounding of the product can be composed of
the three steps of an unbounded normalization shift, gradual signicand rounding1 and a
post-normalization shift:
((s
grc
; e
grc
; f
grc
);tinc;tinx) = ground1(s
ex
; e
ex
; f
ex
) (4.232)
= post norm(sgrnd1
mode?s
((s
ex
; e
ex
; f
ex
)))(4.233)
Because for both, single and double precision, fa; fb 2 [1; 2 2
 52
], the exact signicand
product is in the range fpr = fa  fb 2 [1; 4 2
 51
]. Thus, we can get in the same way like
for the R-path of the addition II unit in lemma 4.13:
(s
grc
; e
grc
; f
grc
) =

gpost norm(s
ex
; e
ex
; rnd
mode?s
ex
;51
(fpr)) if fpr 2 [2; 4[
gpost norm(s
ex
; e
ex
; rnd
mode?s
ex
;52
(fpr)) if fpr 2 [1; 2[.
(4.234)
126 CHAPTER 4. BASIC FP OPERATIONS
With srmode = mode ? s
ex
and the denition of the rounded signicand product
fprnd =

rnd
srmode;51
(fpr) if fpr 2 [2; 4[
rnd
srmode;52
(fpr) if fpr 2 [1; 2[
(4.235)
the regular case factoring can be written as (s
grc
; e
grc
; f
grc
) = gpost norm(s
ex
; e
ex
; fprnd).
From fpr 2 [1; 4 2
 51
] it follows, that also the rounded signicand product is in the
range fprnd 2 [1; 4 2
 51
] and can be represented by fprnd = <fprnd[ 1:104]>
neg
.
With the denition of the post-normalization condition pscond
pscond () (fprnd 2 [2; 4[) () fprnd[ 1]; (4.236)
the generalized post-normalization shift (denition 4.121) can be written as
(s
grc
; e
grc
; f
grc
) =

(s
ex
; e
ex
+ 1; fprnd=2) if pscond
(s
ex
; e
ex
; fprnd) otherwise.
(4.237)
The exponents e
ex
= < e
ex
[12 :0] >
2
= < (0;ea[11 :0]) >
2
+ < (0;eb[11 :0]) >
2
and
e
ex
+ 1 = < ei
ex
[12 :0] >
2
are computed by a 13-bit compound adder, so that according
to equation 4.237, we get the exponent for the regular case e
grc
= < e
grc
[12 :0] >
2
by a
selection depending on pscond. This exponent computation and the computations of the
sign s
grc
= sa sb are included in the block diagram in gure 4.23.
In the following we deal with the computation of the rounded signicand product
fprnd. The rounding computations are based on the injection-based rounding reduction
(see section []) like it is used in the R-path of the addition units II and III. With the use of
the rounding injections from equations 4.124-4.124, we get for srmode 2 fRZ;RNU;RIg
fprnd
0
=

rnd
srmode;51
(fpr) = rnd
RZ;51
(fpr + inj
[2;4[
) if fpr 2 [2; 4[
rnd
srmode;52
(fpr) = rnd
RZ;52
(fpr + inj
[1;2[
) otherwise.
(4.238)
Finally, to compute fprnd from fprnd
0
by implementing RNE instead of RNU, we will
have to consider the L-bit x.
Denition 4.10 We dene the injected signicand product finpr = fpr + inj
[1;2[
, that
already contains the rounding injection for the case that fpr 2 [1; 2[. The injection cor-
rection injcor is dened by
injcor = inj
[2;4[
  inj
[1;2[
(4.239)
=
8
<
:
2
 52
if srmode = RI
2
 53
if srmode = RN
0 otherwise.
(4.240)
We dene the corrected signicand product by fcorpr = finpr + injcor = fpr + inj
[2;4[
.
With these denitions, equation 4.238 can be written as
fprnd
0
=

rnd
srmode;51
(fpr) = rnd
RZ;51
(finpr + injcor) if fpr 2 [2; 4[
rnd
srmode;52
(fpr) = rnd
RZ;52
(finpr) otherwise.
(4.241)
Like in the previous section, the computations for the signicand product are parti-
tioned into two steps:
4.3. MULTIPLICATION 127
FA[0:52] FB[0:52]
FINPRC[-1:104]FINPRS[-1:104]
+ 1 additive operand
& Reduction (53x53)
Partial Product Generation (Booth2)
INJ generation
RMODE[1:0]SA, SB
SRMODE[1:0]
INJ12[-1:104]
Figure 4.24: Full-sized implementation of the partial product generation and reduction
with rounding injection for multiplication unit II.
(A) computation of a carry-save representation of the injected signicand product finpr =
fa  fb+ inj
[1;2[
with sum-string finprs[ 1 :104] and carry-string finprc[ 1 :104].
This computation corresponds to the 'injected partial product generation & reduc-
tion' circuit in gure 4.23.
(B) compression and gradual rounding from the carry-save representation of the injected
signicand product with the bit-strings finprs[ 1:104] and finprc[ 1:104] to the
rounded signicand product fprnd = <fprnd[ 1:52]>
neg
and the rounding tags
tinx and tinc. In combination with the generalized post-normalization shift for
the signicand according to equation 4.237, these computations correspond to the
'compression & gradual rounding circuit' in gure 4.23.
(A) The only dierence in the implementations of step (A) in this section from the
partial product generation and reduction implementations in the previous section is the
addition of the rounding injection constant inj
[1;2[
.
For the full-sized adder tree implementation, we use the binary 106-bit representation
of the injection inj
[1;2[
. (Note, that srmode = RNE for sr mode[0] = 1 and srmode = RI
for sr mode[1] = 1.)
inj
[1;2[
= < inj12[ 1:104]>
neg
(4.242)
= <(0
54
; sr mode[0] _ sr mode[1]; sr mode[1]
51
)>
neg
(4.243)
We replace the function btree
53;53
from the previous section by btreep
53;53
to add the
injection to the signicand product according to finpr = fpr+ inj
[1;2[
by
(finprs[ 1:104]; finprc[ 1:104]) = btreep
53;53
(fa[0 :52]; fb[0 :52]; inj12[ 1:104]):
This full-sized implementation of step (A) is depicted in gure 4.24. In this gure, the
'INJ generation' circuit contains the implementation of equation 4.243 and the rounding
mode reduction according to equation 2.6-2.7.
For the half-sized adder tree implementation of step (A), we modify the feedback
operand fdb from the previous section to add the rounding injection inj
[1;2[
in the rst
iteration for both, single and double precision. This can easily be done, because in the
rst iteration we have fdb = 0. Note, that because the result of the rst iteration is added
128 CHAPTER 4. BASIC FP OPERATIONS
in the second iteration weighted by 2
 27
, the injection 2
27
 inj
[1;2[
has to be added in the
rst iteration for double precision. In this way, we dene the injection feedback by
fdbinj = <fdbinj[ 1:78]>
neg
(4.244)
=

2
27
 inj
[1;2[
if dbl
inj
[1;2[
otherwise.
(4.245)
=

<(0
27
; sr mode[0] _ sr mode[1]; sr mode[1]
52
)>
neg
if dbl
<(0
54
; sr mode[0] _ sr mode[1]; sr mode[1]
25
)>
neg
otherwise.
(4.246)
Integrated with the previous feedback operand fdb, we dene the modied feedback
operand fdb
0
and the modied partial sums pp1
0
and pp2
0
by
fdb
0
=
8
<
:
fdbinj if iter2
2
 27
 pp1
0
if dbl AND iter2
0 otherwise
pp1
0
= fa  fbsel+ fdb
0
pp2
0
= fa  fbsel+ fdb
0
Lemma 4.24 Based on the previous denitions, we get the injected signicand product
by
finpr = fpr + inj
[1;2[
=

pp2
0
if dbl
pp1
0
otherwise.
(4.247)
after one iteration for single precision and after two iterations for double precision.
Proof: For single precision, fdbinj = inj
[1;2[
, so that pp1
0
= pp1+inj
[1;2[
= fpr+inj
[1;2[
,
as required. For double precision, we have fdbinj = 2
27
 inj
[1;2[
, so that
pp2
0
= fa  fbsel + 2
 27
 pp1
0
= fa  fbsel + 2
 27
 pp1 + 2
 27
 2
27
 inj
[1;2[
= fpr + inj
[1;2[
and the proof of the lemma is completed. 2
Thus, starting from the half-sized adder tree implementation from the previous section,
only the feedback operand fdb has to be changed to fdb
0
and the carry-save representations
of fdb
0
and finpr have to be considered for the implementation of step (A). This half -sized
implementation of step (A) is depicted in gure 4.25. The injection feedback according
to equation 4.246 is generated in the 'INJ generation' circuit, which also includes the
rounding mode reduction according to equations 2.6-2.7. This completes the description
of the two implementations (full-sized and half-sized) for step (A).
(B) For the computation of step (B) of the signicand multiplication rounding, we rst
consider the computation of fprnd
0
, so that with the additional computation of the L-bit
x, we will then get the rounded signicand product fprnd.
According to equation 4.241, the computation of fprnd
0
= <fprnd
0
[ 1:52]>
neg
de-
pends on finpr = <finpr[ 1:104]>
neg
and fcorpr = <fcorpr[ 1:104]>
neg
. Since
fprnd
0
=

rnd
RZ;51
(fcorpr) = <fcorpr[ 1:51]>
neg
if fpr 2 [2; 4[
rnd
RZ;52
(finpr) = <finpr[ 1:52]>
neg
otherwise.
4.3. MULTIPLICATION 129
AND
27
0
MUX 10 DBL AND ITER2INJ generation
   
   
 
 
 
FINPRS[-1:104] FINPRC[-1:104]
PPS’[-1:78] PPC’[-1:78]
PPREGS[-1:104]
PPREGC[-1:104]
[-1:51]
[52:78][-1:51]
[79:104]
[52:78]
[79:104]
+ 2 additive operands
& Reduction (53x27)
Partial Product Generation (Booth2)
DBL AND ITER2
FDBS’[79:104] FDBC’[79:105]
FBSEL[0:26]
FDBS’[26:78]
FDBC’[26:78]
[52:77]
FB[0:26] FB[27:52],0
53
0
SA,SBRMODE[1:0]
MUX 01 DBL AND ITER2
FA[0:52]
SRMODE[1:0]INJ[53:105]
SRMODE[1:0]
DBL
Figure 4.25: Half-sized implementation of the partial product generation and reduction
with rounding injection for multiplication unit II.
only the bit strings finpr[ 1:52] and fcorpr[ 1:51] have to be considered.
As input for the computations, we get a carry-save representation of the injected
signicand product finpr from step (A) with the sum string finprs[ 1:104] and the
carry-string finprc[ 1:104]. We compress the the bit positions [ 1 : 52] of this carry-
save representation by a half-adder line to the carry-save representation with sum-string
xsum[ 1:52] and carry-string xcarry[ 1:51], so that
<xsum[ 1:52]>
neg
+<xcarry[ 1:51]>
neg
= <finprs[ 1:52]>
neg
+<finprc[ 1:52]>
neg
:
Moreover, the bit positions [53 : 104] of the carry-save representation of finpr are com-
pressed by a carry-look-ahead adder to the binary representation (c52; finprs[53 : 104])
according to
<(c52; finpr[53 :104])>
neg
= <finprs[53 :104]>
neg
+<finprc[53 :104]>
neg
In this sum, the bit c52 is generated as a carry bit into position [52].
Based on these compressed representations, we partition the computations for finpr
and fcorpr into an upper part considering bit positions [ 1:51] and a lower part consid-
ering bit positions [52:104].
For the computation of finpr, the lower part consists of
lpart12 = <xsum[52]>
neg
+<(c52; finpr[53 :104])>
neg
(4.248)
= <(rc12; finpr[52 :104])>
neg
: (4.249)
130 CHAPTER 4. BASIC FP OPERATIONS
Thus, the bit finpr[52] and the carry bit rc12 (rounding carry for fpr 2 [1; 2[) into
position [51] can be computed by a half-adder according to
<(rc12; finpr[52])> = <xsum[52]> + <c52> : (4.250)
For the computation of fcorpr = finpr+ injcor, we additionaly have to consider the
injection correction injcor. According to equation 4.240, we have for the injection cor-
rection injcor  2
 52
, and injcor can be represented by injcor = < injcor[52 :53]>
neg
.
Thus, the lower part of fcorpr consists of
lpart24 = <xsum[52]>
neg
+<(c52; f inpr[53 :104])>
neg
+< injcor[52 :53]>
neg
= <(rc24; fcorpr[52 :104])>
neg
Because injcor  2
 52
, we can add the tail < (finpr[53 : 104]) >
neg
with the injection
correction by
<(cc52; fcorpr[53 :104])>
neg
= <finpr[53 :104]>
neg
+< injcor[52 :53]>
neg
In this addition, the carry bit cc52 (correction carry into position [52]) is generated. Using
the denition of the injection correction and the encoding for the reduced rounding modes
according to table 2.3, the condition for the correction carry cc52 can be written as
cc52 = sr mode[1] _ sr mode[0] ^ finpr[53]: (4.251)
Thus, the bits fcorpr[52] and rc24, which is the carry from the lower part of fcorpr into
bit position [51], can be computed by a full-adder according to
<(rc24; fcorpr[52])> = < xsum[52] >+< c52 >+< cc52 >: (4.252)
The upper part of finpr and fcorpr consists of
<finpr[ 1:51]>
neg
= < xsum[ 1:51] >
neg
+< xcarry[ 1:51] >
neg
+ rc12  2
 51
<fcorpr[ 1:51]>
neg
= < xsum[ 1:51] >
neg
+< xcarry[ 1:51] >
neg
+ rc24  2
 51
With the denition of the upper product upr and the incremented upper product upri by
upr = < upr[ 1:51] >
neg
(4.253)
= < xsum[ 1:51] >
neg
+< xcarry[ 1:51] >
neg
(4.254)
upri = < upri[ 1:51] >
neg
(4.255)
= upr + 2
 51
; (4.256)
the upper parts of finpr and fcorpr both can only have either the value upr or the value
upri. Obviously, only the carry, which is generated from the lower part into position [51],
is diering in the upper parts for finpr and fcorpr. Thus, if we select the proper carry
bit into position [51] depending on whether fpr 2 [1; 2[ or fpr 2 [2; 4[ by
rcarry51 =

rc24 if fpr 2 [2; 4[
rc12 otherwise,
(4.257)
the upper part of the signicand product fprnd
0
can by selected by
<fprnd
0
[ 1:51]>
neg
=

upri = <upri[ 1:51]>
neg
if rcarry51
upr = <upr[ 1:51]>
neg
otherwise
(4.258)
4.3. MULTIPLICATION 131
Additionaly, the bit fprnd
0
[52] is required for fpr 2 [1; 2[. In this case, we have fprnd
0
[52] =
finpr[52], which we already computed before, so that
fprnd
0
[52] =

0 if fpr 2 [2; 4[
finpr[52] otherwise.
(4.259)
This completes the computation of fprnd
0
. To get the rounded signicand product fprnd,
we additionaly have to implement the L-bit x. In contrast to the L-bit x implementation
for the addition II unit, we have to consider that the injected signicand product fprnd =
fpr + inj
[1;2[
contains the rounding injection inj
[1;2[
= 2
 53
for srmode = RN . In this
way, the conditions for the L-bit x are given by
lfix
[1;2[
= sr mode[0] ^ finpr[53] ^OR(finpr[54 :104])
lfix
[2;4[
= sr mode[0] ^ finpr[52] ^ finpr[53] ^OR(finpr[54 :104]);
so that the rounded signicand product fprnd = <fprnd[ 1:52]>
neg
can be computed
by
fprnd[ 1:52] =

(fprnd
0
[ 1:50]; fprnd
0
[51] ^ lfix
[2;4[
; 0) if fpr 2 [2; 4[
(fprnd
0
[ 1:51]; fprnd
0
[52] ^ lfix
[1;2[
) otherwise.
In this selection, only the bits in positions [51 : 52] are diering and have to be selected.
We denote the least signicant bit of the signicand by l24 for the case that fpr 2 [2; 4[
and by l12 for the case that fpr 2 [1; 2[ according to (Note, that for fpr 2 [1; 2[, we have
fprnd
0
[52] = finpr[52])
l24 = fprnd
0
[51] ^ lfix
[2;4[
(4.260)
l12 = finpr[52] ^ lfix
[1;2[
; (4.261)
so that equation 4.260 can be written as
fprnd[ 1:52] =

(fprnd
0
[ 1:50]; l24; 0) if fpr 2 [2; 4[
(fprnd
0
[ 1:51]; l12) otherwise.
(4.262)
In the description of the rounding computations, the condition fpr 2 [2; 4[ is used to choose
the proper rounding injection and to choose either rnd
RZ;51
(fcorpr) or rnd
RZ;52
(finpr)
as the rounded result fprnd
0
. Because we only deal with the injected signicand products,
we do not have a signal, that exactly implements the condition fpr 2 [2; 4[. The following
lemma shows, that the bit upr[ 1] can be used to substitute the condition fpr 2 [2; 4[.
The bit upr[ 1] does not always agree with the condition fpr 2 [2; 4[, but it will be shown,
that in every case, where upr[ 1] fails to predict the condition fpr 2 [2; 4[ correctly, it does
not matter which rounding injection is chosen, because in these cases rnd
RZ;51
(fcorpr) =
rnd
RZ;52
(finpr).
Lemma 4.25 For the rounding computation according to equation 4.238, the condition
fpr 2 [2; 4[ can be substituted by the signal upr[ 1], so that
fprnd
0
=

rnd
RZ;51
(fcorpr) if upr[ 1]
rnd
RZ;52
(finpr) otherwise.
132 CHAPTER 4. BASIC FP OPERATIONS
Proof: We only have to consider the cases where upr[ 1] 6= (fpr 2 [2; 4[). In the
following, we distinguish between: (a) upr[ 1] = 0; and (b) upr[ 1] = 1.
(a) For upr[ 1] = 0, we have to consider the case, that (fpr 2 [2; 4[). Because fpr  2
and finpr = fpr + inj
[1;2[
= upr + lpart12, we have
finpr 2 [2; upr + lpart12]; (4.263)
where lpart12 < 3  2
 52
according to equation 4.248. Since upr[ 1] = 0, it follows
that upr = <upr[ 1:51]>
neg
 2  2
 51
. Since upr + lpart12  exact  2, we have
upr > 2  3  2
 52
. Thus, upr = <upr[ 1:51]>
neg
= 2  2
 51
and equation 4.263 yields
finpr 2 [2; 2 + 2
 52
[:
The injection correction satises 0  injcor  2
 52
, therefore
fcorpr = finpr+ injcor 2 [2; 2 + 2
 51
[:
For these ranges of finpr and fcorpr, it follows, that
rnd
RZ;51
(fcorpr) = rnd
RZ;52
(finpr) = 2:
Thus, it does not matter which rounding injection is chosen in this case and fprnd
0
= 2
independent of the selection value.
(b) For upr[ 1] = 1, we only have to consider the case, that fpr < 2. Since inj
[1;2[
2
[0; 2
 52
[, it follows that finpr = fpr + inj
[1;2[
< 2 + 2
 52
. Since upr[ 1] = 1 and
finpr = upr + lpart12, we have finpr  2, so that
finpr 2 [2; 2 + 2
 52
[:
The proof now follows the proof of case (a). 2
By the use of this lemma, equations 4.257 and 4.259 are implemented by
rcarry51 =

rc24 if upr[ 1]
rc12 otherwise,
fprnd
0
[52] =

0 if upr[ 1]
finpr[52] otherwise.
This completes the description of the rounded signicand product fprnd. Additionaly, for
step (B) we have to compute the rounding tags for the rounding inexactness tinx and for
the rounding increment tinc.
The conditions for the rounding tags tinx and tinc are given by:
tinx =

fpr[52] _ORtree(fpr[53 :104]) if upr[ 1]
ORtree(fpr[53 :104]) otherwise.
tinc =

fprnd[51]fpr[51] if upr[ 1]
fprnd[52]fpr[52] otherwise.
The following lemma provides the equations for the implementation of the rounding tags
based on the injected signicand product finpr. Moreover, this lemma proposes how the
computation of the lfix-bits can be based on signals from the rounding tag computation
to share hardware.
4.3. MULTIPLICATION 133
Lemma 4.26 With the denition of
fpr
RN
[51 :53] = <(xsum[51]  xcarry[51]  rc12; finpr[52 :53])>
neg
  2
 53
mod 2
 50
tinc
RN
=

fpr
RN
[51] (fprnd
0
[51] ^ lfix
[2;4[
if upr[ 1]
fpr
RN
[52] fprnd
0
[52] ^ lfix
[1;2[
otherwise.
sticky2 = ORtree(finpr[53 :104]  sr mode[1])
the rounding tags can be computed by
tinx = ((sr mode[0] fpr[52]) ^ upr[ 1]) _ sticky2
tinc = (sr mode[0]^tinc
RN
)^(sr mode[1]^tinx):
Moreover, based on the signal sticky2, the lfix-bits can be written as:
lfix
[1;2[
= sr mode[0] ^ finpr[53] ^ sticky2
lfix
[2;4[
= sr mode[0] ^ finpr[52] ^ finpr[53] ^ sticky2;
Proof: In order to proof the equations for the rounding tags and the lfix-bits, we
rst show some properties of the signals fpr
RN
[51 : 53] and sticky2, namely, that (a)
sticky2 = ORtree(fpr[53 :104]) and that (b) in the rounding mode srmode = RNE, we
have fpr
RN
[51 :53] = fpr[51 :53]:
(a) Keeping in mind, that finpr = fpr + inj
[1;2[
, we distinguish for the proof between
the two cases: (i) the rounding mode srmode 2 fRZ;RNEg; and (ii) the rounding
mode srmode = RI.
(i) For srmode 2 fRZ;RNEg, we have sr mode[1] = 0 and inj
[1;2[
[53 : 104] = 0
52
,
so that fpr[53 :104] = finpr[53 :104]  sr mode[1] and (a) follows immediately.
(ii) For srmode = RI, we have sr mode[1] = 1 and inj
[1;2[
[53 :104] = 1
52
. Thus,
(fpr[53 :104] = 0
52
) () (finpr[53 :104] = 1
52
)
(fpr[53 :104] = 0
52
) () ((finpr[53 :104]  sr mode[1]) = 0
52
)
ORtree(fpr[53 :104]) () ORtree(finpr[53 :104]  sr mode[1]);
as required.
(b) In the rounding mode srmode = RNE, we have for the rounding injection con-
stant inj
[1;2[
= 2
 53
. Thus, <fpr[ 1:53]>
neg
= <finpr[ 1:53]>
neg
  2
 53
, and
therefore
<fpr[51 :53]>
neg
= <finpr[51 :53]>
neg
  2
 53
mod 2
 50
:
Property (b) follows then from finpr[51] = xsum[52] xcarry[52] rc12.
The equations for tinx and the lfix-bits follow immediately from property (a). In the
proof of the equation for tinc, we distinguish between the three cases: (i) srmode = RZ;
(ii) srmode = RNE; and (iii) srmode = RI.
(i) In the rounding mode srmode = RZ, a rounding increment never occurs.
(iii) It follows from the IEEE rounding denition, that in the rounding mode srmode =
RI, a rounding increment occurs, i the result is inexact, where (tinx = 1).
134 CHAPTER 4. BASIC FP OPERATIONS
(ii) In the rounding mode srmode = RNE, we implement equation 4.264 for tinc. Us-
ing property (b) and the equations for fprnd[51] = fprnd
0
[51]^lfix
[2;4[
and fprnd[52] =
fprnd
0
[52]^lfix
[1;2[
, we get in the rounding mode srmode = RNE, that tinc = tinc
RN
.
We join the three cases (i)-(iii) for the three rounding modes to the following general
equation for tinc:
tinc =

tinx if srmode = RI
tinc
RN
if srmode = RNE
Because the rounding mode srmode = RI is signaled by sr mode[1] = 1 and the rounding
mode srmode = RNE by sr mode[0] = 1, it is obvious, that this is equivalent to the
equation, that we have to prove for tinc. 2
In this way, the description of the implementation of part (B) with the rounded sig-
nicand product fprnd and the rounding tags tinx and tinc is completed. Based on
this, the signicand f
grc
= <f
grc
[0 :52]>
neg
for the regular case can be selected from
fprnd = <fprnd[ 1:52]>
neg
by the generalized post-normalization shift for the signi-
cand according to equation 4.237. In the following, we summarize the computation steps
for the computation of part (B) of the signicand multiplication and rounding and the
generalized post-normalization shift:
1. compression of positions [ 1 : 52] of the carry-save representation of fprnd by a
half-adder line according to
<xsum[ 1:52]>
neg
+<xcarry[ 1:51]>
neg
= <finprs[ 1:52]>
neg
+<finprc[ 1:52]>
neg
:
addition of bit positions [53 : 104] of the carry-save representation of fprnd by a
52-bit carry-lookahead adder according to:
<(c52; finpr[53 :104])>
neg
= <finprs[53 :104]>
neg
+<finprc[53 :104]>
neg
:
2. computation of the upper product upr = < upr[ 1:51] >
neg
(equation 4.254) and
the incremented upper product upri = < upri[ 1:51] >
neg
(equation 4.256) by a
53-bit compound adder that computes
< upr[ 1:51] >
neg
= < xsum[ 1:51] >
neg
+< xcarry[ 1:51] >
neg
< upri[ 1:51] >
neg
= < upr[ 1:51] >
neg
+ 2
 51
;
3. After the computation of the correction carry cc52 into position [52] (equation
4.251), the rounding carries into position 51 are computed: rc12 for the case fpr 2
[1; 2[ by a half-adder according to equation 4.250 and rc24 for the case fpr 2 [2; 4[
by a full-adder according to equation 4.252. The proper rounding carry rcarry51
into position [51] is then selected according to equation 4.264:
cc52 = sr mode[1] _ sr mode[0] ^ finpr[53]
<(rc12; finpr[52])> = <xsum[52]> + <c52>
<(rc24; fcorpr[52])> = < xsum[52] >+< c52 >+< cc52 >:
rcarry51 =

rc24 if upr[ 1]
rc12 otherwise,
4.3. MULTIPLICATION 135
AND AND
UPRI UPR
[-1][-1:51] [-1:51]
FINPRSFINPRC
[-1:52][-1:52]
[-1:51]
Compound
Adder(53)
[-1:50] [0:51] [-1]
01 MUX
XSUM
[51:52]
HA(54)
XCARRY XSUM
[-1:52][-1:51] XCARRY[51]
UPR[-1]
Rounding
Decisions
SRMODE
[1:0]
L24 L12
0 MUX 1FPRND[-1]
F       [52] TINX
TINC
01 MUX
F       [0:51]grc
RCARRY51
[51]
FPRND’
LFIX
24
LFIX
12
FIN
PR[52]
CLA(52)
[53:115]
FINPRS
[53:115]
FINPRC
C52 FINPR
[53:104]
grc
Figure 4.26: Block diagram of part (B) of the signicand multiplication including the grad-
ual rounding computations and the generalized post-normalization shift for multiplication
unit II.
4. Depending on the value of the rounding carry rcarry51, the upper part (positions
[ 1:51]) of the rounded signicand product fprnd
0
is selected by (equation 4.258):
fprnd
0
[ 1:51] =

upri[ 1:51] if rcarry51
upr[ 1:51] otherwise.
5. The rounding tags and the lfix-bits are computed according to the equations from
lemma 4.26.
6. The l-bits of the rounded signicand product l24 and l12 are computed according
to equations 4.260-4.261 considering the l-bit x:
l24 = fprnd
0
[51] ^ lfix
[2;4[
l12 = finpr[52] ^ lfix
[1;2[
;
7. Finally, the generalized post-normalization shift of the signicand is computed ac-
cording to equation 4.262 and 4.237. This selection is computed separately for
positions [0 :51] by
f
grc
[0 : 51] =

fprnd
0
[ 1:50] if fprnd
0
[ 1]
fprnd
0
[0 :51] otherwise
136 CHAPTER 4. BASIC FP OPERATIONS
and for position [52] by
f
grc
[0 : 51] =

l24 if fprnd
0
[ 1]
l12 otherwise
The implementation of these steps is depicted in the block diagram in gure 4.26. In this
gure, the 'rounding decisions' circuit contains the implementation of steps 3 and 5.
In this way, the description of the multiplication unit II is completed.
4.3.3 Multiplication III (normalized  ! normalized format)
Specication. Like in the previous section also in this section, the FP multiplica-
tion is computed from the inputs of the normalized representations BUSa
NF
[69 :0] and
BUSb
NF
[69 :0] (section 2.6.3). Because IEEE rounding has to be considered in this sec-
tion, also the bit dbl, that signals the case of single precision (dbl = 0) or double precision
(dbl = 1), the input of the rounding mode, represented by rmode[1 :0], and the underow
and overow enable bits unf en and ovf en are required.
In this section, the exact multiplication result according to equation 4.207 has to
be rounded by the rounding function nround, that computes the NF factoring of the
IEEE rounded result. After this rounding computation the product should be output
in the normalized format BUS
NF
[69 :0] (section 2.6.3). According to equation 4.207, a
factoring of the exact product is given by (s
pr
; e
pr
; fpr) = (sa  sb; ea + eb; fa  fb) for
non-zero representable operands. With the NF factoring of the IEEE result for non-
zero representable operands (s
nrc
; e
nrc
; f
nrc
) = nround(s
ex
; e
ex
+ wec; fpr) including the
exponent wrapping constant wec according to equation 2.14 and the following NF factoring
of the result for the general case of arbitrary operands according to equation 2.16:
(s
NF
; e
NF
; f
NF
)=
8
>
>
>
>
<
>
>
>
:
(0; e
qNaN
; f
qNaN
) if scqnan
(s
inf
; e
1
; f
1
) if scinf
(sa; ea; fa) if scx
(sb; eb; fb) if scy
(s
0
; e
0
; 0) if sczero
(s
nrc
; e
nrc
; f
nrc
) otherwise,
(4.264)
the product output of the multiplication unit III is specied by the corresponding repre-
sentation in the normalized format BUS
NF
[69 :0] = nf(s
NF
; e
NF
; f
NF
). In this section,
the occurance of an invalid, inexact, overow and underow exception should be signaled
by the bits inv, inx, ovf and unf, respectively.
Implementation. The special cases conditions and values in equation 4.264 are identical
to that in the specication of the two previous sections. In the implementation of this
special cases selection, the only dierence is that in this case a representation in the
normalized format is required. Because all special cases results are exact, just the two
rounding tags have to be neglected from the special cases implementation of the previous
section. For the special cases selections, these small adjustments are integrated in the
implementation depicted in gure 4.27. This already completes the description of the
special cases computation and we only have to describe the computation of (s
nrc
; e
nrc
; f
nrc
)
for non-zero representable operands in the following.
4.3. MULTIPLICATION 137
0101
NF
INFb
ZEROb
QNANb
INFa
QNANa
SNANb
ZEROa
SNANa
 
SA
BUSb      [69:0]
 
[69]
SA SB
SB
scS
E    [12:0]sc E     [12:0]nrc
F      [0:52]NFE      [11:0]NF
[69]
NF
XOR
SPCA
scS
[69]
NFS
Mux
S
nrcE     [12:0]nrcF     [0:52] S nrc
0
CFOVF1
INX12,INX24,SPCA
CFOVF2
S nrc
 
   
 
 
 
BUSa      [69:0]
nrc
[3:0]
[3:0]
(SA,EA[11:0],FA[0:52])
FA[0:52] FB[0:52]
[56:4]
[56:4]
INV
special
cases
DBL
DBL
INV
RMODE[1:0]
Exceptions &
exponent
computations
EB[11:0]
[68:57]
0
FPRC[-1:104]
[3:0] SNAN NF
BUS      [69:0]NF
[68:57]
QNAN NF
INFNF
NFZERO
SPCA
Mux
[56:4]
nrcF     [0:52]
INX UNF
INX
OVF
UNF
OVF
EB[11:0]
EA[11:0]
[68:57]
EA[11:0]FB[0:52],SBFA[0:52],SA
SPCA
FPRS[-1:104]
generation & reduction
partial product
rounding
Compression &
normalized significand
DBL
(SB,EB[11:0],FB[0:52])
F    [0:52]sc
INJ24[-1:54]
INJ12[-2:54]
MASK1     [-2:53]
MASK1     [-1:53]
vp1’
vp2’
SRMODE[1:0]
WINZIG
OVF
Figure 4.27: Block digram of the multiplication unit III.
According to lemma 2.8, the rounding function nround can be composed of the four
steps of an unbounded normalization shift, normalized signicand rounding, another un-
bounded normalization shift and exponent rounding:
(s
nrc
; e
nrc
; f
nrc
) = nround(s
pr
; e
pr
+ wec; fpr) (4.265)
= exp rnd
mode?s
pr
((n sig rnd
mode?s
pr
((s
pr
; e
pr
+ wec; fpr)))):(4.266)
Like in the addition computations II and III, we partition the discussion of the rounding
computations into two steps. After the computation of a rst step according to
(s
nr1
; e
nr1
+ wec; f
nr1
) = (n sig rnd
mode?s
pr
((s
pr
; e
pr
+ wec; fpr))); (4.267)
the nal result can obviously be computed by the exponent rounding
(s
nrc
; e
nrc
; f
nrc
) = exp rnd
mode?s
pr
(s
nr1
; e
nr1
+ wec; f
nr1
): (4.268)
138 CHAPTER 4. BASIC FP OPERATIONS
Because for both, single and double precision, fa; fb 2 [1; 2 2
 52
], the exact signicand
product is in the range fpr = fa  fb 2 [1; 4 2
 51
]. Thus, with the denition of nor-
malized signicand rounding by n sig rnd
srmode
(s; e; f) = (s; e; rnd
srmode;vp
(f)) and the
variable rounding position vp = p  1 maxf0; e
min
  eg according to denition 2.9, the
normalization shift can be simplied and combined with the rounding separately for the
two cases: fpr 2 [1; 2[ and fpr 2 [2; 4[ by
(s
nr1
; e
nr1
; f
nr1
) = (n sig rnd
mode?s
pr
((s
pr
; e
pr
; fpr))) (4.269)
=

(n sig rnd
mode?s
pr
((s
pr
; e
pr
; fpr))) if fpr 2 [1; 2[
(n sig rnd
mode?s
pr
((s
pr
; e
pr
; fpr))) if fpr 2 [2; 4[
(4.270)
=

(s
pr
; e
pr
; rnd
mode?s
pr
;vp1
(fpr)) if fpr 2 [1; 2[
(s
pr
; e
pr
+ 1; rnd
mode?s
pr
;vp2
(fpr=2)) if fpr 2 [2; 4[,
(4.271)
where according to denition 2.9, the variable rounding positions vp1; vp2 are given by
vp1 = p 1 maxf0; e
min
 (e
pr
+wec)g (4.272)
vp2 = p 1 maxf0; e
min
 (e
pr
+wec+1)g: (4.273)
In the above formulae, the rounding positions vp1 and vp2 could be in a very large
range, namely because oating-point results x 2 FP
n;p
could even have the magnitude
2
2e
min
 2p+2
(see section 2.4.2), the variable signicand rounding positions could be in
the range vp1; vp2 2 [e
min
 p 1 :p 1]. Based on the fact, that the signicand product
fpr is smaller than 4, the signicand rounding can be simplied for rounding positions
vp1; vp2 <  2. In these cases we know for sure, that the rounding operand has a mag-
nitude smaller than half of the smallest representable number, so that in these cases the
rounded result has to be selected between 0 or x
min
. By a separate selection for these
small results, the ranges for the variable rounding positions in the remaining cases is re-
duced to [ 2:p 1] and the range of the rounded signicands is limited to [0; 4]. For these
reasons, the rounding computations and the computation of the unbounded normalization
shift can be simplied. This will be further discussed after the next lemma.
Lemma 4.27 With the denition of the condition winzig, that detects results with very
small exponents by
winzig () (e
pr
+wec  e
min
  3  p+ 1)
() (e
pr
 e
min
  3  p+ 1) AND unf en
the rounded result can be selected by
(s
nr1
;e
nr1
+wec;f
nr1
) =
8
>
>
<
>
>
:
(s
pr
; e
min
  p+ 1; 0) if winzig ^ sr mode[1]
(s
pr
; e
min
  p+ 1; 1) if winzig ^ sr mode[1]
(s
pr
; e
pr
+wec;rnd
mode?s
pr
;vp1
(fpr)) if winzig ^ fpr 2 [1; 2[
(s
pr
; e
pr
+1+wec; rnd
mode?s
pr
;vp2
(
fpr
2
)) if winzig ^ fpr 2 [2; 4[.
Proof: For the proof we distinguish between the cases: (a) winzig = 1; and (b)
winzig = 0.
(a) For winzig = 1, we have (e
pr
+wec  e
min
 3 p+1), so that because of fpr < 4,
the magnitude of the exact product val(0; e
pr
+wec; fpr) is smaller than x
min
=2 = 2
e
min
 p
.
Because we deal with non-zero operands, also the exact product is non-zero, so that the
magnitude of the exact product is in the range 0 < val(0; e
pr
+ wec; fpr) < x
min
=2.
4.3. MULTIPLICATION 139
Thus, the nearest representable numbers to the exact product are 0 = val(s
pr
; e
0
; 0) and
( 1)
s
pr
x
min
= val(s
pr
; e
min
  p+1; 1), so that according to the IEEE rounding denition
in section 2.3.1 the exact product is rounded to (s
pr
; e
min
  p + 1; 0) in rounding mode
srmode 2 fRZ;RNEg and to (s
pr
; e
min
  p + 1; 1) in rounding mode srmode = RI.
Because the rounding mode RI is signaled by the bit sr mode[1], this agrees with the
rst two lines of the rounding formula in this lemma.
(b) For winzig = 0, the rounding equations are copied identically from equation 4.271.
For winzig = 0, we have e
pr
+wec  e
min
 2 p+1. From this condition on the exponent
e
pr
+ wec, it follows, that the variable rounding positions vp1 and vp2 are limited to the
ranges vp1
0
2 [ 2 : p 1] and vp2
0
2 [ 1 : p 1]. These conditions can be used for the
rounding implementation. 2
Because the above selection of the rounded result is simple for winzig = 1, we focus
on the computation of the cases for winzig = 0 in the following. For this purpose, we
introduce the notation:
fprnd12 = rnd
mode?s
pr
;vp1
0
(fpr) (4.274)
fprnd24 = rnd
mode?s
pr
;vp2
0
(fpr=2)) (4.275)
(s
prnd
; e
prnd
; fprnd) =

(s
pr
; e
pr
; fprnd12) if fpr 2 [1; 2[
(s
pr
; e
pr
+ 1; fprnd24) if fpr 2 [2; 4[.
(4.276)
With this notation the result of rst step (equation 4.267) can be written as:
(s
nr1
;e
nr1
+wec;f
nr1
) =

(s
pr
; e
min
  p+ 1; <sr mode[1]>) if winzig
(s
pr
; e
prnd
+wec; fprnd) otherwise.
(4.277)
Because in the rounding computations for fprnd we can use that winzig = 0, the ranges
of the variable rounding positions vp1 and vp2 for the computation of fprnd are limited
to vp1 2 [ 2:p 1] and vp2 2 [ 1:p 1] according to the proof of case (b) in the previous
lemma. To indicate that we only have to consider these limited rounding position ranges,
we write vp1
0
and vp2
0
for the rounding positions with limited ranges and have vp1
0
= vp1
for vp1 2 [ 2:p 1] and vp2
0
= vp2 for vp2 2 [ 1:p 1]. From fpr 2 [1; 4[ and from
the ranges of the variable rounding positions vp1
0
and vp2
0
, it follows that the rounded
signicands fprnd12 and fprnd24 are bounded by fprnd12 2 [0; 4] and fprnd24 2 [0; 2],
and thus, they can be represented according to fprnd12 = <fprnd12[ 2:52]>
neg
and
fprnd24 = <fprnd24[ 1:52]>
neg
.
The selection and computations in the two cases of equation 4.276 can be simplied
by selecting the upper choice only for fprnd12 2 [0; 2[ and the other choice for all other
cases. In this way, the selection condition is based on the rounded signicand value
fprnd12 instead of the value of the unrounded signicand product fpr. The following
lemma shows, that we do not make a mistake by this substitution, but that the new
ranges, for that we consider fprnd12 and fprnd24, allow to simplify the normalization
shifts, that are required after the rounding.
Lemma 4.28 Equation 4.276 can be simplied to
(s
prnd
; e
prnd
; fprnd) =

(s
pr
; e
pr
; fprnd12) if fprnd12 < 2
post norm(s
pr
; e
pr
+ 1; fprnd24) otherwise.
Proof: We divide the proof of the lemma into two steps: In step (a) we show, that the
values on both sides in the equation of the lemma are the same. Then, we show in step
140 CHAPTER 4. BASIC FP OPERATIONS
(b), that the unbounded normalization shifts from equation 4.276 can be replaced by a
post-normalization shift respectively by no shift for the two cases.
(a) The normalization shifts do not change the values of the factorings. Thus, we only
have to show the equality of the selected values
val(s
pr
; e
prnd
; fprnd) =

val(s
pr
; e
pr
; fprnd12) if fpr 2 [1; 2[
val(s
pr
; e
pr
+ 1; fprnd24) otherwise.
=

val(s
pr
; e
pr
; fprnd12) if fprnd12 < 2
val(s
pr
; e
pr
+ 1; fprnd24) otherwise.
Because the equality is obvious, if the conditions (fprnd12 < 2) and (fpr 2 [1; 2[) have
the same value, we only have to consider the cases, where: (a.i) (fprnd12  2) and
(fpr 2 [1; 2[); and (a.ii) (fprnd12 < 2) and (fpr 2 [2; 4[). Thus, to show the above
equality, it suces to show that fprnd24 = fprnd12=2 in the cases (a.i) and (a.ii).
(a.i) In the computation of fprnd12, we have to consider the rounding positions vp1
0
2
[ 2:p 1] and in the computation of fprnd24, we have to consider the rounding positions
vp2
0
2 [ 1 : p 1]. For rounding positions vp1
0
2 [ 1 : p 1], it follows from (fpr 2 [1; 2[),
that (fprnd12  2). Thus, in case (a.i) we either have fprnd12 = 2, or vp1
0
=  2 and
thus fprnd12 = 4 (in this case vp2
0
=  1).
From the denitions of the variable rounding positions vp1
0
and vp2
0
(see equations
4.272-4.273), it follows that vp2
0
2 fvp1
0
; vp1
0
+1g, so that we always have vp2
0
 vp1
0
+1.
The rounded signicand fprnd12 can be written as a rounding function of fpr=2 with
rounding position vp1
0
+ 1:
fprnd12 = rnd
srmode;vp1
0
(fpr) = 2  rnd
srmode;vp1
0
+1
(fpr=2):
Thus, because of vp2
0
 vp1
0
+ 1, the computation of the rounded signicand fprnd24 =
rnd
srmode;vp2
0
(fpr=2) can be interpreted as a second gradual rounding step on the sig-
nicand fprnd12=2 at the rounding position vp2
0
. We now consider fprnd12 = 2 and
fprnd12 = 4, which are the two possible values of fprnd12 for case (a.i). Because
fprnd12=2 = 1 is already a multiple of 2
 vp2
0
for vp2
0
2 [0 :p 1], we get in this case also for
the second gradual rounding step fprnd24 = 1 = fprnd12=2, and because fprnd12=2 = 2
is already a multiple of 2
 vp2
0
for vp2
0
=  1, we get in this case also for the second gradual
rounding step fprnd24 = 2 = fprnd12=2. This completes the proof of case (a.i)
(a.ii) In the computation of fprnd12, the rounding position could be in the range vp1
0
2
[ 2 : p 1]. Because we assume fpr 2 [2; 4[, the rounded signicand fprnd12 can not
become smaller than 2 for the rounding positions vp1
0
2 [ 1:p 1]. Only for the rounding
position vp1
0
=  2, the signicand fprnd12 could become smaller than 2, and the only
possible case for this is fprnd12 = 0. For vp1
0
=  2, we have vp2
0
=  1 and it follows
from
0 = fprnd12 = rnd
srmode; 2
(fpr) = 2  rnd
srmode; 1
(fpr=2) = 2  fprnd24 = 0;
that also in case (a.ii) we have fprnd12=2 = fprnd24, as required.
(b) For the proof of part (b) we distinguish between the two cases: (b.i) fprnd12 < 2;
and (b.ii) fprnd12  2.
(b.i) For fprnd12 < 2, the upper choice is selected. For this choice, we have to consider
the rounding positions vp1
0
2 [ 2:p 1] in the computation of fprnd12. For rounding
positions vp1
0
 0, it follows from fpr 2 [1; 4[, that fprnd12  1, so that the resulting
factoring is already normalized in these cases and the additional normalization shift can
4.3. MULTIPLICATION 141
be neglected. For the remaining rounding positions vp1
0
2 f 1; 2g, it follows from
fpr 2 [1; 4[, that fprnd12 2 f0; 2; 4g. Among these cases, only for the result 0, the
condition for case (b.i) is given and the upper choice is selected. Because the unbounded
normalization shift is dened to compute the identity function for factorings of zero, the
normalization shift can be neglected for all rounding positions, that have to be considered.
(b.ii) For fprnd12  2, the factoring (s
pr
; e
pr
+1; fprnd24) is selected. For this choice, we
have to consider the rounding positions vp2
0
2 [ 1 :p 1] in the computation of fprnd24.
From this range of rounding positions with fpr=2 2 [0:5; 2[, it follows that fprnd24  2 and
because fprnd12=2  1, it follows that fprnd24 2 [1; 2]. Because a post-normalization
shift (see denition 2.11) normalizes factorings with signicands in the range [1; 2], a
post-normalization shift suces to normalize the factoring (s
pr
; e
pr
+1; fprnd24), so that
the unbounded normalization shift can be replaced by a post-normalization shift in the
case (b.ii). Thus, the conclusion of step (a), case (b.i) and case (b.ii) is, that
(s
prnd
; e
prnd
; fprnd) =

(s
pr
; e
pr
; fprnd12) if fprnd12 < 2
post norm(s
pr
; e
pr
+ 1; fprnd24) otherwise,
as required by the lemma. 2
Denition 4.11 We dene two signicand overow conditions cfovf1 and cfovf2:
cfovf1 () (fprnd12  2)
() fprnd12[ 2] _ fprnd12[ 1]
cfovf2 () (fprnd24 = 2)
() fprnd24[ 1]
With this denition of the signicand overow conditions cfovf1 and cfovf2 and with
the denition of the post-normalization shift (see equation 2.11), the equation from lemma
4.28 can obviously be written as
(s
prnd
; e
prnd
; fprnd) =
8
<
:
(s
pr
; e
pr
; fprnd12) if cfovf1
(s
pr
; e
pr
+ 1; fprnd24) if cfovf1 AND cfovf2
(s
pr
; e
pr
+ 2; 1) if cfovf1 AND cfovf2
(4.278)
Lemma 4.29 For exponents e
pr
+wec  e
min
, the condition cfovf2 can not be fullled:
(e
pr
+ wec  e
min
) =) cfovf2:
Proof: From (e
pr
+ wec  e
min
) it follows, that the variable rounding positions vp1
and vp2 are xed to vp1 = vp2 = p   1. Because fa; fb  2   2
 p+1
and thus fpr=2 <
2  2
 p+1
, it follows from the rounding position vp2 = p  1, that the rounded signicand
fprnd24 < 2. Therefore, we get as required cfovf2 = 0. 2
We postpone a detailed description of the rounding implementations for fprnd12 and
fprnd24, and consider the description of the exponent rounding and the exponent wrap-
ping according to the second computation step from equation 4.268 in the following. Be-
cause the conditions winzig and ovf can not both be fullled at the same time and no
exponent wrapping is required for unf en = 0, the exponent rounding selection from
142 CHAPTER 4. BASIC FP OPERATIONS
equation 4.268 can be written in combination with the denition of exponent rounding
(see equation 2.12) and with equation 4.277 as
(s
nrc
;e
nrc
;f
nrc
) =
8
>
<
>
:
(s
pr
; e
max
; f
max
) if ovf ^ ovf en ^ or(sr mode[1 :0])
(s
pr
; e
1
; f
1
) if ovf ^ ovf en ^ or(sr mode[1 :0])
(s
pr
; e
min
 p+1; <sr mode[1]>) if winzig
(s
pr
; e
prnd
+wec; fprnd) otherwise.
(4.279)
We integrate the selection of the x
max
and 1 results with the selection of the 0 and
x
min
results in the factoring (s
pr
; e
sel
; f
sel
) by the selection:
(s
pr
; e
sel
; f
sel
)=
8
<
:
(s
pr
; e
max
; f
max
) if winzig ^ or(sr mode[1 :0])
(s
pr
; e
1
; f
1
) if winzig ^ or(sr mode[1 :0])
(s
pr
; e
min
  p+ 1; <sr mode[1]>) if winzig
so that the factoring (s
nrc
;e
nrc
;f
nrc
) can be selected by:
(s
nrc
; e
nrc
; f
nrc
) =

(s
pr
; e
sel
; f
sel
) if winzig OR (ovf ^ ovf en)
(s
pr
; e
prnd
+wec; fprnd) otherwise.
(4.280)
This already describes, how the signicand f
nrc
is selected. For the computation of the
exponent we additionaly have to consider the implementation of the exponent wrapping.
For the computation of the exponent wrapping, we predict the wrapping exponent
constant wec by the sign of the exponent e
pr
= <epr[12 :0]>
2
(which is epr[12]) similar
to the computation in the addition unit III according to
pwec =

+ if epr[12]
  otherwise.
so that with the denition of the condition ewrap, that signals the requirement for ex-
ponent wrapping by
ewrap () (unf ^ unf en) OR (ovf ^ ovf en); (4.281)
the exponent wrapping can be included into equation 4.278 by
e
prnd
+wec =
8
>
>
>
<
>
>
>
:
e
pr
if ewrap AND cfovf1
e
pr
+ 1 if ewrap AND cfovf1 AND cfovf2
e
pr
+ 2 if ewrap AND cfovf1 AND cfovf2
e
pr
+ pwec if ewrap AND cfovf1
e
pr
+ 1 + pwec if ewrap AND cfovf1
(4.282)
Note, that the exponent e
pr
+ 2 + pwec does not have to be considered in this equation,
because e
pr
+ wec  e
min
for ewrap = 1 (see corollary 2.10) and because of lemma
4.29. Based on the equations 4.280 and 4.282 the computation of the exponent e
nrc
is
implemented by the following six selections:
eopi =

e
sel
if (winzig _ ovf en)
ewi = e
pr
+ 1 + pwec otherwise
eci =

eprii = e
pr
+ 2 if cfovf2
epri = e
pr
+ 1 otherwise
4.3. MULTIPLICATION 143
eni =

eopi if (winzig _ ovf[1] _ (unf[1] ^ unf en))
eci otherwise
eop =

e
sel
if (winzig _ ovf en)
ew = e
pr
+ pwec otherwise
en =

eop if (winzig _ ovf[0] _ (unf[0] ^ unf en))
e
pr
otherwise
e
nrc
=

eni if cfovf1
en otherwise,
where ovf[1] and unf[1] indicate the case of an overow resp. underow under the
assumption that cfovf1 = 1 and ovf[0] and unf[0] indicate the case of an overow resp.
underow under the assumption that cfovf1 = 0.
Because the exponent eni is selected only for cfovf1 = 1, we assume in the selection
for eni that cfovf1 = 1. Therefore, we use the signals unf[1] and ovf[1] instead of unf
and ovf in this selection. Accordingly, we use unf[0] and ovf[0] instead of unf and ovf
in the selection for en. The condition in the seletions for eop and eopi is based on
ovf ^ ovf en _ ewrap() ovf _ (unf ^ unf en):
This completes the description of the selections for the exponent e
nrc
.
In the following, we consider the rounding implementation for the rounded signif-
cands fprnd12 and fprnd24. The computation of fprnd12 and fprnd24 is based on the
injection-based rounding mode reduction (see section 2.5.2). To be able to use injection-
based rounding, we have to consider the rounding modes RZ;RNU;RI rst and to correct
the rounded result in the case of the rounding mode RNE by the additional L-bit x at
the least signicant bit position of the signicand (see section 2.3.2).
For the rounding of a signicand, which is an integral multiple of 2
 105
, at a bit position
vp < 105 in the rounding mode srmode 2 fRZ;RNU;RIg, the rounding injection inj
vp
is dened by
inj
vp
= < inj
vp
[ 2 : 105] >
2neg
=
8
<
:
0 if srmode = RZ
2
 vp 1
if srmode = RNU
2
 vp
  2
 105
if srmode = RI
so that according to lemma 2.18 for srmode 2 fRZ;RNU;RIg, the injection-based round-
ing of a signicand f at the position vp can be written as
rnd
srmode;vp
(f) = rnd
RZ;vp
(f + inj
vp
):
For our rounding computations we have to generate the injections inj
vp1
0
and inj
vp2
0
according to rounding positions vp1
0
and vp2
0
. We denote the injected signicands by
finj12 = fpr+ inj
vp1
0
finj24 = fpr+ inj
vp2
0
:
By the truncation of finj12 and finj24 after bit position vp1
0
resp. vp2
0
, we get the
rounded signicands, that consider the rounding modes srmode 2 fRZ;RNU;RIg:
fprnd12
0
= rnd
RZ;vp1
0
(finj12)
fprnd24
0
= rnd
RZ;vp2
0
(finj24):
144 CHAPTER 4. BASIC FP OPERATIONS
We then get the required rounded signicands fprnd12 and fprnd24 from fprnd12
0
and
fprnd24
0
by an additional L-bit-x for the rounding mode RNE at signicand position
vp1
0
resp. vp2
0
. Because we only have to consider fprnd12 < 8 and fpr is an integral
multiple of 2
 104
for both single and double precision, it suces to consider the bit posi-
tions [ 2:104] in the binary representations of the values inj
vp1
0
and finj12, and because
we only have to consider fprnd24 < 4 and fpr=2 is an integral multiple of 2
 105
for both
single and double precision, it suces to consider the bit positions [ 1:105] in the binary
representations of the values inj
vp2
0
and finj24.
Based on the above notations we overview the computation steps, that are required
for the computation of fprnd12 and fprnd24 in the signicand path:
(A) By the partial product generation and reduction, a carry-save representation of the
exact signicand product fpr is computed. Because in this case no rounding injection
is added during the reduction, we can use the implementations of step (A) from the
multiplication unit I (both, the half-sized version, which is depicted in gure 4.22,
and the full-sized version, which is depicted in gure 4.21).
(B) Step (B) contains the compression, the IEEE rounding and post-normalization shift
of the signicand product fpr from one of its carry-save representations that we get
from step (A). The rounding for the computation of fprnd12 and fprnd24 in the
rounding modes srmode 2 fRZ;RNE;RIg is computed in two steps:
(I) Computation of fprnd12
0
and fprnd24
0
considering the roundingmodes srmode 2
fRZ;RNU;RIg by injection-based rounding with the steps:
(1) Generation of the injections inj
vp1
0
and inj
vp2
0
and addition with the carry-
save representation of fpr by a full-adder line and a carry-look-ahead adder
that implement:
finj12 = < finj12[ 2:105] >
neg
= fpr + inj
vp1
0
(4.283)
finj24 = < finj24[ 1:105] >
neg
= fpr + inj
vp2
0
: (4.284)
(2) Truncation of finj12 after bit position vp1
0
and of finj24 after bit position
vp2
0
. Because the truncation position is not xed in this case, the trunca-
tion is more complicated to be computed than in the previous section and
has to be considered separately.
fprnd12
0
= rnd
RZ;vp1
0
(finj12) (4.285)
= < finj12[ 2:vp1
0
] >
neg
(4.286)
fprnd24
0
= rnd
RZ;vp2
0
(finj24) (4.287)
= < finj24[ 1:vp1
0
] >
neg
(4.288)
(II) Computation of the rounded signicands fprnd12 and fprnd24, that consider
the rounding modes srmode 2 fRZ;RNE;RIg from fprnd12
0
and fprnd24
0
,
that considered the rounding modes srmode 2 fRZ;RNU;RIg by implement-
ing the L-bit-x for the rounding mode RNE.
The signicand position of the L-bit is vp1
0
resp. vp2
0
. This bit has to be pulled
down if the L-bit-x condition is fullled, namely i the number lies exactly
between two representable rounding results. Because in this case the injected
4.3. MULTIPLICATION 145
10 0
155
0 0
155
1 rounding operand
half-decoder
decoder
0 00
10 00 0 1 11
0
0 0 0 0
000
1 1 1 1
1 1
1
vp+10 1 52vp
rounding position (L-bit)
-2 -1
00
00
11
11 1 1 LFIX
rounding mode RI
rounding mode RN injection
injection
L-bit-fix +
truncation
mask
MASK1    [-2:53]
MASK1    [-2:53]
MASK0     [-2:53]
52-vp
52-vp
52-vp
52-vp
vp
vp
vp
truncation mask
Figure 4.28: Generation of the injections for the variable rounding position vp 2 [ 2:51].
signicands already contain the injections 2
 vp1
0
 1
resp. 2
 vp2
0
 1
, the L-bit-x
conditions are given by:
lfix12 = sr mode[0] AND (finj12[vp1
0
+1:104] = 0
104 vp1
0
)
lfix24 = sr mode[0] AND (finj24[vp2
0
+1:105] = 0
105 vp2
0
):
Because step (A) is implemented like in the multiplication unit I, no implementation details
have to be added for this step. The missing implementation details for the computation
step (B) are described in the following:
(B) For the implementation of step (B.I.1), the generation of the rounding injections
inj
vp1
0
and inj
vp2
0
has to be described. The binary representations of these injections are
composed from two parts, a xed mask that accounts for results with values of normalized
numbers from NOR
n;p
, in which case the signicand has to be rounded at the position
p 1, and a variable mask, that is adjusted corresponding to the variable rounding position
for results with values of denormalized numbers from DEN
n;p
. Moreover, we distinguish
between a xed injection mask for the rounding mode RNU , which we call fixmask0,
and a xed injection for the rounding mode RI, which we call fixmask1. For the cases,
where the signicand rounding position is dierent from p  1, the binary representations
of the rounding injections is generated with the help of a decoder for the rounding mode
RI and a half-decoder for the rounding mode RNU . These decoders account for the
correcting terms maxf0; e
min
  (e
pr
+ wec)g and maxf0; e
min
  (e
pr
+ wec + 1)g in the
equations of the rounding positions vp1
0
and vp2
0
. For the rounding modes RNU and RI
the generation of these variable rounding injections is illustrated in gure 4.28 considering
a variable rounding position vp 2 [ 2 : 51]. Moreover, this gure depicts how the masks
that are used to generate the binary representations of the injections could also be used for
the truncation and L-bit-x computations, that are required in steps (B.I.2) and (B.II).
A formal description of the injection generations for inj
vp1
0
and inj
vp2
0
is given by
the following lemma. In this lemma, several dierent masks are dened. In general, we
append a
0
0
0
to the names of masks, that are used to generate injections for the rounding
146 CHAPTER 4. BASIC FP OPERATIONS
mode RNU . To the names of the corresponding masks for the rounding mode RI, we
append a
0
1
0
:
Lemma 4.30 With the condition
vrtiny () (e
pr
+ 1  e
min
< 0) ^ unf en
and the computation of
fixmask0[ 2:53] = (0
26
;dbl; 0
28
;dbl)
fixmask1[ 2:53] = (0
26
;dbl
29
; 1)
varterm = <varterm[12 :0]>
2
=

e
min
  e
pr
  1 if dbl
e
min
  e
pr
  1 + 29 otherwise.
varmask0[ 2:52] = deco(varterm[5 :0])[55 :1]
varmask1[ 2:52] = hdec(varterm[5 :0])[55 :1]
mask0
vp1
0
[ 2:53] =

(varmask0[ 2:52]; 0) if vrtiny
fixmask0[ 2:53] otherwise.
mask0
vp2
0
[ 1:53] =

varmask0[ 2:52] if vrtiny
fixmask0[ 1:53] otherwise.
mask1
vp1
0
[ 2:53] =

(varmask1[ 2:52]; 1) if vrtiny
fixmask1[ 2:53] otherwise.
mask1
vp2
0
[ 1:53] =

varmask1[ 2:52] if vrtiny
fixmask1[ 1:53] otherwise.
the rounding injections can be generated by
inj
vp1
0
[ 2:105] =
8
<
:
(mask1
vp1
0
[ 2:53]; 1
52
) if sr mode[1]
(mask0
vp1
0
[ 2:53]; 0
52
) if sr mode[0]
0
108
otherwise.
inj
vp2
0
[ 2:105] =
8
<
:
(0;mask1
vp2
0
[ 1:53]; 1
52
) if sr mode[1]
(0;mask0
vp2
0
[ 1:53]; 0
52
) if sr mode[0]
0
108
otherwise.
Proof: Based on the value of the condition vrtiny, the denitions of the variable
rounding positions vp1
0
and vp2
0
can be split into
vp1
0
=

p  1  e
min
+ e
pr
if vrtiny
p  1 otherwise
vp2
0
=

p  1  e
min
+ e
pr
+ 1 if vrtiny
p  1 otherwise,
so that the injections can be generated separately in a xed part considering rounding
position p  1 for the case vrtiny = 0 and in a variable part for the case vrtiny = 1. In
this way we use in particular, that for the case vrtiny = 0, we have vp1
0
= vp2
0
= p  1.
In the following we proove the lemma separately for the three rounding modes RZ, RNU
and RI:
4.3. MULTIPLICATION 147
In the rounding mode RZ, the injections are dened to be inj
vp1
0
= inj
vp2
0
= 0 for
both the xed and the variable case. Because the rounding mode RZ is encoded with
sr mode[0] = sr mode[1] = 0, we get by the selection from the lemma
inj
vp1
0
[ 2:105] = inj
vp2
0
[ 2:105] = 0
108
;
as required by the denition for the rounding mode RZ.
In the rounding mode RNU , the injections are dened by
inj
vp1
0
= 2
 vp1
0
 1
= < inj
vp1
0
[ 2:105]>
neg
= <(0
vp1
0
+3
; 1; 0
104 vp1
0
)>
neg
(4.289)
inj
vp2
0
= 2
 vp2
0
 1
= < inj
vp2
0
[ 2:105]>
neg
= <(0
vp2
0
+3
; 1; 0
104 vp2
0
)>
neg
: (4.290)
We distinguish between the cases: (a) vrtiny = 0 and (b) vrtiny = 1:
(a) For vrtiny = 0, we have vp1
0
=vp2
0
=p 1, so that by denition inj
vp1
0
[ 2:105] =
inj
vp2
0
[ 2:105] = (0
p+2
; 1; 0
105 p
). Because the rounding mode RNU is encoded by
sr mode[0] = 1, by the selection from the lemma
inj
vp1
0
[ 2:105] = inj
vp2
0
[ 2:105] = (fixmask0[ 2:53]; 0
52
)
=

(0
55
; 1; 0
52
) if dbl
(0
26
; 1; 0
81
) otherwise
= (0
p+2
; 1; 0
105 p
)
This agrees with the denition.
(b) For vrtiny = 1, we have vp1
0
= p  1  e
min
+ e
pr
2 [ 2 : p  2] and vp2
0
=
p 1 e
min
+e
pr
+1 2 [ 2:p 2], so that vp2 = vp1+1. The dierence of the xed rounding
position p  1 and the rounding position vp1
0
is given by
p  1  vp1
0
= e
min
  e
pr
=

varterm+ 1 if dbl
varterm  28 otherwise.
From the above range of the rounding position vp1
0
, it follows, that the value varterm is in
the range varterm 2 [0 :53], so that it can be represented by varterm = <varterm[5 :0]>.
Because vp2
0
 51, we can write starting from the denition
inj
vp1
0
[ 2:105] = (0
vp1
0
+3
; 1; 0
104 vp1
0
)
= (0
vp1
0
+3
; 1; 0
51 vp1
0
; 0
53
)
=

(0
vp1
0
+3
; 1; 0
p 2 vp1
0
; 0
53
) if dbl
(0
vp1
0
+3
; 1; 0
p 2 vp1
0
+29
; 0
53
) otherwise
= (0
54 varterm
; 1; 0
varterm
; 0
53
)
= (deco(varterm[5 :0])[54 : 0]; 0
53
)
= (varmask0[ 2:52]; 0
53
)
= (mask0
vp1
0
[ 2:53]; 0
52
)
148 CHAPTER 4. BASIC FP OPERATIONS
as required by the lemma for the injection representation inj
vp1
0
[ 2:105].
Because in case (b) we have vp2
0
= vp1
0
+1, we can write for the injection representation
inj
vp2
0
[ 2:105] starting from the denition
inj
vp2
0
[ 2:105] = (0
vp2
0
+3
; 1; 0
104 vp2
0
)
= (0; 0
vp2
0
+2
; 1; 0
104 vp2
0
)
= (0; 0
vp1
0
+3
; 1; 0
103 vp1
0
)
= (0;varmask0[ 2:52]; 0
52
)
= (0;mask0
vp2
0
[ 1:53]; 0
52
):
This agrees with the selection according to the lemma for this case. In this way, the proof
for the rounding mode RNU is completed.
In the rounding mode RI, the injections are dened by
inj
vp1
0
= 2
 vp1
0
  2
 105
= < inj
vp1
0
[ 2:105]>
neg
= <(0
vp1
0
+3
; 1; 1
104 vp1
0
)>
neg
(4.291)
inj
vp2
0
= 2
 vp2
0
  2
 105
= < inj
vp2
0
[ 2:105]>
neg
= <(0
vp2
0
+3
; 1; 1
104 vp2
0
)>
neg
: (4.292)
We compare the injections in the rounding mode RNU (see equations 4.289 and 4.290)
and in the rounding mode RI (see equations 4.291 and 4.292). In the rounding mode
RNU , the binary representation of the injection inj
vp
[ 2 :105] only contains a single bit
that is one, namely inj
vp
[vp+1] = 1. In the representation of an injection for the rounding
mode RI, exactly the bits inj
vp
[vp+1 : 105] are all ones. Thus, to get the equations for the
injections in the rounding mode RI from the equations for the injections in the rounding
mode RNU , only the bits inj
vp
[vp+2 : 105] which are zero in the rounding mode RNU ,
have to be inverted for the rounding mode RI. This can easily be checked in the equations
for mask0
vp1
0
, mask0
vp2
0
and mask1
vp1
0
, mask1
vp2
0
, so that the proof of the lemma is
completed. 2
In the following we describe the implementation of the truncations according to step
(B.I.2) and the implementation of the L-bit-x according to step (B.II). The computation
of the truncations according to equations 4.285-4.288 is based on the masksmask1
vp1
0
[ 2:52]
and mask1
vp2
0
[ 1:52], that have exactly ones in the positions that are relevant in the
truncated signicands. For the L-bit-x we compute the masks lpdmask12[ 2 : 52] and
lpdmask24[ 1 : 52], that have in their L-bits lpdmask12[vp1
0
] resp. lpdmask24[vp2
0
]
the value of the L-bit-x condition lfix12 resp. lfix24, and that have zeros in all other
positions.
For the computations of the masks lpdmask12[ 2 : 52] and lpdmask24[ 1 : 52], the
masks mask1
vp1
0
and mask0
vp2
0
, that were involved in the rounding injection generation
for the rounding mode RI, are used to select the proper L-bit position and to truncate the
injected signicands after bit positions vp1
0
resp. vp2
0
according to equations 4.285-4.288.
To detect the L-bit x condition for the rounding position vp1
0
according to equation
4.289, the condition sticky12
0
[vp1
0
] () (finj12[vp1
0
+1 : 104] = 0
104 vp1
0
) is required.
Accordingly, for the L-bit x at the rounding position vp2
0
(see equation 4.289), the
condition sticky24
0
[vp2
0
]() (finj24[vp2
0
+1:105] = 0
105 vp2
0
) is required. Because these
4.3. MULTIPLICATION 149
bits are only required for the L-bit-x in the rounding mode RNU , where sr mode[0] = 1
and sr mode[1] = 0, we can also use the sticky bits
sticky12[vp1
0
] () (finj12[vp1
0
+1:104] = sr mode[1]
104 vp1
0
) (4.293)
sticky24[vp2
0
] () (finj24[vp2
0
+1:105] = sr mode[1]
105 vp2
0
) (4.294)
for the computation of the L-bit-x condition. The use of the bits sticky12[vp1
0
] and
sticky24[vp2
0
] has the advantage, that these bits are also required for the detection of
the inexact exception in all three reduced rounding modes.
Because the variable rounding positions vp1
0
and vp2
0
have to be considered in the
ranges vp1
0
2 [ 2 : 52] and vp2
0
2 [ 1 : 52], the sticky-bit strings sticky12[ 2:52] and
sticky24[ 1:52] are required for the computation of the L-bit x conditions. For the com-
putation of the inexact conditions we additionaly require sticky12[53] and sticky24[53].
We compute the sticky-bits sticky12[vp1
0
] and sticky24[vp2
0
] using the technique
from [4] for detecting the condition
0
A + B = K
0
. In contrast to a straight-forward
implementation of equations 4.293-4.294, that include the computation the binary rep-
resentation of finj12[vp1
0
+1 : 104] resp. finj24[vp2
0
+ 1 : 105], with the technique
from [4], the sticky-bits can be directly computed from the carry-save representation
of finj12 resp. finj24 without requiring a carry-propagate addition. This allows to
compute the sticky-bits sticky12[vp1
0
] and sticky24[vp2
0
] in parallel to the compres-
sions of finj12 and finj24 from the carry-save representations to the binary repre-
sentations. The details of these sticky-bit computations are described by the following
lemma. In this lemma, we denote a carry-save representation of finj12 by the bit-strings
finj12c[ 2:104] and finj12s[ 2:104], and a carry-save representation of finj24 by the
bit-strings finj24c[ 1:105] and finj24s[ 1:105].
150 CHAPTER 4. BASIC FP OPERATIONS
Lemma 4.31 With the computation of
p12[ 2:104] = (finj12c[ 2:104]  finj12s[ 2:104])
g12[ 2:104] = (finj12c[ 2:104] ^ finj12s[ 2:104])
v12[ 2:105] = ((p12[ 2:104] ^ sr mode[1]) _ g12[ 2:104]; 0)
w12[ 2:104] = (p12[ 2:104]  sr mode[1])
cssticky12[ 2:104] = w12[ 2:104]v12[ 1:105]
p24[ 1:105] = (finj24c[ 1:105]  finj24s[ 1:105])
g24[ 1:105] = (finj24c[ 1:105] ^ finj24s[ 1:105])
v24[ 1:106] = ((p24[ 1:105] ^ sr mode[1]) _ g24[ 1:105]; 0)
w24[ 1:105] = (p24[ 1:105]  sr mode[1])
cssticky24[ 1:105] = w24[ 1:105]v24[0 :106]
we get for each vp1
0
2 [ 2:104] and vp2
0
2 [ 1:105]
andtree(cssticky12[vp1
0
+1:104]) () (finj12[vp1
0
+1:104] = sr mode[1]
104 vp1
0
)
andtree(cssticky24[vp2
0
+1:105]) () (finj24[vp2
0
+1:105] = sr mode[1]
105 vp2
0
)
so that the sticky-bits sticky12[vp1
0
] and sticky24[vp2
0
] can be computed by
sticky12[vp1
0
] = andtree(cssticky12[vp1
0
+1:104])
sticky24[vp2
0
] = andtree(cssticky24[vp2
0
+1:105])
Proof: The proof can be found in [4] by setting k
i
= sr mode[1] for i 2 [ 2 : 105],
a[ 2 : 104] = finj12c[ 2 : 104] resp. a[ 1 : 105] = finj12c[ 1 : 105], and b[ 2 : 104] =
finj12s[ 2:104] resp. b[ 1:105] = finj24s[ 1:105]. 2
In this lemma only the computations for each single sticky-bit sticky12[vp1
0
] and
sticky24[vp2
0
] are described. The computation of the whole sticky-bit string sticky12[ 2:53]
is implemented by the use of the parallel-prex andsymb-function ppand, that com-
putes from an input string input[n
1
: n
2
] in its nth output ppand(input[n
1
: n
2
])[n] =
andtree(input[n :n
2
]), so that according to the previous lemma we get ppand(cssticky12[ 2:
104])[vp1
0
+1] = sticky12[vp1
0
] and, thus, ppand(cssticky12[ 2 : 104])[ 1 : 54] =
sticky12[ 2:53]. Accordingly, the sticky-bit string sticky24[ 1:52] is computed by
sticky24[ 1:53] = ppand(cssticky24[ 1:105])[0 :54]: This completes the description of
the implementation for the sticky-bit strings sticky12[ 2:53] and sticky24[ 1:52].
Based on the sticky-bit strings sticky12[ 2:53] and sticky24[ 1:53] and the masks
mask1
vp1
0
[ 2:53] and mask1
vp2
0
[ 1:53] from the generation of the injections in the pre-
vious lemma, the following lemma describes the computation of the truncation and the
computation of the L-bit-x.
Lemma 4.32 (a) The truncations according to equations 4.285-4.288 can be computed by
fprnd12
0
[ 2:52] = finj12[ 2:52] AND mask1
vp1
0
[ 2:52]
fprnd24
0
[ 1:52] = finj24[ 1:52] AND mask1
vp2
0
[ 1:52]:
(b) With the detection of the L-bit x conditions by the masks
lpdmask12[ 2:52] = sr mode[0] AND mask1
vp1
0
[ 1:53] AND sticky12[ 2:52]
lpdmask24[ 1:52] = sr mode[0] AND mask1
vp2
0
[0 :53] AND sticky24[ 1:52];
4.3. MULTIPLICATION 151
the combination of the truncation and the L-bit-x can be computed by
fprnd12[ 2:52] = fprnd12
0
[ 2:52] AND lpdmask12[ 2:52]
= finj12[ 2:52] AND mask1
vp1
0
[ 2:52] AND lpdmask12[ 2:52]
fprnd24[ 1:52] = fprnd24
0
[ 1:52] AND lpdmask24[ 1:52]
= finj24[ 1:52] AND mask1
vp2
0
[ 1:52] AND lpdmask24[ 1:52]:
Proof: (a) It follows from equation 4.291 in the previous lemma, that:
mask1
vp1
0
[ 2:52] = (0
vp1
0
+3
; 1
52 vp1
0
):
According to this equation the bit string mask1
vp1
0
[ 2 :52] has exactly zeros in the posi-
tions [ 2:vp1
0
]. Thus, starting from equations 4.285-4.286, we get
fprnd12
0
= <fprnd12
0
[ 2:52]>
neg
= rnd
RZ;vp1
0
(finj12)
= <finj12[ 2:vp1
0
]>
neg
= <(finj12[ 2:52] AND mask1
vp1
0
[ 2:52])>
neg
;
so that as required
fprnd12
0
[ 2:52] = finj12
0
[ 2:52] AND mask1
vp1
0
[ 2:52]:
The equation for fprnd24
0
[ 1:52] can be shown analogously.
(b) According to equation 4.289, the L-bit x condition for the rounding position vp1
0
is given by:
lfix12 = sr mode[0] AND ortree(finj12[vp1
0
+1:105])
= sr mode[0] AND sticky12[vp1
0
])
Considering only the valid positions [ 2:vp1
0
] of the truncated signicand fprnd12
0
[ 2:52],
the bit string lsel12[ 2 : 52] = mask1
vp1
0
[ 1 : 53] masks the L-bit position vp1
0
by
lsel12[ 2:vp1
0
  1] = 0
vp1
0
+2
and lsel12[vp1
0
] = 1. Because
lpdmask12[ 2:52] = sr mode[0] AND lsel12[ 2:52] AND sticky12[ 2:52];
it follows from the above, that lpdmask12[ 2 : vp1
0
] = (0
vp1
0
+2
; lfix12). Because the
rounded signicand fprnd12
0
[vp1
0
+1:52] is already truncated with fprnd12
0
[vp1
0
+1:52] =
0
52 vp1
0
, we get
fprnd12[ 2:52] = finj12[ 2:52] AND mask1
vp1
0
[ 2:52] AND lpdmask12[ 2:52]
= fprnd12
0
[ 2:52] AND AND lpdmask12[ 2:52]
= (fprnd12[ 2:vp1
0
 1]; fprnd12[vp1
0
] ^ lfix12; 0
52 vp1
0
)
This equation agrees with the denition of the L-bit-x for the computation of the rounded
signicand fprnd12 from the signicand fprnd12
0
. The equations for the computation of
fprnd24[ 2:52] can be shown analogously. 2
This completes the descriptions of the equations for the implementation of step (B), thus,
leaving the description of the exception detections. The detection of the invalid exception
inv was already included in the special cases computations. The detections of the inexact,
the underow and the overow exceptions inx, unf and ovf are described in the following
lemma.
152 CHAPTER 4. BASIC FP OPERATIONS
Lemma 4.33 With the denition of (note, that tiny1 was already used in the computa-
tion of vrtiny in lemma 4.30) :
large0 () (e
max
  e
pr
< 0)
large1 () (e
max
  (e
pr
+ 1) < 0)
rmask12[ 1:53] () (mask1
vp1
0
[ 2];mask1
vp1
0
[ 1:52] AND mask1
vp1
0
[ 2:53])
rmask24[0 :53] () (mask1
vp2
0
[ 1];mask1
vp1
0
[0 :52] AND mask1
vp1
0
[ 1:53])
inx12 () ortree(rmask12[ 1:53] AND sticky12[ 1:53]) OR
(sr mode[1] _ sr mode[0])  ortree(rmask12[ 1:53] ^ finj12[ 1:53])
inx24 () ortree(rmask24[0 :53] AND sticky24[0 :53]) OR
(sr mode[1] _ sr mode[0])  ortree(rmask24[ 1:53] ^ finj24[ 1:53])
tiny0 () (e
pr
  e
min
< 0)
tiny1 () (e
pr
+ 1  e
min
< 0)
tiny2 () (e
pr
+ 2  e
min
< 0);
the overow, the inexact and the underow exception can be detected by
ovf =

spca ^winzig ^ large1 cfovf1
spca ^winzig ^ large0 otherwise
inx =

spca ^ (winzig _ inx24) _ ovf) if cfovf1
spca ^ (winzig _ inx12) _ ovf) otherwise
unf =

spca ^ (inx24 _ unf en) ^ ((tiny2 ^ cfovf2) _ (tiny1 ^ cfovf2)) if cfovf1
spca ^ (inx12 _ unf en) ^ tiny0 otherwise.
Proof: In this multiplication unit the overow condition can be written as
ovf() spca ^winzig ^ (jval(s
pr
; e
prnd
; fprnd)j  2
e
max
+1
:)
An overow could only occur for results with very large exponents where e
pr
+2  e
prnd
>>
e
min
. Because of corollary 2.10 we then also have e
pr
+wec > e
min
. Thus, it follows from
lemma 4.29, that ovf =) cfovf2 and we do not have to consider the case cfovf2 = 1
in the overow detection. In this way we get according to equation 4.278, where we only
have to consider fprnd12 < 2 and fprnd24 < 2:
(jval(s
pr
; e
prnd
; fprnd)j  2
e
max
+1
) ()

((e
pr
+1 > e
max
) cfovf1
(e
pr
> e
max
) otherwise
()

large1 cfovf1
large0 otherwise.
In this way we get the equation for ovf from the lemma.
Because all special cases results with (spca = 1) are exact, the condition for an inexact
exception (see section 2.4) can be written as
inx () (spca ^ rndinx) _ ovf) (4.295)
where the bit rndinx signals the signicand rounding inexactness, namely the case, that
signicand rounding changes the value of the signicand product.
According to the use of rndinx in equation 4.295, we can assume for the computation
of rndinx that spca = 0 and that no overow occurs. Thus, according to equation
4.3. MULTIPLICATION 153
4.279 and equation 4.278 we have to consider the following three cases for the detection of
rndinx: (a) (winzig = 1); (b) (winzig ^ cfovf1 = 1); and (c) (winzig ^ cfovf1 = 1).
Because only non-zero signicand products have to be considered for (winzig = 1), it
is obvious that we have rndinx = 1 in case (a). For the rounding inexactness conditions
in the cases (b) and (c) we use equation 2.51 regarding the (vp1
0
+1)-representative of fpr,
that relates to the computation of fprnd12 (case (b)), and the (vp2
0
+ 1)-representative
of fpr=2, that relates to the computation of fprnd24 (case(c)). In this way we get
rndinx =

winzig _ ortree(fpr[vp2
0
:104]) if cfovf1
winzig _ ortree(fpr[vp1
0
+1:104]) otherwise.
(4.296)
Because we do not compute a representation of the exact signicand product fpr, the
above equation has to be computed from the injected signicand products finj12 and
finj24. By considering the injections that are included in the injected signicands in the
rounding modes RNU and RI, the above ortree-conditions can be computed based on
the representations of the injected signicands by
ortree(fpr[vp1
0
+1:104]) =
8
<
:
ortree(finj12[vp1
0
+1]; finj12[vp1
0
+2:104]) if sr mode[1]
ortree(finj12[vp1
0
+1]; finj12[vp1
0
+2:104]) if sr mode[0]
ortree(finj12[vp1
0
+1:104]) otherwise.
=
(sr mode[1] _ sr mode[0]) (finj12[vp1
0
+1])
_ ortree(sr mode[1] finj12[vp1
0
+2:104])
=
(sr mode[1] _ sr mode[0]) (finj12[vp1
0
+1])
_ sticky12[vp1
0
+1])
(4.297)
ortree(fpr[vp2
0
:104]) =
8
<
:
ortree(finj24[vp2
0
+1]; finj24[vp2
0
+2:105]) if sr mode[1]
ortree(finj24[vp2
0
+1]; finj24[vp2
0
+2:105]) if sr mode[0]
ortree(finj24[vp2
0
+1:05]) otherwise.
=
(sr mode[1] _ sr mode[0]) (finj24[vp2
0
+1])
_ ortree(sr mode[1] finj24[vp2
0
+2:105])
=
(sr mode[1] _ sr mode[0]) (finj24[vp2
0
+1])
_ sticky24[vp2
0
+1]
(4.298)
For the computations in equation 4.297, we have to select the bit finj12[vp1
0
+1] from
the bit string finj12[ 2 : 52] and to select the bit sticky12[vp1
0
+1] from the bit string
sticky12[ 2 :52]. For this purpose we require a mask, that exactly has a one in position
vp1
0
+ 1 and zeros in all other positions. This is exactly the case for
rmask12[ 1:53] = (mask1
vp1
0
[ 2];mask1
vp1
0
[ 1:52] AND mask1
vp1
0
[ 2:51])
= (0
vp1
0
+2
; 1
53 vp1
0
) AND (1
vp1
0
+3
; 0
52 vp1
0
)
= (0
vp1
0
+2
; 1; 0
52 vp1
0
):
Thus, we get
finj12[vp1
0
+1] = ortree(rmask12[ 1:53] AND finj12[ 2:52])
sticky12[vp1
0
+1] = ortree(rmask12[ 1:53] AND sticky12[ 2:52])
154 CHAPTER 4. BASIC FP OPERATIONS
and equation 4.297 can be written as:
ortree(fpr[vp1
0
+1:104]) = ortree(rmask12[ 1:53] AND sticky12[ 2:52]) OR
(sr mode[1] _ sr mode[0])  ortree(rmask12[ 1:53] ^ finj12[ 2:52])
= inx12
It can be shown analogously, that equation 4.298 can be computed by
ortree(fpr[vp2
0
:104]) = ortree(rmask24[0 :53] AND sticky24[ 1:52]) OR
(sr mode[1] _ sr mode[0])  ortree(rmask24[0 :53] ^ finj24[ 1:52])
= inx24
The substitution of rndinx in equation 4.295 according to equation 4.296 and the substi-
tution of ortree(fpr[vp1
0
+1 : 104]) and ortree(fpr[vp2
0
: 104]) by inx12 resp. inx24
according to the previous two equations then yield the equation for inx from the lemma.
For the multiplication unit the condition for an underow exception is dened by (the
function TINY is dened in denition 2.10):
unf =

spca ^ TINY (s
pr
; e
prnd
; fprnd) if unf en
spca ^ rndinx ^ TINY (s
pr
; e
prnd
; fprnd) otherwise.
= spca ^ (rndinx _ unf en) ^ TINY (s
pr
; e
prnd
; fprnd)
=

spca ^ (inx24 _ unf en) ^ TINY (s
pr
; e
pr
+ 1; fprnd24) if cfovf1
spca ^ (inx12 _ unf en) ^ TINY (s
pr
; e
pr
; fprnd12) otherwise
Because for cfovf1 = 0 the rounded signicand fprnd12 is smaller than 2, we get
TINY (s
pr
; e
pr
; fprnd12) () (e
pr
< e
min
)
() tiny0
Because fprnd24 < 2 for cfovf2 = 0 and fprnd24 = 2 for cfovf2 = 0, we get for the
function
TINY (s
pr
; e
pr
+ 1; fprnd24) ()

(e
pr
+ 2 < e
min
) if cfovf2
(e
pr
+ 1 < e
min
) otherwise
() ((tiny2 ^ cfovf2) _ (tiny1 ^ cfovf2))
With the substitution of TINY (s
pr
; e
pr
; fprnd12) and TINY (s
pr
; e
pr
+ 1; fprnd24) in
equation 4.299 according to the previous equations, we get the equation for unf from the
lemma. 2
This lemma completes the description of the computations for the exception ags, so
that the description of the whole multiplication unit III is completed.
Figure 4.27 depicts the main structure of the multiplication unit III, gure 4.29 depicts
a detailed block diagram of the implementation of step (B) and a detailed block diagram
of the exceptions and exponent computations is given in gure 4.30.
4.3. MULTIPLICATION 155
NOR
NAND
ORtreeORtree
AND AND
XOR
OR
AND
NOR
NOR
NAND
ORtree ORtree
ANDAND
XOR
OR
AND
OR
CFOVF1
F      [0:52]max
F      [0:52]inf
F     [0:52]nrc
[-1:53]
RMASK12
[-1:53]
RMASK12
RMASK24
[0:53]
RMASK24
[0:53]
CLA(107)
full adder(107)
[-1:53][54]
[-1:53]
[53]
FINJ12[-1:53]
SRMODE[0]
SRMODE[1] OR
SRMODE[0]
SRMODE[1]
CLA(107)
full adder(107)
[0:53]
SRMODE[1]
[54]
LPDMASK24 [-1:52]
SRMODE[0]
[-1:52]
SRMODE[0]
SRMODE[1] OR
FINJ24[0:53]
STICKY24
[0:53]
[-1:52] [-2:51]
[53]
STICKY12
1 [-2:51][-2:52]
NOR
[-2:52]
[-2:52]LPDMASK12
MASK1       
INJ       [-1:105]INJ        [-2:104] vp2’vp1’ FPRS[-1:104] FPRC[-1:104]
OR
[-2:52] [-1:52]
CFOVF2
FINJ12[-2:52] FINJ24[-1:52]
[-2] [-1] [-1] [0]
MUX
CFOVF1 0 1
0
MUX1 0
0MUX1
(SRMODE[1], 0    )52
MASK1       
[-1:53]
vp1’
MASK1       vp1’
MASK1       
[0:53]
vp2’
vp2’
MASK1       vp2’MASK1       vp1’
1MUX
FPRND[0:52]
WINZIG
F     [0:52]sel
(OVF AND OVF_EN)
WINZIG OR
INX12 INX24
CFOVF1
SPCA
AND
INX12
SPCA AND WINZIG
OR INX24
FPRND12[0:52] FPRND24[1:52]
AND AND
MUX 10
INX
OR OVF
Parallel Prefix
AND(107)
’A+B=K’ circuit
[-2:104]
CSSTICKY12
’A+B=K’ circuit
Parallel Prefix
AND(107)
CSSTICKY24
[-1:105]
0 0 0 0
[-2:104] [-2:104] [-1:105] [-1:105]
[53]
STICKY12
[53]
STICKY24
[-1:52][-2:52]
Figure 4.29: Implmentation of the signicand compression and variable position rounding
in the multiplication unit III.
156 CHAPTER 4. BASIC FP OPERATIONS
TIN
Y
0 A
N
D
 SPCA
OR
ANDAND
0
OR
maxE        +1[11:0]
MUX 0
MUX AND
1
AND1
01 MUX
INC2compound
adder(12)
MUX
(12)
1 0
EW[11:0]EWI[11:0]
PWEC[11:0]
EPR[12]
MUX
EW[11:0]
ESEL[11:0]
CFO
V
F2
MUX
EPR
II[11:0]
EPR[11:0]
MUX1 0
EOP[11:0]
TOI[12] TTII[12]
OVF[1]
OVF
OVF[0]
OVF[1] OVF[0]
TO[12]
AND AND
SPCA
EPR[11:0]
EPR[12:0]
E        [12:0]max
TIN
Y
2
LA
RG
E1
LA
RG
E0
ENI[11:0] EN[11:0]
1 0
EPR[12:0]EPR[11:0]
CLA(6)
CLA(13)
[12]
CLA(13)
0 0EA[11:0] EB[11:0]
compound
adder(13)
rounding mode
reduction
RMODE[1:0]
SRMODE[1:0] EPR[12:0]
EPR[12:0]EPR[5:0]
EPR[12:0]
SRM
O
D
E[0]
FIX
M
A
SK
0
VARMASK1[-2:52] VARMASK0[-2:52]
[-1:53]
FIX
M
A
SK
1 [-2:53][-2:53]1 0 [-2:53]
TT[12] TTI[12]
UNF_EN
Snrc
EPR[12:0] TINY0 TINY1
TIN
Y
0
TIN
Y
1
V
RTIN
Y
W
IN
ZIG
EPR[12:0]
VARTERM[5:0]
-E       [12:0]min
SRMODE[1:0]VRTINY WINZIG
SRMODE[0]MUX
INJ24[-1:54]INJ12[-2:54]
INJ24[-1:54]
INJ12[-2:54]
0 1
1 1
0 0
+ALPHA[11:0] -ALPHA[11:0]
MASK0       [-2:53]vp1’
MASK0       [-1:53]vp2’
HDEC(6) DEC(6)
[55:1] [55:1]
-E       +p+2 [12:0]min(0,DBL  ,1,DBL) 
3
MUX1 0MUX1 0MUX1 0
AND
vp2’
M
A
SK
1       [-1:53]
M
A
SK
1       [-2:53]
vp1’
MASK1       [-2:53]vp1’
vp2’MASK1       [-1:53]
-E       +2 [12:0]min
CLA(13)compound
adder(13)
MUX1 0
MUX
E      [11:0]
1 0
nrc
O
R(SRM
ODE[1:0])
E        [11:0]max
[11:0]
MUX 10
CFOVF1
WINZIG
OVF_EN
WINZIG OR
W
IN
ZIG
 O
R O
V
F_EN
EPR[11:0]
01 MUX
1 0
WINZIG OR OVF[1] OR
(UNF[1] AND UNF_EN)
ECI[11:0]EOPI[11:0]
W
IN
ZIG
 O
R O
V
F[0] OR (UNF[0] AND UNF_EN)
ESEL[11:0]
IN
X
24
01 MUX
UNF
UNF[0]UNF[1]
IN
X
12
UNF_EN
MUX1 0 CFOVF2
CFOVF1
AND AND
WINZIG AND
SPCA
TIN
Y
1TIN
Y
1
E      -p+1min
Figure 4.30: Implementation of the exponent&exceptions circuit of the multiplication
unit III.
4.4. DIVISION 157
4.4 Division
In this section the implementations of the oating-point division are described. Like in
the previous sections for the addition and the multiplication implementations, also for the
division, the descriptions are separated into three subsections for the microarchitectures
I, II and III. For the division implementations the main details have to be described
about the computation of the signicand quotient. Because a very similar signicand
quotient implementation is used for all three microarchitectures, we will only describe it
once for the implementation for microarchitecture I. For the other two microarchitectures
we will only describe the small adjustments, that are required. The implementation of
the signicand quotient uses an initial approximation for the reciprocal of the divisor.
Because the description of our implementation of this initial reciprocal approximation
(see also [36, 39]) is quite complex, we describe it separately in the next subsection in
preparation for the division implementations.
4.4.1 Initial Reciprocal Approximation
The circuit for the reciprocal approximation should approximate the reciprocal of a normal-
ized input signicand y = <y[0 :52]>
neg
2 [1; 2[. We denote the approximated reciprocal
by arecip(y)  1=y and dene the approximation error by err(y) = 1=y   arezip(y). For
the approximated reciprocal result arezip(y) the computation has to guarantuee an upper
bound on the absolute approximation error jerr(y)j. In particular, for the implementations
of the FP division, we will require initial reciprocal approximations with absolute approx-
imation errors that are bounded by jerr(y)j < 2
 8
, jerr(y)j < 2
 15
and jerr(y)j < 2
 28
respectively.
In literature the initial reciprocal approximations fall into two groups: The constant
approximation [13, 18, 35, 6, 5, 15, 42] is easy to implement in 1 clock cycle by a simple
lookup table, but due to the huge cost it is limited to small accuracies ( 2
 16
). The
linear [18, 35] and modied linear [18] approximation approaches can achieve even twice
the accuracy of constant approximations at nearly the same cost, but the implementations
corresponding to [7, 18, 35] require about 3 clock cycles for an approximation: one lookup
and decode cycle, one cycle for the adder tree of the full-size multiplication and one clock
cycle for the carry-propagate addition of this multiplication.
We present a faster linear approximation implementation for the reciprocal. A descrip-
tion of this implementation can also be found in [36, 39]. In comparison to the previous
linear reciprocal approximation implementations from literature, our implementation is
accelerated by the use of the following new ideas:
1. a linear approximation formula, that reduces the widths of table lookup inputs and
multiplication operands for a given approximation accuracy.
2. the use of a specic small Booth multiplier (with less than 8 partial products in the
implementation for jerr(y)j < 2
 28
) for the computation of the linear approximation
formula.
3. a fast redundant compression from carry-save representations to redundant Booth-
Digit representations, a redundant format, that can directly be fed into the large
Booth multiplier of the FP multiplication unit. This fast partial compression avoids
the slow carry propagate addition step in the multiplication of the linear approxi-
mation formula.
158 CHAPTER 4. BASIC FP OPERATIONS
For the description of the implementations, we rst develop the linear approximation
formula for the approximation of the reciprocal. We then introduce the new intermediate
format, the redundant Booth-Digit (redBD) representation, in that the reciprocal approx-
imation should be output. Based on the approximation formula, we nally describe the
implementation of the computations from the binary representation of the input y[0 : 52]
to the redundant Booth-Digit representation of the approximated reciprocal arecip(y) for
a given appoximation accuracy. In particular we consider the implementations for the
three target accuracies that will be required for the implementations of the oating-point
divisions.
4.4.1.1 Approximation formula
We consider a linear approximation formula for the reciprocal. The linear and the constant
parameter of this linear approximation are not xed for the whole range of y, but the
range of y is partitioned into 2
m
subintervalls and for each of these subintervalls a specic
a linear approximation arezip
p
(y) with an appropriate linear and an appropriate constant
parameter is used to approximate the reciprocal function.
We consider the 2
m
equidistant subintervalls [p; p+2
 m
[ with p2f2
m
;  ; 2
m+1
 1g=2
m
.
Because y2 [1; 2 2
 52
], one of these intervals contains y 2 [p; p + 2
 m
[. We get the
left endpoint of this interval by p = <y[0 :m]>
neg
and we get the right endpoint by
<y[0 :m]>
neg
+2
 m
. The linear approximation formula for the interval [p; p + 2
 m
[ can
be written as
arezip
p
(y) = C0
p
+C1
p
 (y   p)
with the constant parameter C0
p
and the linear parameter C0
p
. For the approximation
formulae in the 2
m
dierent intervals, we require 2
m
dierent constants C0
p
and 2
m
dierent constants C1
p
. In the implementation we will get these constants by a table
lookup from a ROM for C0
p
and from a ROM for C1
p
. Because y 2 [1; 2[ is normalized
and we always have y[0] = 1, the ROMs with the 2
m
entries for C0
p
and C1
p
can be
addressed by y[1 :m], where m is the input width of the table lookup ROMs.
In this way the delay for the implementation of the linear approximation formula can
be mainly inuenced by the following parameters:
 the input width m of the lookup tables, because it determines the delay of the ROM
tables.
 the widths of the multiplication operands within the linear approximation formula,
that inuence the delay and the cost of the additional small multiplier.
We consider the linear approximation formula with the focus to minimize these parameters
for a given accuracy in the follwing lemma.
Lemma 4.34 For y 2 [p; p+2
 m
[ with p 2 f2
m
;    ; 2
m+1
 1g=2
m
, the reciprocal approx-
imation of f(y) = 1=y by the linear function
arecip
p
(y) = rnd
RZ;wr
(C0
p
  C1
p
 rnd
RZ;wy
(y   p))
with
C1
p
= rnd
RNE;wc1

1
(p+ 2
 m 1
)
2

C0
p
= rnd
RNE;wc0

1
p+ 2
 m 1
+ 2
 2m 3
+ 2
 m 1
 C1
p

4.4. DIVISION 159
results in the approximation error
jerr(y)j = j1=y   arecip
p
(y)j < 2
 2m 3
+ 2
 wc1 m 2
+ 2
 wc0 1
+ 2
 wy
+ 2
 wr
:
Proof: Taylor approximation of degree 1 of the function f(y) = 1=y developed at the
midpoint p+ 2
 m 1
of the interval [p; p+ 2
 m
[ yields the linear approximation formula
r
taylor
(y) = f(p+ 2
 m 1
) + f
0
(p+ 2
 m 1
) 
 
y   (p+ 2
 m 1
)

=
1
p+ 2
 m 1
 
1
(p+ 2
 m 1
)
2

 
y   (p+ 2
 m 1
)

:
Using the Lagrange error formula, the approximation error error
taylor
= 1=y   r
taylor
(y)
in the interval [p; p+ 2
 m
[ is bounded by
jerror
taylor
j 




f
00
(p+ 2
 m 1
)
2
 
2




with  2 [ 2
 m 1
; 2
 m 1
):
As f
00
(y) =
2
y
3
and y 2 [1; 2) we have
jerror
taylor
j  2
 2m 2
:
The 2nd derivative of 1=y is positive for y 2 [1; 2) and r
taylor
(y) describes a tangent of
the graph of 1=y. Therefore, error
taylor
can not become negative. By adding half of the
maximum error in the approximation formula
r
1
(y) = r
taylor
(y) + 2
 2m 3
we halve the absolute error
jerror
1
j = j1=y   r
1
(y)j  2
 2m 3
:
In jerror
1
j only the approximation error, produced by the linear approximation using
innite precision numbers, is considered. We have to consider the additional inuence of
the discretization errors by using nite precision numbers.
First, we discretize the derivative C1
p
= rnd
RNE;wc1

1
(p+2
 m 1
)
2

at position wc1 and
bring then the linear term in r
1
(y) to the form of the linear term in arezip
p
(y):
r
2
(y) =
1
p+ 2
 m 1
+ 2
 2m 3
+ 2
 m 1
 C1
p
  C1
p
 (y   p):
Because jy   (p + 2
 m 1
)j  2
 m 1
, and because the rounding function rnd
RNE;wc1
produces a discretization error smaller than or equal to 2
 wc1 1
, we get the error bound
jerror
2
j = j1=y   r
2
(y)j  2
 2m 3
+ 2
 wc1 m 2
:
Discretizing the constant part at position wc0:
C0
p
= rnd
RNE;wc0

1
p+ 2
 m 1
  2
 2m 3
+ 2
 m 1
 C1
p

and the linear factor (y   p) at position wy by rnd
RZ;wy
(y   p) yields
r
3
(y) = C0
p
+ C1
p
 rndRZ;wy(y   p):
160 CHAPTER 4. BASIC FP OPERATIONS
Because the rounding function rnd
RZ;wy
produces a discretization error smaller than 2
 wy
,
the error bound increases to
jerror
3
j = j1=y   r
3
(y)j < 2
 2m 3
+ 2
 wc1 m 2
+ 2
 wc0 1
+ 2
 wy
:
The linear approximation formula arecip
p
(y) = rnd
RZ;wr
(r
3
(y)) contains then the nal
approximation error, that is bounded by
jerr(y)j = j1=y   arecip
p
(y)j < 2
 2m 3
+ 2
 wc1 m 2
+ 2
 wc0 1
+ 2
 wy
+ 2
 wr
:
2
Corollary 4.35 For the implementation of the reciprocal approximation we will use the
linear approximation formula from lemma 4.34 with wc1 = m + 2,wy = 2m + 6, wc0 =
2m+ 5 and wr = 2m+ 5, so that we get:
arecip
p
(y) = rnd
RZ;2m+5
(C0
p
  C1
p
 rnd
RZ;2m+6
(y   p))
C1
p
= rnd
RNE;m+2

1
(p+ 2
 m 1
)
2

C0
p
= rnd
RNE;2m+5

1
p+ 2
 m 1
+ 2
 2m 3
+ 2
 m 1
 C1
p

This approximation formula results in an absolute error jerr(y)j < 2
 2m 2
:
Because rnd
RZ;wy
(y p) < 2
 m
and jC1
p
j < 1, the binary representation of rnd
RZ;wy
(y p)
contains wy m = m+6 non-zero positions and the binary representation of  C1
p
contains
wc1 = m + 2 non-zero positions. For 0:5  C0
p
< 1, the most signicant bit of C0
p
in
the position with weight 2
 1
is always a 1. Therefore, only wc0  1 = 2m+4 bits have to
be saved in a lookup table entry for C0
p
. In this way a straightforward implementation
of the linear approximation formula according to corollary 4.35 requires a m-bit-in lookup
table for C0
p
with a bit width of 2m+4 and a m-bit-in lookup table for  C1
p
with a bit
width of m + 2. A (m + 2)-bit by (m + 6)-bit multiplication is required to compute the
multiplication of this linear approximation formula.
For the three target accuracies of the reciprocal approximation, that we will require
for the implementations of the oating-point division, we consider the linear reciprocal
approximation formula from corollary 4.35 with m = 13 to get jerr(y)j < 2
 28
, with
m = 7 to get jerr(y)j < 2
 16
and with m = 3 to get jerr(y)j < 2
 8
.
For these three cases we dene the linear approximation equations (note, that in these
equations p = < y[0 :m] >
neg
and that y   p = < y[m+1 :52] >
neg
)
arecip28(y) = rnd
RZ;31
(C0 28
p
  C1 28
p
 rnd
RZ;32
(y   p)) (4.299)
arecip16(y) = rnd
RZ;19
(C0 16
p
  C1 16
p
 rnd
RZ;20
(y   p)) (4.300)
arecip08(y) = rnd
RZ;11
(C0 08
p
  C1 08
p
 rnd
RZ;12
(y   p)); (4.301)
where the constants are dened by:
C0 28
p
= rnd
RNE;31

1
p+2
 14
+2
 29
+2
 14
C1 28
p

C1 28
p
= rnd
RNE;15

1
(p+2
 14
)
2

C0 16
p
= rnd
RNE;19

1
p+2
 8
+2
 17
+2
 8
C1 16
p

C1 16
p
= rnd
RNE;9

1
(p+2
 8
)
2

C0 08
p
= rnd
RNE;11

1
p+2
 4
+2
 9
+2
 4
C1 08
p

C1 08
p
= rnd
RNE;5

1
(p+2
 4
)
2

:
4.4. DIVISION 161
non-zero bit positions in the representations of
function m jerr(y)j < CO
p
 C1
p
rnd
RZ;wy
(y   p) arecip
arecip28 13 2
 28
c0
p
[1 :31] c1
p
[1 :15] y[14 :32] arecip28[0 :32]
arecip16 7 2
 16
c0
p
[1 :19] c1
p
[1 :9] y[8 :20] arecip16[0 :20]
arecip08 3 2
 8
c0
p
[1 :11] c1
p
[1 :5] y[4 :12] arecip08[0 :12]
Table 4.2: Bit positions of the operands in the linear reciprocal approximation formulae
for the functions arecip28, arecip16 and arecip08.
The computation of arecip28(y) requires a 15-bit by 19-bit multiplication and an addition
with a 31-bit value, the computation of arecip16(y) requires a 9-bit by 15-bit multipli-
cation and an addition with a 19-bit value, and the computation of arecip08(y) requires
a 5-bit by 9-bit multiplication and an addition with a 11-bit value. The required bit
positions for these computations are listed in table 4.2. We postpone a detailed descrip-
tion of the implementation and introduce the intermediate format, in that the reciprocal
approximation should be represented in the following section.
4.4.1.2 Redundant Booth-Digit Representations
For the denition of redundant Booth-digit representations we shortly review Booth re-
coding. Following the descriptions of [3, 30] a number b = <b[m 1:0]> is recoded
in Booth-2 recoding as suggested in Fig. 4.31. With b[m+ 1] = b[m] = b[ 1] = 0 and
m
0
= d(m+ 1)=2e one writes
b = <b[m 1:0]> (4.302)
= 2b  b = 2<b[m 1:0]>  <b[m 1:0]> (4.303)
=
X
m
0
 1
j=0
B
2j
 4
j
(4.304)
where
B
2j
= 2b[2j] + b[2j   1]  2b[2j + 1]  b[2j] (4.305)
=  2b[2j + 1] + b[2j] + b[2j   1]: (4.306)
For 0  j  m
0
  1 this equation computes the Booth-digits B
2j
2 f 2; 1; 0; 1; 2g for
the number b = <b[m 1:0]>. The Booth-digits B
2j
of a number are not unique and
not only the set of values according to equation 4.306, but each set of values B
2j
2
f 2; 1; 0; 1; 2g that fullls equation 4.304, is dening a set of Booth-digits for the number
b. A string of Booth digits (B
2j
)
0jm
0
 1
, that fullls equation 4.304, is called a Booth
digit representation of the number b.
According to equation 4.306 each Booth digit B
2j
can be computed from three con-
secutive bits (b[2j + 1];b[2j];b[2j   1]) of the binary representation of the number b =
<b[m 1:0]>. For an arbitrary string of tripels (rb3
j
;rb2
j
;rb1
j
)
0jm
0
 1
, we dene
the corresponding Booth digits by
B
2j
=  2rb3
j
+ rb2
j
+ rb1
j
: (4.307)
The Booth digits B
2j
that are computed according to this equation represent the number
b =
P
m
0
 1
j=0
B
2j
 4
j
=
P
m
0
 1
j=0
( 2rb3
j
+ rb2
j
+ rb1
j
)  4
j
. In this way the number b is
162 CHAPTER 4. BASIC FP OPERATIONS
  
  


  
  


 
 


  
  


  
  


 
 


B
m
B
m-2 Bi B2 B0
[m-1][m]
B B B B B[i+1] [i] [i-1]
B B
[1] [0]
B B
[1] [0]
           
0
0
[m-1][m]
B B B B B
[i+1] [i] [i-1]2 <B[m-1:0]>
- <B[m-1:0]>
Figure 4.31: Booth digits B
2j
also represented by the string of tripels (rb3
j
;rb2
j
;rb1
j
)
0jm
0
 1
. We dene, that the
string of tripels (rb3
j
;rb2
j
;rb1
j
)
0jm
0
 1
is called a redundant Booth-digit representation
of b, i b =
P
m
0
 1
j=0
( 2rb3
j
+ rb2
j
+ rb1
j
)  4
j
.
Note, that because also the string of tripels (b[2j + 1];b[2j];b[2j   1])
0jm
0
 1
is
a redundant Booth-digit representation of b according to equations 4.304 and 4.306, a
binary representation of a number can be easily converted into a redundant Booth-digit
representation of the number. We denote this conversion from the binary representation of
the number b to a redundant Booth-digit representation of the number b by the operation
redBD with
(b[2j + 1];b[2j];b[2j   1])
0jm
0
 1
= redBD(b[m 1:0]):
Multiplier with Input of Redundant Booth-Digit Representation. In an or-
dinary implementation of a multiplier that uses Booth recoding, the binary input of
one of the operands is encoded by Booth encoders (we call this the second operand
and denote it by b = <b[m 1:0]>). Each of these Booth encoders computes a sign-
magnitude representation of one Booth digit and gets as input the three consecutive
bits (b[2j + 1];b[2j];b[2j   1]) from that a Booth digit is originally computed accord-
ing to equation 4.306. We change the specication of the multiplier, so that as the
second operand not the binary representation of the number b, but the set of tripels
(b[2j + 1];b[2j];b[2j   1]) has to be input. (This change can easily be realized, be-
cause these are exactly the inputs of the Booth encoders). Thus, to multiply a number
a by b, not the binary representation of b, but the redundant Booth digit representation
(b[2j + 1];b[2j];b[2j   1])
0jm
0
 1
is required as input of the new multiplier. Because
in equations 4.306 and 4.307 the weights of the bits in (b[2j + 1];b[2j];b[2j   1]) and
(rb3
j
;rb2
j
;rb1
j
) correspond to each other, it does not change the value of the second
operand if the redundant Booth-digit representation (b[2j + 1];b[2j];b[2j   1])
0jm
0
 1
is replaced by an arbitrary Booth-digit representation (rb3
j
;rb2
j
;rb1
j
)
0jm
0
 1
of the
number b. In this way we get a multiplier, that multiplies a number a = <a[k 1:0]>,
that is given in the binary representation a[k 1 : 0], with a number b that is given by a
an arbitrary redundant Booth-digit representation.
Compression from Carry-Save to Redundant Booth-Digit representations. The
following lemma describes how a number can be converted from carry-save to redundant
Booth-digit representation. This technique will help us to avoid the carry-propagate ad-
dition in the implementation of the multiplication for the linear reciprocal approximation.
Lemma 4.36 Let the compression injection compinj be dened by compinj =
P
m
0
 1
j=0
2 
4
j
. Then, from a carry-save representation of b + compinj, a redundant Booth-digit rep-
4.4. DIVISION 163
adder
2-bit
adder
2-bit
adder
2-bit
adder
2-bit
adder
2-bit
adder
2-bit
 
 


 
 


 
 


 
 


 
 


 
 


 
 


 
 


 
 


                          
carry-
save
number representation
of <b> + const
redundant Booth digit representation
of  <b>
2-bit windows
2-bit adder row
cc
s sssssss
c c c c c
s sssss
[2j][2j+2][2j+4][2j+6]
[2j+7][2j+6][2j+5][2j+4][2j+3][2j+2][2j+1] [2j] [2j-1] [2j-3] [2j-4] [2j-5][2j-6]
[2j-6][2j-4][2j-2]
[2j-2]
Figure 4.32: Compression from carry-save to redundant Booth-digit representation.
resentation of b can be computed by a line of two-bit adders with inverted most signicand
sum-bit outputs like depicted in gure 4.32.
Proof: We get a carry-save representation of b + compinj. Looking at bit windows
of width 2 in this carry-save representation of the number b + compinj, each window j
contains 4 bits (see Fig. 4.32); two with the weight of one and two with the weight of two.
Therefore, the binary value w
j
of the part of the number within a window j is in the range
w
j
2 f0;    ; 6g. The number b+ compinj can then be written as:
b+ compinj =
X
m
0
 1
j=0
w
j
4
j
:
If we input the 4 bits of a window j into a 2-bit adder, we get three output bits c[2j +2],
s[2j + 1] and s[2j], that represent the value of the window by:
w
j
= 4  c[2j + 2] + 2  s[2j + 1] + s[2j]:
With s[2m
0
+ 1] = s[2m
0
] = [  2] = 0, we have
b+ compinj =
X
m
0
 1
j=0
(4  c[2j + 2] + 2  s[2j + 1] + s[2j])4
j
=
X
m
0
j=0
(2  s[2j + 1] + s[2j] + c[2j])4
j
:
Now we subtract the additive constant compinj on both sides, and get the value of b:
b+ compinj  
X
m
0
 1
j=0
2  4
j
=
X
m
0
j=0
(2  s[2j + 1] + s[2j] + c[2j]   2)4
j
b =
X
m
0
j=0
(2  (s[2j + 1]  1) + s[2j] + c[2j])4
j
As x  1   x for x 2 f0; 1g, we can substitute s[2j + 1]  1 by  s[2j + 1] and have:
b =
X
m
0
j=0
( 2s[2j + 1] + s[2j] + c[2j])4
j
;
164 CHAPTER 4. BASIC FP OPERATIONS
 
 
 



 
 
 
 
 
 
 
 
 







arecip’(y)
m+1
CC0
full adder line
CS to redBD compression
in redundant Booth digit representation
1
ROM
[0:2m+6]
p
ROM
C1
1 m-bit (m+6)-bit
p p
PPTREE(m’,m+11)
[m+1:2m+6][m+1:2m+6]
LINTERMSLINTERMC
[0:2m+6][0:2m+6]
[0:2m+6]
[1:m+2]
encoded
Booth2
(3 x m’) bits
full adder line
selection logic
Y[m+1:2m+6]Y[1:m]
1
p
(y-p)
m+1
arecip28’’(y)
in redundant Booth digit representation
114114
ROM ROM
selection logic
[0:32]
pp
(y-p)C1
1
p
[14:32]
LINTERMSLINTERMC
[1:15]
encoded
Booth2
half adder line
CS to redBD compression
DBL
CC0p -2
-28
Y[1:13]
13-bit
Y[14:32]
(3 x 8) bits
PPTREE(8,24)
[14:32]
[29:32] [29:32][0:28] [0:28]
[0:32]
19-bit
[0:27][0:28]
(b)(a)
Figure 4.33: Structure of the reciprocal approximation implementation for the computa-
tion of: (a) arecip16
0
(y) with m = 7 and arecip08
0
(y) with m = 3; (b) for arecip28
00
(y).
so that the string of tripels (s[2j + 1]; s[2j];c[2j])
0jm
0
 1
is a redundant Booth-digit
representation of b. Thus, a partial compression from a carry-save representation of b +
compinj to a redundant Booth-digit representation of b can be implemented by a line of
2-bit adders with inverted most signicand sum bit outputs like depicted in gure 4.32.
2
4.4.1.3 Implementation
In this section the implementations of the three initial reciprocal approximations that im-
plement the equations for arecip28, arecip16 and arecip08 (see equations 4.299-4.301)
are described. These implementations have to compute a multiplication of C1
p
and
rnd
RZ;wy
(y p). For this multiplication we use Booth encoding. Because we read one
of the operands from a ROM and this operand is not required for anything else than
this multiplication, we already save this multiplicand in its Booth encoded form in the
ROM. Because in this way the Booth encoders for the multiplication are not required,
this technique saves cost and delay for this multiplication. In [36, 39], we encode this
operand even by Booth3 recoding, which further accelerates the computations, but to
simplify the description in this work, here only Booth2 recoding is used. The multiplica-
tion of C1
p
=<c1
p
[1 :m+2]>
neg
and rnd
RZ;wy
(y p)=<y[m+1:2m+6]>
neg
results in a
positive product, that can be written as
linterm = C1
p
 rnd
RZ;wy
(y   p)
= <c1
p
[1 : m+2]>
neg
<y[m+1 :2m+6]>
neg
= <linterm[m+1:3m+8]>
neg
:
4.4. DIVISION 165
To compute arecip(y), we have to consider the dierence C0
p
  linterm, and we only have
the carry-save representation of
linterm = lintermc+ linterms
= <lintermc[m+1:3m+8]>
neg
+<linterms[m+1:3m+8]>
neg
:
Because the computation of the bit positions [0 :m+5] of the binary representation of
arecip(y) includes a truncation after the bit position [m+5] with an truncation error
bounded by 2
 (m+5)
and the truncation error of the truncation of a carry-save represen-
tation after bit position [m+6] is also at most 2
 (m+5)
, we can also consider the positions
[0 :m+6] of the carry-save representation of arecip(y) to achieve the same absolute er-
ror bounds according to corollary 4.35 (we denote the corresponding approximation by
arecip
0
(y)). By the use of two's complement number representations we get:
arecip
0
(y) = < (01;c0
p
[2 :2m+6]) >
2neg
 
  < lintermc[m+1:2m+6] >
2neg
 < linterms[m+1:2m+6]) >
2neg
= < (01;c0
p
[2 :2m+6]) >
2neg
+< (1
m+1
; lintermc[m+1:2m+6]) >
2neg
+
+ < (1
m+1
; linterms[m+1:2m+6]) >
2neg
+ 2  2
 2m 6
We store the binary representation of CC0
p
= C0
p
+2 2
 2m 6
+compinj in the ROM for
the constant parameter. If we compress the inverted carry-save representation of linterm
with the bit strings (1
m+1
; lintermc[m+1:2m+6]) and (1
m+1
; linterms[m+1:2m+6])
and the binary representation of CC0
p
= <cc0
p
[0 :2m+6]>
neg
by a full-adder-line, we
get a carry-save representation of arecip
0
(y) + compinj. A redundant Booth-Digit repre-
sentation of arecip
0
(y) can then easily be computed by the partial compression technique
from the previous section according to lemma 4.36.
In this way, the reciprocal approximation is implemented in the following steps:
 Table lookup of the binary representation of CC0
p
= <cc0
p
[0 :2m+6]>
neg
and the
Booth2 encoded representation of C1
p
from two ROMs with the input of y[1 :m].
 Multiplication of the Booth2 encoded representation of C1
p
and y[m+1:2m+6] by
selection logics and the partial product reduction with an adder tree.
 Addition of the carry-save representation from the output of the adder tree with the
binary representation of CC0
p
by a full-adder line.
 Partial compression of the carry-save representation from the output of the full-
adder line to a redundant Booth-digit representation of the reciprocal approximation
according to lemma 4.36.
Figure 4.33(a) depicts the structure of the implementations for the redundant Booth-Digit
representations of arecip16
0
(y) and arecip08
0
(y). In the case of single precision, not the
approximation arecip28
0
, but arecip28
00
(y) = arecip28
0
(y) dbl  2
 28
will be required to
guarantuee a positive error 0 < (arecip
00
28(y)   1=y) = (err(y)   2
 28
) < 2
 27
in single
precision. For this reason, the implementation of arecip
00
28 is slightly changed (see gure
4.33(b)). Note, that for double precision, we have arecip28
00
(y) = arecip28
0
(y).
166 CHAPTER 4. BASIC FP OPERATIONS
4.4.2 Division I (normalized  ! representative format)
Specication. This section describes a FP division implementation, that is able to
divide two FP numbers given in the normalized representations (section 2.6.3):
BUSa
NF
[69 :0] = (sa;ea[11 :0]; fa[0 :52]; zeroa; infa;qnana; snana) (4.308)
BUSb
NF
[69 :0] = (sb;eb[11 :0]; fb[0 :52]; zerob; infb;qnanb; snanb); (4.309)
which represent the factorings (sa; ea; fa) = fact
NF
(BUSa
NF
[69 :0]) and (sb; eb; fb) =
fact
NF
(BUSb
NF
[69 :0]). Additionaly, we get as input the bit dbl, which signals the case
of double precision by dbl = 1, and an active bit isdiv = 1, that signals the case, that
the operation which is actually perfomed is a division.
In the case, that both operands have representable values and the second operand is
non-zero with zerob = 0, the exact quotient exact
div
is dened by (section 2.2.4):
exact
div
= ( 1)
sasb
 2
ea eb
 fa  1=fb: (4.310)
If (s
rc
; e
rc
; f
rc
) is a RF factoring of this exact quotient exact
div
for non-zero representable
inputs, then for the general case of arbitrary input values, a RF factoring of the required
quotient is given by (see equation 2.17):
(s
RF
; e
RF
; f
RF
) =
8
>
>
>
>
>
<
>
>
>
>
:
(0; e
qNaN
; f
qNaN
) if scqnan
(s
inf
; e
1
; f
1
) if scinf
(sa; ea; fa) if scx
(sb; eb; fb) if scy
(s
0
; e
0
; 0) if sczero
(s
rc
; e
rc
; f
rc
) otherwise.
(4.311)
The quotient output of the division I implementation is then specied by the corresponding
representation in the representative format BUS
RF
[73 :0] = rf(s
RF
; e
RF
; f
RF
): Moreover,
in the the division I implementation the exception ags inv and dvz should be signaled
according to the occurance of an invalid exception and the occurance of a division by zero,
respectively.
Implementation. Because we will consider a multiplicative implementation of the sig-
nicand quotient that shares the xed-point multiplier with the FP multiplication imple-
mentation, the whole division I implementation will be integrated into the multiplication
unit I.
The equations for the special conditions in equation 4.311 are already summarized in
section 2.4.4 by equations 2.34-2.40. Among these equations, only the equations for the
conditions scinf and sczero dier from the equations 2.27-2.33 for the special conditions
for FP multiplication. Thus, we change the computations of these two signals in the
special cases circuit according to:
scinf =

scqnan ^ scx ^ scy ^ (infa _ zerob) if isdiv
scqnan ^ scx ^ scy ^ (infa _ infb) otherwise
(4.312)
sczero =

scqnan ^ scx ^ scy ^ (infb _ zeroa) if isdiv
(zeroa ^ zerob) _ (scqnan _ scx _ scy) otherwise.
(4.313)
4.4. DIVISION 167
Based on the selection of the special cases results in the special cases circuit of the multi-
plication unit according to
(s
sc
; e
sc
; f
sc
) =
8
>
>
>
<
>
>
>
>
:
(0; e
qNaN
; f
qNaN
) if scqnan
(s
inf
; e
1
; f
1
) if scinf
(sa; ea; fa) if scx
(sb; eb; fb) if scy
(s
0
; e
0
; 0) otherwise,
(4.314)
we get the nal division result also by the selection
(s
RF
; e
RF
; f
RF
) =

(s
sc
; e
sc
; f
sc
) if spca
(s
rc
; e
rc
; f
rc
) otherwise.
(4.315)
Like for multiplications, also for divisions the invalid ag is given by dvz () scqnan
(compare tables 2.8 and 2.9), so that the implementation for inv from the multiplication
unit does not have to be changed. The case of a division by zero can only occur for
divisions with zeroa = 0 and zerob = 1. Thus, the ag dvz is computed in the special
cases circuit of the multiplication unit by
dvz = isdiv ^ zeroa ^ zerob:
This already completes of the description of the computations for the special cases and
the detection of the exceptions.
In the following the computation of the RF factoring (s
rc
; e
rc
; f
rc
) for divisions of non-
zero representable operands (regular case) is described. According to equation 4.310, for
the regular case the exact quotient can be written as
exact
div
= ( 1)
sasb
 2
ea eb
 fa  1=fb
= ( 1)
sasb
 2
ea eb 1
 2  fa  1=fb:
Because the signicands fa and fb were extracted from normalized representations and
because the operands are non-zero in the regular case, we have fa 2 [1; 2[ and fb 2 [1; 2[.
From this it follows that the quotient 2  fa  1=fb is in the range ]1; 4[. In this way, the
factoring (sa sb; ea  eb  1; rep
53
(2  fa  1=fb) fullls the requirements of RF factorings
according to denition 2.21 and a RF factoring of the exact quotient exact
div
is given by:
(s
rc
; e
rc
; f
rc
) = (sa sb; ea  eb  1; rep
p
(2  fa  1=fb)):
The computation of this factoring is described in the following separately for the sign, the
exponent and the signicand.
The computation of s
rc
according to equation 4.316 is identical to the sign computation
from the multiplication unit I, so that the sign computation of the multiplication I unit
can be used unchanged also for the division I implementation.
With the 13-bit wide representations of the exponents ea = <(ea[11];ea[11 : 0])>
2
and eb = <(eb[11];eb[11 : 0])>
2
. we exactly get the exponent e
rc
for divisions (where
isdiv = 1) as required according to equation 4.316 by
e
rc
= ea  eb  1
= <(ea[11];ea[11 : 0])>
2
+<(eb[11] isdiv;eb[11 : 0] isdiv)>
2
168 CHAPTER 4. BASIC FP OPERATIONS
0
1
1
0
1 0
BUSb      [69:0]
NF
NF
INFa
SNANa
QNANa
rc
E     [12:0]
BUSa      [69:0]
ZEROa
sc
S
rc
rcsc
F     [-1:54]rc
CLA(13)
F    [-1:54]
rc
F     [-1:54]
SPCA
scS rc
E    [12:0]
F     [54]rc
BUS      [73:0]
Mux
scS
[3:0] SNAN RF
QNAN RF
RF
RFZERO
SPCA
Mux
S
X
06
06
0
RF
 
 
 
   
  
 
INF
 
 
 
 
 
 
 
 
   
   
5
E     [12:0]
 
partial product
+ representative injection
CLA(119)
FPR[-2:116]
generation & reduction
FPRS’[-2:116] FPRC’[-2:116]
DBL
ISDIV
INV
DVZ
[-2:116]
special
cases
INV
DVZ
DBL
FB[0:2w+6]
FA[0:52]
SBSA
ISDIV
XOR XOR
EB[11]
EA[11]
EB[11:0]EA[11:0]
[68:57]
[68:57] [69]
[69]
EA[11:0] EB[11:0] SA SB
[3:0]
[3:0]
ZEROb, INFb
QNANb, SNANb
[56:4]
FB[0:52]
(SB,EB[11:0],FB[0:52])
(SA,EA[11:0],FA[0:52])
[56:4]
FA[0:52]
Eadoe
ybdoe
xbdoe
fadoe
xadoe
obdoe
fbdoe
FA[0:52] FB[0:52]
DBL,ISDIV
Initial
reciprocal
approx
redBD([0:58])
[72:60] [59:4]
E      [12:0]RF
Mux
F      [-1:54]RF
RFS[73]
ORtree(51)
[-1:53] [54:104]
FPR
[0:24]
[25:53]
[0:58]
YCE
ECE
INRA
Y
DBL
FPR
XACE
X
BCE
[0:59]
[0:58]
AND
([0:58])
redBD
([0:58])
redBD
redBD
58(1,0    )
[1:59]2X
([0:58])
E
QUOSTEP1,3,4
Figure 4.34: Structure of the integrated multiplication/division unit I.
and the exponent e
rc
for multiplications (where isdiv = 0) as required according to equa-
tion 4.213 by
e
rc
= <(ea[11];ea[11 : 0])>
2
+<(eb[11] isdiv;eb[11 : 0] isdiv)>
2
= ea+ eb:
Thus, the exponent implementation from the multiplication I unit can be adopted to be
usable for both the division I and the multiplication I implementation, if we replace the
4.4. DIVISION 169
second input (eb[11];eb[11 : 0]) of the exponent addition from the multiplication unit I
by (eb[11]  isdiv;eb[11 : 0]  isdiv). This implementation for the exponent is depicted
in gure 4.34 completing the description of the exponent computations and leaving the
description for the computations of the signicand f
rc
.
The signicand f
rc
= rep
53
(2  fa  1=fb) is computed in two steps:
1. We rst compute an approximation aquot  (2  fa  1=fb) with an approximation
error errquot = (2fa=fb aquot) in the range 0  errquot < 2
 p
. Note, that in this
case the signed value of the error and not only the absolute value has to be bounded.
2. in the second step, the 53-representative f
rc
= rep
53
(2  fa  1=fb) is computed from
the approximation aquot.
The implementation of these two steps is described in the following two subsection:
4.4.2.1 Approximation of the quotient (step 1.)
In this subsubsection we describe how to compute the approximation of the quotient
aquot  (2 fa=fb). We extract the problem of nding an approximation for the reciprocal
of a signicand fb rst. By the multication of the approximated reciprocal with 2  fa we
will then get the approximation aquot  (2  fa=fb) of the quotient.
For the approximation of the reciprocal 1=fb we use the Newton-Raphson iteration
with an initial approximation of 1=fb that is computed with the implementations from
section 4.4.1. Because we can assume the signicand fb to be normalized fb 2 [1; 2[,
the reciprocal 1=fb is known to be in the range ]0:5; 1]. From an approximation x
i
of
the reciprocal x = 1=fb, the Newton-Raphson algorithm iteratively determines a better
approximation x
i+1
by the equation:
x
i+1
= x
i
(2  fb  x
i
): (4.316)
We dene the approximation error after the iteration i by 
i
= 1=fb   x
i
. After the
iteration i+ 1, we then get the approximation error

i+1
= 1=fb  x
i+1
= 1=fb  (x
i
(2  fb  x
i
))
= fb(
1
fb
2
 
2x
i
fb
+ x
i
2
)
= fb(
1
fb
  x
i
)
2
= fb  
i
2
;
so that because of fb 2 [1; 2[, we get

i
2
 
i+1
< 2  
i
2
:
Thus, starting with an accurate initial reciprocal approximation, the approximation error
converges quadratically with the number i of iterations.
According to equation 4.316 the following two dependent multiplications are required
for the computation of each Newton-Raphson iteration:
y
i
= fb  x
i
x
i+1
= x
i
 (2  y
i
)
170 CHAPTER 4. BASIC FP OPERATIONS
In an exact computation the number of signicant bit positions increases after each of
these multiplications, so that already after a few iterations, we would have to handle very
long operands and require a very large multiplier. To avoid this problem, we limit the
number of signicant bit positions and truncate each product after a xed bit position
wdiv with wdiv  p. Thus, we consider
y
0
i
= rnd
RZ;wdiv
(fb  x
0
i
) (4.317)
x
0
i+1
= rnd
RZ;wdiv
(x
0
i
 (2  y
0
i
)) (4.318)
where the values x
0
i+1
, x
0
i
, y
0
i
and fb can be represented in a binary representation with
bit positions [0 :wdiv]. For y
0
i
< 2, the dierence in equation 4.318 can be written as
(2  y
0
i
) = 2 <y
0
i
[0 :wdiv]>
neg
= <y
0
i
[0 :wdiv]>
neg
+ 2
wdiv
:
We simplify the computation of this dierence by neglecting the addition of the 2
wdiv
and
only compute the approximation
x
00
i+1
= rnd
RZ;wdiv
(x
00
i
<y
0
i
[0 :wdiv]>
neg
):
In the analysis of the approximation error for x
00
i+1
, which we denote by 
00
i+1
= 1=fb x
00
i+1
,
we now additionaly have to consider the errors due to the product truncations and due
to the lazy computation of the dierence (2   y
0
i
). Each product truncation produces a
discretization error in the range [0; 2
 wdiv
[, so that because of x
0
i
 1 we get
y
0
i
+ 2
 wdiv
> y
i
 y
0
i
(2  y
0
i
  2
 wdiv
) < (2   y
i
)  (2  y
0
i
)
<y
0
i
[0 :wdiv]>
neg
< (2   y
i
)  <y
0
i
[0 :wdiv]>
neg
+ 2
 wdiv
x
00
i
 (<y
0
i
[0 :wdiv]>
neg
) < x
i
 (2  y
i
)  x
00
i
 (<y
0
i
[0 :wdiv]>
neg
+ 2
 wdiv
)
x
00
i+1
< x
i+1
< x
00
i+1
+ 2
 wdiv
Note, that in the above error analysis of one iteration step, we consider x
i
= x
0
i
= x
00
i
and

i
= 
0
i
= delta
00
i
. Thus, we get for the approximation error 
00
i+1
= 1=fb   x
00
i+1
after one
iteration step:
x
i+1
> x
00
i+1
> x
i+1
  2
 wdiv
1=fb  x
i+1
< 1=fb  x
00
i+1
< 1=fb  x
i+1
+ 2
 wdiv

i+1
< 
00
i+1
< 
i+1
+ 2
 wdiv
fb  
2
i
< 
00
i+1
< fb  
2
i
+ 2
 wdiv

2
i
< 
00
i+1
< 2  
2
i
+ 2
 wdiv
For the approximation of the signicand quotient, we have to multiply a reciprocal approx-
imation x
00
i
by (2fa): aquot
i
= 2fax
00
i
. Because 2fa 2 [2; 4[, in this way the approximation
error for the quotient approximation errquot
i
= 2fa=fb  aquot
i
is in the range
2  
i
< errquot
i
< 4  
i
(4.319)
2  
2
i 1
< errquot
i
< 8  
2
i 1
+ 4  2
wdiv
(4.320)
2  
4
i 2
< errquot
i
< 16  
4
i 2
+ 12  2
wdiv
(4.321)
2  
8
i 3
< errquot
i
< 32  
8
i 3
+ 28  2
wdiv
(4.322)
4.4. DIVISION 171
requirements for fullled for
precision i errquot
i
< j
0
j < wdiv  x
0
= wdiv =
double 3 2
 53
2
 61=8
58 arecip08
0
58
single 2 2
 24
2
 29=4
29 arecip08
0
58
double 2 2
 53
2
 59=4
57 arecip16
0
58
single 1 2
 24
2
 14
28 arecip16
0
58
double 1 2
 53
2
 28
56 arecip28
00
58
single 0 2
 24
2
 26
32 arecip28
00
58
Table 4.3: Requirements on the initial reciprocal approximations using the Newton-
Raphson iteration with i 2 f1; 2; 3g iterations in double precision and i 2 f0; 1; 2g it-
erations in single precision.
For the computation of the p-representative in step 2, we require an approximation of
the quotient aquot
i
in the range 0 < errquot
i
< 2
 p
. We are interested in approximations
that are computed after i 2 f0; 1; 2; 3g and have to know the initial approximation error 
0
,
that is required for the initial reciprocal approximation. Because for i > 0, we know that
errquot
i
> 0, we only have to consider the upper bound on the absolute initial approxima-
tion error 
0
for i > 0. To get into the target range for errquot
i
, the truncation position
wdiv has to fulll wdiv  p + 5 for i = 3 (see equation 4.322). To fulll this condition
for single and double precision, we set wdiv = 58 in the following. The requirements for
the initial reciprocal approximations are listed in table 4.3. Thus, the initial reciprocal
approximation arecip08
0
(fb) can be used for the computation of an appropriate quotient
approximation aquot
i
after i = 2 iterations for single precision and after i = 3 iterations
for double precision. The initial reciprocal approximation arecip16
0
(fb) can be used for
the computation of an appropriate quotient approximation aquot
i
after i = 1 iteration
for single precision and after i = 2 iterations for double precision. The initial reciprocal
approximation arecip28
00
(fb) can be used for the computation of an appropriate quotient
approximation aquot
i
after i = 0 iterations for single precision and after i = 1 iteration
for double precision. Note, that the use of arecip28
00
instead of arecip28
0
guarantuees a
positive error err28
00
> 0, so that also for this case the lower bound on the approximation
error aquot
0
fullls the requirements for the computation of the representative in step 2.
4.4.2.2 Computation of the p-representative for f
rc
(step 2.)
From the quotient approximations in the previous section we get
0 < errquot
i
= 2fa=fb  aquot
i
< 2
 p
aquot
i
< 2fa=fb < aquot
i
+ 2
 p
Thus,
rnd
RZ;p
(aquot
i
) < aquot
i
< 2fa=fb < aquot
i
+ 2
 p
< rnd
RZ;p
(aquot
i
) + 2
 p+1
:
In other words,
E = rnd
RZ;p
(aquot
i
)
172 CHAPTER 4. BASIC FP OPERATIONS
is an approximation of q = 2fa=fb, and the exact quotient lies in the open interval
(E;E + 2
 p+1
). Moreover, we have
rep
p
(2fa=fb) =
8
<
:
E + 2
 (p+1)
if 2fa=fb < E + 2
 p
E + 2
 (p)
if 2fa=fb = E + 2
 p
E + 3  2
 (p+1)
if 2fa=fb > E + 2
 p
For any relation  2 f<;=; >g we have
2fa=fb  E + 2
 p
() 0  fb  (E + 2
 p
)  fa:
Thus, with the computation of g = fb  (E + 2
 p
)  fa and the conditions
repzero () (g = 0)
repneg () (g > 0)
the representative f
rc
= rep
p
(2fa=fb) can be selected by
rep
p
(2fa=fb) =
8
<
:
E + 2
 (p+1)
if repzero ^ repneg
E + 2
 (p)
if repzero ^ repneg
E + 3  2
 (p+1)
if repzero ^ repneg
For this computation of rep
p
(2fa=fb), we dene the representative increment repinc:
repinc = < repinc[ 2 : 54] >
neg
=
8
<
:
E + 2
 (p+1)
if repzero^ repneg
E + 2
 (p)
if repzero^ repneg
E + 3  2
 (p+1)
if repzero^ repneg
= < (000; 0
23
; (repzero _ repneg) ^ (dbl); (repzero _ repneg) ^ (dbl);
; 0
27
; (repzero _ repneg) ^ dbl; (repzero _ repneg) ^ dbl) >
neg
;
so that rep
p
(2fa=fb) = E+ repinc. The value g = fb  (E+2
 p
)  fa is computed in two
steps. We rst compute
E
b
= E + 2
 (p)
Then we compute g = E
b
 fb   fa. Again, we dene an additive constant, that selects
the additive operands also including the case of the representative by
repinj = < repinc[ 2 : 117] >
neg
=
8
>
<
>
>
:
2
 p
if quostep1
 fa if quostep3
repinc if quostep4
0 otherwise.
Thus, we get for quostep1 = 1:
E
b
= E + repinj (4.323)
we get for quostep3 = 1:
g = E
b
 fb+ repinj (4.324)
4.4. DIVISION 173
signicand quotient sequence
implementation version
step computation control signals i ii iii iv v vi
init  1=fb x
00
0
= arecip(fb) inra,xace,xbce 1 1 1 1 1 1
Newton 1a y
0
i
= x
00
i
 fb xadoe, fbdoe, dbl 2; 8 2; 6 2 2    
Newton 1b "
xadoe; fbdoe;dbl;
2nd
3; 9   3      
Newton 1c " yce 4; 10 3; 7 4 3    
Newton 2a x
00
i+1
= x
00
i
 y
i
xadoe, ybdoe, dbl 5; 11 4; 8 5 4    
Newton 2b "
xadoe; ybdoe;dbl;
2nd
6; 12   6      
Newton 2c " xace 7; 13 5; 9 7 5    
Quot 1a
E
b
= fa  2x
00
i+1
+repinj
fadoe; xbdoe;dbl;
quostep1
14 10 8 6 2 2
Quot 2a
g = E
b
 fb
+repinj
fadoe; xbdoe;dbl;
ece
15 11 9 7 3 3
Quot 3a
g = E
b
 fb
+repinj
Eadoe; fbdoe;dbl;
ece;quostep3
16 12 10 8 4 4
Rep Sel a
f
rc
= E  1
+repinj
fadoe; obdoe;dbl;
quostep4
17 13 11 9 5 5
Rep Sel c " - 18 14 12 10 6 6
Table 4.4: Computation steps in the six dierent implementations for the computation of
the signicand quotient representative in single precision.
and we get for quostep4 = 1:
f
rc
= rep
p
(2fa=fb) = E  1 + repinj: (4.325)
After this last step the path for the computation of f
rc
for the multiplication contains also
the signicand f
rc
for the quotient result. To control the steps of the division we dene the
control signals xace, xbce, ece, yce, inra, Eadoe, fadoe, xadoe, xbdoe, ybdoe, obdoe
and fbdoe that inuence the computation paths by controlling drivers and register clocks
like depicted in gure 4.34. The changed implementations of the partial product generation
and reduction are depicted in gure 4.35(full-sized adder tree) and in gure 4.36(half-sized
adder tree). The computation steps including the required values for the control signals are
summarized in table 4.4 (single precision) and table 4.5 (double precision) for the six cases:
(i) use of arecip08
0
(fb) and half-sized adder tree; (ii) use of arecip08
0
(fb) and full-sized
adder tree; (iii) use of arecip16
0
(fb) and half-sized adder tree; (iv) use of arecip16
0
(fb)
and full-sized adder tree; (v) use of arecip28
00
fb) and half-sized adder tree; (vi) use of
arecip28
00
(fb) and full-sized adder tree. Note, that for multiplications, where isdiv = 0,
the implementations of the RF factoring (s
rc
; e
rc
; f
rc
) are not changed. This completes
the description of the integrated multiplication/division I implementations.
174 CHAPTER 4. BASIC FP OPERATIONS
signicand quotient sequence
implementation version
step computation control signals i ii iii iv v vi
init  1=fb x
00
0
= arecip(fb) inra,xace 1 1 1 1 1 1
Newton1a y
0
i
= x
00
i
fb xadoe,fbdoe,dbl 2; 8; 14 2; 6; 10 2; 8 2; 6 2 2
Newton1b "
xadoe; fbdoe;dbl;
2nd
3; 9; 15   3; 9   3  
Newton1c " yce 4; 10; 16 3; 7; 11 4; 10 3; 7 4 3
Newton2a x
00
i+1
= x
00
i
 y
i
xadoe,ybdoe,dbl 5; 11; 17 4; 8; 12 5; 11 4; 8 5 4
Newton2b "
xadoe; ybdoe;dbl;
2nd
6; 12; 18   6; 12   6  
Newton2c " xace 7; 13; 19 5; 9; 13 7; 13 5; 9 7 5
Quot 1a
E
b
= fa  2x
00
i+1
+repinj
fadoe; xbdoe;dbl;
quostep1
20 14 14 10 8 6
Quot 1b E = fa  2x
00
i+1
fadoe; xbdoe;dbl;
2nd;quostep1
21   15   9  
Quot 2a
g = E
b
 fb
+repinj
fadoe; xbdoe;dbl;
ece
22 15 16 11 10 7
Quot 2b
g = E
b
 fb
+repinj
fadoe; xbdoe;
dbl; 2nd
23   17   11  
Quot 3a
g = E
b
 fb
+repinj
Eadoe; fbdoe;dbl;
ece;quostep3
24 16 18 12 12 8
Quot 3b
g = E
b
 fb
+repinj
Eadoe; obdoe;dbl;
2nd;quostep3
25   19   13  
Rep Sel a
f
rc
= E  1
+repinj
fadoe; obdoe;dbl;
quostep4
26 17 20 13 14 9
Rep Sel b "
fadoe; obdoe;dbl;
2nd;quostep4
27   21   15  
Rep Sel c " - 28 18 22 14 16 10
Table 4.5: Computation steps in the six dierent implementations for the computation of
the signicand quotient representative in double precision.
4.4.3 Division II (normalized  ! gradual result format)
Specication. This section describes a FP division implementation, that is able to
divide two FP numbers given in the normalized representations (section 2.6.3):
BUSa
NF
[69 :0] = (sa;ea[11 :0]; fa[0 :52]; zeroa; infa;qnana; snana)
BUSb
NF
[69 :0] = (sb;eb[11 :0]; fb[0 :52]; zerob; infb;qnanb; snanb);
which represent the factorings (sa; ea; fa) = fact
NF
(BUSa
NF
[69 :0]) and (sb; eb; fb) =
fact
NF
(BUSb
NF
[69 :0]). Additionaly, we get as input the bit dbl, which signals the case
of double precision by dbl = 1, and an active bit isdiv = 1, that signals the case, that
the operation which is actually perfomed is a division.
In this section, the exact division result according to equation 4.310 has to be rounded
by the general rounding function ground1. After this gradual rounding step the quotient
4.4. DIVISION 175
FA[0:58] FB[0:58]
Partial Product Generation (Booth2)
& Reduction (59x59)
Full adder line (119)
FPRS[-1:116] FPRC[-1:116]
FPRS’[-2:116] FPRC’[-2:116]
0 0
FPR[-1:116] FPR[-2]
REPINJ[-2:116]
repres. injection
generation
IS_ZERO
(118)
DBL,ISDIV
QUOSTEP1,3,4
FA[0:52]
R
EPZERO
R
EPN
EG
Figure 4.35: Implementation of the full-sized partial product generation and reduction
including representative test and injection generation for integrated multiplication/division
I/III implementation.
AND
AND
   
   
 
 
 
MUX 10
FB[0:58]
& Reduction (59x30)
Partial Product Generation (Booth2)
DBL AND ITER2
FPRC[-2:116]
PPS[-2:87] PPC[-2:87]
PPREGS[-2:117]
PPREGC[-2:117]
[-1:57]
[58:87][-1:57]
[88:117]
[58:87]
[88:117]
DBL AND ITER2
FDBS[88:117] FDBC[88:117]
[58:87]
FPRS[-2:116]
FPR[-1:116] FPR[-2]
REPINJ[-2:87]
repres. injection
generations
FD
B
C[
29
:87
]
31
0
FD
B
S[
29
:87
]
4/2 adder line (90)
full adder line (90)
DBL
ITER2
AND
redBD(FA[0:29]) redBD((FA[30:59],0))
redBD(FASEL[0:29])
[-1:87] [-1:87]0 0
IS_ZERO
(118)
QUOSTEP1,3,4
DBL,ISDIV
FA[0:52]
R
EPZERO
R
EPN
EG
Figure 4.36: Implementation of the half-sized partial product generation and reduction
including representative test and injection generation integrated multiplication/division
I/III implementation.
176 CHAPTER 4. BASIC FP OPERATIONS
should be output in the gradual result format BUS
GF
[73 :0] (section 2.6.5). According
to equation 4.316, a RF factoring of the exact product is given by (s
rc
; e
rc
; f
rc
) = (sa 
sb; ea   eb   1; rep
p
(2fa  fb)) for non-zero representable operands. With the gradual
rounded product ((s
grc
; e
grc
; f
grc
);tinc;tinx) = ground1(s
rc
; e
rc
; f
rc
) and the following
GF factoring of the result for the case of arbitrary IEEE operands
((s
GF
; e
GF
; f
GF
);tinc
GF
;tinx
GF
)=
8
>
>
>
>
<
>
>
>
>
>
:
((0; e
qNaN
; f
qNaN
); 0; 0) if scqnan
((s
inf
; e
1
; f
1
); 0; 0) if scinf
((sa; ea; fa); 0; 0) if scx
((sb; eb; fb); 0; 0) if scy
((s
0
; e
0
; 0); 0; 0) if sczero
((s
grc
; e
grc
; f
grc
);tinc;tinx) otherwise,
(4.326)
the quotient output of the division II implementation is specied by the gradual result
representation BUS
GF
[73 :0] = gf((s
GF
; e
GF
; f
GF
);tinc
GF
;tinx
GF
). The occurance of
an invalid exception or a division by zero should be signaled by the bit inv and the bit
dvz also in this section.
Implementation. Like the division I implementation is integrated into the multiplica-
tion I unit, we will integrate the implementation of the division II into the multiplication
II unit. The changes for the special cases are the same like in the previous section for the
integrated multiplication/division I implementation.
Thus, we only have to consider the computations for the regular case. The imple-
mentation of the GF factoring of the quotient in this section is just a combination of
the computation of the RF factoring of the quotient from the previous section and the
implementaton of the gradual rounding function ground1 from the multiplication II. The
computation of the signicand quotient is implemented like in the previous section. By
setting the rounding injections to zero during the signicand quotient computations by
a signal injsel = 0 we get in the multiplication II implementation fpr = finj12. Thus
the binary pruduct output finj[0 : 116] of the Compression & gradual rounding circuit
can be used for the computation of the signicand quotient like in the previous section.
For the gradual rounding implementation in the last cycle the rounding injection has to
be activated again by injsel = 0, so that we get as output of the last cycle f
grc
instead
of f
rc
. The integrated implementation of the multiplication/division implementation is
depicted in gure 4.37, where the implementation of the partial product generation & and
reduction has to be adopted according to gure 4.38 for the use of a full-sized adder tree
and according to gure 4.39 for the use of a half-sized adder tree. This already completes
the description of the integrated multiplication/division II implementations.
4.4.4 Division III (normalized  ! normalized format)
Specication. Like in the previous section also in this section, the FP division is com-
puted from the inputs of the normalized representationsBUSa
NF
[69 :0] andBUSb
NF
[69 :0]
(section 2.6.3). Because IEEE rounding has to be considered in this section, also the bit
dbl, that signals the case of single precision (dbl = 0) or double precision (dbl = 1), the
input of the rounding mode, represented by rmode[1 :0], and the underow and overow
enable bits unf en and ovf en are required.
In this section, the exact division result according to equation 4.310 has to be rounded
by the rounding function nround, that computes the NF factoring of the IEEE rounded
result. After this rounding computation the quotient should be output in the normalized
4.4. DIVISION 177
1 0
1
0
0
0
1
1
ZEROa
SNANa
Mux
QNANa
grc
INFa
BUSb      [69:0]
Mux
NF
F     [0:52]
GF
F     [0:52]grc
X
E     [12:0]
scS
NF
F      [0:52]
06
06
 
grc
  
 
 
 
 
 
 
 
 
 
 
 
grcS
   
 
 
 
 
DBL
INV
DVZ
special
cases
INV
DVZ
DBL
FB[0:2w+6]
FA[0:52],SA,SB
[68:57]
[68:57] [69]
[69]
EA[11:0] EB[11:0] SA SB
[3:0]
[3:0]
ZEROb, INFb
QNANb, SNANb
[56:4]
FB[0:52]
(SB,EB[11:0],FB[0:52])
(SA,EA[11:0],FA[0:52])
[56:4]
FA[0:52]
Eadoe
ybdoe
xbdoe
fadoe
xadoe
obdoe
fbdoe
FA[0:52] FB[0:52]
DBL,ISDIV
ISDIV
XOR XOR
EB[11]
EA[11]
EB[11:0]EA[11:0]
FPR[-2:116] Mux
BUSa      [69:0]
PSCOND
Compound
adder(13)
injected partial product
generation & reduction
+ repres. injections
[-2:116]
SRMODE[1:0]
FINPRC’FINPRS’
[-2:116] [-2:116]
ISDIV
RMODE
[1:0]
SA SB
Initial
reciprocal
approx
redBD([0:58])
[71:59] [58:6]
E      [12:0]GF
Mux
GFS[72]
FPR
[0:24]
[25:53]
[0:58]
YCE
ECE
INRA
Y
DBL
[3:0] SNAN GF
QNAN GF
INFGF
GFZERO
SPCA
F     [0:52]grc
E    [12:0]sc
F    [0:52]sc
E     [12:0]grc
scS grcS
AND SPCA
SPCA
TINC
TINX
TINX
TINC
GF
GF[5:4]
AND
XACE
X
BCE
[0:59]
[0:58]
BUS      [72:0]GF
Compression &
gradual rounding
RMODE
E
05
([0:58])
redBD
[1:59]2X
([0:58])
redBD
([0:58])
redBD (1,0    )58
ISDIV
QUOSTEP1,3,4
Figure 4.37: Structure of the integrated multiplication/division unit II.
format BUS
NF
[69 :0] (section 2.6.3). According to equation 4.316, a RF factoring of
the exact product is given by (s
rc
; e
rc
; f
rc
) = (sa  sb; ea   eb   1; rep
p
(2fa  fb)) for
non-zero representable operands. With the NF factoring of the IEEE result for non-
zero representable operands (s
nrc
; e
nrc
; f
nrc
) = nround(s
rc
; e
rc
+ wec; f
rc
) including the
exponent wrapping constant wec according to equation 2.14 and the following NF factoring
178 CHAPTER 4. BASIC FP OPERATIONS
0
FB[0:58] redBD(FA[0:58])
Partial Product Generation (Booth2)
& Reduction (59x59)
FPRS[-1:116] FPRC[-1:116]0 0
INJ
generation
SRMODE[1:0]
4/2 adder line (119)
FINPRS’[-2:116] FINPRC’[-2:116]
RMODE[1:0]SA,SB FPR[-2]
REPINJ[-2:116]
FPR[-1:116]
repres. injection
generation
IS_ZERO
(118)
FA[0:52]
DBL,ISDIV
QUOSTEP1.3,4
R
EPZERO
R
EPN
EG
IN
J12[-1:116]
INJSEL
AND
Figure 4.38: Implementation of the half-sized partial product generation and reduction
including representative test and injection generation for integrated multiplication/division
II implementation.
of the result for the general case of arbitrary operands according to equation 2.16:
(s
NF
; e
NF
; f
NF
)=
8
>
>
>
>
>
<
>
>
>
>
:
(0; e
qNaN
; f
qNaN
) if scqnan
(s
inf
; e
1
; f
1
) if scinf
(sa; ea; fa) if scx
(sb; eb; fb) if scy
(s
0
; e
0
; 0) if sczero
(s
nrc
; e
nrc
; f
nrc
) otherwise,
(4.327)
the quotient output of the division III implementation is specied by the corresponding
representation in the normalized format BUS
NF
[69 :0] = nf(s
NF
; e
NF
; f
NF
). In this
section, the occurance of an invalid, inexact, overow, underow exception should be
signaled by the bits inv, inx, ovf, unf and dvz, respectively.
Implementation. In an analogous way like in the two previous sections, in this section
the implementation of the division III is integrated into the multiplication III unit. Like in
the previous section for the special cases computations only the implementation of scinf,
sczero and dvz the have to be changed according to equations 4.312,4.313 and 4.316.
The computations for the regular case are implemented in two steps. First, the RF
factoring (s
rc
; e
rc
; f
rc
) is computed like in the integrated division/multiplication I im-
plementation, then the rounding hardware from the multiplication unit III computes the
normalized IEEE rounding function nround from this RF factoring. To get the binary rep-
resentation of the product, we have to set sr mode[1 :0] = 00 during the computation of
the signicand quotient, so that the rounding injection is zero and we get fpr = finj12.
Based on this product output, the signicand quotient implementation from the divi-
sion I implementation is integrated into the multiplication unit III like depicted in gure
4.40. Because the partial product generation and reduction implementation of the mul-
tiplication unit III and I are the same, they are changed identically for the integrated
division/multiplication implementations III and I. The implementation using a full-sized
adder tree is depicted in gure 4.35 and the implementation using a half-sized adder tree
is depicted in gure 4.36. During the last two cycles the cleared value of the rounding
4.4. DIVISION 179
AND
INJ generation
ITER2
AND
DBL
MUX1 0
25
031
0
   
   
 
 
 
FPRC[-2:116]
PPS[-2:87] PPC[-2:87]
PPREGS[-2:117]
[-1:57]
[58:87][-1:57]
[88:117]
[58:87]
[88:117]
DBL AND ITER2
FDBS[88:117] FDBC[88:117]
[58:87]
FPRS[-2:116]
DBL
RMODE[1:0]SA,SB FPR[-1:116] FPR[-2]
[-2:87]
repres. injection
generations
MUX 10
FA[0:58]
& Reduction (59x30)
Partial Product Generation (Booth2)
DBL AND ITER2
redBD(FASEL[0:29])
full adder line (90)
4/2 adder line (90)
redBD(FA[0:29]) redBD((FA[30:59],0))
PPREGC[-2:117]
[-1:87] [-1:87]
QUOSTEP1,3,4
DBL,ISDIV
FA[0:52]
IS_ZERO
(118)
R
EPN
EG
R
EPZERO
AND
FD
B
C[
-2
:87
]
FD
B
S[
-2
:87
]
0
90
12
0
SRMODE[1:0]
INJ[53:105] INJSEL
Figure 4.39: Implementation of the half-sized partial product generation and reduction
including representative test and injection generation for integrated multiplication/division
II implementation.
mode in the bits sr mode[1 : 0] have to be computed from rmode[1 : 0] again to guar-
antuee the correct rounding injection for the computation of the rouning function nround
to get the output of (s
nrc
; e
nrc
; f
nrc
). This completes the description of the integrated
multiplication/division III implementation.
180 CHAPTER 4. BASIC FP OPERATIONS
0 11
0
0
1
QNANa
SNANa
ZEROa
INFa
NF
Mux
NFBUSb      [69:0]
Mux
CFOVF1
INX12,INX24,SPCA
S nrc
CFOVF2
06
06
58(1,0     )
   
  
 
 
 
 
 
 
 
 
 
 
 
 
 
   
 
  
BUSa      [69:0]
INV
DVZ
special
cases
INV
DVZ
DBL
FA[0:52],SA,SB
[68:57]
[68:57] [69]
[69]
EA[11:0] EB[11:0] SA SB
[3:0]
[3:0]
ZEROb, INFb
QNANb, SNANb
[56:4]
FB[0:52]
(SB,EB[11:0],FB[0:52])
(SA,EA[11:0],FA[0:52])
[56:4]
FA[0:52]
Eadoe
ybdoe
xbdoe
fadoe
xadoe
obdoe
fbdoe
FA[0:52] FB[0:52]
DBL,ISDIV
ISDIV
XOR XOR
EB[11]
EA[11]
EB[11:0]EA[11:0]
ISDIV
SA SB
[68:57] [56:4]
E      [11:0]NF
Mux
NFS[69]
FPR
[0:24]
[25:53]
[0:58]
YCE
ECE
INRA
Y
DBL
FPR[-2:116]
[-2:116]
[3:0] SNAN NF
QNAN NF
INFNF
NFZERO
SPCA
scS
F     [0:52]nrc
E    [11:0]sc
F    [0:52]sc
E     [11:0]nrc
scS nrcS
S nrc
SPCA
FPRC’FPRS’
[-2:116] [-2:116]
Exceptions &
exponent
computations
UNF
OVF
DBL
INJ24[-1:54]
INJ12[-2:54]
MASK1     [-2:53]
MASK1     [-1:53]
vp1’
vp2’
SRMODE[1:0]
WINZIG
OVF
DBL
UNF
INX
OVF
RMODE[1:0]
E     [11:0]nrc
INX
F     [0:52]nrc
F     [0:52]nrc
F      [0:52]NF
BUS      [69:0]NF
Compression &
normalized significand
rounding
FB[0:2w+6]
partial product
generation & reduction
+ representative injection
[0:58]
E
AND 0 5
[0:59]
XXACE
X
BCE
SPCA
([0:58])
redBD
2X
([0:58])
redBD
([0:58])
redBD
[1:59]
([0:58])redBD
Initial
reciprocal
approx
QUOSTEP1,3,4
Figure 4.40: Structure of the integrated multiplication/division unit III.
Chapter 5
Evaluation
In this section we quantitatively analyze the FP designs that have been described in the
previous sections. For the analysis we use the formal hardware model from [22]. Based on
a specication of the FPU designs at gate level, we compute the costs of the designs by
counting the gates, that are used in the designs, and by weighting them for a particular,
but typical technology [24] (see table 5.1). For any other technology the relative costs of
the basic circuits could be changed by the corresponding parameters in the cost formulae.
The cost for the FP designs are listed table 5.4. These costs also contain the cost of the
pipelined RISC architecture from [23] in that the FP designs are integrated.
The performance of the FP implementations mainly depend on two factors. On the
one hand the maximum delay within one cycle of a FP implementations determines the
minimum cycle time that would be possible with this FP implementation. We will consider
the performance of the FP implementations integrated into a pipelined RISC architecture
from [23]. In this setting, a dierence between the FP implementations regarding the cycle
time only becomes visible, if the FP implementations lie on the critical path and the cycle
time of the FP implementation exeeds the cycle time of the microprocessor. This is not
the case for any of our FPU designs for the chosen pipelined RISC processor from [23].
Thus, integrated into the microprocessor the performance is measured by the aver-
age number of cycles per instructions, that the microprocessor achieves with this FP
implementation on an average FP workload. To analyze the performance in this way,
we consider a pipelined RISC processor design from [23] as a basis. This design already
includes the implementation of pipelining, forwarding, interrupt handling, and a result-
shift register [23]. Corresponding to this architecture a trace driven run-time simulator
was implemented, so that with the input of the latency and restart-time set of the FPU,
the average number of clock cycles that are needed per instruction (CPI) could be simu-
lated. The latencies and restart-times of our proposed FP implementations are listed in
table 5.2. For the runtime-simulations, we consider the benchmarks from the SPECfp92
oating-point benchmark suite, because traces using the MIPS R3000 instruction set were
already available for them [17].
The results of the analysis are depicted in gure 5.1, where the costs in terms of kG
(kilo gates) are displayed against the performances in terms of CPI (cycles per instruction).
We separate the results for the three dierent rounding architectures in three gures. In
the topmost gure, the results for the FP units using rounding architecture I with a shared
general rounder are depicted, the gure in the middle depicts the results for the FP units
using rounding architecture II with a gradual rounder and in the gure on the bottom the
results for the FP units using rounding architecture III with variable position rounding
181
182 CHAPTER 5. EVALUATION
Motorola Not
Nand
Nor
And
Or
F lip 
flop
Xor
Xnor
Mux
3  state
driver
delay 1 1 2 2 4 2 2
cost 1 2 2 4 8 3 5
Table 5.1: relative delay and cost of basic gates for the Motorola technology from [24].
FPU division multiplication add/sub conv compare
double single double single
Gen rnd I, full, NR28 13/8 9/4 5 5 5 3 1
Gen rnd I, full, NR16 17/12 13/8 5 5 5 3 1
Gen rnd I, full, NR8 21/16 17/12 5 5 5 3 1
Gen rnd I, half, NR28 19/14 9/4 6/2 5 5 3 1
Gen rnd I, half, NR16 25/20 15/10 6/2 5 5 3 1
Gen rnd I, half, NR8 31/26 21/16 6/2 5 5 3 1
Grad rnd II, full, NR28 12/8 8/4 4 4 4 3 1
Grad rnd II, full, NR16 16/12 12/8 4 4 4 3 1
Grad rnd II, full, NR8 20/16 16/12 4 4 4 3 1
Grad rnd II, half, NR28 18/14 8/4 5/2 4 4 3 1
Grad rnd II, half, NR16 24/20 14/10 5/2 4 4 3 1
Grad rnd II, half, NR8 30/26 20/16 5/2 4 4 3 1
Var rnd III, full, NR28 10/8 6/4 2 2 2 3 1
Var rnd III, full, NR16 14/12 10/8 2 2 2 3 1
Var rnd III, full, NR8 18/16 14/12 2 2 2 3 1
Var rnd III, half, NR28 16/14 6/4 3/2 2 2 3 1
Var rnd III, half, NR16 22/20 12/10 3/2 2 2 3 1
Var rnd III, half, NR8 28/26 18/16 3/2 2 2 3 1
Table 5.2: Latencies/restart-times of the FP units for single precision and double precision
operations. If there is only one entry, this corresponds to the latency and the operation is
fully pipelined.
183
21.91.81.71.61.51.41.3
300kG
200kG
100kG
0kG
CPI
NR28full addertree
half addertree
0.23 CPI
0.13 CPI
~20 kG
~154 kG
NR8
NR16
Shared general rounding I
21.91.81.71.61.51.41.3
300kG
200kG
100kG
0kG
  general rounding)
CPI
~ + 8 kG
~ - 0.1 CPI
(compared to
Gradual rounding II
NR28
NR16
NR8
full addertree
half addertree
21.91.81.71.61.51.41.3
300kG
200kG
100kG
0kG
CPI
Variable position rounding IIINR28
NR16
NR8
full addertree
half addertree
~ + 26 kG
~ - 0.3 CPI
(compared to general rounding)
Figure 5.1: Cost (kilo gates) and performance (cycles per instruction) of the dierent FP
designs.
184 CHAPTER 5. EVALUATION
Version Gen rnd I Grad rnd II Var pos rnd III
Newton-Raphson 8, full tree 1:769 CPI 1:667 CPI 1:466 CPI
Newton-Raphson 16, full tree 1:723 CPI 1:620 CPI 1:419 CPI
Newton-Raphson 28, full tree 1:676 CPI 1:574 CPI 1:3722 CPI
Newton-Raphson 8, half tree 1:901 CPI 1:817 CPI 1:613 CPI
Newton-Raphson 16, half tree 1:830 CPI 1:746 CPI 1:542 CPI
Newton-Raphson 28, half tree 1:760 CPI 1:676 CPI 1:472 CPI
Table 5.3: Performance (cycles per instruction in runtime simulations on traces of the
SPEC92fp benchmarks) of the dierent FP units integrated into a pipelined RISC proces-
sor.
Version Gen rnd I Grad rnd II Var pos rnd III
Newton-Raphson 8, full tree 134641 142988 161274
Newton-Raphson 16, full tree 136796 145143 163429
Newton-Raphson 28, full tree 266769 275116 293402
Newton-Raphson 8, half tree 114479 121334 141112
Newton-Raphson 16, half tree 116634 123489 143267
Newton-Raphson 28, half tree 246607 253462 273240
Table 5.4: Cost (gate count) of the dierent FP units integrated into a pipelined RISC
processor.
are depicted. In each gure, the result of a FPU version, that uses a full-sized addertree,
is connected with the corresponding FPU version, that uses a half-sized adder-tree, by a
line, where the full-sized version is always faster and more expensive than the half-sized
version. The maximum dierence between two connected FPU results is 0:1 CPI and
about 20 kG. In this way, the choice of the multiplier options has only a small eect
on the performance and small eect on the cost. Comparing all dierent FPU versions
within a particular rounding architecture, the situation is similar in the dierent gures.
The maximum dierence of the CPI is 0:24, so that a moderate speed-up can be achieved
by using a fast divider and multiplication implementation. But the use of a fast divider
increases the cost to a large extent by up to 132 kG, so that a fast divider version might
be too expensive. Comparing the dierent rounding architectures, the best performance
with relatively small additional cost is provided by the variable position rounding FPUs
using rounding architecture III. In this way, the choice of the rounding architecture has
the largest impact on the design quality, diering by about 0:3 CPI among the dierent
architectures, but only by about 26 kG in cost.
In general the use of rounding architecture III, that uses dedicated rounding imple-
mentations for each functional unit seems to be the best choice in IEEE compliant FP
implementations.
Bibliography
[1] Al-Twaijry, H. Area and Performance Optimized CMOS Multipliers. PhD thesis,
Stanford University, August 1997.
[2] Anderson, S.W. and Earle, J.G. and Goldschmidt, R.E. and Powers, D.M. The IBM
system 360 model 91: Floating-point unit. IBM J. Res. Dev., 11:34{53, January 1967.
[3] Bewick, G.W. Fast Multiplication: Algorithms and Implementation. PhD thesis,
Stanford University, March 1994.
[4] Cortadella, J. and Llaberia, J.M Evaluation of A+B = K Conditions without Carry
Propagation IEEE Trans. on Computers, vol. 41, pp. 1484-1488, November, 1992.
[5] Das Sarma, D. and Matula, D. W. Measuring the Accuracy of ROM Reciprocal Tables.
IEEE Trans. on Computers, vol. 43, pp. 932-940, August, 1994.
[6] Das Sarma, D. and Matula, D. Faithful Bipartite ROM Reciprocal Tables, In Pro-
ceedings of the 12th Symposium on Computer Arithmetic, vol. 12, pp.17-28, IEEE,
1995.
[7] Das Sarma, D. and Matula, D. Faithful Interpolation in Reciprocal Tables, In Pro-
ceedings of the 13th Symposium on Computer Arithmetic, vol. 13, pp. 82-91, IEEE,
1997.
[8] Daumas, M. and Matula, D.W. Recoders for partial compression and rounding.
Technical Report 97-01, Laboratoire de l'Informatique du Paralllisme, Lyon, France,
1997.
[9] Even, G. and Mueller, S.M. and Seidel, S.M. A Dual Mode IEEE multiplier. In Pro-
ceedings of the 2nd IEEE International Conference on Innovative Systems in Silicon,
pages 282{289. IEEE, 1997.
[10] Even, G. and Paul, W.J. On the design of IEEE compliant oating point units. In
Proceedings of the 13th Symposium on Computer Arithmetic, volume 13, pages 54{63.
IEEE, 1997.
[11] Even, G. and Seidel, P.M. A comparison of three rounding algorithms for IEEE
oating-point multiplication. In Proceedings of the 14th IEEE Symposium on Com-
puter Arithmetic, pages 225-232, April 1999.
[12] Even, G. and Seidel, P.M. A comparison of three rounding algorithms for IEEE
oating-point multiplication. to be published in Special Issue on Computer Arith-
metic, IEEE Trans. on Computers, July 2000.
185
186 BIBLIOGRAPHY
[13] Fowler, D.W. and Smith, J.E. An accurate, high speed implementation of division
by reciprocal approximation. In Proceedings of the 9th Symposium on Computer
Arithmetic, volume 9, pages 60{67. IEEE, September 1989.
[14] Farmwald, M. P. On the design of high performance digital arithmetic units, PhD
thesis, Stanford Univ., August, 1981.
[15] Ferrari, D. A division method using a parallel multiplier, IEEE Trans. Electr. Com-
put., vol. EC-16, pp. 224-226, 1967.
[16] Hennessy, J.L. and Patterson, D.A. Computer Architecture: A Quantitative Approach.
Morgan Kaufmann Publishers, INC., San Mateo, CA, 2nd edition, 1996.
[17] Hill, M. SPEC92 Traces for MIPS R2000/3000. University of Winconsin,
ftp://ftp.cs.newcastle.edu.au/pub/r3000-traces/din', 1995.
[18] Ito, M. and Takagi, N. and Yajima, S. Ecient Initial Approximation and Fast
Converging Methods for Division and Square Root", In Proceedings of the 12th Sym-
posium on Computer Arithmetic, vol. 12, pp. 2-9, IEEE, 1995.
[19] IEEE standard for binary oating-point arithmetic. ANSI/IEEE754-1985, New York,
1985.
[20] Kane, G. and Heinrich, J. MIPS RISC Architecture. Prentice Hall, 1992.
[21] Lee, C. Multistep gradual rounding. IEEE Transactions on Computers, 32(4):595{
600, April 1989.
[22] Muller, S.M. and Paul,W.J. The Complexity of Simple Computer Architectures. Lec-
ture Notes in Computer Science 995. Springer, 1995.
[23] Muller,S.M. and Paul,W.J. The Complexity of Correctness of Computer Architectures.
Springer, 2000, Draft.
[24] Nakata, C. and Brock,J. H4C Series: Design Reference Guide. CAD, 0.7 Micron
L
eff
. Motorola Ltd., 1993. Preliminary.
[25] Nielsen, A.M. and Matula, D.W. and Lyu, C.N. and Even, G. Pipelined packet-
forwarding oating point: II. an adder. In Proceedings 13th Symposium on Computer
Arithmetic, pages 148{155, Asilomar, California, July 6-9 1997.
[26] Oberman, S.F. Design Issues in High Performance Floating Point Arithmetic Units.
PhD thesis, Stanford University, January 1997.
[27] Oberman, S.F. and Al-Twaijry, H. and Flynn, M.J. The SNAP project: Design of
oating point arithmetic units. In Proceedings of the 13th Symposium on Computer
Arithmetic, volume 13, pages 156{165. IEEE, 1997.
[28] Oberman, S.F. and Flynn,M.J. Fast IEEE rounding for division by functional itera-
tion. Technical Report CSL-TR-96-700, Stanford University, July 1996.
[29] Intel Corporation Pentium Processor Family Developer's Manual Volume 1: Pentium
Processors, 1995.
BIBLIOGRAPHY 187
[30] Paul,W.J. and Seidel,P.M. The complexity of Booth recoding. In Proceedings of the
3rd conference on Real Numbers and Computers RNC3, pages 199-218, Paris, France,
April 1998.
[31] Quach, N. and Flynn, M. Design and implementation of the SNAP oating-point
adder. Technical Report CSL-TR-91-501, Stanford University, December 1991.
[32] Quach, N. and Flynn, M.J. An improved algorithm for high-speed oating-point
addition. Technical Report CSL-TR-90-442, Stanford University, August 1990.
[33] Quach, N. and Takagi, N. and Flynn, M. On fast IEEE rounding. Technical Report
CSL-TR-91-459, Stanford University, January 1991.
[34] Santoro, M.R. and Bewick, G. and Horowitz, M.A. Rounding algorithms for IEEE
multipliers. In Proceedings 9th Symposium on Computer Arithmetic, pages 176{183,
1989.
[35] Schulte, M.J. and Omar, J. and Swartzlander, E.E., Optimal Initial Approximations
for the Newton-Raphson Division Algorithm", Computing, vol. 53, pp. 233-242,
August, 1994.
[36] Seidel, P.-M. High-speed redundant reciprocal approximation. In Proceedings of the
3rd conference on Real Numbers and Computers RNC3, pages 219-229, Paris, France,
April 1998.
[37] Seidel, P.-M. How to half the latency of IEEE compliant oating-point multiplication.
In Proceedings of the 24th Euromicro Conference, volume 24, pages 329-332. IEEE,
1998.
[38] Seidel, P.-M. On the architecture of IEEE compliant oating-point units. to appear
in, Proceedings of the IASTED Conference of Applied Informatics 2000, February
2000.
[39] Seidel, P.-M. High-speed redundant reciprocal approximation. INTEGRATION, the
VLSI journal 28 (1999), pp. 1-12.
[40] Seidel, P.-M. and Even, G. How many logic levels does oating-point addition require?
In Proceedings of the IEEE International Conference on Circuit Design (ICCD98),
pages 142-149, October 1998.
[41] Siemens Munchen. VENUS-S Semi-Custom Design System: Zellkatalog, 1988.
[42] Wong, D. and Flynn, M. Fast Division Using Accurate Quotient Approximations to
Reduce the Number of Iterations, IEEE Trans. on Computers, vol. 41, pp. 981-995,
August, 1992.
[43] Soderquist, P. and Leeser, M.. Floating-point division and square root: Choosing
the right implementation. Technical Report EE-CEG-95-3, Cornell University, April
1995.
[44] Yu, R.K. and Zyner, G.B. 167 MHz Radix-4 oating point multiplier. Proceedings
12th Symposium on Computer Arithmetic, 12:149{154, 1995.
[45] Zyner, G. Circuitry for rounding in a oating point multiplier. U.S. patent 5150319,
1992.
188 BIBLIOGRAPHY
