Processor support for interval arithmetic by Williams, Gerald Shawn
Lehigh University
Lehigh Preserve
Theses and Dissertations
1998
Processor support for interval arithmetic
Gerald Shawn Williams
Lehigh University
Follow this and additional works at: http://preserve.lehigh.edu/etd
This Thesis is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact preserve@lehigh.edu.
Recommended Citation
Williams, Gerald Shawn, "Processor support for interval arithmetic" (1998). Theses and Dissertations. Paper 550.
Williams, .Gerald
Shawn
o
Processor
Support for
Internal Arithmetic
May 13,1998
PROCESSOR SUPPORT FOR INTERVAL ARITHMETIC
by
Gerald Shawn Williams
A Thesis
Presented to the Graduate and Research Committee
of Lehigh University
~
in Candidacy for the Degree of
Master of Science
III
Computer Engineering
Lehigh University
May, 1998

Acknowledgements
I would like to thankmy wife Cynthia for all ofher love and support. She is my source
of encouragement, and has always been willing to do whatever it takes to allow me to
complete my studies, even though it has meant doing more than her share of work
around the house during the past few years.
I would also like to thank my advisor, Dr. Michael J. Schulte. He has been a
continual source of encouragement and guidance to me, and has spent many hours
helping to refme my ideas, clearing up misunderstandings, and reviewing papers and
conference submissions. He has always been more than willing to help, and has given
up many nights and weekends to this end.
Finally, I would like to thank my parents, James and Julia, for instilling in me a
will to learn and grow.
iii
Acknowledgements
Table of Contents
List of Tables
List of Figures
Abstract
Table of Contents
111
IV
V111
IX
1
Chapter 1
Introduction 2
1.1 The Price.ofFailure 2
1.2 Arithmetic Errors in Multiuse Computer Systems 3,
1.3 The Role of Simulation and Modeling 4
1.4 Handling Arithmetic Errors 4
1.5 Reliable computing 5
1.6 The Need for Speed 6
1.7 The Interval Arithmetic Solution 7
1.8 Performance of Interval Arithmetic 7
1.9 Architectural Support for Interval Arithmetic 8
1.10 Overview 9
Chapter 2
Number Representations 11
2.1 Exact Methods 11
2.2 Inexact Methods 11
2.3 Floating Point Numbers 12
2.4 Limitations of Inexact Methods 13
Chapter 3
Interval Arithmetic 14
3.1 History of Interval Arithmetic 14
3.2 Interval Representations 15
3.3 Outward Rounding 16
3.4 Interval Operations 16
3.4.1 Binary Interval Operations 17
3.4.2 Interval Scalar Decompositions 18
3.4.3 Unary Interval Operations 19
3.4.4 Interval Comparisons 20
3.5 Software Support for Intervals 21
iv
3.6 The Perfonnance Cost of Intervals 21
3.6.1 Overhead due to Functions/Subroutines 22
3.6.2 Overhead due to Extra Interval Data ; 22
3.6.3 Overhead due to Computing both Endpoints 22
3.6.4 Overhead due to Changing Rounding Direction 23
3.6.5 Overhead due to Data-dependent Operation Selection 23
3.6.6 Overhead due to Interval Special Cases 24
Chapter 4
.Current Processor Architectures and Research 25
.4.1 Processor Organization 25
4.2 Separation ofArchitecture from Implementation 26
4.3 More flexibility 27
4.4 Faster floating point. 27
4.5 Specialized Hardware 28
4.6 Dedicated Interval Processors 28
Chapter 5
Interval Support Instructions 29
5.1 The DLX Architecture 29
5.2 SRND Instruction 31
5.3 IIMD Instruction 32 .
5.4 IIDD Instruction 33
5.5 IABSD Instruction 33
5.6 MOVDT and MOVDF Instructions 34
5.7 HALFD Instruction 34
5.8 NEGSD Instruction 35
5.9 The Enhanced DLX Instructions 35
Chapter 6
Interval Support Hardware Requirements 36
6.1 General Hardware Requirements 36
6.2 SRND Hardware Requirements 37
6.3 IIMD and IIDD Hardware Requirements 37
6.4 MOVDT and MOVDF Hardware Requirements 38
6.5 HALFD Hardware Requirements 38
6.6 IABSD Hardware Requirements , 38
6.7 NEGSD Hardware Requirements 39
Chapter 7
Simulation Environment 40
7.1 Simulator Evaluation 40
7.1.1 Simulator Accuracy 41
7.1.2 Simulator Precision 41
v
7.1.3 Simulator Timing 41
7.1.4 Simulator Scope 41
7.1.5 Simulator Perfonnance 41
7.1.6 Simulator Requirements 42
7.2 Simulator Availability 43
7.3 Simulator Quality 43
7.4 Floating Point Simulation 44
7.5 The FP Module: a General Floating Point Model 45
7.6 The FP Pipeline Description File 46
7.6.1 Pipeline Description for the Sun UltraSparc Processor 46
7.6.2 Pipeline Description for the MIPS R4000 Processor 47
7.6.3 Pipeline Description for the MIPS R10000 Processor 48
7.7 The DLXsim Simulator 49
7.8 The GCC-DLX C Compiler 49
7.9 Integrating FP, DLXsim, and GCC-DLX 50
7.10 EnhancedDLX 51
7.11 Implementing Interval Operations in Enhanced DLX 52
Chapter 8
Performance Evaluation 53
8.1 Interval Reference Applications 53
8.2 Interval Stress Test Results 54
8.3 Interval Newton Test Results 55
804 Interval Replacement Test Results 56
8.5 Testing Summary 57
Chapter 9
Conclusions and Future Research 58
9.1 Faster Interval Calculations 58
9.2 Replacing Floating Point Numbers with Intervals 59
9.3 Dedicated versus General Purpose Enhancements 60
9A Opportunities for Future Research 61
Bibliography 62
Appendix A
C Implementations ofInterval Operations 65
A.1 Interval Addition in C 65
A.2 Interval Subtraction in C 65
A.3 Interval Negation in C 65
AA Interval Multiplication in C 65
A.5 Interval Division in C , 67
A.6 Interval Width in C 67
A.7 Interval Midpoint in C 68
vi
A.8 Interval Magnitude in C 68
A.9 Interval Mignitude iIi C 68
AppendixB
DLX Implementations of Interval Operations 69
B.1 Interval Addition in DLX 69
B.2 Interval Subtraction in DLX 69
B.3 Interval Negation in DLX 69
BA Interval Multiplication in DLX 69
B.5 Interval Division in DLX 71
B.6 Interval Width in DLX 74
B.7 Interval Midpoint in DLX 74
B.8 Interval Magnitude in DLX 74
B.9 Interval Mignitude in DLX 74
Appendix C
Definitions of the Proposed Extensions
AppendixD
Pipeline Description File
AppendixE
Defined Interval Types and Macros
Vita
vii
76
80
81
82
List of Tables
Table 1: Binary interval operations 18
Table 2: Scalar interval decompositions 19
Table 3: Unary interval operations 19
Table 4: Interval comparisons 20
Table 5: Summary ofproposed interval extensions 29
Table 6: Some traditional DLX instructions 30
Table 7: Rounding mode transitions 31
Table 8: Addition, subtraction, and width in enhanced DLX 32
Table 9: Interval multiplication and division in enhanced DLX 33
Table 10: Interval magnitude and mignitude in enhanced DLX 34
Table 11: Interval midpoint and negate in enhanced DLX 35
Table 12: Simulator Requirements 42
Table 13: FP module application programming interface (API) 45
Table 14: UltraSparc functional units 47
Table 15: UltraSparc instruction types 47
Table 16: MIPS R4000 functional units 47
Table 17: MIPS R4000 instruction types 48
Table 18: MIPS RlOOOO functional units 48
Table 19: MIPS RlOOOO instruction types 48
Table 20: Interval stress test results 54
Table 21: Interval Newton test results 56
Table 22: Interval replacement test results 56
viii
List of Figures
Figure 1: An equation that causes numerical errors 5
Figure 2: An equation that causes a rounding error 5
Figure 3: IEEE floating point fonnat.. 12
Figure 4: C representation ofan interval 15
Figure 5: Standard interval notation 15
Figure 6: Outward rounding 16
Figure 7: Calculation of c = a ® b for a typical interval operation ® 17
Figure 8: Problems to be solved using the interval Newton method 53
ix
Abstract
Most computer applications require reliable and accurate calculations, although they
often fail in this regard when working with real-valued numbers. Interval arithmetic
provides reliability and accuracy by computing a lower and upper bound in which each
result is guaranteed to reside. By representing numbers as ranges of possible values
rather than as point values, it also provides a mathematical basis for implementing
many scientific and commercial applications. However, even though interval arithmetic
is widely applicable and relatively efficient, current software implementations fail to
provide the performance required for it to gain widespread acceptance.
This thesis presents a new approach to improving interval performance. It intro-
duces hardware and instruction set modifications to conventional processors which
focus on some ofthe most costly aspects of interval arithmetic. These modifications are
shown to have little or no impact on the processors' cycle time and do not require much
area. General enhancements such as deep pipelines and multiple functional units are
relied upon to absorb the cost of extra floating point operations.
Simulation shows that these enhancements, applied to the DLX architecture,
can improve the performance of some interval operations by over 400 percent, and the
performance of highly optimized interval applications by 33 to 50 percent. More
importantly, these improvements bring performance of interval arithmetic to within a
factor of three of equivalent floating point implementations. This is low enough for
interval arithmetic to be used in many conventional computer applications.
Chapter 1
Introduction
Humanity is becoming increasingly dependent on the successful operation of comput-
ers. They are used to design, test, and later control devices, machinery, and even the
buildings in which we live. The high performance and apparent reliability ofcomputers
have made previously impossible tasks commonplace. Rapid increases in computer
capabilities have continued this trend. As computers become faster and less expensive,
they are called upon to do more, and the accuracy and reliability of computations
becomes even more important. However, computations involving real-world data, and
in most cases any computations involving real numbers, are almost guaranteed to con-
tain inaccuracies. Thus, there is a growing need for reliable and validated computing
methods.
1.1 The Price of Failure
Embedded computer control systems make many things possible. But in doing so, they
become essential components of the systems they control. Many modem warplanes
have such complicated aerodynamics that they cannot even fly without the constant
guidance of their onboard computers. In this case, a major computer failure almost cer-
tainly results in a crash. Yet even slight errors could be disastrous, especially in planes
capable ofnap-of-the-earth and/or transsonic flying.
A subtle, yet potentially very dangerous, failure is that of incorrect results due
to limitations ofthe arithmetic systems used for calculations. Small errors can accumu-
2
late, sometimes rapidly, and produce highly inaccurate results [8]. Two disasters caused
by such cumulative arithmetic failures are the failure ofa Patriot missile during the Per-
sian Gulf War (which resulted in the death of 28 American soldiers) and the explosion
of an Arianne rocket forty seconds after takeoff in 1996 [2].
1.2 Arithmetic Errors in Multiuse Computer Systems
Arithmetic failures in embedded computers can be dramatic and immediate. A plane
may crash, a rocket may veer offcourse, or an engine may fail to operate. Designers of
these systems usually understand the price of failure, though, and they attempt to
design systems that are somewhat failsafe. Redundancy is added, error checking is
done, and the effect of rounding errors is minimized, although this can be difficult in
larger embedded systems or those that perform complex calculations.
General purpose systems and multiuse applications such as computer aided
design and simulation systems are more difficult to design in a failsafe manner, espe-
cially since arithmetic errors are highly data dependent. For instance, a computer aided
design system might be able to design automobiles reliably in many cases, but might
fail if the ratio of wheelbase to total length exceeded a certain value (e.g., due to cata-
strophic cancellation). Thorough testing could probably uncover this problem, but it is
usually impossible to test all possible values that may be input into a system. It can also
be very difficult to foresee every way in which a general-purpose or multiuse computer
system will be used or which uses may produce unacceptable arithmetic errors.
3
1.3 The Role of Simulation and Modeling
Simulation and modeling play an important role iIi engineering today. Not only is it less
expensive and faster to design using these techniques, but the results of mistakes are
usually much less severe. If the simulation is accurate and complete, a mistake merely
results in extended design time rather than later system failure. This allows tolerances
to be reduced while improving designs. Arithmetic errors in this case may not result in
immediate failure, but rather introduce latent faults that can cause failures months or
years later. The more that designs "push the edge," facilitated by advanced simulation
and modeling tools, the more they are susceptible to later catastrophic failure due to
arithmetic errors in the design process [6].
1.4 Handling Arithmetic Errors
It is important that arithmetic errors be handled in a general purpose fashion. Although
specific causes of arithmetic errors, such as denominators in a particular equation that
are known to approach zero under certain conditions, can be handled individually, it
can be difficult to identify the conditions to test. Furthermore, it is seldom feasible to
predict all possible uses for a given system, much less all combinations of inputs.
Without some indication of the reliability of a particular answer, designers
using computer aided design tools may be relying on values that are totally incorrect.
The same holds true for suitably advanced embedded systems that rely on real-world
models or complex calculations-as soon as an unusual or unforeseen situation occurs,
their results may be suspect.
4
1.5 Reliable computing
The fundamental problem with most real-number computations is that their accuracy is
not guaranteed. Increasing precision does not prevent this. Small errors can accumulate
rapidly, and limitations in the representation of numbers can quickly cause completely
wrong results.
f = 333.75b6 + i(11ib2_b6-121b4-2) + 5.5b8 + a/(2b)
Figure 1: An equation that causes numerical errors
Consider the example in Figure 1 [26]. For a = 77617.0 and b = 33096.0,
this equation yields f ~ 1.17260 when solved using single precision, double precision,
and extended precision arithmetic. Increasing the precision seems to validate the
results. However, the correct answer is actually f~-o.827396 x 10-17 .
30 29 29f = 1.0 x 10 + 100.0-5.0 x 10 -5.0 x 10 -10.0
Figure 2: An equation that causes a rounding error
The limitations ofpoint-valued computer arithmetic are shown more clearly in
Figure 2. Obviously, the correct solution to Figure 2 is f = 90.0. However, the second
term is lost in either single-precision or double-precision floating point arithmetic,
causing f = -10.0 to be computed instead. Limitations inherent to the representation
of real numbers skew the results with no indication of the inaccuracy. In Figure 2,
exchanging the terms can rectify the problem. This might not be desirable, however, if
5
the numbers represent measured values. Perhaps the larger values should dwarf the
smaller ones, and the result should not be precise enough to distinguish between
f = -10.0 and f = 90.0 anyway.
Some forms of computation represent real numbers not as simple values, but
rather as all of the possible values which they may hold. These provide a guarantee of
accuracy. Such a computational model bounds not only arithmetic errors, but also
errors due to limitations in measuring and control devices. In fact, such computational
models do more than just constrain errors. Properly bounded results of operations on
known values (even bounded values) can be shown to be equivalent to a mathematical
proof [36]. This can lead to entirely new algorithms to solve problems.
1.6 The Need for Speed
It has been postulated that, in order for general-purpose error bounding arithmetic to
gain widespread acceptance, its performance must be no less than one fifth that of
equivalent floating point algorithms [16]. This requirement is easy to justify for
advanced embedded systems that require complex calculations such as flight control
systems, robotic sensing, or embedded systems that rely on simulation and modeling.
Speed is also important in general purpose computer systems. Performance gains due
to modem processor developments can offset some ofthe increased cost, but there will
always be applications that are "pushing the envelope," especially in high-end simula-
tion and modeling tools. People are also reluctant to give up performance. The perfor-
mance of personal computers is increasing by an order of magnitude every few years,
6
at which point the old systems tend to be worth virtually nothing. An order of magni-
tude performance penalty makes a high end system perform like a system that is so
slow that it is destined for disposal. There is also a monetary consideration-extra pro-
cessing overhead in computer aided design systems translates to longer design times or
at least increased design costs.
1.7 The Interval Arithmetic Solution
Interval arithmetic provides a general methodology for bounding errors. The concept is
simple: represent all real numbers not as discrete values, but as ranges (intervals) in
which the actual (Le., correct) value is known to reside [21] [1]. An interval's width
indicates the maximum possible error. Since all interval solutions to a problem contain
the correct answer, results that are too wide can be narrowed by solving the problem
(e.g., reducing its equations) in a different way and intersecting the results.
Interval arithmetic is also an enabling technology for several new algorithms,
some of which solve problems previously thought impossible (or at least impractical)
to solve with a computer. The interval Newton algorithm is one such algorithm, which
efficiently finds or proves non-existence of all roots of any continuously differentiable
function [36]. Interval solutions to non-linear global optimization problems also exist.
1.8 Performance of Interval Arithmetic
Interval arithmetic is faster than other error constraining techniques (such as various
high-precision or exact methods), and can use arithmetic precisions that allow efficient
processing. However, it still has a serious performance penalty. Estimates indicate that
7
interval computations implemented in software are typically tens to hundreds of times
slower than the equivalent floating point computations [5]. A great deal of overhead
comes from using function calls to access the interval arithmetic operations [40], but
experiments using the DLX architecture [11] have shown that, even without this over-
head, interval operations require 10 to 30 times as many instruction cycles as the equiv-
alent floating point operations [39].
1.9 Architectural Support for Interval Arithmetic
There are several ways to reduce the overhead of interval arithmetic in computers.
Interval-specific coprocessors that very efficiently handle intervals have been proposed
[28] [32] [30] [29]. As improvements in chip design processes allow more features to
be added to existing microprocessors, it is conceivable that these could eventually
become standard additions to mainstream processors. However, it is unlikely that this
will happen until interval arithmetic is more widely used, and this growth is limited by
the overhead ofperforming interval operations on existing machines.
Fortunately, there are a number of less aggressive improvements in processor
design that can benefit interval operations greatly. Most are general improvements that
benefit other operations as well and already exist in one or more experimental and/or
production processors. The addition of a few interval specific operations can provide
significant improvements in the performance of several of the more time consuming
operations. These additional operations can be compared to the multimedia extensions
that are becoming common in modem processors [20] [34] [22].
8
Experiments with the DLX processor architecture [11] have shown the value of
each of these improvements. Overall, they can speed up interval applications by up to
50% and, more importantly, reduce the overhead ofinterval operations below the cutoff
point for common acceptance. By ensuring that at least some of these improvements
become common in processor design, the future of interval arithmetic will be made
much brighter and reliable computing will take a big step forward.
1.10 Overview
The goal of this thesis is to improve the performace of interval operations in order to
help make reliable computing generally available. A set ofarchitectural enhancements,
meant to have minimal impact on current processor design, are proposed and evaluated
to determine if they might make this possible.
Chapters 2 through 4 contain background material. Chapter 2 is an overview of
number systems and how they affect computations. Chapter 3 is an overview of interval
arithmetic. Chapter 4 examines current processor architectures and trends, as well as
some previously proposed hardware support for intervals.
Chapter 5 proposes a set of architectural enhancements in order to improve the
performance of interval operations. Chapter.6 investigates the processor hardware
required to implement the enhancements. Chapter 7 details the simulation environment
used to evaluate these enhancements. Chapter 8 reports measured performance benefits
of these enhancements. Chapter 9 discusses possible consequences ofthis research and
suggests areas for future research.
9
...
Appendix A shows implementations of some common interval operations in C.
Appendix B shows the equivalent operations using DLX assembly language. Appendix
C details the proposed interval extensions as they apply to the DLX architecture.
Appendix D describes the floating point pipeline model used in the enhanced DLX
architecture. Appendix E lists the interval support types and macros that have been
defmed for the enhanced DLX architecture.
10
Chapter 2
Number Representations
Computers can quickly perform integer arithmetic, which usually produces exact
results. Real-world problems, however, typically involve non-integer data and require'
approximations. Treating approximate values as point values is dangerous, since small
errors can quickly accumulate into large discrepancies.
2.1 Exact Methods
There are various ways to represent real numbers exactly. For instance, numbers may
be represented as sums of fractions or symbolically as equations involving fractions as
well as irrational numbers such as 1t or e. This is the approach taken by symbolic arith-
metic packages such as those used in Mathematica or Maple [4] [42]. These methods
are computationally expensive in terms of program speed and size and have limited
applicability to real-world problems. They also do not address errors that are implicit to
transfering data. A system applying exact methods to real-world problems still needs to
input and output data in some form, which typically introduces errors. More errors are
typically introduced by converting these values to the fmite internal representation as
well.
2.2 Inexact Methods
Fortunately, an approximation is usually good enough. By placing limits on precision
and range, calculations can be performed much faster. Floating point representations
11
are typically used, although non-integer fixed point numbers are sometimes employed
to achieve performance close to that of integer operations. Fixed point representations
often have less precision and dynamic range than floating point representations and are
often used when performance is critical but chip area or power are at a premium (e.g.,
in fixed-point digital signal processors used in cellular phones and other low-power
applications that require high performance). In many of these fixed-point applications,
the behavior ofthe system is defmed using bit-exact standards so that limitations of the
approximation are not treated as errors. In fact, they become requirements [41].
In most cases, however, real values are represented as floating point numbers.
By far, the most common floating point representations are those that conform to the
IEEE 754 floating point standard [11] [13].
2.3 Floating Point Numbers
The IEEE 754 floating point standard dermes both single precision and double preci-
sion numbers. Single precision numbers are 32-bit entities consisting of 8 exponent
bits, 23 significand bits, and 1 sign bit. Double-precision numbers are 64-bit entities
consiting of 11 exponent bits, 52 significand bits, and one sign bit. Extended precision
arithmetic with greater ranges and/or precisions is also allowed by the standard. The
standard format for floating point numbers is as shown in Figure 3 ("S" represents the
sign bit).
~__E_xp_o_n_en_t S_ig_ni_fi_ca_n_d _
Figure 3: IEEE floating point format
12
For normalized numbers, there is an implied I before the significand, which
represents a binary fraction. This increases the effective number of significand bits by
one. For extremely small (Le., denormalized) numbers and for zero, the 1 is dropped.
This allows gradual underflow. Two other special cases are infinities and non-valued
numbers, both ofwhich are represented with the largest-possible value in the exponent
field. A non-valued number is called a NaN, which stands for Not-a-Number.
2.4 Limitations of Inexact Methods
Inexact numeric representations are unreliable. Even the initial parameters are typically
approximated due to quantization error. Results may suffer from roundoff error, such as
was demonstrated in section 1.5. Other phenomena may also affect the accuracy and
precision of the results. For instance, when subtracting two numbers that differ only in
a few low-order bits, most of the bits cancel each other out, causing a loss ofprecision.
This is referred to as catastrophic cancellation. There is no measure of the accuracy of
each result. Inaccuracies can be cumulative, so it is possible for the results of inexact
calculations to be very far from the correct results [8] [25].
13
Chapter 3
Interval Arithmetic
Interval arithmetic provides an efficient method for bounding errors that are inherent to
inexact arithmetic [21]. It does this by representing values as ranges rather than discrete
numbers, and by controlling the direction of inaccuracies. Inaccuracies cannot be
avoided, but control can be exerted over them.
3.1 History of Interval Arithmetic
Interval arithmetic (and interval analysis) is not a new field. The first major publication
on interval arithmetic was in the 1960s [21]. As a field, it has been kept alive for many
years largely due to the efforts of a small number of researchers [38]. However, it is
now starting to gain broader acceptance. Many interval analysis software packages are
now freely available [5] [10] [14] [17] [18], and support for intervals is being added to
FORTRAN compilers and systems [3] [16] [27].
Researchers and mathematicians have already acknowledged the need for the
type of reliable computing that interval arithmetic allows. Many believe it is only a
matter of time before reliable computing becomes mainstream technology. Interval
arithmetic has also led to the discovery of new algorithms for solving problems, some
ofwhich were considered unsolvable just 15 years ago [36].
14
3.2 Interval Representations
Intervals are ranges of real numbers. They are typically represented as two floating
point numbers. A C implementation is shown in Figure 4.
typedef struct Interval {
double infimum;
double supremum;
} ;
Figure 4: Crepresentation of an interval
In interval analysis, each interval corresponds to the set of all values that the
real value being represented can possibly have. The two floating point numbers are the
lower and upper bounds of that set, which contains them and all real numbers between
them (±oo indicates that the interval is unbounded in that direction). The lower bound is
called the infImum and the upper bound the supremum. Intervals are often written
using standard closed-interval notation. For example,Figure 5 shows a number that is
known to lie between -0.00301 and -0.00299 (inclusive) using this notation.
[ -3.01 x 10-3 , -2.99 x 10-3 ]
Figure 5: Standard interval notation
The width .of the interval (i.e., the distance between the interval endpoints)
gives an indication of the accuracy of the result. This is useful for monitoring calcula-
tion errors such as roundoff error as well as the effects of approximation errors and
errors due to non-exact inputs [1]. Intervals thus provide a useful tool for computations
in which data are uncertain or take a range ofvalues.
15
3.3 Outward Rounding
Although arithmetic errors cannot be avoided, it is possible to control the direction of
error. This is called the rounding direction. Using this capability, interval computations
are performed in a way that guarantees that the results always contain the correct
answer. When the endpoints ofan interval are not exactly representable; the lower end-
point is rounded towards -00 and the upper endpoint is rounded towards +00. This is
referred to as outward rounding. For example, if the interval [2.234,2.323] is rounded
to three decimal digits, the resulting interval is [2.23,2.33]. Figure 6 shows how out-
ward rounding can affect division results, assuming interval endpoints are precise to
four decimal digits.
[1.000 ,1.000] [3. 000 , 3. 000] = [3.333 x 10-1 , 3.334 x 10-1 ]
Figure 6: Outward rounding
3.4 Interval Operations
Many interval arithmetic operations are similar to "traditional" arithmetic operations,
except that they compute the upper and lower bounds rather than just a single number.
The resultant infImum is the smallest result that can be generated by applying the oper-
ation to all possible input values, rounded towards -00. Similarly, the resultant supre-
mum is the largest such result, rounded towards +00.
A given interval arithmetic operation can typically be calculated by applying
the corresponding real operator to the endpoints (rounding appropriately) and selecting
16
the minimum and maximum values. Figure 7 demonstrates this for a binary operator @
(the inftmum and supremum ofa given interval, a, are denoted as inf(a) and sup(a)).
inf(c)=min(inf(a)@inf(b),inf(a)@sup(b),sup(a)@inf(b),sup(a)@sup(b))
sup(c)=max(inf(a)@inf(b),inf(a)@sup(b),sup(a)@inf(b),sup(a)@sup(b))
Figure 7: Calculation of c = a ® b for a typical interval operation ®
3.4.1 Binary Interval Operations
Typical binary operations do not actually require this many calculations. For addition
and subtraction, the same pairs of inftmum/supremum terms from the operands are
always used to generate the inftmum and supremum of the result. For multiplication or
division, the signs of the input operands can be used to determine how to compute the
result in two multiplies or divides, except when multiplying two intervals that cross
zero, in which case four multiplies are needed, plus two compares. Table 1 shows the
calculations required for the common binary interval operations.
A very important interval operation is intersection. It produces an interval that
encloses all values contained by both operands (if they are disjoint, it produces an
empty interval). The deftnition in Table 1 actually produces an illegal interval (using
the standard deftnition of intervals) instead, so the result should be checked for validity
if it matters. Intersection produces an interval no wider than the narrowest operand.
This can be exploited to narrow results by simply solving an equation in multiple ways
and intersecting the results (narrowing results is also called tightening or sharpening).
The converse operation of intersection is convex hull (or interval hull), which produces
the narrowest interval that fully contains both of its operands.
17
Interval division can actually be slightly more complicated than Table 1 shows.
If exactly one of the divisor's endpoints is zero, only one of the endpoints of the result
needs to be ±oo. If both are zero, then the endpoints of the result can be -00, +00, or
NaN, depending on whether the dividend is negative, positive, zero, or crosses zero.
Operation ® Result c (= a ® b) Example
Interval a Interval b Infimum inf(c) Supremum sup(c)
AddItIOn
any any inf(a) + mf(b) sup(a) + sup(b) [2,3] + [4,5] - [6,8]
SubtractIon
any any inf(a)-sup(b) sup(a)-mf(b) [2,3] - [4,5] - [-3,-1]
Multiphcation
posItIve posItIve mf(a) x mf(b) sup(a) x sup(b) [2,3] x [3,4] - [6,12]
positive negatIve sup(a) x mf(b) inf(a) x sup(b) [2,3] x [-2,-1] - [-6,-4]
posItIve crosses zero sup(a) x inf(b) sup(a) x sup(b) [2,3] x [-2,1] - [-6,3]
negatIve posItIve mf(a) x sup(b) sup(a) x mf(b) [-3,-2] x [3,4] - [-12,-6]
negative negative sup(a) x sup(b) inf(a) x inf(b) [-3,-2] x [-2,-1] - [2,6]
negative crosses zero inf(a) x sup(b) inf(a) x inf(b) [-3,-2] x [-2,2] - [-6,6]
crosses zero pOSItIve mf(a) x sup(b) sup(a) x sup(b) [-1,2] x [3,4] - [-4,8]
crosses zero negatIve sup(a) x inf(b) inf(a) x inf(b) [-1,2] x [-2,-1] - [-4,4]
crosses zero crosses zero smaller of larger of [-1,2] x [-2,2] - [-4,4]
inf(a) x sup(b), inf(a) x inf(b),
sup(a) x inf(b) sup(a) x sup(b)
DIVISIon
positive positive inf(a) -;. sup(b) sup(a) -;. inf(b) [2,4] -;. [1,2] - [1,4]
posItive negative sup(a) -;. sup(b) mf(a) -;. inf(b) [3,6] -;. [-3,-1] - [-6,-1]
negatIve pOSItIve mt{a) -;. mt{b) sup(a) -;. sup(b) [-4,-2] -;. [1,2] - [-4,-1]
negatIve negatIve sup(a) -;. inf(b) inf(a) -;. sup(b) [-6,-3] -;. [-3,-1] - [1,6]
crosses zero pOSItIve mf(a) -;. inf(b) sup(a) -;. inf(b) [-1,2] -;. [1,5] - [-1,2]
crosses zero negatIve sup(a) -;. sup(b) mf(a) -;. sup(b) [-1,2] -;. [-5,-1] - [-2,1]
any crosses zero -00 +00 [2,4] -;. [-1,2] - [-00,+00]
Intersection
any any larger of smaller of [1,3] n [2,4] - [2,3]
inf(a), inf(b) sup(a), sup(b)
Convex Hull
any any smaller of larger of [1,3] u [2,4] - [1,4]
inf(a), inf(b) sup(a), sup(b)
Table 1: Binary interval operations
3.4.2 Interval Scalar Decompositions
Several operations decompose intervals into scalar values. These access the infImum
and supremum, compute the midpoint or width ofan interval, and compute the furthest
18
(magnitude) or closest (mignitude) that an interval is to zero. The scalar decomposition
operations are shown in Table 2. For midpoint, subtraction and division are rounded
towards +00 and addition towards -00 to insure that the result lies in the interval. Width
is rounded towards +00 to insure that the result spans the interval.
Operation Calculation Example
mtlmum m10 mfUI,3]) - 1
supremum supO sup([1,3]) - 3
midpoint mfO + (suPO-mf()) + 2 midpoint([1,3]) - 2
width sUPO-mfO width([-3,0]) - 3
magnitude larger of I mfO I , 1supO I magnitude([-2,-I]) - 2
mIgnItude oIf interval contams zero, otherwIse mignitude([-1,2]) - 0
smaller of I infO 1, 1supO I mignitude([-2,-I]) =1
Table 2: Scalar interval decompositions
Interval Infimum calculation Supremum calculation Example
Absolute value
any mignItudeO magnitudeO 1[-3,1] 1- [0,3]
NegatIon
any -supO -mfO - [4,5] - [-5,-4]
Square
positIve infO x mf() supO x supO sqr([2,3]) - [4,9]
negatIve supO x supO mf() x mfO sqr([-3,-2]) - [4,9]
crosses zero smaller of infO x mfO, larger of infO x mfO, sqr([-2,1]) - [1,4]
supO x supO supO x supO
Square Root
posItIve sqrt(inf()) sqrt(sup()) sqrt([I,4]) - [1,2]
negative NaN' NaN sqrt([-4,-I]) -
[NaN,NaN]
crosses zero oor NaN sqrt(sup()) sqrt([-1 ,1]) - [0,1]
Table 3: Unary interval operations
3.4.3 Unary Interval Operations
Common unary interval operations include absolute value, negation, square, and square
root. These are shown in Table 3. There is some debate in the interval community about
how to calculate the infimum ofthe square root ofan interval that crosses zero. Though
19
not technically correct, intersecting the solution with the set ofreal numbers tends to be
more useful to applications and is less likely to contaminate future calculations. Either
way, extra error checking is usually required when interval square root is performed.
3.4.4 Interval Comparisons
Comparisons are more difficult than with point-valued operations. There are certainly
true, possibly true, and set variations of most of them. Interval comparison operators
are shown in Table 4.
Comparison operator ® Calculation of a ® b
Set comparisons
set equals (::) mf(a) - mt(b) and sup(a) - sup(b)
subset of (c) inf(a) ~ inf(b) and sup(a) :::; sup(b)
superset of (::::» inf(a):::; inf(b) and sup(a) ~ sup(b)
proper subset of(c) (a C b) and not (a =b)
proper superset of(::::» (a::::> b) and not (a =b)
disjoint from «>) inf(a) > sup(b) or sup(a) < inf(b)
Possibly true compansons
possibly equals inf(a):::; sup(b) and sup(a) ~ inf(b)
possibly not equal to inf(a):I= sup(a) or inf(a):I= inf(b) or
sup(a):I= sup(b)
possibly less than inf(a) <sup(b)
possibly less than or equal inf(a) :::; sup(b
possibly less than sup(a) > inf(b)
possibly greater than or equal sup(a) ~ inf(b)
Certamly true compansons
certamly equals mf(a) - sup(a) - mf(b) - sup(b)
certamly not equal to same as (a <> b)
certamly less than sup(a) < inf(b)
certamly Jess than or equal sup(a) :::; inf(b)
certamly greater than inf(a) >sup(b)
certamly greater than or equal inf(a) ~ sup(b)
Table 4: Interval comparisons
20
3.5 Software Support for Intervals
There are several software packages that support interval arithmetic, including BIASI
PROFIL [18], INTLffi [14], and C-XSC [17]. Efforts are currently underway to add
built-in support for interval arithmetic to FORTRAN [3] [16] and other languages. C
implementations ofcommon interval operations are shown in Appendix A.
Although naive application of interval arithmetic can lead to wide intervals,
many efficient algorithms that produce narrow intervals have been developed [6] [10].
Using these and other tools, interval arithmetic has been successfully applied to a wide
range ofscientific applications [15] [19].
However, current software implementations of interval operations are not fast
enough. It has been postulated that, for interval arithmetic to have wide acceptance, its
performance overhead must be no more than two to five times that of regular floating
point arithmetic [16]. The overhead for current implementations has been estimated at
20 to 100 times that of floating point [5].
3.6 The Performance Cost of Intervals
The following six factors make interval operations expensive:
• function/subroutine overhead,
• increased size of the data,
• computation of two or more results per interval operation,
• changing ofrounding direction,
• data dependent selection ofhow some operations are performed, and
• handling special cases.
Taken as a whole, these make interval operations very expensive. However,
many general improvements to processors and tools are being implemented that
21
address each of these costs, except those associated with the changing of rounding
direction.
3.6.1 Overhead due to Functions/Subroutines
Using software libraries for interval arithmetic incurs additional overhead since they
are accessed via function calls. The cost of making a function call for every interval
operation is signifigant [40]. This overhead can be largely eliminated through compil-
ers with native interval support or through inline functions that are customized for par-
ticular architectures and compilers. Techniques that eliminate function overhead may
also cause a significant increase in code size, however. It is thus desirable for interval
operations to be implemented using a minimal amount of code.
3.6.2 Overhead due to Extra Interval Data
The increased data size of intervals (using pairs of values) causes more registers and
more memory transfers to be used. Newer processors tend to have more memory, more
and larger registers, and wider data buses, which offsets this cost. Eliminating function
calls will also reduce this overhead since operand and result passing as well as local
variables associated with functions cause increased register and stack usage.
3.6.3 Overhead due to Computing both Endpoints
Computing both endpoints results in extra computations. However, newer processors
tend to employ more parallelism (e.g., through deep pipelines, very long instruction
words, or superscalar execution), which helps offset the cost of the extra computations
22
required for intervals. This is especially true for applications that otherwise limit what
can be done in parallel. In other words, the extra computations may be able to occupy
execution slots that would be otherwise vacant due to stalls or failure to schedule an
applicable instruction, as long as the pipeline is not stalled by requesting a change in
rounding.
3.6.4 Overhead due to Changing Rounding Direction
Most processors are not designed to have their rounding direction changed very often.
It is usually changed through an inefficient "back door" such as loading a floating point
status register or writing a control location in memory, causing the floating point pipe-
line to be flushed (Le., all pending floating point operations are allowed to complete)
before it will accept new operations. This introduces additional latency and prevents
the simultaneous computation of the infImum and supremum of an interval, even.
though the processor may otherwise have sufficient parallelism to do so. Some
machines have been designed without this restriction, though. SpecifIcally, to remove
execution interlocks, some researchers have broken the dependencies between system
state and the processor execution units by duplicating that state, including floating
point control, as data that accompanies each instruction in the execution pipeline [35].
3.6.5 Overhead due to Data-dependent Operation Selection
Software implementations of interval operations require many data dependent opera-
tions because they must choose between lower and upper bounds. This can introduce
signifIcant delays in deeply pipelined processors, which have long branch latencies.
23
Predicated branches combined with speculative execution, such as will be present in
Intel's IA-64 architecture [9], may be able to eliminate some of these stalls. Another
approach is to use conditional instructions to reduce the number of branches. This
approach is used in the ARM instruction set [7].
3.6.6 Overhead due to Interval Special Cases
Special cases, such as infinities or NaNs, can make interval operations expensive as
well. Most processors support floating point special cases in some way (such as by
allowing floating point exceptions to be raised). Special cases are handled differently
for intervals, however. Floating point handling of special cases is generally disabled
while using intervals, since it may not be desirable to have floating point exceptions
raised twice and there are some cases when a floating point exception would be raised
by an intermediate result that is discarded anyway. Special case handling for intervals
is rather complicated, and is an area ofactive research [23] [24].
24
Chapter 4
Current Processor Architectures and Research
Although interval arithmetic can have a significant performance cost, current trends in
processor design may help offset that cost. Carefully used, hand-optimized interval
libraries may soon be fast enough that interval arithmetic may be able to get a foothold
in conventional computer applications. Ifcare is taken now to insure that future designs
continue to benefit intervals, processors may soon compute intervals efficiently enough
to support their widespread use.
4.1 Processor Organization
Modem processors use techniques that allow them to do several things in parallel, so
that programs can execute faster. Pipelined architecture do this by splitting instructions
into stages and executing one stage each from multiple instructions simultaneously.
The current trend is towards deeper pipelines. Each stage executes faster, but more
stages are required per instruction. As long as the pipeline stays full, the throughput of
the processor is one instruction per cycle, so the full benefit ofthe shorter clock cycle is
realized. Superscalar processors can get even higher throughput by issuing multiple
instructions per cycle to multiple functional units.
Superscalar processors are a mixed blessing for intervals. Despite efforts by
researchers to keep the functional units busy [12], many of them are often left idle.
These can absorb some of the overhead of intervals. However, if changing of rounding
modes causes the processor pipeline to flush, the overhead can be severe. As pipelines
25
become longer, the cost of flushing them goes up. Support for intervals may soon be at
a critical juncture. Up to now, processors have been getting increasfugly efficient at
processing intervals, but this effect could begin to reverse the trend.
Deep pipelines need not cause intervals such a performance penalty, however.
Branches used to cause pipeline stalls. Today, speculative execution and predicated
branches [9] can often avoid stalls. Data hazards used to cause stalls, although dynamic
scheduling techniques such as Tomasulo's algorithm can eliminate write-after-write
and write-after-read hazards, and can reduce the number ofread-after-write hazards as
well. Similarly, steps can be taken to avoid pipeline flushes and stalls when rounding
modes and other processor controls are changed [35].
4.2 Separation of Architecture from Implementation
Another trend in modem processors is the separation of the instruction set architecture
from the implementation details of the processor. Microcoding, in which instructions
are decoded into internal sequences of control words, is a primitive example of this.
Processor designers have moved away from microcoding to boost performance, but are
finding better ways to isolate architectures from implementations.
This trend may be partly due to competition among processor vendors. Several
processor families, such as X86, ARM, and SPARe, are provided by multiple vendors.
Each vendor wants to add its own value, and needs to be able to enhance performance
to keep up with, or stay ahead of, competitors. For instance, Intel has made drastic
changes in the basic architecture of their X86-based processors while maintaining
26
backward compatibility (and the instruction set has remained essentially the same since
the 80386 was intruduced). The IA-64 architecture will continue this trend, and Intel
has hinted that the technology to map the old instruction onto IA-64 long instruction
words will be pretty novel [9].
4.3 More flexibility
Modem processors tend to have greater numbers of registers, and larger register sizes.
This can reduce function overhead by avoiding stack accesses, since registers can be
used to pass parameters and results and more scratch registers are available. It also
allows compilers to optimize across larger blocks of code. This allows programs that
use more data, or data in a more complex fashion, to be optimized.
There is also a trend toward larger word sizes. The associated wider buses offset
the cost of increased data requirements. Wider instructions increase the available
opcode space, allowing more specialized instructions, more orthogonality in existing
instructions, more operations that work on multiple registers, or most likely all three.
4.4 Faster floating point
The performance of processors on floating point calculations has not always been a
focus of processor designers. Significantly more area is now available on processors,
though. In addition, processors are being rated for floating point performance through
measures such as SPECfp. Thus, floating point performance is getting more attention.
Although integer pipelines are being lengthened to boost clock rates, enough extra
hardware is being dedicated to floating point operations that the floating point pipelines
27
4.5 Specialized Hardware
Even with 32-bit bus architectures, there is a growing trend in the processor industry to
add specialized hardware and instruction set extensions such as multimedia extensions.
This is a way to add value to the device by targeting specific applications and libraries
which can cause a demand for the processor. For the most part, these extensions go
hand in hand with the set of libraries they were designed to enhance.
4.6 Dedicated Interval Processors
Significantly improving the performance of interval arithmetic requires some form of
architectural improvements [31]. Previous research has demonstrated the feasibility of
coprocessors for variable-precision [28] [32] and staggered [30] interval arithmetic, as
well as custom interval arithmetic units that can be added to existing processors [29]. In
the same way that coprocessors introduced native floating point support to systems
whose central processing unit did not have it, dedicated interval arithmetic units and
coprocessors can add native interval support.
These units increase system cost and add a great deal of specialized hardware
that does not speed up most existing applications, although as process improvements
allow more features to be packed into each processor, this will become less ofan issue.
Given sufficient demand for interval arithmetic, native interval units will eventually be
found on many processors.
28
Chapter 5
Interval Support Instructions
Rather than designing specialized hardware which processes intervals as rapidly as
possible, this thesis investigates how some of the more costly components of interval
processing can be made faster. General enhancements like deep pipelines superscalar
execution are relied upon to help absorb the cost of extra floating point operations. This
approach requires only a few interval-specific instructions and some features that are
more widely applicable and have already been implemented in some processors. A
summary ofthe proposed enhancements is shown in Table 5.
Operation Mnemonic Description
Set rounding mode SRND Set rounding mode as specified
Initiate mterval multIply lIMO Set rounding mode, set up operands for multiply
Initiate mterval divide lIDO Set roundmg mode, set up operands for divide
In-place absolute value IABSD Compare SignS ottwo numbers and make both positive
ConditIOnal move MOVDT Move iffloatmg pomt status is true/false
MOVDF
DlVlde by two HALFD Quickly diVide a floating point number by two
Negate and swap NEGSD Negate and exchange two ttoatmg pomt numbers
Table 5: Summary of proposed interval extensions
Only double-precision versions of the new instructions are shown. All but
SRND have single-precision versions as well.
5.1 The DLX Architecture
The DLX architecture [11] is used as the basis for designing the new interval support
instructions. The DLX architecture with the new instructions is referred to as enhanced
DLX. DLX is a RISe architecture created by combining the features of a number of
29
current processor designs, so modifications to the DLX architecture are likely to apply
to other modem processors as well.
For the code examples in this chapter, input operands start at the F4 register
(and proceed through F5, F6,etc.). Output operands start at F4 if the code invalidates
the current contents of any registers. If not, output operands start at FO. DLX uses reg-
ister pairs for double-precision numbers, which is why no odd-numbered registers are
shown. For example, double-precision references to register FO actually affect both FO
and Fl. DLX defmes the RO integer register to always contain the value zero-a double
precision zero can be created by copying RO and extending the result.
The DLX architecture uses delayed branches---the following instruction is
always executed regardless of whether the branch is taken. Descriptions of several
common DLX instructions are given in Table 6.
Instruction Description Operation performed
ADDD Fd, Fa, Fb Add Fd~Fa + Fb
SUBD Fd, Fa, Fb Subtract Fd~Fa - Fb
MULTD Fd, Fa, Fb Multip.ly Fd~Fa x Fb
DIVD Fd, Fa, Fb Divide Fd~Fa + Fb
GED Fa, Fb Greater or Equal FPSR~ (Fa ~ Fb)
LED Fa, Fb Less or Equal FPSR~(Fa ~ Fb)
BFPF addr Branch If FP false if FPSR, PC~addr
J addr Branch always PC~addr
MOVI2FP Fd, Rs Integer to Float Fd~ (float) Rs
CVTF2D Fd, Fs Float to Double Fd~(double)Fs
Table 6: Some traditional DLX instructions
Example implementations of interval operations using the DLX architecture
without interval enhancements are given in Appendix B. A write floating point control
30
word (WFPCW) instruction was also needed for these. This stalls the processor until
the floating point pipe is empty and sets rounding mode towards 0, +00, -00, or nearest.
5.2 SRND Instruction
The set rounding mode (SRND) instruction implies architectural support for rounding
mode in the instruction pipeline. Rounding can be set to Near, Down, Up, DownOnce,
UpOnce, DownUp, UpUp, or UpDown. The rounding mode automatically changes
after a floating point operation, and changing of rounding modes does not introduce
any delay. The rounding mode transitions are shown in Table 7. There must also be a
way to save and restore the rounding mode during a context switch.
Rounding Round Next Rounding
mode Toward mode
Zero 0 Zero
Near nearest Near
Down -00 Down
Up +00 Up
DownOnce -00 Near
UpOnce +00 Near
DownUp -00 UpOnce
UpUp +00 UpOnce
UpDown +00 DownOnce
Table 7: Rounding mode transitions
Interval addition and subtraction simply require a SRND instruction to set the
rounding mode to DownUp followed by two operations to compute the endpoints.
Interval width requires only a SRND and a subtract instruction. These are shown in
Table 8.
31
Addition Subtraction Width
SRND DownUp SRND DownUp SRND UpOnce
ADDD FO, F4, F8 SUBD FO, F4, FlO SUBD FO, F6, F4
ADDD F2, F6, FlO SUBD F2, F6, F8
Table 8: Addition, subtraction, and width in enhanced DLX
5.3 IIMD Instruction
The initiate interval multiply (IIMD) instruction sets the rounding mode to DownUp
and conditionally exchanges the infImum and supremum of its input operands based on
their signs. An IIMD instruction and two floating point multiplies can complete an
interval multiply, as shown in Table 9.
To avoid specifying four arguments to IIMD, it is assumed that adjacent pairs of
double-precision numbers are used to represent intervals. This should not by itself
cause register assignment problems, since to a compiler this is just a register constraint.
Adding this constraint is not trivial, though. The current implementation of the IIMD
instruction was modifIed to accept four arguments instead.
Special cases (±oo or NaN in an endpoint) and cases where both operands cross
zero must be handled by the IIMD instruction. One way to do this is to issue a special
trap and complete the operation or recompute the operands in the trap handler. The cur-
rent IIMD implementation fakes a trap ifboth operands cross zero (the DLX simulator
does not support real traps), but does not handle special cases very well.
32
Multiplication Division
IIMD F4, F8 IIDD F4, F8
MULTD F4, F4, F8 DIVD F4, F4, F8
MULTD F6, F6, FlO DIVD F6, F6, FlO
Table 9: Interval multiplication and division in enhanced DLX
5.4 IIDD Instruction
The initiate interval divide (IIDD) instruction does for interval division what
IIMD does for interval multiplication. It sets the rounding mode to DownUp and condi-
tionally exchanges the infImum and supremum of its input operands based on their
signs so two floating point divides can complete the operation, as shown in Table 9.
Implementation issues are very similar to IIMD. However, no trap is needed when both
operands cross zero since the result is always unbounded in at least one direction (for
"sharp" division) or in both directions (for "simple" division) [3]. ITDD also must
check when an endpoint ofthe demoninator is exactly zero.
5.5 IABSD Instruction
The in-place absolute value (IABSD) instruction computes the absolute value
of two registers simultaneously and stores the results back in the same registers. The
destinations are not explicitly specifIed, since that would require two destination fIelds
and it is likely that there is a better use for the limited opcode space. In addition,
IABSD sets a flag indicating whether the signs of the registers were different when the
instruction began. This check is required for interval mignitude.
33
5.6 MOVDT and MOVDF Instructions
Conditional move instructions such as MOVDT and MOVDF are becoming common
in modem processors (e.g., SPARC v.9). They would likely be used in interval multipli-
cation and division, but only in the special trap handler(s) ifIIMD and IIDD are sup-
ported. IABSD and MOVDF together allow magnitude and mignitude to be calculated
quite efficiently, as shown in Table 10.
Magnitude Mignitude
IABSD F4, F6 IABSD F4, F6
GED F4, F6 BFPF _notnil
MOVDF F4, F6 LED F4, F6
MOVI2FP F4, RO
J _end
CVTF2D F4, F4
_notnil:
MOVDF F4, F6
_end:
Table 10: Interval magnitude and mignitude in enhanced DLX
5.7 HALFD Instruction
The divide by two (HALFD) instruction quickly divides a floating point num-
ber in half by decrementing its exponent (for normalized numbers) or shifting the sig-
nificand by one bit (for denormalized numbers). Rounding is not an issue for
normalized numbers. It is assumed that rounding towards zero is acceptible for denor-
malized numbers. If not, a special trap could be added in case a denormalized number
is encountered. HALFD is used in interval midpoint calculations, as shown in Table 11.
34
Midpoint Negate
SRND UpDown NEGSD F4, F6
SUBD Fa, F6, F4
HALFD Fa, Fa
ADDD Fa, Fa, F4
Table 11: Interval midpoint and negate in enhanced DLX
5.8 NEGSD Instruction
The negate and swap (NEGSD) instruction exchanges two register values and also tog-
gles their sign bits. This implements a one-cycle interval negate, as shown in Table 11.
The exchange is not strictly necessary (the compiler could simply be notified that the
registers have changed), but could eliminate some copying if an interval multiply or
divide follows.
5.9 The Enhanced DLX Instructions
The proposed instructions are designed to be fast, completing in a single execution
cycle in most cases. Many of them invalidate the current contents of their source regis-
ters. This is unusual for DLX, but is inherent to the nature ofIIMD and IIDD, and both
IABSD and NEGSD would require two destinations otherwise. This could result in
extra copying of the old register values, but these instructions should eliminate more
cycles than the extra copies cost. The GCC-DLX compiler is usually intelligent enough
to rearrange the code in order to avoid this copying anyway.
3S
Chapter 6
Interval Support Hardware Requirements
The interval extensions are designed to be fast and to complete in a single execution
cycle. To avoid having a negative effect on the processor clock speed, they are
designed to require only minor modifications to existing hardware. The exact changes
required are highly dependent on the processor implementation. For conventional high-
performance microprocessors, however, any increase in area or cycle time resulting
from these modifications is expected to be negligible.
6.1 General Hardware Requirements
The instructions require decoding, control, and routing circuitry, much of which can
usually be shared with other parts of the processor. General floating point support is
required, including the ability to disable floating point exceptions. Most processors
now support IEEE 754 floating point arithmetic and therefore meet this requirement.
A number of the new instructions read and write two double-precision registers
at the same time. Most high-performance processors meet this requirement. Some pro-
cessors may have only one such write port, though, and adding a second write port
might affect basic instruction timing. With only a single write port, some instructions
take an extra cycle to complete. However, forwarding hardware may allow some stalls
to be avoided. Many processors support dynamic scheduling techniques like register
renaming and reservation stations (e.g., Tomasulo's algorithm) which can be used to
overcome read and write port limitations.
36
6.2 SRND Hardware Requirements
The set rounding mode (SRND) instruction requires support for the rounding opera-
tions specified in the IEEE 754 floating point standard. One or two 4-bit multiplexers
may be required to share the rounding mode settings with the floating point control
word and other instructions that access it. It is also necessary for the rounding mode to
travel with each instruction as it proceeds through the floating point pipeline. This
requires a 4-bit latch at each floating point pipeline stage. Finally, a 4x9 transition table
or (more likely) equivalent circuitry is required to specify the rounding mode transi-
tions.
6.3 IIMD and IIDD Hardware Requirements
The initiate interval multiply and divide (IIMD and IIDD) instructions read from four
double-precision registers and write up to four double-precision values. Processors that
support register renaming can avoid exchanges by simply renaming registers. The write
back to the register file can sometimes be avoided since succeeding instructions destroy
these values, but may be needed in some cases, such as if the processor is interrupted.
If four read ports are not available, separate floating point state may have to be
maintained to indicate which floating point registers are negative, zero, or special
cases. The register exchanges can be implemented using endpoint selection logic
(which can be implemented using 76 transistors [33]) and four 64-bit multiplexers plus
latches and forwarding hardware (two more multiplexers for each multiply and divide
unit). More likely, the partially-decoded register state will be combined with simple
37
logic (or possibly endpoint lookup tables) and used to control existing register renam-
ing hardware. In this case, two 64-bit latches are still required as copy destinations (and
perhaps two additional 64-bit latches and 64-bit multiplexers so that there is always an
active copy target and one available for renaming). IIMD also must be able to generate
a processor trap.
6.4 MOVDT and MOVDF Hardware Requirements
Conditional tests and register transfers are already present in essentially all processors,
so the conditional move (MOVDT and MOVDF) instructions probably just require
extra decode and control logic. Conditional moves are becoming popular on newer pro-
cessors and are likely to be available anyway [11].
6.5 HALFD Hardware Requirements
The divide by two (HALFD) instruction requires a test for special cases. This informa-
tion is likely to be available for the initiate interval multiply and divide instructions
anyway. A 52-bit multiplexer is required for shifting the significand and an II-bit sub-
tractor is needed to update the exponent. If traps are required for denormalized num-
bers (in order to keep rounding modes consistent), an II-input ORINOR is required to
detect the exception and a stored vector address is needed.
6.6 IABSD Hardware Requirements
The in-place absolute value (IABSD) instruction requires an exclusive-or for
the sign bits and the ability to latch register values and force their sign bits to zero (this
38
probably just requires two one-bit multiplexers). IABSD writes two double-precision
registers, so two register write ports, or possibly some advanced forwarding hardware,
are required or else this instruction will take an extra cycle to execute.
Essentially, IABSD just sets the floating point status bit to the exclusive-or of
the two sign bits and forces the sign bits to zero.
6.7 NEGSD Hardware Requirements
The negate and swap (NEGSD) instruction requires two inverters to change the sign
bits plus the ability to write two double-precision registers as required by IABSD.
39
Chapter 7
Simulation Environment
Simulation is a crucial element for this thesis, which involves evaluating alternative
designs. Detennining the performance impact of a design without implementing it in
some form is very difficult since the design may have side-effects or be used in unan-
ticipated ways. Although developing a test processor to evaluate new designs such as
these is not practical, a simulator may be quickly modified.
To fully understand the impact of a design change, simulations should be used
as the processor is eventually intended to be used (Le., to run entire applications). The
real value of a processor modification is usually how much faster applications run. The
speedup ofthe processor can then be measured as the ratio ofthe time applications take
without the change to the time they take with the change [11]. To measure speedup,
cycle-accurate models are required. As long as the designs are realizable and do not
affect either the fundamental clock rate ofthe device or the timing ofother instructions,
more detailed simulation is not needed.
7.1 Simulator Evaluation
Simulator choice had a significant impact on this thesis-it detennined which proces-
sor would be used as the basis for the instruction set changes. A number of simulators
were thus evaluated. In choosing or designing a simulator, it is important to specify the
correct capabilities. The five characteristics considered for this thesis are: accuracy,
precision, timing, scope, and performance.
40
7.1.1 Simulator Accuracy
Accuracy is how honestly the device is represented. Measures ofaccuracy include: Are
numbers represented in the same way as the target machine? Are rounding modes mod-
eled correctly? Are the effects of the pipeline modeled correctly?
7.1.2 Simulator Precision
Precision is the level ofdetail of the simulation. Measures ofprecision include: Are pin
values modeled? Are rounding modes modeled? Are register values kept up to date or
just updated at the last cycle ofan instruction?
7.1.3 Simulator Timing
Timing is the unit of time modeled. Measures of timing include: Is the model accurate
to a millisecond? Is the model updated every instruction cycle? Every clock cycle?
Every clock phase?
7.1.4 Simulator Scope
Scope is the size of the system modeled. Measures of scope include: Is the entire pro-
cessor modeled? Are other system components modeled? Is an entire system modeled
(including disk drives, memory, etc.)?
7.1.5 Simulator Performance
Performance is the speed and resource usage ofthe model itself. Measures of simulator
performance include: How fast does the simulator run? How much memory does it
use? How much disk space does it require?
41
7.1.6 Simulator Requirements
All of the characteristics of simulators compete for a set of limited resources. Simula-
tors that are precise and accurate may have a smaller scope. System-wide simulations
usually sacrifice timing. Simulations that are complete, precise, accurate, and provide
detailed timing tend to have lower performance, require more effort to program and
maintain, and may be unweildy to use since they present too much information. It is
thus important not to overspecify simulator requirements. The simulator requirements
for this thesis are shown in Table 12.
Characteristic Required level
Accuracy Correct timing and liD
PrecisIOn Provide output and total run hme
TIming Machine cycle
Scope Run applicatIons
Pertormance Not an Issue
Table 12: Simulator Requirements
To measure the performance of applications, an entire application is run all at
once through the simulator, which must measure the correct number of cycles to exe-
cute from start-to-finish. Correct cycle counts require a certain level of detail (e.g.,
floating point pipelines must be considered). Correct outputs are needed to verify cor-
rect operation. The simulator could take days ifneeded to run an application-multiple
host machines would be used in this case---so performance is not much of an issue.
42
7.2 Simulator Availability
Many simulators are freely available, but most sacrifice accuracy and precision for
scope. This is especially true for newer architectures. Simple processors like the 8051
have good models, but complete, detailed simulators are not available for most modem
processors. Detailed models that are available may only consider part of a device. Such
is the case with Sun SPARe processors. System-level simulators such as SHADE do
not provide accurate timing. More detailed simulators exist but do not model the entire
processor and are not publicly available. In fact, better simulators for many processors
exist but are kept proprietary by processor design companies that apparent do not want
to risk losing their competitive advantage or revealing too much about their processor's
internals.
7.3 Simulator Quality
Unfortunately, the quality of free simulators is often suspect. Fundamental problems
have been uncovered in the timing information for all of the simulators considered as
candidates for use in this thesis. Ultimately, the selection was not based on the accuracy
ofa model but rather on which could be easily modified to have accurate timing.
Floating point timing accuracy is one area that was almost universally lacking
in the candidate simulators. This may not be a concern of the general public. Floating
point applications often have less critical timing, and rough timing estimates made by
hand can often be followed by exact cycle counts gathered from actual machines.
Designers ofmodem high-speed processors are also more reluctant than ever to release
43
detailed descriptions of the internal workings of their devices. This could be to avoid
giving away information to competitors, because the details are too complex to explain
easily, so that they have freedom to change the implementation (e.g., migrating copro-
cessor operations into central processing units), or some combination of the three.
7.4 Floating Point Simulation
A fundamental requirement of this thesis is the ability to model cycle-accurate timing
and results of floating point operations in modem processors. These needs are pretty
modest, but all of the simulators evaluated were lacking in this regard. To get accurate
floating point timing, the floating point pipelines must be modeled, taking hazards and
stalls into account.
There are three types of data hazards. Read after write (RAW) hazards are
caused when the results of a previous operation are not available before an operation
that requires them is issued. Write after write (WAW) hazards are caused when the
results of a previous operation would overwrite the results of a future operation. Write
after read (WAR) hazards are caused when a previous operation attempts to read a
value after a future operation writes it.
Structural and control hazards must also be considered. Structural hazards are
caused when resources needed to complete an operation are currently being used by
another operation. Control hazards, mostly caused by branches, are the result of an
instruction invalidating partially-completed future instructions which are not supposed
to execute.
44
Hazards usually stall the floating point pipeline. During a stall, no new floating
point instructions are issued until the hazard is resolved. Depending on the architecture
and circumstance, floating point stalls may stall the integer pipeline as well [11].
7.5 The FP Module: a General Floating Point Model
Since accurate floating point modeling is lacking in so many simulators, a solution that
can address the floating point needs of a large number of simulators is better than one
that is tied to a particular simulator. The FP module was thus created to fill this need.
The FP module is a general-purpose library which simulates an arbitrary floating point
pipeline. It provides a simple yet flexible application programming interface (API) and
is designed to be portable so that it can be compiled and linked with a wide variety of
simulators. This API is shown in Table 13.
API call Description
FplnitUnit Imtlahze state (usmg the plpehne descnptlOn file)
FpShutDownUnit Clear all internal state
keyMarkFpRegBeingWritten Mark register as destinatIOn ofan executing instruction
keyMarkFpRegBeingRead Mark register as source for an executmg 1Ostructlon
MarkFpRegNotBeingWritten Unmark destmation w/keyfrom keyMarkFpRegBemgWritten
MarkFpRegNotBeingRead Unmark source wkey from keyMarkFpRegBe10gRead
isFpRegBeingWritten Check Ifregister is the destmatlOn ofan execut10g 1Ostruction
isFpRegBeingRead Check Ifregister IS the source ofan execut10g mstructlon
keylnitiateFplnstruction Load an 1Ostructlon 1Oto the plpehne
FpExecuteCycle Execute one cycle 10 the model
isFpOperationPending Check Ifany 1Ostructlons are in the plpehne
isFplnstructionReady Check ifa given 1Ostructlon has suffiCient resources to execute
Table 13: FP module application programming interface (API)
The FP module focuses on just the floating point pipeline. A description of the
pipeline is loaded from a file when the model is initialized, so different pipelines can be
modeled without recompiling. It simulates accesses to registers and to functional unit
45
resources, detecting and resolving data and structural hazards automatically. It can also
respond to control hazards or to externally-stimulated data hazards. It uses callbacks
(function pointers provided by the main simulator) to perform the actual arithmetic.
7.6 The FP Pipeline Description File
The FP module uses a pipeline.description file to initialize its model. This file contains
two sections. The first section simply lists the names ofall functional units available to
service floating point pipeline requests and identifies how many of each are present.
The second.section lists the floating point instruction types, each of their stages, and
how many of each functional unit are required at each stage.
7.6.1 Pipeline Description for the Sun UltraSparc Processor
According to Sun, an UltraSparc performs single-precision divide in 12 cycles, double-
precision divide in 22 cycles, and any floating point add/subtract or multiply operation
in 3 cycles [34]. Division is not pipelined; the others are fully pipelined and have a
throughput of one instruction per cycle. Table 14 identifies the UltraSparc floating
point functional units. Table 15 describes its instruction types. In Table 15, commas
separate stages and superscripted numbers indicate that a stage is repeated that many
times sequentially. In the case of division, the stages are not actually repeated, and the
number represents the fact that the division unit takes that many cycles to complete.
The UltraSparc floating point pipeline is fairly representative of leading-edge
RISe processors. For this thesis, a non-superscalar version of the UltraSparc floating
point pipeline is assumed. A defmition ofthis pipeline is shown in Appendix D.
46
Unit Description
FPDIV Floatmg point divide/square root
FPADDXl Floatmg pomt add, stage I
FPADDX2 Floatmg pomt add, stage 2
FPADDX3 Floatmg pomt add, stage 3
FPMULXl Floatmgpoint multiply, stage 1
FPMULX2 Floatmg pomt mUltiply, stage 2
FPMULX3 Floatmg pomt mUltiply, stage 3
Table 14: IDtraSparc functional units
Instruction type Definition
DIVF FPDIV12
DIVD FPDIV22
FP ADD FPADDX1,FPADDX2,FPADDX3
FP MULT FPMULX1, FPMULX2, FPMULX3
Table 15:IDtraSparc instruction types
7.6.2 Pipeline Description for the MIPS R4000 Processor
The MIPS R4000, on the other hand, has eight functional units. Floating point addition,
multiplication, and division use up to two different units in each stage [11]. Table 16
and Table 17 describe the R4000 functional units and instruction types.
Unit Description
A Mantissa Add stage
D Divide pipeline stage
E ExceptIOn test stage
M First stage ofmultiplier
N Second stage ofmultiplier
R Rounding stage
S Operand shift stage
U Unpack FP numbers
Table 16: MIPS R4000 functional units
47
Instruction type Definition
FP DIV U,A,R,D28 ,D+A, (D+R)2,D+A,D+R,A,R
FP ADD U,S+A,A+R,R+S
FP MULT U,E+M,M3,N,N+A,R
Table 17: MIPS R4000 instruction types
7.6.3 Pipeline Description for the MIPS RI0000 Processor
The MIPS RI0000 floating point pipeline looks much like that of the UltraS-
pare. It also has three-stage fully pipelined adder and multiplier pipelines, although
divide is done in a unit which shares its fIrst and last stages with the multiplier pipeline
[43]. Table 18 and Table 19 describe the RI0000 functional units and instruction types.
Unit Description
AA AlIgn
AD Add/N
AP Pack addItion results
MU Multiply
MS SumIN
MP Pack multiply results
DD DIvide
Table 18: MIPS RI0000 functional units
Instruction type Definition
DIVF MU,DD12 ,MP
DIVD MU,DD19 ,MP
FPADD AA,AD,AP
FPMULT MU,MS,MP
Table 19: MIPS RI0000 instruction types
48
7.7 The DLXsim Simulator
The DLXsim simulator [11] provides the base upon which the interval extensions are
constructed. This simulator is easily enhanced. A core simulation module, sim.c, is
responsible for most of the simulation. An assembler module,. asm.c, is responsible for
converting DLX source code into a format used by the simulation module. Since it
accepts source files as input, no separate assembler is required. DLXsim is also built
around a Tel scripting engine, so programmable macros are also available. Since the
DLX architecture is a RISC design intentionally similar to popular modem processors,
it is likely that changes made to it would apply to other processors.
7.8 The GCC-DLX C Compiler
The GCC-DLX C compiler provides the compilation technology used with the interval
extensions. This is simply a port of the Gnu C compiler to the DLX processor. GCC-
DLX was actually designed for a slightly different DLX simulator called FAST (which
is optimized for speed and not as easy to modify as DLXsim).
GCC-DLX provides a powerful feature that allows assembly code to be inserted
directly into C programs and directly access the contents of variables from registers.
GCC-DLX also provides a linker that allows multiple DLX object modules to be linked
into a single executable. GCC-DLX is also portable to many platforms.
49
7.9 Integrating FP, DLXsim, and GCC-DLX
The FP module is designed specifically to support this thesis and is comprised ofabout
a thousand lines of C code. Integrating the FP module requires that only a few hundred
lines be changed in DLXsim, mostly in the core simulation module (a few functions
and data structures are replaced with simpler FP calls and callback functions). This is
actually quite straightforward.
DLXsim itself has many more problems, however. The ".half' directive,
required to support "short" data types, is not implemented. GCC-DLX requires that
".proc" and ".endproc" directives be added. DLXsim does not work properly on a little-
endian machine-additional support is required for byte and word swapping. DLXsim
also does not properly reinitialize itselfwhen a new executable is loaded, although the
safest solution is to simply rerun the simulator in this case.
GCC-DLX also has a number of problems. No scratch registers are provided,
which forces extraneous stack accesses to needlessly save and restore register contents.
Its notion ofwhere signed values (as opposed to unsigned values) are required does not
always correspond to those of DLXsim. The GCC-DLX linker does not always link
library files correctly, and several of the supplied library functions are not considered
by DLXsim to be valid.
A number of small fixes were made to the versions ofDLXsim and GCC-DLX
used in support of this thesis. Most were corrected before the interval enhancements
were added, although some of the flaws had not been discovered at that time. None of
the changes were particularly difficult to make, since both applications were designed
50
to be easy to modify. However, the number of flaws was almost unacceptably high.
This is one of the risks ofusing free software.
7.10 EnhancedDLX
Implementing the proposed extensions requires about a thousand lines of new or
changed code in the DLXsim simulator. Since the extensions can be specified explicitly
through inline assembly directives, a lower bound can be placed on the speedup with-
out any changes in GCC-DLX. GCC-DLX could be modified to generate the exten-
SiOlis on its own, in which case further speedup would be realized as extensions such as
conditional moves are applied to non-interval operations.
Since traditional DLX does not provide any way to set rounding modes, a write
floating point control word (WFPCW) instruction is also present in the enhanced DLX
simulator. This instruction causes the entire DLX processor to stall until all floating
point operations already issued have been allowed to complete. Enhanced DLX cur-
rently stalls the entire processor whenever any floating point operation causes a stall.
Enhanced DLX assumes that all instructions execute in a single cycle except
when hazards result in a stall or IIMD is issued and both operands cross zero. For IIMD
and IIDD, this implies both register renaming support and four read ports or an extra
set of floating point register state in order to avoid read and write port limitations on the
register bank. Without these technologies, these operations take several extra cycles to
complete.
51
7.11 Implementing Interval Operations in Enhanced DLX
The basic interval data type and a fairly complete set of interval operations are defmed
in a single interval support module. The module is implemented as a single header file.
All of the operations are implemented as macros in order to give the compiler every
opportunity to optimize the code that uses them. Most operations have several variants.
These allow specific DLX enhancements to be tested. Thus, one module implements
interval operations on the base DLX (traditional DLX with a WFPCW instruction) or
on various types of enhanced DLX architectures which implement one or more of the
proposed extensions. The code relies heavily on the inline assembly capability ofGCC-
DLX, both to access the enhanced operations and to set the base DLX rounding mode.
The operations are hand optimized for different DLX variations.
The interval data types are "REAL", which is defmed as a double-precision
number, and "struct Interval", which is defined as in Figure 4. The rest of the macros
are detailed in Appendix E. Square root is not currently specified, nor are trigonometric
functions. Square is currently only optimized for base DLX.
52
Chapter 8
Performance Evaluation
A small number of interval applications have been implemented and tested using the
enhanced DLX simulator and the interval support module. The results of five have
been analyzed in detail. The proposed extensions made some interval applications at
least 50% faster than equivalent hand optimized interval code on a traditional DLX
processor. An interval application created by stric1y replacing double precision floating
point code with interval operations also ran almost 50% faster on enhanced DLX.
Three interval Newton applications, which balance interval and non-interval calcula-
tions, averaged around 35% faster.
8.1 Interval Reference Applications
Of the five reference applications, three are instances of using the interval Newton
method to obtain the roots of equations. The equations used are shown in Figure 8. The
first equation has four roots at 0, 3, 4, and 5. The second has two real roots at 1 and
approximately 0.888. The third has no real roots. These particular equations are inter-
esting because in 1983 researchers used them as an example, stating that a general
equation solving program could not be used to fmd all of these roots [36].
4 3 2f\(x) = x -12x +47x -60x
4 3 2f2(x) = x -12x +47x -60x+24
4 3 2h(x) = x -12x +47x -60x+24.l
Figure 8: Problems to be solved using the interval Newton method
53
The other two reference applications are an interval stress test that uses all of
the interval operations provided and a math-intensive program that was originally writ-
ten using double-precision numbers and then converted to use intervals through straight
replacements. The former insures that all of the extensions are tested. The latter demon-
strates the required overhead ofusing intervals instead of real numbers.
8.2 Interval Stress Test Results
Two base configurations are used to evaluate the interval stress test. Configuration A is
the base DLX. Configuration B adds the conditional move instructions. Each of the
proposed new instructions is then individually added to each. The results are as shown
in Table 20.
Enhancement Data
Configuration A Configuration B
Size Code Size Cycles Speedup Code Size Cycles Speedup
None 320 3232 320184 1 3208 317184 1
NEGSD 320 3212 312984 1.023 3188 .309984 1.023
IIDD 304 2592 310778 1.030 2568 307778 1.031
HALFD 312 3220 310281 1.032 3196 307281 1.032
IIMD 320 2700 301480 1.062 2676 298480 1.063
SRND 320 3200 290584 1.102 3176 287584 1.103
IABSD 320 3008 287175 1.115 2964 277875 1.141
All 296 1768 212862 1.504 1724 212862 1.490
Table 20: Interval stress test results
The high speedup from the IABSD is partly attributable to the fact that migni-
tude and magnitude were used an abnormally large number of times (each was used as
many times as addition or subtraction). It may also be partly attributable to the fact that
the base implementations of these operations, which do not require setting of rounding
modes, were implemented in C, not DLX assembly language.
54
Further analysis. of the results reveals that the IIMD operation saves 18704
cycles. Since the stress test perfonns 1800 multiplies, the average savings per multiply
are 10.39 cycles, meaning for certain inputs, 11 or more cycles are saved. The base
interval multiply always takes the same number ofcycles to complete and the enhanced
DLX version can be issued once every three cycles if there are no data dependencies,
so the "best-case" speedup from IIMD is at least (11 + 3)/3 = 4.6667 cycles, and is
probably higher. Previous studies corroborate this claim [39].
8.3 Interval Newton Test Results
Only one configuration is required for the interval Newton tests. These tests do not use
mignitude or magnitude, and do not benefit from conditional moves or from IABSD.
They also do not benefit from NEGSD since they do not perfonn interval negation. The
results are shown in Table 21.
55
Enhancement Data Size Code Size Cycles Speedup
Roots off!
None 208 6092 57930 1
HALFD 200 6080 56334 1.028
IIDD 192 5400 53365 1.086
IIMD 192 3236 51976 1.115
SRND 208 6008 51694 1.121
All 168 2444 42541 1.362
Roots off2
None 224 6108 31566 1
HALFD 216 6096 30698 1.028
IIDD 208 5416 29175 1.082
IIMD 208 3252 28334 1.114
SRND 224 6024 28216 1.119
All 184 2460 23307 1.354
Roots off3
None 224 6108 27431 1
HALFD 216 6096 26689 1.028
IIDD 208 5416 25307 1.084
IIMD 208 3252 24683 1.1ll
SRND 224 6024 24530 1.118
All 184 2460 20280 1.353
Table 21: Interval Newton test results
8.4 Interval Replacement Test Results
The interval replacement test also required only one base configuration. In addition to
mignitude, magnitude, and negation, this test does not use midpoint. It thus does not
benefit from conditional moves, IABSD, NEGSD, or HALFD. Its results are shown in
Table 22.
Enhancement Data Size Code Size Cycles Speedup Cycle Cost Code Cost
None 48 1228 9017 1 4.290 2.843
IIDD 32 904 81ll 1.112 3.859 2.093
IIMD 48 972 8017 1.125 3.814 2.250
SRND 48 1212 7317 1.232 3.481 2.806
All 24 620 6108 1.476 2.906 1.435
Reference 32 432 2102 N/A N/A N1A
Table 22: Interval replacement test results
56
Table 22 also shows the performance ofthe double-precision implementation of
the reference application. The ratio of the performance of interval to real arithmetic on
an enhanced DLX is only 2.91. What is surprising is that the ratio is only 4.29 on the
base DLX. This is not caused by the short floating-point pipelines used in enhanced
DLX. Changing the latency of multiplication and addition to 5 cycles raises execution
time of the reference implementation to 2500 cycles and execution time of the interval
implementations to 6708 and 10217 cycles. The perfonnance ratios actually drop to
2.68 and 4.09, so in this case the longer pipeline favors the interval implementations.
8.5 Testing Summary
The utility ofmany ofthe proposed extensions is tied very much to the use ofparticular
interval operations. Even conditional moves only affect the magnitude and mignitude
interval operations if the other interval extensions are used.
Some of the traditional interval applications, such as interval Newton method
from [10], have been converted to work with the enhanced DLX simulator and interval
libraries. However, they took over 20 times as many cycles to complete. The resultant
speedup from all of the enhancements was just over 1.02. This indicates that these
applications have considerable overhead not caused by the interval operations, which is
easily verified by simple inspection ofthe source code.
57
Chapter 9
Conclusions and Future Research
A few minor additions to current processor designs can improve the performance of
interval arithmetic significantly. These enhancements also bring the performance of
interval arithmetic to well within a factor offive ofthe performance ofdouble precision
implementations. This is an important accomplishment, since it has been speculated
that this is the threshold for widespread acceptance of intervals.
9.1 Faster Interval Calculations
After extrapolating slightly on current technology trends, it has been demonstrated that
improving the handling ofrounding modes and adding a few operations that help set up
interval operations allow even hand optimized interval applications to run 35% to 50%
faster. Previous studies estimate the execution time of current fast interval libraries at
10 to 30 times that of equivalent floating point operations. Other studies estimate the
overhead at 20 to 100 times, presumably for the more general purpose interval libraries.
Using these extensions, intervals can thus be processed 3 to 40 times faster than with
current interval libraries on existing hardware.
For even the hand-optimized libraries to be as fast as they are, they have to be
expanded inline. Without interval extensions, this causes a significant increase in code
size. The interval extensions allow interval operations to be encoded as very small
blocks of instructions that are simpler and use less register resources. This can help
interval-enhanced compilers generate better code. It is important that the code can
58
expand inline, since much of the potential performance benefits from such compilers
will be lost ifmany additional function calls are needed. Even functions that complete
as quickly as short, simple blocks of inline code will cause calling functions to sacrifice
some scratch registers and the opportunity to schedule operations to be performed in
parallel.
9.2 Replacing Floating Point Numbers with Intervals
The proposed extensions reduce the total execution time of interval applications to
within a factor of three of their floating point counterparts. Analyzing the resulting
code by hand reveals that this is close to the expected relative throughput of individual
interval operations, so this ratio should hold for applications of different sizes.
Even without the interval extensions, careful use of hand optimized libraries
has been shown to bring the overhead for intervals down to under 4.5 times that of
floating point numbers. Although this requires inline expansion and causes code size to
grow significantly, it is an important milestone in interval performance. Experiments
have shown that the ratio is not drastically higher with longer floating point pipelines,
and is even lower in some cases. Therefore, it can be reasonably expected that interval
libraries for existing processors can be created now such that the execution time of the
resulting interval applications is 4 to 5 times that of floating point. Similar performance
can be expected for interval-enhanced compilers, since they can do at least as well by
simply copying the hand optimized libraries, and may be able to do more optimizations
afterward.
59
9.3 Dedicated versus General Purpose Enhancements
The original design goal for the interval enhancements was a set of general purpose
extensions that make intervals faster but do not have much impact on processor design
and are useful to other applications. Over time, the interval enhancements have evolved
into their current state. Only the conditional move instructions remained faithful to the
original design premise, and these are widely available anyway. They also are not used
very much in interval operations ifthe other extensions are provided.
The rounding mode improvements also meet these criteria, but implementing
them is likely to be done in the larger context of putting all of the processor state into
the pipeline and removing the rest of the processor state dependencies. Suddenly, the
design impact is not so trivial. Another issue with the rounding mode improvements is
their impact on multiple-issue processors, especially when speculative and out-of-order
execution is considered. By requiring many additional state transitions, this simple
solution could quickly become quite complex.
An alternative is to abandon the idea of making the extensions available for
general use and simply add a full set of interval operations while still leaving the basic
processor architecture unchanged. These could handle special cases without causing
system traps, and would not depend as much on system-wide services like register
renaming. They could also send operands to multiple execution units in parallel. Since
they can execute atomically, intermediate state would not need to be held in registers.
Interval multiply and divide could latch their intermediate state and thus need not
destroy the contents of the source registers.
60
9.4 Opportunities for Future Research
Clearly, an area that deserves further study is the where this thesis converges with some
ofthe previous research into dedicated processors and functional units. Adding interval
instructions such as add and multiply can avoid many of the limitations of dynamic
rounding modes and of instructions that merely set up state for future instructions. The
lesson to be learned from this thesis is to keep the impact on processors to a minimum.
The existing hardware can typically process intervals with the required performance as
long as steps are taken to keep it busy.
Another area that deserves stUdy is the effect ofmultiple-issue, superscalar, and
out-of-order execution on these extensions. Dynamic scheduling should be considered
as well. Currently, the enhanced DLXsim only supports simple forwarding.
Square, square root, and other operations such as trigonometric functions ought
to be considered in light of this new approach to intervals. Speeding up these interval
operations has not received much attention in the past.
Finally, the FP module deserves to be supported and made public. It is a very
powerful tool that allows the impact of new pipeline designs to be evaluated in mere
seconds.
61
Bibliography
[1] G. Alefeld and J. Herzberger, Introduction to Interval Computations, New York:
Academic Press, 1983.
[2] D. N. Arnold, "Two disasters Caused by Computer Arithmetic Errors,"
available from Internet URL http://www.math.psu.edu/dna/455.j96/
disasters.html, February, 1997.
[3] D. Chiraev and G. W. Walster, "Interval Arithmetic Specification," available
from Internet URL http://www.mscs.mu.edu/-globsol/readings.html. 1998.
[4] B. W. Char et al., "A Tutorial Introduction to Maple," Journal of Symbolic
Computation, vol. 2, pp. 179-200, 1986.
[5] G. F. Corliss, "Comparing Software Packages for Interval Arithmetic," in
Abstracts of the International Symposium on Scientific Computing, Computer
Arithmetic, and Validated Numerics, 1993.
[6] G. F. Corliss, "Industrial Applications of Interval Techniques," in Computer
Arithmetic and SelfValidating Numerical Methods, (C. Ullrich, ed.), pp. 91-
113, Boston: Academic Press, 1990.
[7] S. Furber, ARMSystem Architecture, New York: Addison-Wesley, 1996.
[8] D Goldberg, "What Every Computer Scientist Should Know About Floating-
Point Arithmetic," ACM Computing Surveys, vol. 23, pp. 5-48, 1991.
[9] T. R. Halfhill, "Beyond Pentium II," Byte, pp. 80-86, December, 1997.
[10] R. Hammer, M. Hocks, U. Kulisch, and D. Ratz, C++ Toolbox for Verified
Computing, Berlin: Springer-Verlag, 1995.
[11] J. L. Hennessy and D. A. Patterson, Computer Architecture, A Quantitative
Approach, 2nd edition, San Mateo: Morgan Kaufmann Publishers, 1996.
[12] A. M. Holler, "Optimization for a Superscalar Out-of-order Machine,"
Proceedings of the 29th Annual IEEE/ACM International Symposium on
Microarchitecture, pp. 336-348, 1996.
[13] IEEE, "IEEE Standard for Binary Floating Point Arithmetic," Sigplan Notices,
vol. 22, no. 2, pp. 9-25, 1985.
[14] R. B. Kearfott, M. Dawande, K. Du, and C. Hu, "Algorithm 737: INTLIB: A
Portable FORTRAN 77 Interval Standard Function Library," ACM
Transactions on Mathematical Software, vol. 20, pp. 447-459, 1994.
[15] R. B. Kearfott and V. Kreinovich, "Applications of Interval Computations: An
Introduction," in Applications ofInterval Computations (R. B. Kearfott and V.
Kreinovich, eds.), pp. 1-21, Kluwer Academic Publishers, 1996.
62
[16] R. B. Kearfott et al., "A Specific Proposal for Interval Arithmetic in
FORTRAN," available from Internet URL http://interval.usl.edu/F90/
j96-pro.asc, March, 1996.
[17] R. Klatte, U. Kulisch, A. Wiethoff, C. Lawo, and M. Rauch, C-XSC: A C++
Class Libraryfor Extended Scientific Computing, Springer-Verlag, 1993.
[18] O. Knuppel, "PROFILIBIA8-A Fast Interval Library," Computing, vol. 53,
pp.277-288, 1994.
[19] M. Koshelev and V. Kreinovich, "Interval Computations," available from
Internet URL http://cs.utep.edu/interval-comp.html, 1997.
[20] R. B. Lee, "Accelerating Multimedia with Enhanced Microprocessors," IEEE
Micro, vol. 15, pp. 22-32, April, 1995.
[21] R. E. Moore, Interval Analysis, Englewood Cliffs: Prentice Hall, 1966.
[22] A. Peleg and U. Weiser, "MMX Technology Extension to the Intel
Architecture," IEEE Micro, vol. 16, no. 2, pp. 42-50, August, 1996.
[23] E. D. Popova, "Interval operations involving NaNs," Reliable Computing, vol.
2, pp. 161-165, 1996.
[24] D. M. Priest, "Handling IEEE 754 Invalid Operation Exceptions in Real
Interval Arithmetic," manuscript, February 13, 1997.
[25] D. Ratz, "The Effects of the Arithmetic of Vector Computers on Basic
Numerical Methods," in Contributions to Computer Arithmetic and Self
Validating Numerical Methods, (C. Ullrich, ed.), pp. 499-514, Basel: 1. C.,
Baltzer AG, 1990.
[26] S. M. Rump, "Algorithms for Verified Inclusions: Theory and Practice,"
Reliability in Computing, San Diego: Academic Press, 1988.
[27] M. J. Schulte et al., "Adding Interval Support to the GNU Fortran Compiler,"
available from Internal URL http://www.eecs.lehigh.edu/-mschulte/compiler/
work-notes, January 19, 1998.
[28] M.1. Schulte and E. E. Swartzlander, Jr., "A Hardware Design and Arithmetic
Algorithms for a Variable-Precision, Interval Arithmetic Coprocessor," in
Proceedings of the 12th Symposium on Computer Arithmetic, pp. 163-171,
IEEE Computer Society Press, 1995.
[29] M. 1. Schulte, K. C. Bickerstaff, E. E. Schwartzlander, Jr., "Hardware Units for
Interval Multiplication." Proceedings of the 2nd Workshop of Computer
Arithmetic, Interval, and Symbolic Computations, pp. 85-87, 1996.
63
{[30] M. J. Schulte and E. E. Swartzlander, Jr., "A Processor for Staggered Interval
Arithmetic," Proceedings ofthe 1995 International Conference on Application
Specific Array Processors, pp. 104-112, IEEE Computer Society Press, 1995.
[31] M. J. Schulte and E. E. Swartzlander, Jr., "Software and Hardware Techniques
for Accurate, Self-validating Arithmetic," Applications of Interval
Computations, pp. 381-404, 1996.
[32] M. J. Schulte and E. E. Swartzlander, Jr., "Variable-Precision, Interval
Arithmetic Coprocessors," Reliable Computing, vol. 2, no. 1, pp.47-62, 1996.
[33] M. J. Schulte, A Variable-Precision, Interval Arithmetic Processor, Ph.D.
thesis, University of Texas at Austin, available from Internet URL http://
www.eecs.lehigh.edu/~mschulte/papers, 1996.
[34] M. Tremblay and J. M. O'Connor, "UltraSparc I: A Four-Issue Processor
Supp0t:ting Multimedia," IEEE Micro, vol. 16, pp. 42-50, April, 1996.
[35] S. Vassiliadis et aI, "SCISM: A scalable compound instruction set machine,"
IBM Journal ofResearch and Development, vol. 38, no. 1, pp. 59-77, January,
1994.
[36] G W. Walster, "Interval Arithmetic: The New Floating-Point Arithmetic
Paradigm," available from Internel URL http://www.mscs.mu.edu/~globsol/
readings.html, March, 1998.
[37] G W. Walster, "Philosophy and Practicalities of Interval Arithmetic,"
Reliability in Computing, pp. 309-323, San Diego: Academic Press, 1988.
[38] G W. Walster, "Stimulating Hardware and Software Support for Interval
Arithmetic," Applications ofInterval Computations, pp. 405-416, 1996.
[39] G S. Williams and M. J. Schulte, "Architectural Support for Interval
Arithmetic: Faster interval math using existing hardware," available from
Internet URL http://www.eecs.lehigh.edu/~caar, February 28, 1997.
[40] G S. Williams, "Improving Interval Arithmetic Through Interval-Friendly
Operations," available from Internet URL http://www.eecs.lehigh.edu/~caar,
December, 1996.
[41] G S. Williams, "Cellular Communication Networks," available from Internet
URL http://www.eecs.lehigh.edu/~caar, December, 1995.
[42] S. Wolfram, Mathematics: A System for Doing Mathematics by Computer,
Addison-Wesley, 1988.
[43] K. C. Yeager, "The MIPS RlOOOO Superscalar Microprocessor," IEEE Micro,
vol. 16, no. 2, pp. 28-40, April, 1996.
64
Appendix A
C Implementations of Interval Operations
A.l Interval Addition in C
void IAdd(struct Interval *r, struct Interval *a, struct Interval *b)
{
Set_Rounding_Direction(DOWN);
r->infirnum = a->infirnurn + b->infirnumi
Set_Rounding_Direction(UP);
r->suprernum = a->suprernum + b->suprernum;
Set_Rounding-Pirection(DEFAULT);
A.2 Interval Subtraction in C
void ISub(struct Interval *r, struct Interval *a, struct Interval *b)
{
Set_Rounding_Direction(DOWN);
r->infirnum = a->infirnum + b->suprernum;
Set_Rounding_Direction(UP);
r->suprernum = a->suprernum + b->infirnum;
Set_Rounding_Direction(DEFAULT);
}
A.3 Interval Negation in C
void INegate(struct Interval *r, struct Interval *a)
Set_Rounding_Direction(DOWN);
r->infirnum = - (a->suprernum) ;
Set_Rounding_Direction(UP);
r->suprernum = -(a->infirnum);
Set_Rounding_Direction(DEFAULT);
A.4 Interval Multiplication in C
void IMul(struct Interval *r, struct Interval *a, struct Interval *b)
{
Set_Rounding_Direction(DOWN);
if (a->infirnurn > 0) {
if (b->infirnum > 0) {
r->infirnurn = a->infirnum * b->infirnumi
65
}Set_Rounding_Direction(UP);
r->supremurn = a->supremurn * b->supremurn;
} else if (b->supremurn < 0) {
r->infimurn = a->supremurn * b->infimurn;
Set_Rounding_Direction(UP);
r->supremurn = a->infimurn * b->supremurn;
} else {
r->infimurn = a->supremurn * b->infimurn;
Set_Rounding_Direction(UP);
r->supremurn = a->supremurn * b->supremurn;
}
else if (a->supremurn < 0)
if (b->infimurn > 0) {
r->infimurn = a->infimurn * b->supremurn;
Set_Rounding_Direction(UP)i
r->supremurn = a->supremurn * b->infimurn;
} else if (b->supremurn < 0) {
r->infimurn = a->supremurn * b->supremurni
Set_Rounding_Direction(UP);
r->supremurn = a->infimurn * b->infimurn;
} else {
r->infimurn = a->supremurn * b->supremurn;
Set_Rounding_Direction(UP);
r->supremurn = a->infimurn * b->infimurn;
}
else {
if (b->infimurn > 0) {
r->infimurn = a->infimurn * b->supremurni
Set_Rounding_Direction(UP);
r->supremurn = a->supremurn * b->supremurn;
else if (b->supremurn < 0) {
r->infimurn = a->supremurn * b->infimurn;
Set_Rounding_Direction(UP);
r->supremurn = a->infimurn * b->infimurn;
else {
double templ
double temp2;
templ = a->infimurn * b->supremurn;
temp2 = a->supremurn * b->infimurn;
r->infimurn = (templ < temp2) ? templ temp2;
Set_Rounding_Direction(UP);
templ = a->infimurn * b->infimurn;
temp2 = a->supremurn * b->supremurn;
r->supremurn = (templ > temp2) ? templ temp2;
}
Set_Rounding_Direction(DEFAULT);
66
A.5 Interval Division in C
void IDiv(struct Interval *r, struct Interval *a, struct Interval *b)
Set_Rounding_Direction(DOWN);
if (b->infimum > 0) {
if (a->infimum >= 0) {
r->infimum = a->infimum / b->supremum;
Set_Rounding_Direction(UP);
r->supremum = a->supremum / b->infimum;
else if (a->supremum~<= 0) {
r->infimum = a->infimum / b->infimum;
Set_Rounding_Direction(UP);
r->supremum = a->supremum / b->supremum;
else {
r->infimum = a->infimum / b->infimum;
Set_Rounding_Direction(UP);
r->supremum = a->supremum / b->infimum;
}
} else if (b->infimum < 0) {
if (a->infimum >= 0) {
r->infimum = a->supremum / b->supremum;
Set_Rounding_Direction(UP);
r->supremum = a->infimum / b->infimum;
else if (a->supremum <= 0) {
r->infimum = a->supremum / b->infimum;
Set_Rounding_Direction(UP);
r->supremum = a->infimum / b->supremumi
else {
r->infimum = a->supremum / b->supremum;
Set_Rounding_Direction(UP);
r->supremum = a->infimum / b->supremum;
}
} else
r->infimum = MINUS_INFINITY;
r->supremum = PLUS_INFINITY;
}
Set_Rounding_Direction(DEFAULT) ;
A.6 Interval Width in C
void IWidth(REAL *r, struct Interval *a)
{
Set_Rounding_Direction(UP);
r = a->supremum - a->infimum;
Set_Rounding_Direction(DEFAULT);
67
A.7 Interval Midpoint in C
void IMidpoint(REAL *r, struct Interval *a)
{
Set_Rounding_Direction(UP);
r = (a->supremum - a->infimum) / 2.0;
Set_Rounding_Direction(DOWN);
r = a->infimum + r
Set_Rounding_Direction(DEFAULT);
A.8 Interval Magnitude in C
void IMagnitude(REAL *r, struct Interval *a)
{
REAL tempi, temp2;
tempi = (a->infimum < 0.0) ? -(a->infimum) : a->infimum;
temp2 = (a->supremum < 0.0) ? -(a->supremum) : a->supremum;
r = (tempi> temp2) ? tempi : temp2;
A.9 Interval Mignitude in C
void IMignitude(REAL *r, struct Interval *a)
{
if ((a->infimum > 0.0) != (a->supremum> 0.0)) {
r = 0.0;
else {
REAL tempi, temp2;
tempi = (a->infimum < 0.0) ? -(a->infimum) : a->infimum;
temp2 = (a->supremum < 0.0) ? -(a->supremum) : a->supremum;
r = (temp2 < tempi) ? temp2 : tempi;
68
AppendixB
DLX Implementations of Interval Operations
B.I Interval Addition in DLX
iadd: wfpcw
addd
wfpcw
addd
wfpcw
jr
nop
roundDown
fO,f4,f8
roundUp
f2,f6,f10
roundNear
r31
B.2 Interval Subtraction in DLX
isub: wfpcw
subd
wfpcw
subd
wfpcw
jr
nop
roundDown
fO,f4,£10
roundUp
f2,f6,f8
r"oundNear
r31
B.3 Interval Negation in DLX
ineg: rnovi2fp
cvtf2d
wfpcw
subd
wfpcw
subd
wfpcw
jr
nop
f8,rO
f8,f8
roundDown
fO,f8,f6
roundUp
f2,f8,f4
roundNear
r31
B.4 Interval Multiplication in DLX
irnul: rnovi2fp
cvtf2d
ged
bfpt
led
bfpt
£12,rO
£12, £12
f4, £12
rnul_ap
f6, £12
rnul_an
69
f2,fO
f8,f12
rntil_az_bp
flO, fl2
rnul_az_bn
roundDownUp
fO,f4,f8
f2,f6,flO
f2,fO
rnul_azbz_s2
roundDown
fO,f4,flO
f2,f6,f8
fO,f2
rnul_azbz_il
-8(r29},fO
-8(r29},f2
ged
bfpt
led
bfpt
rnul az_bz:
wfpcw
rnultd
rnultd
led
bfpt
sd
sd
rnul_azbz_il:
wfpcw
rnultd
rnultd
ged
bfpt
nop
rnovd
rnul_azbz_s2:
j rnul_end
ld fO,-8(r29}
rnul az_bn:
i ROUNDDOWN(} in delay slot
rnultd fO,f6,f8
wfpcw roundDownUp
j rnul_end
rnultd f2,f4,f8
rnul_az_bp:
wfpcw roundDownDown
rnultd fO,f4,flO
wfpcw roundDownUp
j rnul_end
rnultd f2,f6,flO
delay slotf8,fl2 in
rnul_an_bp
flO, fl2
rnul_an_bn
roundDownDown
fO,f4,flO
roundDownUp
rnul_end
f2,f4,f8
slot
i ged
bfpt
led
bfpt
rnul an_bz:
wfpcw
rnultd
wfpcw
j
rnultd
rnul an_bn:
i ROUNDDOWN(} in delay
rnultd fO,f6,flO
wfpcw roundDownUp
j rnul_end
70
multd
mul_an_bp:
wfpcw
multd
wfpcw
j
multd
f2,f4,f8
roundDownDown
fO, f4, flO
roundDownUp
mul....;.end
f2,f6,f8
ged f8,fl2
bfpt mul_ap_bp
led flO, fl2
bfpt mul_ap_bn
mul_ap_bz:
wfpcw roundDownDown
multd fO,f6,f8
wfpcw roundDownUp
j mul_end
multd f2,f6,flO
mul_ap_bn:
; ROUNDDOWN() in delay slot
multd fO,f6,f8
wfpcw roundDownUp
j mul_end
multd f2,f4,flO
mul_ap_bp:
wfpcw roundDownDown
multd fO,f4,f8
wfpcw roundDownUp
multd f2,f6,flO
mul_end:
wfpcw roundDownNear
jr r31
nop
B.5 Interval Division in DLX
idiv: movi2fp
cvtf2d
lhi
addui
ld
lhi
addui
ld
ged
bfpt
led
bfpt
f12,rO
fl2, fl2
rl, (neginf»16)&Oxffff
rl,rl,neginf&Oxffff
f14, (rl)
rl, (posinf»16)&Oxffff
rl,rl,posinf&Oxffff
fl6, (rl)
f4, fl2
div_ap
f6, fl2
div_an
gtd f8, fl2
71
bfpt div_az_bp
ltd flO, fl2
bfpt div_az_bn
wfpcw roundDownDown
div aZ_bz:
movd fO, fl4
j diy_end
movd f2, fl6
div aZ_bn:
; ROUNDDOWN() in delay slot
divd fO,f6,flO
wfpqw roundDownUp
j diy_end
divd f2,f4,flO
div_az_bp:
wfpcw roundDownDown
divd fO,f4,f8
wfpcw roundDownUp
j diy_end
divd f2,f6,f8
slot
delay slot
fO,fl4 in delay
roundDownUp
diy_end
f2,f6,flO
fO, fl4
diy_end
f2, fl6
f8, fl2 in
div_an_bp
flO, fl2
div_an_bn
roundDownDown
f8, flO
div_an_bznp
flO, f12
div_an_bzn
f8, fl2
div_an_bzp
j
movd
div_an.Jnp :
; movd
wfpcw
j
divd
div_an_bzn:
; ROUNDDOWN() done already
divd fO,f6,f8
j diy_end
movd f2, fl6
div an_bn:
; ROUNDDOWN() in delay slot
divd fO,f6,f8
wfpcw roundDownUp
; gtd
bfpt
ltd
bfpt
div_an_bz:
wfpcw
eqd
bfpt
eqd
bfpt
eqd
bfpt
div_an_bznp:
movd
72
j
divd
div_an_bp:
wfpcw
divd
wfpcw
j
divd
diy_end
f2,f4,f10
roundDownDown
fO,f4,f8
roundDownUp
diy_end
f2,f6,flO
slot
roundDownDown
fO,f4,flO
roundDownUp
f2,f6,f8
roundDownNear
r31
fO,f4,flO
diy_end
f2, fl6
roundDownDown
f8,f10
div_ap_bznp
f8,fl2
div_ap_bzp
f10,fl2
div_ap_bzn
fO, fl4
diy_end
f2, fl6
fO,fl4 in delay
roundDownUp
diy_end
f2,f4,f8
f8, fl2
div_ap_bp
flO, fl2
div_ap_bn
j
movd
div_ap_bzn:
; movd
wfpcw
j
divd
div_ap_bzp:
divd
j
movd
div_ap_bn:
; ROUNDDOWN() done already
divd fO,f6,f10
wfpcw roundDownUp
j diy_end
divd f2,f4,f8
div_ap_bp:
wfpcw
divd
wfpcw
divd
diy_end:
wfpcw
jr
nop
neginf: .word OxfffOOOOO, OxOOOOOOOO
posinf: .word Ox7ffOOOOO, OxOOOOOOOO
gtd
bfpt
ltd
bfpt
div_ap_bz:
wfpcw
eqd
bfpt
eqd
bfpt
eqd
bfpt
div_ap_bznp:
movd
73
B.6 Interval Width in DLX
iwid: wfpcw roundUp
subd fO,f6 f4
wfpcw roundNear
jr r31
nop
B.7 Interval Midpoint in DLX
imidpt: lhi rl, {two»16)&Oxffff
addui rl,rl,two&Oxffff
ld f2, (rl)
wfpcw roundUp
subd fO,f6,f4
divd fO,fO,f2
wfpcw roundDown
addd fO,fO,f4
wfpcw roundNear
jr r31
nop
two: . double 2.0
B.8 Interval Magnitude in DLX
imag: movi2fp fO,rO
cvtf2d fO,fO
ged f4,fO
bfpt infOK
led f6,fO
subd f4,fO,f4
infOK: bfpt supOK
gtd f4,f6
subd f6,fO,f6
supOK: bfpt end
movd fO,f4
movd fO,f6
end: jr r31
nop
B.9 Interval Mignitude in DLX
imig: movi2fp fO,rO
cvtf2d fO,f2
ged f4,fO
bfpt infOK
led f6,fO
bfpf end
74
subd £4,£0,£4
subd £6,£0,£6
in£OK: ltd £6,£4
b£pt end
movd £0,£6
movd £0,£4
end: jr r31
nop
75
Appendix C
Definitions of the Proposed Extensions
Operation
State[round]
SRNDmode
MOVDTFa,Fb
Definition
if (a floating point operation is being performed) {
Round as indicated
State [round] ~ NextState [State [round] ]
}
State [round] ~ mode
if (Flags [FP] - 1) {
Regs [Fa] ~ Regs [Fb]
Regs [Fa+1] ~ Regs [Fb+1]
}
MOVDF Fa, Fb if (Flags [FP] - 1) {
Regs[Fa]~Regs[Fb]
Regs[Fa+1]~Regs[Fb+1]
}
IABSD Fa, Fb Flags [FP] ~ (Regs [Fa] 0 -::j::. Regs [Fb] 0)
Regs [Fa] 0~ 0
Regs [Fb] 0 ~ 0
NEGSD Fa, Fb temp ~ Regs [Fb]
Regs [F (b)] ~ Regs [Fa]
Regs [F (a)] ~ temp
Regs [F (a) ] 0~ Regs [Fa] 0
Regs [F (b)] 0~ Regs [Fb] 0
IIMD Fa, Fb State [round] ~ DownUp
if (Regs [Fa] 0 = 0) {
if (Regs [Fb] 0 = 0) {
/* do nothing */
} else if (Regs [Fb+2] 0 = 1) {
temp ~ Regs [Fa+2]
Regs [Fa+2] ~ Regs [Fa]
Regs [Fa] ~ temp
temp ~ Regs [Fa+3]
Regs [Fa+3] ~ Regs [Fa+1]
Regs [Fa+1] ~ temp
} else {
Regs [Fa] ~ Regs [Fa+2]
Regs [Fa+1] ~ Regs [Fa+3]
}
} else if (Regs [Fa+2] 0 = 1) {
if (Regs [Fb] 0 = 0) {
76
IIMDFa,Fb
(continued)
temp f- Regs [Fb+2]
Regs [Fb+2] f- Regs [Fb]
Regs [Fb] f- temp
temp f- Regs [Fb+3]
Regs [Fb+3] f- Regs [Fb+2]
Regs [Fb+2] f- temp
else if (Regs[Fb+2]o = 1)
temp f- Regs [Fa+2]
Regs [Fa+2] f- Regs [Fa]
Regs [Fa] f- temp
temp f-'Regs [Fa+3]
Regs [Fa+3] f- Regs [Fa+1]
Regs [Fa+1] f- temp
temp f- Regs [Fb+2]
Regs [Fb+2] f- Regs [Fbr
Regs [Fb] f- temp
temp f- Regs [Fb+3]
Regs [Fb+3] ~ Regs [Fb+2]
Regs [Fb+2] f- temp
else {
Regs [Fa+2] f- Regs [Fa] ;
Regs [Fa+3] f- Regs [Fa+1] ;
temp f- Regs [Fb+2]
Regs [Fb+2] f- Regs [Fb]
Regs [Fb] f- temp
temp f- Regs [Fb+3]
Regs [Fb+3] f- Regs [Fb+2]
Regs [Fb+2] f- temp
}
else {
if (Regs [Fb] 0 = 0) {
Regs [Fb] f- Regs [Fb+2]
Regs [Fb+1] f- Regs [Fb+3]
else if (Regs [Fb+2]0 = 1)
temp f- Regs [Fa+2]
Regs [Fa+2] f- Regs [Fa]
Regs [Fa] f- temp
temp f- R~gs [Fa+3]
Regs [Fa+3] f- Regs [Fa+1]
Regs [Fa+1] f- temp
Regs [Fb+2] f- Regs [Fb]
Regs [Fb+3] f- Regs [Fb+1]
else {
R31 f- PC
77
IIMDFa,Fb
(continued) }
IIDD Fa, Fb State [round] ~ DownUp
if (Regs [Fb] 0 = 0) {
if (Regs [Fa] 0 = 0) {
temp ~ Regs [Fb+2]
Regs [Fb+2] ~ Regs [Fb]
Regs [Fb] ~ temp
temp ~ Regs [Fb+3]
Regs [Fb+3] ~ Regs [Fb+2]
Regs [Fb+2] ~ temp
} else if (Regs[Fa+2]o = 1) {
/* do nothing */
} else {
Regs [Fb+2] ~ Regs [Fb]
Regs [Fb+3] ~ Regs [Fb+1]
}
} else if (Regs [Fb+2] 0 = 1) {
if (Regs [Fa] 0 = 0) {
temp ~ Regs [Fa+2]
Regs [Fa+2] ~ Regs [Fa]
Regs [Fa] ~ temp
temp ~ Regs [Fa+3]
Regs [Fa+3] ~ Regs [Fa+1]
Regs [Fa+1] ~ temp
temp ~ Regs [Fb+2]
Regs [Fb+2] ~ Regs [Fb]
Regs [Fb] ~ temp
temp ~ Regs [Fb+3]
Regs [Fb+3] ~ Regs [Fb+2]
Regs [Fb+2] ~ temp
} else if (Regs [Fa+2] 0 = 1) {
temp ~ Regs [Fa+2]
Regs [Fa+2] ~ Regs [Fa]
Regs [Fa] ~ temp
temp ~ Regs [Fa+3]
Regs [Fa+3] ~ Regs [Fa+1]
Regs [Fa+1] ~ temp
} else {
temp ~ Regs [Fa+2]
Reg~ [Fa+2] ~ Regs [Fa]
Regs [Fa] ~ temp
temp ~ Regs [Fa+3]
Regs [Fa+3] ~ Regs [Fa+1]
78
IIDD Fa, Fb
(continued)
Regs [Fa+l] ~ temp
Regs [Fb] f- Regs [Fb+2]
Regs [Fb+l] ~ Regs [Fb+3]
}
} else {
Regs [Fa] ~ a_negative_value
Regs [Fa+2] f- a-positive_value
Regs [Fb] ~ (0)
Regs [Fb+l] f- (0)
Regs [Fb+2] f- (0)
Regs [Fb+3] f- (0)
HALFD Fa, Fb if (Regs [Fb] exponent - (Emax + 1)) {
Regs [Fa] ~ Regs [Fb]
Regs [Fa+1] ~ Regs [Fb+1]
} else if (Regs [Fb] exponent = Emin) {
Regs [Fa] exponent ~ (Emin - 1)
Regs [Fa] significandD ~ 1
Regs [Fa] significandl...N~ Regs [Fb] significandD... (N-1)
Regs [Fa+1] D~ Regs [Fb] significandN
Regs [Fa+1] 1...32 ~ Regs [Fb] D...31
} else if (Regs [Fb] exponent = (Emin - 1)) {
Regs [Fa] exponent ~ (Emin - 1)
Regs [Fa] significandD ~ Regs [F (b)] significandD
Regs [Fa] significandl...N~ Regs [Fb] significandD... (N-1)
Regs [Fa+1] D~ Regs [Fb] significandN
Regs [Fa+1] 1...32 ~ Regs [Fb] D...31
} else {
Regs [Fa] exponent ~ (Regs [Fb] exponent - 1)
Regs [Fa] significand f- Regs [Fb] significand
Regs [Fa+1] f- Regs [Fb+l]
79
AppendixD
Pipeline Description File
The following pipeline description file describes the floating point pipeline used in the
enhanced DLX architecture.
UNITS
FPADDX1 1
FPADDX2 1
FPADDX3 1
FPDIV 1
FPMULX1 1
FPMULX2 1
FPMULX3 1
IDIV 1
IMUL 1
ENDUNITS
USAGE
INSTRUCTION DADD
FPADDX1 1
ENDSTAGE
FPADDX2 1
ENDSTAGE
FPADDX3 1
ENDSTAGE
ENDINSTRUCTION
INSTRUCTION FADD
FPADDX1 1
ENDSTAGE
FPADDX2 1
ENDSTAGE
FPADDX3 1
ENDSTAGE
ENDINSTRUCTION
INSTRUCTION DSUB
FPADDX1 1
ENDSTAGE
FPADDX2 1
ENDSTAGE
FPADDX3 1
ENDSTAGE
ENDINSTRUCTION
INSTRUCTION FSUB
FPADDX1 1
ENDSTAGE
FPADDX2 1
ENDSTAGE
FPADDX3 1
ENDSTAGE
ENDINSTRUCTION
INSTRUCTION DMUL
FPMULX1 1
ENDSTAGE
FPMULX2 1
ENDSTAGE
FPMULX3 1
ENDSTAGE
ENDINSTRUCTION
INSTRUCTION FMUL
FPMULX1 1
ENDSTAGE
FPMULX2 1
ENDSTAGE
FPMULX3 1
ENDSTAGE
ENDINSTRUCTION
INSTRUCTION DDIV
FPDIV 1
ENDSTAGE
REPEAT 11
ENDSTAGE
ENDINSTRUCTION
INSTRUCTION FDIV
FPDIV 1
ENDSTAGE
REPEAT 21
ENDSTAGE
ENDINSTRUCTION
80
INSTRUCTION IMUL
IMUL 1
ENDSTAGE
ENDINSTRUCTION
INSTRUCTION UIMUL
IMUL 1
ENDSTAGE
ENDINSTRUCTION
INSTRUCTION IDIV
IDIV 1
ENDSTAGE
ENDINSTRUCTION
INSTRUCTION UIDIV
IDIV 1
ENDSTAGE
ENDINSTRUCTION
ENDUSAGE
AppendixE
Defined Interval Types and Macros
Data types REAL
struct Interval
INTERVAL INITZERO()
Initializers for static interval data INTERVAL INIT ()INTERVAL_INITREAL()
INTERVAL INITINVALID()
INTERVAL_SETZERO()
To set intervals directly INTERVAL_SET ()INTERVAL_SETREAL() ,
INTERVAL SETINVALID()
INTERVAL_INF ()
INTERVAL_SUP ()
Scalar decomposition INTERVAL_MAGNITUDE()INTERVAL_MAGNITUDE()
INTERVAL_MIDPOINT()
INTERVAL_WIDTH()
INTERVAL_SQUARE()
Unary operations INTERVAL_NEGATE()
INTERVAL_ABS ()
INTERVAL_ADD ()
INTERVAL_SUBTRACT()
Binary operations INTERVAL_MULTIPLY()INTERVAL_DIVIDE()
INTERVAL_INTERSECTION()
INTERVAL_CONVEXHULL()
INTERVAL_OK ()
Unary tests INTERVAL_POSSIBLYZERO()
INTERVAL_DEFINITELYZERO()
Test for scalar inclusion INTERVAL_CONTAINS()
INTERVAL PROPERSUBSET()
INTERVAL PROPERSUPERSET()
Set comparisons INTERVAL_SUBSET()INTERVAL_SUPERSET()
INTERVAL_DISJOINT()
INTERVAL_SETEQUALS()
INTERVAL_POSSIBLYGREATERTHANOREQUAL()
INTERVAL_POSSIBLYGREATERTHAN()
Possibly true comparisons INTERVAL_POSSIBLYEQUAL()INTERVAL_POSSIBLYNOTEQUAL()
INTERVAL_POSSIBLYLESSTHAN()
INTERVAL POSSIBLYLESSTHANOREQUAL()
INTERVAL DEFINITELYGREATERTHANOREQUAL()
INTERVAL DEFINITELYGREATERTHAN()
Definitely true comparisons INTERVAL DEFINITELYEQUAL()INTERVAL DEFINITELYNOTEQUAL()
INTERVAL DEFINITELYLESSTHAN()
INTERVAL DEFINITELYLESSTHANOREQUAL()
81
Vita
Gerald Shawn Williams, the son of James and Julia Williams, was born in Somerville,
New Jersey on July 5, 1966. In 1988, he received the degree ofBachelor of Science in
Computer Science at Stevens Institute of Technology in Hoboken, New Jersey. During
his undergraduate studies, he interned at Ciba-Geigy Pharmaceuticals in Summit, New
Jersey, Stone and Webster Engineering Management Consultants in New York City,
and Canadian Imperial Bank of Commerce in New York City. He has since worked as a
design engineer for AT&T Bell Laboratories, now part ofLucent Technologies. He has
filed for four patents related to software technology. He is currently pursuing a Masters
Degree in Computer Engineering at Lehigh University in Bethlehem, Pennsylvania. He
resides in Macungie, Pennsylvania with his wife Cynthia.
82
END
OF ..
TITLE
