Abstract-The dot product operation is very prevalent in scientific computation and has therefore been incorporated as a primitive operation in some languages. The implementation of the dot product operation by a sequence of IEEE standard multiplications and additions does not prevent a substantial accumulation of the round-off errors or warn the user about a catastrophic cancellation. We present the design of a double precision dot product operation employing sticky accumulation, where the final rounded result is validated by raising a new exception flag if the result incurred catastrophic cancellation. Sticky accumulation can be implemented in a pipeline or parallel environment to sustain double precision with an extended control of the error. Our design allows that, in the absence of catastrophic cancellation, one ulp accuracy is guaranteed.
INTRODUCTION
THE process commonly used in floating point units to perform the dot product operation almost never yields the correct rounding of the mathematical, e.g., infinitely precise, operation [1] . More importantly, there is no high-level hardware-supported method to efficiently estimate the error. Although some results are available from numerical analysis research, the quantities involved in the analysis are too far from the machine floating point model. The user will most probably skip the validation of his code rather than pay the computing time needed to check it. This work is intended to assist a numerical analyst with a new efficient hardware support for error control and precision enhancement of accumulation.
Extending the working precision from the double format to the quad format (see Fig. 1 ) is not sufficient to compensate both for unmonitored cancellations of leading bits set to 0 and for accumulations of round-off errors. Going to quad precision will not give any additional information to the user on the correctness of the results and extending overall precision is very expensive in running time, circuit area, and memory usage.
At the machine level, the standard IEEE 754 rounding is often achieved by performing the internal computations with some extended registers larger than the output format-three additional bits are introduced: the guard, round, and sticky bits. With these extensions, the IEEE standard operations are done as if the result were computed to infinite precision prior to rounding. However, the only way to guarantee the correctness of the dot product operation in hardware is to compute the partial sums with a multiple precision exact representation [2] , [3] , [4] , [5] . In fact, many articles propose different methods to compute as fast as possible and with limited resources an error-free dot product operation rather than to monitor and control the error. Computing all the digits of each partial sum to keep only the most 53 significant bits is not really cost efficient. A multiprecision dot product is not a common tool:
Its hardware implementation is usually slower than a standard floating point multiply and add scheme; it needs a very large amount of hardware preventing its duplication in a vector processor or a pipelined unit. We should restrain the use of such time consuming exact processes to the mathematical expressions for which other tools are insufficient.
Section 2 presents our faithful rounding mechanism and an introduction to the analysis and the monitoring of the cancellation phenomenon. The third section describes our two different timeand-space-efficient self-validating dot product schemes, sign segregated and sticky accumulation. We discuss in Section 4 considerations related to the usage of the sticky accumulator in modern computers.
ERROR MONITORING
The IEEE 754 standard for floating point operation [7] specifies four different rounding modes:
• RN-Round to Nearest even ( ∞ );
• RD-Round Down towards -• (-).
• RU-Round Up towards +• (D);
• RZ-Round towards Zero (xv);
The result of any supported operation is the exact infinitely precise mathematical result rounded with the active rounding mode chosen among the four specified rounding modes-∞ , -, D, or x v. This condition has proven to be feasible with reasonably small additional hardware for the standard arithmetic operations-addition, multiplication, division, and square root [8] . It is more difficult to attain correct rounding for the elementary functions as there is no known algorithm able to compute correctly for any value the sine, cosine, logarithm, and so on, in a practically bounded time [9] , [10] .
The IEEE standard rounding is always attainable for the dot product operation employs a long accumulator. Our approach herein is simply to obtain a validated a good approximation. In the absence of a catastrophic cancellation signal, the proposed algorithm will guarantee for accumulation at least a faithful rounding.
Interval Notation
We shall denote by x [a,b] , where x, a, and b are real numbers, the following interval. We define the intervals
Let F be the set of representable values with the working precision-single F 0 , double F 1 , double extended ¢ F 1 , or quad F 2 . Depending on the active rounding mode, we define the abbreviated interval x (d) as described in Fig. 2 . We define the quantity of one unit in the last place or an ulp for any floating point number v π 0 as the absolute value of the difference between v and v¢ where v¢ is the next floating point number.
ulp(v) = |v -v¢|
The function ulp(v) depends on the floating point format, so we use ulp F (v) rather than just ulp (v) . It can be computed directly from the logarithm function and the value ulp F (1) as follows.
We extend the ulp function to any real number x. For any integer n, we then define the interval x F,(n) that we will denote by x (n) whenever no confusion is possible.
Given an approximation x F,(n) , its number of pertaining bits #x F,(n) is defined as follows.
#x F,(n) = -Èlog 2 (n ulp F (1))F or any floating point value v π 0, if v is not an exact power of 2, v (1) is the set of real numbers that rounds to v. The number of pertaining bits of #x (1) for a real number x is #x (1) = 24 in single precision and #x (1) = 53 in double precision.
Faithful Rounding
Some notion of faithful rounding was introduced in [11] and in [12] as optimal rounding. Our proposal, in [6] , extends the following notion to nondeterministic correct rounding: we present a general mechanism for interval rounding that is well suited to be a first extension of the IEEE deterministic rounding.
A hardware implementation of an elementary function f(x) like the sine or logarithm typically returns an approximation of the result that can be viewed as a very small interval rather than just the real value f(x). In contemporary floating point units, elementary functions are implemented by computing an approximation f *(x) of f(x) in internal extended precision ¢¢ Down-Down --to replace the Round Down -mode, and DD (resp. x vx v) to replace D (resp. x v).
The two cases of faithful rounding with Round to Nearest-Zero are presented in Fig. 3 . The fat lines are representable floating point numbers, the thin lines are the midpoints used in the Round to Nearest mode. The brackets enclose the interval f x p * ( ) ,( )
; the center of the interval is also visible. In the first case (Fig. 3a) , the interval rounds to exactly one value using the preferred rounding mode ∞ . This value is used as the result of the function. In the second case (Fig. 3b) , although the interval is less than one ulp wide, some part of the interval rounds to one floating point value whereas the other part rounds to another one. However, since the interval is small enough, the alternate rounding mode x v can be used and the whole interval is rounded to one single value using the x v rounding mode. The directed rounding modes -, D, and x v, allow the implementation of interval arithmetic at the user level. This is to guarantee that the final result of a compound operation is a directed approximation of the infinitely precise result. We present in Fig. 4 the two examples of faithful rounding of an interval with a directed faithful rounding mode: Round Down-Down. In the first case ( Fig. 4a ), the correct rounding is possible, the interval is rounded using -. In the second case ( Fig. 4b) , -is not possible but the interval can be rounded correctly using ∞ . In order to return to the user a coherent approximation of the result the rounded value is decremented and the computer yields the next lower represent-
THEOREM 1 (Daumas and Matula 1994, [6]). Faithful rounding is always possible if the interval does not contain an exact power of two and if the number of pertaining bits of the approximation is strictly larger than the number of bits of the significand of the target format. If the interval contains an exact power of two, it is sufficient that the number of pertaining bits of the approximation exceeds at least by two the number of bits of the significand of the target format.
A faithfully rounded result is a very accurate result: It is a correct rounding or a slightly modified correct rounding and the difference between the correct result and a faithful approximation is at most one ulp. A faithful evaluation of a compound formula will most certainly return a result close to the IEEE correct result. Faithful rounding is always possible if the internal process used for the evaluation of the function carries enough precision whereas IEEE correct rounding might not be possible.
Sign Exponent
Significand Total 
Rounding mode
Round towards Zero 
Catastrophic Cancellation
The IEEE precise rounding semantic for each individual operation implemented in hardware does not imply the correctness of a result involving compound operations. During the evaluation of a general signed sum of terms by a computer, the machine builds the partial sums as approximations x (n) rather than numbers. Each operation adds some round-off error. The last bits of the result of a large signed sum may be incorrect (see Fig. 5a ) but the user does not usually have the ability to determine exactly the border between valid bits and random bits. However, the accumulation of such round-off error alone is never too damageable. Doing the subtraction of two almost identical numbers will lead to another phenomenon called cancellation (see Fig. 5b ). Some problems arise when the cancellation occurs on approximated results (see Fig. 5c ). The accumulated round-off error is magnified by the cancellation when the result is left shift normalized. Although the result of the last operation is the standard rounding of the infinitely precise mathematical operation on the inputs, the number of pertaining bits of the result is reduced compared to the number of pertaining bits of the inputs. Paradoxically, when an addition leads to a large cancellation, that floating point operation is exact: No rounding has to be performed and no new error is introduced, but existing error is magnified.
Cancellation is a magnifier for would be errors. However, in real applications, most of the values are not known explicitly but they are approximated. Even when a floating point subtraction is known to be exact, it is important to retrieve the number of bits canceled. We define the level of cancellation of a routine which computes a signed sum as the difference between the exponent of the final result and the exponent of the biggest unsigned number used as an input or as a partial sum. A cancellation is not necessarily a source of error. A high level of cancellation is rather a signal for a possibly very incorrect result.
The IEEE standard has introduced the notion of hierarchy among the precision types: single, double, quad. It is common to protect a result by switching to a higher precision for the partial results. An example of such hierarchy is presented in Fig. 6 . The inputs and outputs are assumed single precision as when they are measured by physical instruments. The user codes a program in double precision so the output will most probably be accurate to 1 ulp once rounded back to single precision. If a numerical routine is called from the program, it uses the double extended format available on PCs.
Detecting every cancellation of a single bit in a program or a routine would lead to tagging any computed result as suspect. On the other hand, the operations that cancel all the bits of their operands are quite rare, and they usually present obviously erroneous results. We define a catastrophic cancellation as the trade-off that combines both users' practices and concerns for hardware efficiency: A catastrophic cancellation has occurred if enough bits have been lost in the process so that the significand length of the result is not larger than a significan of the next lower precision (see Fig. 7 ). We deduce the table presented in Fig. 8 from this definition.
VALIDATED DOT PRODUCT

Sign Segregated Accumulation (SSA)
Computing separately the sum X + of the positive numbers and the sum X -of the negative numbers, we ensure that any cancellation will occur only on the last operation X = X + + X -. This method also guarantees that the values ulp(X + ) and ulp(X -) are nondecreasing, and, as a consequence the accumulated error is bounded relatively to the final values of X + and X -(see Fig. 9 ). The proof of the properties presented in this section can be found in [13] . 
The SSA algorithm represents the worst case accumulation in regard to cancellation by building two large numbers X + and X -that cancel with the last addition. With software support, it would be much easier to sort by magnitude both positive and negative numbers in two separate lists. The software could compensate the intermediate errors in sums selecting a member of either list to sum next to keep it as small as possible in magnitude. This process could delay the significant cancellations to the end of the accumulation but it would require the program to first sort the set of numbers both by sign and by magnitude. Without any change to the usual order dependent multiply and add scheme, the SSA algorithm provides information on precision and cancellation. At the hardware level, some logic must decide which register X + or X -has to be accumulated by the SSA mechanism with the incoming value x i . This selection can be done in the early stages of a pipeline as soon as x i is read. If both registers are available at that moment, the right partial sum will be forwarded to the following stages of the pipeline. As Kahan proposed in [14] , the accumulation process should take place in a type precision higher than the working type. The quad precision format (see Fig. 6 ) represents a nice trade-off to accumulate the dot product partial sums for a double precision result. Being a standard size, this format can be accepted by manufacturers following the IEEE philosophy.
Considering the performances of the machines now widely available, and the numerical codes that are used on these machines, one can argue that one billion terms (10 9 < 2 30 ) is a reasonable upper limit on the length of a dot product operation. Thus any accumulated error cannot be bigger than 30 bits. If no catastrophic cancellation has occurred, the last operation has canceled less than 29 bits and at least 54 = 112 -(28 + 30) bits of the result are available; the result can be faithfully rounded. In the case of a catastrophic cancellation, we strongly advise the user to check his/her code for the meaning of a catastrophic cancellation at this point in the execution.
The dot product operation could easily return the infinitely precise rounding of the sum X + + X -. We can determine exactly the number of pertaining bits of the approximation from the maximum possible accumulated error and set the new exception flags: IEEE correct rounding and faithful rounding.
• If the error was small enough and positioned so that the whole uncertainty interval can be rounded to exactly one double precision floating point number, both the correct rounding flag and the faithful rounding flag are set. • If it is not possible to round the interval to exactly one value, but the approximation is small enough so that faithful rounding is possible, then only the faithful rounding flag is set.
• If the interval cannot be rounded properly using either IEEE standard rounding or faithful rounding, neither the correct rounding nor the faithful rounding flag is set.
Sticky Accumulation (SA)
The machine computes the partial sums X i = X i-1 + x i in quad precision without the normalization step (see Fig. 10 ). The definition of the quad type given in Fig. 1 is consistent with the philosophy of the standard that an extended precision level behaves like an unpacked format and includes the leading bit explicitly. Obtaining one extra bit of accuracy with an implicit leading bit at the level of 112 or 113 does not seem as useful as the facility to store unnormalized intermediate results where the number of leading zeros can itself contain some useful information that we do not want to lose or have to count and explicitly save elsewhere. A quad precision floating point accumulator X, with the explicit leading bit, that is never normalized by shifting left after an addition, satisfies the property used to prove Theorem 2, for the sign segregated accumulation.
X represents a computer word rather than a real number. If we allow unnormalized numbers like X where X is written with p leading zeroes and X 0 is its normalized value, then X = X 0 as far as real arithmetic is used but ulp(X) = ulp(X 0 ) ¥ 2 p . During an addition, the only process where the exponent may decrease is the left normalization which is impossible here. The exponent of the result has to be larger than or equal to the exponent of both inputs, and so is the value of the ulp function. The result of the process is the properly normalized and rounded value of X n in double format. During the last normalization, the system has access to the useful data to set the IEEE correct rounding flag and the faithful rounding flag.
We present Fig. 11 to illustrate the case of a decimal accumulation with three significant digits rounded to nearest. The first column is the exact sum of the inputs. This result may be produced by a fixed point process like a full length accumulator. The second sum is a standard accumulation where we keep three digits of each partial sum only and the intermediate result is rounded to nearest. The sum yields zero because of the final cancellation and the user does not have either a rough result or information on the error. No automatic support is offered to compute the error bound.
In the last case, we used sticky accumulation with four digits, i.e., one extra digit compared to the standard sum. As a result, we got better accuracy from the extra digit. More importantly, a bound on the final error is available through the sticky accumulation process. Although the bound is not tight in this situation, the user is able to qualify the result.
In current systems, the user has to decide a priori whether to use a standard sum or an exact process for an accumulation. The sticky accumulation validates a faithful result or signals results that really need to be computed with some higher precision device.
IMPLEMENTATION OF STICKY ACCUMULATION
Hardware Issues
The IEEE 754 standard for floating point arithmetic guarantees both maximum accuracy and reproducibility, and was designed to allow efficient implementation; the approach proposed here for the dot product provides a balanced approach to these same goals.
Sticky accumulation guarantees that either the dot product operation is faithfully rounded or a catastrophic cancellation has occurred. For the latter case, the accumulation process itself should be considered suspect.
Instruction Set
To obtain some information about the dot product operation we consider it as an order dependent atomic operation. The dot product is obtained by issuing the stream of operations presented in Fig. 12 . The functionalities that are used on a quad precision accumulator to implement the sticky accumulation are:
• Accumulate a new value with no post normalization.
• Normalize using the information on the sticky accumulation to set the rounding flags.
At the last stage of the dot product operation, it is possible to deduce from the quad accumulator and the size of the input vectors an evaluation of the result and a bound on the accumulated error. A new floating point operation quad_normalize taking a quad accumulator and an integer has to be added to the unit. From the unnormalized quad precision sum acc, and the number of terms n of the dot product (x. y), the floating point unit deduces the double precision result res of the dot product operation. Concurrently, the new flags IEEE correct rounding and faithful rounding are set accordingly to the result.
High Level of Integration
In order to be practical in VLSI, an algorithm-in-silicon must be competitive in time compared to functionally equivalent circuits. For a parallel environment (vector processor or small grain parallelism), the circuit must also be sufficiently small so that it can be replicated in a large number of units. Some other implementations proposed to enhance precision of the dot product operation always guarantee the exact rounding when the sticky accumulation cannot even guarantee faithful rounding in the presence of catastrophic cancellation. Yet our algorithm can readily be integrated into a general purpose central unit. The hardware needed to implement sticky accumulation is no larger than a quad precision accumulator and the design utilizes well known structures currently employed by manufacturers [15] , [16] , [17] .
Full length representation is feasible on a one accumulator per machine basis, but it is impractical to implement a multiplicity of these active registers on the same chip to allow a pipelined implementation. Sometimes, a dot product operation can be split by the user or by the compiler into as many partial sums as needed to keep the pipeline busy; each partial sum is accumulated in a different sticky accumulator stored in two 64 bits registers. The sticky quad accumulator can be replicated if needed as was done for the floating point multiply unit on the DEC Alpha chip [18] .
With a hardware data-path equivalent to a quad precision adder, sticky accumulation can cooperate easily with other units of the processor. It can be integrated in the early stages of the design of a new processor and would not rely on the relatively slow hardware interruption mechanism. The algorithm uses only register-size storage that can be kept in a cache memory or in a register for intermediate storage or transmitted over an internal network for manipulation on multiple sites. This means that many dot product operators implemented as hardware functionality on every single processing element of a massively parallel machine could either work on totally different accumulation processes or cooperate on a few common dot products.
Software Issues
Safe. Getting a bitwise correct result is not automatic. If the user does not trap on the nonfaithful rounding condition, the result will be dependent on the order of evaluation of the terms and thus the result may not be correctly validated. However, we can imagine three levels of error to monitor a group of operations:
• First level: A record of the inaccurate operations is kept by the machine and presented at the end of the program. If the record is empty, it means that the results are correctly judged. Otherwise, part or all of the results should be considered as suspect. Should any two machines return results tagged correctly, validating the results obtained will be identical.
• The second level consists of propagating the suspect flag like the IEEE format propagates NaN values (Not a Number). Only the partial results that are tagged as correct should be considered for checking reproducibility between two machines.
• At the third level, the computer traps on any noncorrect result. It then uses a higher precision process (larger accumulation or full length accumulation) to compute the correct result. This mechanism is fully reproducible, and it uses the optimized functionalities of the machine as much as possible: the high precision accumulation is used only when the common tools have failed to give a needed result.
Associative. Some automatic optimizations of compilers change the order of evaluation of the dot product accumulation in a parallel environment or with a pipelined processor. This sometimes leads to a result that is not reproducible from one machine to another since the user does not have any control over the order of evaluation unless he turns off any automatic parallelization.
Our proposed dot product operation works correctly by accumulating the numbers in any order step by step with the multiply and accumulate scheme. The techniques that the users have learned to enhance performances or precision can still be applied efficiently with a sticky accumulation. One may use three sticky accumulators to compute three partial sums and take advantage of a pipelined processor designed to start one dot product accumulation every cycle and with a depth of three cycles (see Fig. 13 ). The sticky accumulation does not introduce any new data dependency between the registers acc1, acc2, and acc3.
As long as the quad accumulators are not normalized, the mechanism that is used to monitor the round-off error and the cancellation is still valid. The bound on the final error is given by (n -1) ulp(acc1) at the last step of the algorithm. The user can surely rely on the new available flags to obtain any desired tradeoff between precision and speed and to strongly attenuate the issue of reproducibility. A much wider range is now offered compared to the existing solution where the choice was only between: a fast accumulation with no guarantee at all or a slow process solving any possible problem.
Scalable.
We have presented this algorithm as an implementation for double precision arithmetic using one quad precision register. The sticky accumulation can be extended to an accumulator of any length. The choice of the size of the accumulator might even be left to the user. The user would use a triple or quadruple length accumulator to improve the precision of a nonfaithful result.
With the catastrophic cancellation mechanism, our algorithm detects the pathological cases. The user may ignore these warnings: Sticky accumulation is then equivalent to a standard multiply and add operation with a small overhead due to the final rounding to detect whether the validated result is safe or suspect. If the user is looking for a safe result even in the presence of a catastrophic cancellation, he or she will have to use a backup process offering more precision through a larger accumulator; yet the interval result of the sticky accumulation might be used to enhance the performances of the backup process. Only the very difficult cases need employ the implicit full length accumulation schemes proposed in [2] , [3] , [4] , [5] .
If the user has decided to ensure reproducibility, it is necessary to achieve IEEE standard rounding on each operation. In the case of catastrophic cancellation, the user may also choose to search for the rounded result of the exact operation although this quantity might be meaningless.
With existing double precision accumulation, since no feedback is available to the user, there is no way to implement an efficient linkage with a high precision package. The user can either perform all the dot product operations with the standard addition or compute each single dot product with a slow high precision package. In the first case, the result is available quickly although the result is most likely to be correct, no guarantee is available. In the second case, the fully certified computation may take hours, and most of the computing power might have been wasted in high precision operations where the operations would have returned a fair enough result with the standard addition.
Depending on knowledge of the algorithm's stability, the user may always prefer a long (fixed point) accumulator mechanism. However, a first pass of the SA dot product will detect the suspect cases, and the expensive process of multiple length accumulation will be used only when it is really needed. 
Simulation
The simulation of a single precision sticky accumulation (double length sticky accumulator) was obtained on a PC-compatible machine through the use of 80 bit wide double extended type. The difference of length between double precision and double extended formats (11 bits) is used to simulate unnormalized arithmetic.
Two generic sets of input on this simulation were used. The first one generates numbers of both signs between 1 and 1 2 in magnitude. Chances of stopping the process just after a major cancellation are poor, and the process had to run for hours before a visible error occurred. We have then designed a generator that would frequently involve a shift or some bit cancellation. The next input in the set has the same exponent as the current partial sum. The sign and the mantissa of the input are drawn at random.
In the example presented Fig. 14 , we have added 32 numbers. The sum with the sticky accumulation returns the same result as the double precision accumulation, but with two indications: The number of bits canceled is 22, and the bounds on the error is 2.910e -11.
If the degree of cancellation is low, the three sums are comparable. When the degree of cancellation starts growing, the single sum is inaccurate. For a limited level of cancellation, double precision sum and sticky accumulation produce the same result. However, when the cancellation level increases, the sticky accumulation returns the guaranteed interval whereas the standard accumulation is not able to set up any warning or information. When the level of cancellation is too high, the sticky accumulation just produces an interval centered around 0, and the double accumulation produces a meaningless number almost at random.
CONCLUSION
More than any other dot product algorithm, we have presented a reliable and efficient dot product implementation: This algorithm is associated with an error detection mechanism, hence, it is able to facilitate performance at the level of accuracy desired by the user.
The sticky accumulation process allows to perform a dot product operation at speed close to any standard IEEE operation. Because the user obtains information on each result, only an operation that may need some special care uses the necessary extra resources. For double precision arithmetic, the algorithm guarantees a faithfully rounded result unless a catastrophic cancellation has occurred.
Since even clever hardware implementations of error-free dot products would be much larger than a quad precision accumulation, they would have few chances to be highly integrated into the design of a processor. The implicit full length accumulation will remain a relatively slow process that should be used only when the huge amount of information it carries is really necessary, e.g., when the sticky accumulation signals a difficult accumulation.
