Abstract Double precision summation is at the core of numerous important algorithms such as Newton-Krylov methods and other operations involving inner products, such as matrix multiplication and dot products. However, the effectiveness of summation is limited by the accumulation of rounding errors due to compressed representations, which are an increasing problem with the scaling of modern HPC systems and Disclaimer: This document was prepared as an account of work sponsored by the United States Government. While this document is believed to contain correct information, neither the United States Government nor any agency thereof, nor the Regents of the University of California, nor any of their employees, makes any warranty, express or implied, or assumes any legal responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by its trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof, or the Regents of the University of California. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof or the Regents of the University of California. Copyright Notice: This manuscript has been authored by an author at Lawrence Berkeley National Laboratory under Contract No. DE-AC02-05CH11231 with the U.S. Department of Energy. The U.S. Government retains, and the publisher, by accepting the article for publication, acknowledges, that the U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for U.S. Government purposes. data sets that can easily perform summations with millions or billions of operands. To reduce the impact of precision loss, researchers have proposed increased-and arbitraryprecision libraries that provide reproducible error or even bounded error accumulation for large sums. However, such libraries increase computation and communication time significantly, and do not always guarantee an exact result. In this article, we propose fixed-point representations of double precision variables that enable arbitrarily large summations without error and provide exact and reproducible results. We call this format big integer (BigInt). Even though such formats have been studied for local processor computations, we make the case that using fixed-point representation for distributed computation over a system-wide network is feasible with performance comparable to that of double-precision floating point summation. This is possible by the inclusion of simple and inexpensive logic into modern NICs, or by using the programmable logic found in many modern NICs, in order to accelerate performance on large-scale systems in order to avoid waking up processors.
data sets that can easily perform summations with millions or billions of operands. Tomultiply each number with a few others, but the final result is produced by a large summation whose size directly depends on the data set size. This summation can easily reach millions or billions of operands.
Even if all terms are positive, round-off error when millions or more terms are summed can significantly reduce the accuracy of the result, especially if the summation involves large and small numbers. The value of the operands, data set size, and the number of processors is frequently unknown to the programmer, which means that estimating the magnitude of the error is impossible.
As an example of a real-world application where accurate summation is important, He and Ding [26] analyzed anomalous behavior in large atmospheric model codes, wherein significantly different results were obtained on different systems, or even the same system with a different numbers of processors. As a result, code maintenance was an issue, since it was often difficult to ensure that a "bug" was not introduced when porting the code to another system. While a certain degree of non-reproducibility is to be expected since weather and climate calculations are fundamentally chaotic, these researchers found that almost all of the numerical variations were eliminated when two key inner summations were converted to use double-double (DD) arithmetic (approximately 31 decimal digits) [26, 28] . Another concrete example is complex dynamical systems, where precision loss has been quoted as the main limitation [22] .
Currently, researchers generally presume that nothing can be done to ameliorate such problems and are often concerned with bounding errors from precision loss to acceptable levels [16] [17] [18] 24, 29, 41] . A few have tried higher-precision arithmetic using software facilities such as libraries, with relatively good success, although computational time for this software is often an issue [28, 49] . In the atmospheric model case mentioned above, researchers employed DD arithmetic in the summations, which uses two double-precision variables to extend precision [28] , and found that almost all numerical variations were eliminated. Others have even resorted to quad-double arithmetic or infinite precision libraries [6, 28, 49] .
In modern HPC systems with large datasets, summations and other operations are typically performed on operands distributed across the network [12, 48] . Currently none of the high-precision software libraries support parallel summations, partly due to the increased amount of bandwidth required to transport their complex internal data structures. Balanced-tree summations and other sorting or recursion algorithms that reduce round-off error accumulation [18, 29, 41, 49] are only efficient within nodes because they would incur an impractical amount of system-wide data movement for distributed operations. Such a communication burden would quickly make these libraries infeasible with large data sets. The lack of support for increased of infinite precision over the network leaves distributed operations prone to precision loss.
There are two natural ways to implement distributed summations with provably minimum data movement. One way is for nodes to send their partial sum (a single variable) to a single node to perform the summation and generate a single result. This only occupies a single node's processor but completion time grows linearly with the number of operands and can also create network hotspots all operands are transmitted over the network to a single (or a few) destination [52] . The other approach (used in most MPI implementations) performs the summation in a tree fashion, where each node adds a number of operands and passes the partial sum (also a single variable) to the next level of the tree. This approach requires logarithmic time to complete, but the latencies of each level of the tree greatly impact performance at scale as it incurs multiple communication delays between network interface cards (NICs) and local processors. That is because traversing up each level of the tree is a new communication event over the network. The downside of software implementations of reduction operations is that they incur potentially long context switch penalties just to have processors at each node add a few numbers together, which occurs at each level of the tree, and may also cause these intermediate processors to wake up from potentially deep sleep state just to perform an infinitesimal amount of work.
Earlier work proposes NIC-based reductions in Myrinet networks [8] by modifying the drivers (and therefore using the host processor's processing power), and showed that NIC-based reduction can yield performance gains even with as few as eight nodes [9] . Further work evaluated conducting reduction operations exclusively in NICs and concluded that doing so increases efficiency, scalability and reduces execution time compared to performing the computations in processors [43] . That is because there are no wake up or context switching penalties in each processor. This is made possible by using the programmable logic in modern NICs [43] . However, this work only applied this method for double-precision variables due to the complex data structures and computation demands of arbitrary-precision libraries and the limited processing power of programmable logic in NICs. Even with double-precision variables, the low computational power of programmable logic can pose a significant performance issue [43] . The alternative, adding complex dedicated hardware to NICs for increased or arbitrary precision computations, increases design risk and complexity. Therefore, the internal data structures of such libraries are so large a complex that programmable logic or hardware support in NICs is infeasible.
In this article, we make the case that using a very wide integer to provide an uncompressed and exact encoding of the complete numerical space that can be represented by a double-precision floating point number, which we call big integer (BigInt), is well-suited for distributed operations. This representation is essentially a wide integer that covers the same number space as a double-precision variable. We demonstrate that BigInts incur no precision loss and can be productive and efficient to use in distributed summations, such as reduction operations, without increasing communication latency or application execution time compared to double-precision variables. In fact, this approach offers the promise of actually accelerating system-wide summations used by inner products for methods such as Newton-Krylov, because integer adders can be easily incorporated in NICs at speeds comparable to modern floating point units (FPUs), to avoid waking up and context switching processors. This advantage stems from the simplicity of fixed-point variables. Therefore, we are able to move computation from the processor to the NIC without any precision or performance loss. This was previously possible only with double-precision variables due to the complex internal structures of arbitrary-precision libraries. Even though the size of a BigInt variable is 2,101 bits-33× larger than a 64-bit double-precision variable-we observe insignificant loss in communication delay due to the fixed latency costs and packet overhead bytes in modern large-scale networks [51] . Therefore, BigInts readily apply to network operations and can be combined with past work on local-node computations that uses sorting and recursion or alternative wide fixed-point represen- tations with dedicated hardware support [38, 39, 49] , in order to provide reproducible system-wide operations with no precision loss. Previously, there was no technique to make distributed summations error-free without a significant slowdown.
The rest of this article is organized as follows: Section 2 provides background and an overview of related work. Section 3 describes BigInts and their applicability to distributed reductions. Section 4 describes our methodology. Section 5 presents evaluation experiments and results. Finally, Sect. 7 concludes this article.
Background and Related Work
The format of double-precision variables is set by the IEEE 754 standard [34] and is shown in Fig. 1 . For double-precision variables, the total number of bits is 64, including the most significant bit which is the sign. Because a double-precision variable needs to represent a large number space with only 64 bits, it uses a compressed representation with a mantissa, exponent, and a sign bit. The mantissa includes a silent bit at its most significant (53rd) position, which is 1 if the exponent is non-zero. Double-precision variables with a zero exponent are considered denormalized and are used to fill the gaps in the numbers that double-precision variables can represent close to zero. A denormalized double-precision variable encodes the number:
If the exponent is non-zero, the number is considered normalized and a bias of −1,023 is applied to the exponent value. Therefore, an exponent value of 50 represents an actual exponent of 50 − 1,023 = −973. This way, double-precision variables can represent numbers with negative exponents. The decimal value of a normalized double is:
Normalized doubles encode numbers in the range:
However, because of the limited mantissa bits, double-precision variables cannot represent all numbers in the above range [22] . In addition, operating on numbers that have significantly different exponent values causes bits to be dropped, because the result can only have one exponent value. For example, adding the numbers 2 N and 2 M where N >= M + 53 causes the latter number to be dropped because the 53 mantissa bits are not enough to retain both numbers. Even if the smaller number is not small enough to be fully discarded, it may have to be partly discarded for the same reason. Alternatively, the result of two double-precision variables with the same exponent value may require more than 53 mantissa bits [22] . This can occur with a summation or a multiplication. Multiplication poses extra challenges because the result can easily be outside the representable number space, or can produce a number that is within the range but still not representable due to the compressed format [22] . Precision loss is hard to predict in many applications and can appear with just two operands. Therefore, even estimating the magnitude of precision loss is usually unfeasible without calculating the error-free result. In a large summation this error accumulates in every sum of two numbers and therefore in a summation with millions or billions of operands the error may rise to unacceptable levels [2, 6, 13, 26] . Double-precision variables are fully supported in hardware with dedicated FPUs in modern processors [40] . They also include special values to represent infinity and a result that is not a number (NaN), as well as exceptions to set variables to those values when appropriate [32, 34] . Modern architectures include support for "long double" variables. In the IEEE 754 standard [34] , this is referred to as double-extended precision for which the byte width is 10 and translates to 19 or 20 decimal digits. The implementation of these variables depends on the programming environment and system architecture. In an x86 architecture long doubles are 80-bits long, with 64-bit mantissas and 15-bit exponents [14] . x86 processors include dedicated registers with this format, but the extra precision is lost once a long double variable is written back to memory.
In the past 10 years or so, high-precision software packages have been produced which typically utilize custom datatypes and operator overloading to facilitate conversion. One example is the QD package [28] which provides DD variables, where each 16-byte variable is the unevaluated sum of two 8-byte doubles of which the first consists of the "leading" digits and the second the "trailing digits" of the format's value. That is, the first double contains the best 64-bit double-precision approximation to the answer, and the second double contains the difference between the exact answer and the first double. Similarly, a quad-double number is an unevaluated sum of four IEEE doubles. However, while such packages offer easy-to-use high-level interfaces, they still offer a limited amount of precision.
For arbitrary precision, packages such as ARPREC [7] and GNU GMP [25] are available. These packages typically employ arbitrary-size representations, such as using a character string format. Other packages offer guarantees on the order of operations and rounding (since there is no standard) [20] . Inevitably, these packages sacrifice speed and memory due to large arrays used in the internal representations and lack of hardware support. Formats with variable-length mantissas have also been proposed, such as IEEE 854 [10] . Such formats are similar to double-precision variables because they use mantissa and exponent fields, but they also introduce additional complexity for operations and exception handling that make hardware support costly. Such complex hardware also represents a design risk for NIC designers, because they make verification harder.
Existing software solutions are usually slow for most physical modeling and simulation codes, and the precision provided by DDs or quad-doubles are insufficient, especially for much larger simulations to come. A DD summation requires approximately 20 operations, all performed in processor registers, which results in an approximately 2× to 5× slowdown compared to double-precision variables [28] . This slowdown becomes worse when using arbitrary-precision packages. As an example, ARPREC is approximately 2.5× to 4.5× slower than DD multiplication on a Sun Ultra 167 MHz processor [7, 28] . Alternative algorithms use sorting, recursion, or other computation or data movement overhead to provide exact results or with tight error bounds, as well as reproducibility [18, 41, 49] . Due to their overhead, such algorithms are only practical within nodes. Because of the performance tradeoff, some researchers settle with bounding or predicting the computation error [16] [17] [18] 24, 29, 41] .
Past work has used wide fixed-point representation similar to BigInts in hardware accumulators for local-node summations [38, 39, 49, 53] . This work mitigates the negative effects of such a large format in registers and caches by adding dedicated hardware support in the form of a single accumulator to produce the result. This work has already shown that such formats incur no precision loss for summations of arbitrary size. BigInts extend this work by applying this concept on distributed summations. We take advantage of the simplicity of BigInts to perform summations in the NICs with simple dedicated hardware support or by using on-board programmable logic, in order to avoid waking up or context switching processors.
Description
This section describes BigInts, their applicability to network operations, and how they interface with the programmer. Figure 2 illustrates the format of a BigInt variable. Essentially, BigInt variables remove the exponent field and expand the mantissa such that each bit's location corresponds to an exponent value. To represent the same number space as the IEEE 64-bit double- precision format,BigInt variables need to be 2,101 bits since the 11-bit exponent field's value ranges from 0 to 2,048, and the exponent defines the value of the most significant (silent) mantissa bit. Thus, the lowest exponent value BigInt needs to represent is the lowest value of the exponent field (−1,022) minus the number of mantissa bits (53), for a total of −1,075. BigInt variables also have a sign bit which we place in the least significant position.
Big Integer Variables
Like integers, BigInts are an uncompressed representation and therefore can exactly encode the full range of real numbers in their number space. In other words, there is no gap in the range of numbers that are expressible by BigInts. In addition, BigInts do not use a compressed format and therefore do not drop bits depending on the value of operands, as discussed in Sect. 2. Therefore, they do not introduce any precision loss and are well-suited for very large operations.
BigInts of all 0s but with a sign bit of 1 represent "not a real number" (NaN). This can occur by dividing any non-zero number by zero. A value of all 1s refers to infinity (Inf), used for numbers outside of the representable number space. In this case, the sign bit distinguishes between positive and negative infinity, similar to IEEE 754 [32, 34] . Setting a BigInt to infinity or NaN is done by the hardware logic through simple checks without the need for exceptions, as in the IEEE 754 format. For example, the result of a summation is infinity if the addition of the most significant bits of two BigInts produces a carry. Similarly, in multiplications, the resulting BigInt is set to infinity if the result is wider than the BigInt, which can occur even if multiplying two numbers within the number space. A BigInt of all 0s is the number 0.
Because of their integer representation, operations between BigInts such as summation and multiplication follow the exact same methodology and algorithms as integers, with two exceptions: First, the sign bit resides in the least significant position, as shown in Fig. 2 . Furthermore, operations that would produce NaN or infinity produce the special encoding necessary to represent those states. Figure 3 shows how a normalized double-precision variable maps to a BigInt. To convert from a denormalized double, the most significant bit of the double's mantissa is inserted in position 53 (corresponding to exponent value −1,022). This does not overwrite the sign bit of BigInt because denormalized floats do not have a silent bit. This conversion can be done with very simple hardware or memory operations, because it is simply inserting the double's mantissa bits in the position specified by the exponent [38, 39] . In the case of converting a BigInt to a double, the double's exponent is determined by the position of the most significant bit that is asserted in the BigInt, and the mantissa captures as many most significant digits of the BigInt that it can. Converting the summation result from a BigInt back to a double, long double, or DD variable inevitably causes the least significant asserted bits to be dropped if the distance between the leftmost and rightmost asserted bits in the BigInt is more than the length of the mantissa of the variable the BigInt is converted to. That is because the resulting compressed format can only have one exponent value. For example, when converting to a double-precision variable a BigInt with asserted bits at locations 100 and 10, the least significant bit will inevitably be dropped because 100 − 10 > 53. When this conversion is performed for the final result of the operation, if may be of concern to the application. In that case, programmers can retain BigInt or use another wide fixed-point representation for local-node summation similar to past work [38, 39, 49] , or use any of the higher-precision formats or libraries such as recursion and sorting methods, ARPEC, QD, and GMP [7, 18, 25, 28, 41, 49] .
It is key to realize that the conversion back to lower-precision formats discussed above is performed only for the final result or for local node processing, if at all. The final result is guaranteed to be accurate as long as all arithmetic is performed with BigInt precision. Performing the entire operation with precision less than BigInt may produce an entirely different result due to precision loss in every summation of two numbers, as discussed in Sect. 2; this effect accumulates over the course of large summations. This precision loss can result in the most significant bits of the result, the bits well within the range of even double-precision variables, to differ compared to the result produced by BigInt because BigInt was able to correctly and exactly capture the contributions of all operands without losses. Therefore, as long as the operation is completed using BigInts, the conversion back to double-precision will drop the last few decimal digits of the result, but the remaining ones will be correct.
The simplicity of BigInts enables fast and accurate numerics for software implementations and requires only simple logic for hardware implementations for large summations [38, 39] . Very little additional memory is required because only partial sums are projected to BigInt, instead of all operands, and only over the network for distributed operations. BigInt variables can be treated as long integers for addition, multiplication and division, even though they represent floating point numbers. Adding two BigInts can be accomplished with simple and inexpensive integer adders in a sequential manner; an adder of N bits length requires 2,101 N cycles. Multiplying two BigInts is not a sequential process, but can also be accomplished with cheaper and faster circuits than floating point arithmetic [21] . For summation, in order to match the latency of a highly-optimized FPU in modern Intel processors which has a 5-cycle latency [14] , the adder width needs to be 421 bits. The simplicity of integer (fixed-point) adders and multipliers compared to floating-point ones makes adding hardware support for BigInts feasible and more likely to operate at comparable speeds to the node processors, in contrast to more complex libraries that also offer arbitrary precision.
Applicability to Network Operations
BigInts are good candidates for network-wide operations, such as MPI reductions where communication follows a tree-like fashion and every node sums the operands it receives and then transmits the partial sum (a single number) to the next level of the tree to produce a single result [31, 36, 46] . This is because BigInts do not increase communication time significantly because the additional bits lead to a minor increase since packet headers and fixed (zero-load) delays in the network and NICs dominate for small transfers. In the case of multiplication, the same can occur since multiplication can also be performed cumulatively. BigInts are in fact good candidates for distributed operations because the additional bits do not have a noticeable effect in communication delay for a small number of operands, as we show in Sect. 5.4.
Past work has shown that performing computation in the NICs during reduction operations increases both scalability and consistency with speedups of up to 121 % [43] . However, this has only been applied to double-precision variables because of the complex data structures of libraries with arbitrary precision and the limited programmable logic in terms of both area and clock frequency in modern NICs [43] . For instance, Elan3 in Quadrics QsNet provides a user-programmable, multi-threaded, 32-bit, 100 MHz RISC-based processor with 64 MB local SDRAM, originally targeted for communication protocol modifications [42] . This processor is an order of magnitude slower than the multi-GHz node processors. Therefore, using formats more complex than double-precision variables would result in excessive slowdowns. BigInts address this tradeoff because they are even simpler than a double-precision variable. Similar work which implemented the computation logic in software drivers in a Myrinet network [8] showed that NIC-based reductions can be preferable to host-based (serial) reductions with as few as eight nodes [9] .
The only way to offer high-speed computation in NICs without BigInts is to add dedicated high-speed floating-point computation hardware to NICs. The power and area consumed by a fully functional double-precision adder is substantial. In addition, including a full floating point into the NIC increases design complexity to a point that drastically increases design risk for a very specialized function. An example of the simplest double-precision FPU from the Tensilica design library requires on the order of 150,000 gates [37] . Even though FPU units support multiplication as well as addition, and therefore would be partially unused in a distributed summation which is the focus of our work, floating point multiplication is less complicated to implement than addition [23] . Verification is non-trivial even for a circuit of this size and complexity, and this implementation is among the simplest available. For future Infiniband 4x EDR interface theoretical peak bandwidth, operands would need to be summed at a rate of 100 Gigabits/s. This would require a 32-bit carry-lookahead adder operating at 0.6 GHz (very modest clock rate). Operating as a ripple-carry, this would require 66 cycles to graduate a result, and consume about 380 gates for the adder (a negligible amount of area and power) [35] .
Therefore, network-wide computations can be efficiently performed in the NIC with BigInts, because alternative error-free libraries use complex formats that introduce considerable delays and would also preclude computation in the NICs, because of the limited processing capabilities of NICs. In addition, sorting or recursion techniques that also prevent precision loss would produce an excessive amount of system-wide data movement. Local computations can be performed in advance of the arrival of packets with partial sums such that when the reduction operation function call is made, the result (a single partial sum number) can be stored in the NIC in anticipation of other partial sums (operands).
For reduction operations with both local and network-wide computations, as is the case if more than one operands reside in each node, local computations can either use sorting or recursion techniques for the local part of the operation, or use similar wide fixed-point representations with dedicated hardware support [38, 39] . Alternatively, local computations can be performed in the NICs using the BigInt format without using the local processor. The partial result would then reside in the NIC, waiting for other operands to arrive through the network. When that occurs, the NIC can perform the computation and transmit the result to the next level of the tree in the distributed summation. However, computing the local partial result in the NIC requires transferring all local operands to the NIC instead of just the partial sum, which stresses the processor-NIC interconnect. The technique for local computation is orthogonal to the use of BigInts for the distributed part. However, using a technique that does not guarantee exact and reproducible local computation will affect the quality of the overall result due to potential precision loss locally, even if BigInts is used for the distributed part which avoids any precision loss.
Software Interface
Since BigInts apply to distributed operations and not necessarily local processor computations as explained above, the conversion from double-precision format, or any format used for increased or infinite precision in local computations, to BigInts can simply occur only for variables arriving at the NIC from the local processor. This can be done with a simple shift operation using very simple dedicated hardware as described in Sect. 3.1, simple logic in the NIC's programmable logic, if available, or even in the local processor using software. In the first case, the appropriate circuitry in the NIC simply needs to be enabled in order to perform the conversion. If the conversion is performed by the NIC's programmable logic, this functionality needs to be enabled in a similar fashion. Performing the conversion in software requires just a handful of instructions to construct a bit array and place the mantissa in the Fig. 4 The NIC is informed to convert double-precision variables that the local processor transmits to BigInts, and vice versa. After that, BigInts are used abstractly from the programmer and produce an exact and reproducible result in the distributed operation appropriate location. After enabling the conversion using an appropriate function call similar to setting options in modern NICs such as eager delivery [44] , conversions are performed abstractly from the application. The reduction operation is then performed using BigInts and produces a single result, which is converted back to double-precision or another format locally in the node that stores the final result. A code outline is shown in Fig. 4 .
If we were to use BigInts or similar wide fixed-point representations in local processor computations with hardware support [38, 39] in addition to distributed operations, we can make this capability explicitly accessible by the programmer by the addition of a new variable type. In that case, all lower-precision variables that are added to the wide fixed-point variable are implicitly converted to the fixed-point format to avoid precision loss. This way, the programmer can explicitly choose the variable type to use for reduction operations, to better control the desired precision.
BigInts can also be deployed without hardware support using software implementations. This can be done using MPI constructs to define a new data type, object oriented programming to define a new BigInt class, or other software constructs. However, even though software-implemented BigInts will still prevent any precision loss, a single summation between BigInts will result in a few tens of instructions to correctly add the whole length of BigInt variables.
Methodology
We evaluate the impact of BigInts on computation precision and performance of summations and MPI reduction operations [19] with millions to billions of operands. We focus on summations because they are at the heart of many crucial operations, such as dot products and matrix multiplications. In fact, in those operations, multiplication is performed in a pair-wise manner, but the final summation is the operation with potentially millions or billions of operands.
We first evaluate the numerical behavior and error accumulation of techniques applicable to network-wide operations, namely doubles, long doubles, DDs and BigInts, using serial test cases on variable numbers of operands. Next, we evaluate the performance of MPI reduce() operations on payload sizes required for these different extended precision implementations. For these experiments, we focus on the impact on communication delay from the extra payload from BigInts. We then estimate the impact of embedding BigInt summations into the NICs for future interconnect designs. Finally, we present application benchmark results with and without BigInts.
In our architecture, long double-precision variables are 80 bits long, where 1 bit is the sign, 15 bits are for the exponent, and 64 bits for the mantissa [33] . We report the error of each precision type against BigInt as a percentage. We also perform a bitwise comparison of the mantissa of each precision type against BigInt and report the first bit that differs. Due to the different representation with two double-precision variables that DDs use, we do not report bit-wise mantissa comparison results for DDs. Finally, we perform a comparison of decimal digits of each variable in decimal form and report the position of the first decimal digit that differs. For example, the numbers 2.54 and 2.58 differ in the second decimal digit. For both decimal and bit comparisons, a value of minus one means that there is no difference, whereas a value of 0 means that the integer parts of the numbers differ in the decimal comparison.
Each trial uses exclusively operands and operations of the appropriate precision. For example, experiments for BigInts perform all summations and intermediate results using BigInts and only convert the final result to other formats for the final comparison. Finally, for our experiments we use NERSC's Hopper, which is a Cray XE6 cluster [3] .
Evaluation
This section presents our computation precision and performance results.
Composite Summation
We begin by conducting numerical stability results comparing doubles, long doubles, DDs and BigInts. For these experiments we perform a composite summation of numerous equal small numbers to a large number. We vary the large and small numbers as well as the number of operands to illustrate precision loss as a function of summation size and operand values, discussed in Sect. 2. These experiments are done in a single processor as they are designed entirely to understand the numerical stability of different approaches. The results are illustrated in Table 1 . Because the small numbers (10 −8 ) we add to the large number (10 8 ) are equal, each partial sum creates the same amount of error. Therefore, the overall error grows exactly linearly with the number of operands. Also, the errors introduced by each variable type differ in orders of magnitude compared to BigInt, because doubles, long doubles and DDs have different levels of precision (different mantissa widths). DD variables introduce approximately 10 −16 less error than doubles. For better illustration, Fig. 5 compares bits and decimal digits with the BigInt result. In addition, Fig. 6 shows the number of bits that were dropped when converting the final result from BigInt format to a double-precision variable. Essentially, this shows the additional number Table 1 of precision (in bits) that BigInt captured in this benchmark, compared to doubleprecision variables. As shown, double and long double variables have a large error, whereas the error for DDs is still noticeable and can become a factor with billions of operands. The overall error in these experiments is a result of the error accumulated by each individual summation, which we show can grow to be important. Each experiment in this section uses exclusively variables of the appropriate type for all summations and intermediate results. Table 2 and Fig. 7 show results for a summation where the large number (10 12 ) is large enough such that double-and long double-precision variables drop the small numbers entirely. For those cases, summation is ineffective because the result equals the large number, which is the phenomenon this experiment is designed to show. DDs however, do not disregard the small numbers. As shown, the first mantissa bit of difference for doubles compared to BigInts is nearly the total bits in the mantissa of a double-precision variable (53) , which shows that the summation affects bits that Table 2 Percentage error when adding operands with a value of 10 −8 to a variable with the value of 10 12 Number of operands are beyond what the double-precision variable can hold. In contrast, BigInt contains all such bits. Compared to adding to a large number of 10 −8 , the precision for DDs is also lower.
Computing Arc Length of an Irregular Function
We then show results for a funarc calculation that attempts to calculate the arc length of the irregular function g(x) = x + 0≤k≤10 2 −k sin(2 k x), over the interval (0, π). The task here is to sum
into n subintervals, where n = 10,000,000, so that h = π/10,000,000 and
The function is shown in Fig. 8 . This function highly irregular, so that the arc length calculation will sum many highly varying quantities. If the limit on the summation were infinity instead of 10, the resulting function, while continuous and innocuouslooking, is in fact non-rectifiable-it does not have a finite arc length. In any event, it was chosen as a useful example for demonstrating round-off error when adding a large number of varying terms. Since the operands in this experiment vary in value, some operands have to be discarded similar to the previous experiment, others are partially discarded, whereas others are not discarded at all. This depends on the different between each operand and the partial sum, relative to the precision of each variable format. As a result of the variability of the operands, the error has an irregular and not linear relationship with the number of operands. Figure 9 illustrates the percentage error compared to BigInt, and Fig. 10 shows the bit and decimal digit comparisons. As shown, DDs still introduce noticeable error.
Geometric Series
Finally, to clearly show the effect of the limited mantissa bits, we use the geometric series:
The results are shown in Fig. 11 . Even though for i infinite the geometric series converges to the number 2, for natural values of i it should converge to less than 2. However, due to the limited number of mantissa bits, double variables report equal to 2 for i > 53 and long doubles for i > 64. This is simply because each operand x) with a varying number of operands sets a mantissa bit (for example adding the operand for i = 4 sets the fourth most significant bit of the mantissa with the proper exponent), but for operands that the mantissa is too narrow to record, the value has to be rounded up to 2. DDs are accurate until i > 106. Beyond that, the second double-precision variable, which denotes the error contained in the first double-precision variable in DDs [28] , cannot accurately capture the accumulated error because the error exceeds its 53 available mantissa bits. 
Performance Evaluation
To evaluate the performance impact of BigInts in distributed summations, we first focus on the impact in completion time of MPI reductions by measuring communication time using MPI constructs. MPI reductions are typically used for summations performed on operands distributed across a large-scale system [36, 46] . In such operations, communication progresses in a tree-like fashion where nodes in each level of the tree sum the operands they receive with their local operands, and pass the partial sum (a single number) to the next level [31] . Figure 12 shows the time spent in communication. Each data point is the average of 50,000 MPI reduction operations. BigInt increases communication time by approximately 35 % compared to doubles, but only 2-14 % compared to DDs. Even though BigInts contain 2,101 bits and therefore are 16× larger than 128-bit DDs, for such small sizes communication is latency-bound instead of throughput-bound. In fact, depending on the network's maximum packet size, the 2,101 bits for a BigInt may still be contained in a single packet as is the case with the 1,500-byte MTU in Ethernet. Therefore, BigInts do not increase communication time significantly because the additional bits lead to a minor increase since packet headers and fixed (zero-load) delays in the network and NICs dominate for small transfers. Our results are confirmed by past work [51] which showed that the communication delay for messages with payloads containing double-word, quad-word, and double quad-word variables is identical and approximately 12 µs for up to 512 bytes, which is a larger payload size than BigInts (256 bytes).
Because these reduction operations are not part of a benchmark, there is no need for warmup and cooldown periods. Also, due to the small payload of packets and to better isolate the effect of payload size, we configure the NICs to use the eager message path to the processor [44] to avoid the latency of block transfers.
We then estimate computation time. In modern Intel processors, double-precision additions performed on dedicated FPUs require five cycles [14] . Even though the peak rate for FPUs is two flops/cycle, summation throughput is limited by the FPU pipeline latency. Increased-precision representations are considerably slower. For example, as discussed in Sect. 2, DD addition requires up to 20 operations. Arbitrary-precision libraries use more complex representations and therefore are even slower. BigInts can perform these operations at a rate that is comparable to DDs (20 cycles) only with an 106-bit integer adder, which can be easily integrated in modern NICs [43] . Computing in NICs avoids potentially large context switching or wake up delays for local processors, as well as data transfer latencies and bandwidth constraints between the NIC and the processor that are typically significantly longer than the few cycles to perform integer addition. Context switching requires µs [50] , while waking up processors from deep sleep may require seconds [11, 15] . So the handful additional cycles required for extended precision and BigInts are negligible compared to the network time. As discussed in Sect. 3.2, the complexity of integer adders is an order of magnitude less than FPUs. Therefore, compared to FPUs, BigInt summations are feasible to perform at line rate using existing technology and consuming an order of magnitude fewer gates, using only integer adder components in a pipelined ripple-carry fashion. Therefore, BigInts enable reduction operations to complete in comparable time as operations with double-precision variables, while providing no precision loss in the same number space with exact encoding for every real number in that space. Note that precision libraries such as GMP [25] which use string representations to offer arbitrary precision, would require significantly more expensive and therefore slow operations for summations because of the necessary conversions between string and numerical formats. Results are given normalized to the baseline case (positive numbers indicate longer execution times). Because benchmarks perform reduction operations similarly, reduction completion time is the same for all three benchmarks
Application Performance
For application results, we run a high-performance implementation of conjugent gradient (CG) [27] , geometric multigrid (GMG) [1] , and the SkaMPI benchmark suite (SkaMPI) [45] . All applications are executed in NERSC's Hopper using 8192 cores (342 nodes). We distinguish three cases: the baseline case using double variables, BigInts without NIC support that uses a C++ implementation of BigInt variables and operations and custom MPI datatypes [30] , and BigInts with NIC support. For the latter case, since we do not have the necessary access to Hopper's NICs, we rely on our technology analysis in Sect. 5.4 to show that summation of BigInts in the NIC can be performed at network line rate, and therefore poses no additional delay compared to the baseline case using doubles. BigInts implemented in software perform conversions in software to and from double variables at each node of the reduction operation. However, each conversion requires only a handful of instructions to translate a double variable to a bit array and vice versa. Table 3 shows completion time for reduction operations as well as execution time for each application. Reduction operation completion time is the time to complete a MPI reduction operation to calculate a sum, including the time to perform the summation at each node in hardware or software. Results are shown normalized to the baseline case with doubles.
As shown, the use of BigInts has a noticeable impact in the completion time of reduction operations, similar to Fig. 12 . The extra instructions to perform BigInt summations in software incur further slow reduction operations. However, the benchmarks are minimally affected by the increased completion time of reduction operations. In fact, only conjugent gradient (CG) shows a maximum of 1.6 % slowdown, whereas the other applications are not affected. This is because reduction operations are only a part of these applications and therefore an increase in reduction completion time has a small effect in overall performance. In fact, we notice that the scheduler's choice of placement of MPI ranks to nodes has a more noticeable effect. The impact of BigInts in other benchmarks can be estimated by how much time the benchmark spends in reduction operations compared to its overall execution time. However, we note that the cores in these experiments were not in a sleep state nor was the core to NIC interface stressed, as discussed in Sect. 5.4.
Discussion
Our experiments focus on summations because of their importance. In a wide variety of applications and mathematical operations, such as dot products and matrix multiplications, operands are multiplied in a pair-wise fashion, but then there is a final sum of all multiplication results. Because of the sheer number of operands which can reach millions or billions, this summation is most at risk for important precision loss. This is confirmed by past work which noted that using increased precision only for the final sum in many algorithms elimates almost all precision loss [28, 49] .
However, BigInts also apply to multiplications, simply by using an integer adder or the programmable logic in NICs. Past work has proposed several optimized integer multipliers, with a delay path that grows linearly with the number of integer bits [47] . While multiplication is more likely to overflow or underflow the result, which occurs when the result is beyond the boundaries of the IEEE double-precision variable number space, BigInts were designed for the same number space as doubles (extending that number space is a decision orthogonal to our work). Therefore, we do not double the amount of bits in the multiplication product of two BigInts. Section 3.2 discusses how to combine BigInts with error-free techniques for local processor computation, such as to provide system-wide computation with no precision loss. In most applications the number of operands is well beyond the number of nodes, which requires each node to perform a partial sum locally and then send their for the distributed reduction operation. Therefore, to guarantee no precision loss in the final result, we must use error-free techniques both within nodes and over the network. BigInts provide the solution for the distributed part of the summation and with no performance loss. For local computations, past work as already proposed similar wide fixed-point representations with hardware support [38, 39] , sorting or recursion techniques [18, 41, 49] , or software libraries with complex arbitrary-size representations [7, 25] . Techniques that use wide fixed-point representations are better suited to combine with BigInts, but they require hardware support in processors for efficient performance. In cases where that is not feasible in the processor but it is in the NICs, processors can transfer their operands to their local NICs to perform the computation.
Conclusion
This article makes the case that using a very wide fixed-point representation, which we call BigInt, is well-suited for distributed operations. BigInts encode the same number space as double-precision variables but have a simple enough format to perform operations such as additions in comparable time as double-precision variables with dedicated FPUs, using inexpensive hardware integer adders or programmable logic found in modern NICs. The additional payload for BigInts also does not appreciably extend communication delay or application execution time due to fixed latency costs in modern system-wide networks. Computing in NICs avoids the energy and latency costs of NIC to processor communication, as well as context switching or waking up the processor. Therefore, BigInts enable distributed reduction operations in NICs in large-scale systems, with no precision loss and also no performance degradation compared to double-precision variables. Computing in NICs is only achievable by past work with double-precision variables due to the complex internal structures and system-wide data movement introduced by increased-or arbitrary-precision software libraries and algorithms.
