It is known that using larger byte-sizes to access memory usually result s in faster computations of checksum algorithms . This paper proposes two different ways to use larger byte-sizes to improve the performance of the OS I checksum . First, an algorithm is presented that computes the 8-bit checksu m using 16-bit integers . It is shown that this algorithm yields a 5 to 20 percent performance improvement on many architectures . Second, the benefits o f expanding the basic computation unit of the OSI checksum algorithm t o 16-bits integers is considered . This change can yield an additional performance improvement of up to 50% and greatly extended error detection properties, although it is incompatible with the current standard . The measurement s of these algorithms are compared with some taken of checksums in commo n use, such as IP and XNSt .
Fletcher analyzes the protection provided by these quantities (which he calls " check-bytes" ) for detecting errors under quite general circumstances, for bytes consisting of K bits (not just eight), and in fact considers higher order quantities, which we will not make use of. Th e ISO committees essentially have adopted the case K = 8 and the two check-bytes Co and C I as providing the basis for their checksum . More specifically, they allow space in the packets for two contiguous bytes to be chosen so that when the two quantities Co and CI are computed in 8-bit, one ' s complement arithmetic for the packet as a whole, both sums are 0 .
Our paper has two purposes : first, to discuss additional techniques for mustering a s much computational efficiency as we can in computing these quantities ; second, to propose a modification to this algorithm, which can double computational efficiency and greatly improv e error detection properties .
. The 8-Bit Algorithm .
The 4 .3 Berkeley Software Distribution of UNIX]' provides implementations of two communications protocols in common use, IP and XNS . The IP and XNS checksum routine s supplied employ four techniques for reducing the time required for calculation . In paper s analyzing Fletcher's checksum algorithm (those cited above, [Na88] and [Co87] ), we hav e found references to three of these . First : one's complement arithmetic can be done by usin g native two's complement arithmetic for some number of iterations known not to generate an y carries, followed by a reduction step ; second, reduction from 32-to 16-or 8-bit arithmeti c can be done by merely adding up the constituent halfwords or bytes ; and third, unrolling loops can contribute a substantial reduction in processing time .
Additionally, we propose that less obvious or slightly more complex iteration algorithm s that access two bytes of memory at a time instead of one may provide additional efficienc y for some CPU's and cache architectures. (Our inspiration for this is the Internet checksu m implementation in 4 .3BSD, which references memory in 4-byte accesses instead of two . )
These four techniques (and others) are discussed at length in recent papers on computin g the IP checksum ( [Br89] , [P189] ) . The most commonly available reference on the XNS protocols ( [Xe81] ), describes the checksum algorithm without formulas, so we will present a brie f analysis of the XNS checksum in Appendix A for the insatiably inquisitive .
Computer architectures can be distinguished by the manner in which pairs or quadruples of bytes in memory serve as arithmetic operands in ALU's . In a "Big-Endian" byteaddressable machine, when a pair of bytes bn and b n . fi is fetched from memory, the 16-bi t arithmetic quantity 256b n + bna. 1 is used as the value . By contrast, a "Little-Endian " machine will let the byte with the higher address have higher significance : the quantity bn + 256b,,+1 is used . We ' ll discuss computations on with Big-Endian machines ; modification s for Little-Endians are simple and do not affect the analysis .: Initially, our discussion i s directed at computing Fletcher ' s quantities We'll also limit ourselves to packets with eve n numbers of bytes . Suppose we have a sequence of bytes in memory :
As a notational aid, let capital letters (Ai) denote 256 times the contents of the memory byte denoted by the small letter (a1) . Instead of computing Co and CI directly, we will compute 'UNIX is a registered trademark of AT&T Bell Laboratories in the USA and other countries . with initial conditions So = To = 0, and where (A i+ bi) is the result of fetching two bytes o f memory . Having a closed form expression for S and T will both help us to understan d why they are equivalent to Fletcher's check-bytes, and to determine how many iteration steps may be taken before a carry can occur . The expression is not elegant, but can be easil y verified by induction :
Since multiplication distributes over addition and 256 is congruent to 1 mod 255, it is not hard to see that S reduces to C O . Persistence and diligence in re-arranging terms will convince the reader that T reduces to C1 .
Since each term ai and b i is no bigger than 255, and by using closed form expression s for 1 + 2 + 3 + • • • + n a n d 1 + 3 + 5 +
• In an actual implementation, we will have to compute the values S and T for packet s longer than 255 bytes . Every so often, we`fold" the quantities S and T by replacing eac h by the value obtained from adding the upper and lower 16 bits in its two's complemen t representation . (Folding a quantity has no effect on the value it represents modulo 65535) . The results of a folding operation are bounded by the maximum values for S and T afte r two iterations, so no carries can occur if an additional 253 iteration steps are made after a folding operation .
A C language implementation of our induction step might look like this : unsigned short *wp ; long S, T, upper ; upper = *wp++ ; T += S ; S += upper ; T += S ; upper &= OxffOO ; T += upper ; On CPU's such as the DEC VAX, the CCI power 6, and the Motorola 68000t, thi s sequence requires six machine instructions per two bytes of data, which is admittedly th e same instruction count as computing Co and C 1 in the standard way : unsigned char *ep ; long CO, Cl, Reg; Reg = *cp++ ; CO += Reg ; C1+= CO ; Reg = *ep++ ; CO += Reg ; Cl += CO ;
Although the instruction count is the same, the very fact of accessing memory in words ma y be faster for some machine architectures, as our experiments described below will show .
On the DEC VAX [DE81] and on the CCI Power 6 [Ha87] , which both have a specia l "double indexed" addressing mode, it is possible to perform the addition of twice S to T i n a single instruction, "move address of word" . This reduces the instruction count from 6 pe r pair of bytes to 5 per pair of bytes . Here the induction step would look like : movzwl n(rll),r8 ; add12 r8,rlO; movaw (r9)[r10],r9 ; and12 $255,r8 ; sub12 r8,t-9
There remain two minor considerations : auto-increment, and loop control for reductio n back to 16 bits . The CCI machine does not support auto-increment addressing, so, when the induction step is unrolled, one can use register displacement addressing with different displacements for each step unrolled . Although the VAX family of computers supports autoincrement, it appears that certain processors in the series compute this checksum more rapidl y by using the method for the CCI machine, ignoring auto-increment .
For periodic reductions to 16 bits (to prevent overflow), it is convenient to test whethe r to reduce at the end of the largest unrolled loop, which in our implementation passes throug h the data 32 bytes at a time . Although one might be tempted to decrement a counter eac h time through the end of the unrolled loop (thus saving reductions for every sixth pass), i t turns out to be 2 percent faster to check two bits for simultaneously being zero in the register counting the number of bytes remaining ; even though this causes a reduction operatio n every fourth pass, the test is quite cheap, there is no overhead in resetting the counter, an d the reduction itself is not expensive, merely being a store, two loads and an add .
. The 16-Bit Algorithm .
Fletcher's original paper carries out an error analysis of his algorithm for arbitrary K bi t bytes, not necessarily just for K = 8 . We would like to suggest that it may be profitabl e for future versions of OSI protocols to use the algorithm with K = 16 . We have determine d by measurement that the computation is significantly less costly, and show here that it ha s dramatically better error detection properties .
An interesting feature of either checksum is that for sufficiently small packets, it wil l detect all double bit errors . Fletcher gives the bounding size as 2 K --1 "bytes" . In the 8-bit case, this gives 255 bytes for the packet size, somewhat less than conventional Etherne t tDEC and VAX are trademarks of Digital Equipment Corporation ; CCI and Power 6 are trademarks of Computer Consoles, Incorporated ; SUN is a trademark of Sun Microsystems . packet sizes, for large data transfers . In the 16-bit case, this gives us 64K 16-bit bytes or 128K 8-bit bytes . Doubling the size of the bytes generally squares the probability of othe r sorts of undetected errors : the fraction of all undetected errors is on the order of 2 .37 * 10 -10 (as opposed to 1 .58 * 10 -5 in the 8-bit case) . The probability of undetected 32 bi t burst errors is on the order of 2`40 (compared with a probability of undetected 16-bit burs t errors of 2-20 in the 8-bit case) . These numbers are computed from the formulas given i n Fletcher ' s paper.
A precise statement of the algorithm would be as follows . If the packet is of an od d number of 8-bit bytes, logically extend it with a zero byte . One then computes the tw o sums Co and C I using 16-bit one's complement arithmetic, thinking of each pair of bytes a s a 16-bit number. The packet has been properly prepared and transmitted if the two sum s are zero .
We wish to show that the verification procedure is independent of the "Endian-ness" o f the machine . The effect of the computation being performed by a machine of the opposit e Endian-ness is the same as reversing the order of every pair of bytes in the packet. W e state that the 16-bit checksum of a byte swapped packet is the same as swapping the byte s of the natively computed checksum for the original packet . The proof hinges on two observations : first, if we think of two bytes as representing a number modulo 65535, and we multiply the number by 256, the resulting number modulo 65535 is represented by the two byte s in the opposite order ; second, multiplication distributes over sums and commutes with th e multiplicative " weights " in sum C I . Clearly, byte-swapping zero gives zero .
As noted earlier, the way the checksum algorithm would be likely to be employed i n OSI protocols would be to have a header option, namely a type byte, length byte, and fou r contiguous bytes whose values would be chosen so that both check-quantities (now 16 bit s each, giving us 32 total) would be zero . It is not hard to see that if the four bytes were word-aligned, the same formal computation that generates the present checksum option (fo r 8-bit Fletcher) would work for our replacement checksum . Nonetheless, it is still possible to choose values for the four bytes, even in the case o f odd positioning, so that the two sums Co and C I will be zero . Using the naming schem e in the previous section, let us first assume we are to adjust a sequenc e bk_i ak bk ak+ l We first replace all four bytes by zero, then comput e
where each wi is obtained by adding A i + bi as a one's complement 16-bit quantity in th e Big-Endian case. This gives us two equations to satisfy :
subtracting k times the first equation from the second we ge t
This is satisfied if we let b k_ I be the lower byte of CI -kCO3 and Ak+1 be the one's complement of the upper byte . One can then get the other two quantities by substitution in th e equation for Co .
4e Measurements and Comparisons
We prepared two timing tests to give some feeling for the way checksum routine s would compare under different limiting sorts of conditions . The tests were run typically o n lightly loaded machines with large memories, but in off hours . Each measurement was take n a few times to establish reasonable credibility . By contrast, the measurements in [Na88] were made in a much more careful manner . Nonetheless, we believe that the data discussed her e are sufficient to make qualitative comparisons .
The first test is the equivalent of checksumming a single 16 megabyte packet . This i s designed to exercise mostly the inner unrolled loop of each algorithm . The second is checksumming 90 minutes worth of packets as found on one large Ethernet connecting SUN workstations to file servers from trace data giving the length and type of every packet [Gu87] .
Most of the traffic on the workstation network consisted of paging and nonchecksummed file system references, which we exclude entirely in our test . The other packets were largely single character virtual terminal packets, or acknowledgements, with about a quarter of the remaining packets being file transfers, and mail activity . We discarded th e paging and file activity (comprising 60 per cent of the network traffic) as being as yet relatively uncommon in communications over long-haul networks .
Under realistic conditions, if we had a dedicated mail gateway, although most of th e data packets would be large, there would likely to be about as many acknowledgement packets as data packets, and acknowledgement packets are typically quite small . Thus, one woul d not expect to see a mix of more than 50 percent of large packets .
We compared the following routines :
IP
The standard IP checksum routine for the machine .
PIP A " portable " IP checksum routine accessing memory a word at a time ,
XNS
The standard XNS checksum routine for the machine .
simple A naive OSI implementation (one byte at a time), but with loop unrolling .
NoMagic Our OSI implementation (two bytes at a time), but without the "movaw" instructio n Magic Our OSI implementation (two bytes at a time), with the "movaw" instruction .
Ediv
The OSI implementation given in [Mc87] (no unrolling, using ediv, aob, etc . ) Each of these routines includes the mechanism (and consequent overhead) to process packets buffered in a segmented fashion . An anonymous reviewer of an earlier draft of this pape r calls this "the curse of 4BSD : mbufs" ; an amusingly described but regrettably pertinent detail that must be considered when trying to optimize the last iota out of checksum algorithms .
Our test bed included the following machines : CCI Power 6, Vax 8800, Vax 8600, Va x 785, Vax 750, and Sun 31280 . We will present all the raw data collected, in Tables 1 showing the amount of cpu time in seconds to perform the the single huge checksum, an d Table 2 showing the amount of cpu time in seconds to checksums of approximately 250,00 0 packets of varying and more reasonable sizes, respectively . We will then draw some comparisons using more specific tables, so that it should not be necessary for the reader t o memorize the first two completely . As one might guess, one sees less dramatic differences between protocol checksum time s in the mixed packet test than in the large packet test, due to increased overhead in th e latter . Here, for a single CPU (the DEC VAX 8800) we have the following ranges o f times : We notice that the ratio of slowest time to fast time in the big packet test is about 9 . 1 whereas in the mixed packet test it is only about 3 .4 . A similar amelioriation occurs acros s all machines :
t Machine lacks addressing mode or special instruction . The regular OSI checksum can be made to run only two and a half times more slowly tha n the IP checksum in the case of a realistic mix of packets, even though it takes longer by a factor of at least 4 .4 for very large packets . It is amusing to note that using the special machine-specific instruction pays off only on extremely sophisticated CPU's, and then only fo r large packets . In the following table we note which was the best OSI algorithm for th e given machine type and test; the ratio of OSI to IP performance ; and give the per cen t improvement of the best OSI algorithm for that machine and test over the simple OSI implementation . Almost in every case, the extended Fletcher algorithm is comparable to the XNS checksum , though, at best, it runs twice as slowly as an optimized IP checksum for large packets . Furthermore, the simple act of writing the C code for the extended Fletcher algorithm i n such a way to avoid having auto-increment addressing generated, seems to made an improvement in the efficiency on the same order that our proposed technique for improving th e existing algorithm has . In our final table, we provide the ratios of times for extende d Fletcher over IP, XNS over "new" Fletcher, and percentage of marginal improvement b y choosing the best auto-increment strategy for both big and small packet tests . 14 . 14 . 0.
5.

Conclusions .
We have proposed some alternative techniques for computing the OSI checksum . As we have seen, not every architecture makes it worthwhile to trade more numerous or complicated instructions instead of fewer references to data .
We have proposed a related checksum algorithm which has significantly better errordetection properties . We showed that this checksum presents no special burden on machine s of differing "Endian-ness" . Our data shows that for every machine, the proposed checksu m was faster to compute than the existing one, up to a factor of two for large packets .
For those architectures where it is faster to use more complicated instructions and reference memory less often, one might ask, if two bytes at a time is a good thing for the standard OSI checksum, might not four be even better? Unfortunately, we have not been abl e to devise an induction step of sufficiently few instructions to make this pay off, as tw o instructions are required to perform a 64-bit add on most of the machines at our disposal . However, we are tantalized by the prospect of other machines having single instruction 64-bi t arithmetic, or possibly even using floating point arithmetic to get us a one-instruction 53-bi t (i .e. mantissa-sized) register .
Lastly, we have proposed an alternative checksum algorithm which has significantl y better error detection properties than any of the existing three, and is far from the wors t among them in computational requirements .
6.
Acknowledgements
The author wishes to acknowledge careful reading of drafts, numerous suggestions for improvements and general encouragement of the following people (among others) : Lawrenc e Landweber, Two unnamed reviewers for Sigcomm 88, Riccardo Gusella, Domenico Ferrari , Tassos Nakassis, Van Jacobsen, and the Editor, Craig Partridge .
Appendix A --The XNS Checksum Algorith m
The definition given for the checksum in [Xe81] (pp 19-20 .) i s "an optional ones complement add-and-left-cycle (rotate) of all the 16-bit words of th e internet packet excluding the checksum word itself . "
We note that multiplication by 2 modulo 2 16 -1 has the same effect as left cycling a 16-bi t quantity . Horner's rule for evaluating a polynomial in X is precisely an iterative process o f multiplying by the fixed quantity X, and adding coefficients ; this action is very similar (with X = 2) , If we have a sequence of 16-bit word s WO W1 W2
Wn
We are instructed to comput e Si = 2(S1_ 1 + wi) with So = 0
The closed form solution can be shown to b e
Each iteration step can add at most two high bits of significance . The implementation in 4 .3BSD performs eight iteration steps before reducing the partial sum back to a 16-bi t quantity, (and thus avoids bit rotations altogether) . The closed form solution can be used t o show that the XNS checksum is independent of endian-ness : multiplication by 2 8 has th e same effect as swapping bytes, distributes over sums and commutes with multiplication b y other powers of 2 .
In fact, a little more care in the analysis (including the contribution of a non-zero S o as 2nSo, bounding each w1 , and summing the powers of 2), would show that one could perform 14 iteration steps without overflow ; however whether this translates to any actual savings is left as a measurement for the reader ! 9. Appendix B --Sample Checksum routine (Little-Endian Version) /* * Copyright (c) 1988, 1989 
