In this paper we describe techniques for efficient calculation of the ISO checksum which, to our knowledge, are not discussed in current literature. We propose that future versions of the ISO protocols employ checksums computed using logical "bytes" twice as large as the actual ones. Measurements are presented comparing times required to calculate the XNSt, IP, and ISO checksums with and without these techniques, and the proposed new checksum. Our refinements yield improvements of 5 to 10 per cent in speed. Our proposed replacement checksum can be computed twice as quickly in some instances.
Introduction.
The computing community in the world at large is slowly ratifying agreements for international standards for the exchange of information over networks. Such agreements sometimes are partly political in nature, and one may be faced with the task of computing unpleasant quantities. The checksum chosen for the first round of these international standards ([IS86], [IS87], for example) was first proposed by John Fletcher [Fl82] as a way of providing the same order of protection as the more computationally expensive CRC algorithm. It has some interesting error detection properties, but is still somewhat expensive to compute, and has a decided impact on the total throughput of these protocols [Mc87] . Thus, even a modest improvement of five to ten percent in the speed of checksum computation conceivably could translate to a four to eight percent increase in transport throughput
Fletcher analyzes the protection provided by these quanuues (which he calls "check-bytes") for detecting errors under quite general circumstances, for bytes consisting of K bits (not just eight), and in fact considers higher order quantities, which we will not make use of. The ISO committees essentially have adopted the case K = 8 and the two check-bytes Co and C 1 as providing the basis for their checksum. More specifically, they allow space in the packets for two contiguous bytes to be chosen so that when the two quantities Co and C 1 are computed in 8-bit, one's complement arithmetic for the packet as a whole, both sums are 0.
Our paper has two purposes: first, to discuss additional techniques for mustering as much computational efficiency as we can in computing these quantities; second, to propose a modification to this algorithm, which can double computational efficiency and greatly improve error detection properties.
The 8-Bit Algorithm.
The IP and XNS checksum routines supplied with the 4.3 Berkeley Software Distribution of UNIXt employ four techniques for reducing the time required for calculation. In papers analyzing
Fletcher's checksum algorithm (those cited above, and [Co87]), we have found references to three of these. First one's complement arithmetic can be done by using native two's complement arithmetic for some number of iterations known not to generate any carries, followed by a reduction step; second, reduction from 32-to 1&-or 8-bit arithmetic can be done by merely adding up the constituent halfwords or bytes; and third, unrolling loops can contribute a substantial reduction in processing time.
Additionally, we propose that less obvious or slightly more complex iteration algorithms that access two bytes of memory at a time instead of one may provide additional efficiency for some CPU's and cache architectures. (Our inspiration for this is the Internet checksum implementation in 4.3BSD, which references memory in 4-byte accesses instead of two.)
Computer architectures can be distinguished by the manner in which pairs or quadruples of bytes in memory serve as arithmetic operands in ALU's. In a "Big-Endian" byte-addressible machine, when a pair of bytes b,. and b,.+ 1 is fetched from memory, the 1&-bit arithmetic quantity 256b,. + b,.+ 1 is used as the value. By contrast, a "Little-Endian" machine will let the byte with the higher address have higher significance: the quantity b,. + 256b,.+ 1 is used. We'll discuss computations on with Big-Endian machines; modifications for Little-Endians are simple and do not affect the analysi'i.:j: Initially, our discussion is directed at computing Fletcher's quantities We'll also limit ourselves to packets with even numbers of bytes. Suppose we have a sequence of bytes in memory:
As a notational aid, let capital letters (A;) denote 256 times the contents of the memory byte denoted by the small letter (a;). Instead of computing C 0 and C 1 directly, we will compute two quantities S; and Ti, which have the same values when reduced modulo 255. are equivalent to Fletcher's check-bytes, and to detennine how many iteration steps may be taken before a carry can occur. The expression is not elegant, but can be easily verified by induction:
Since multiplication distributes over addition and 256 is congruent to 1 mod 255, it is not hard to see that S reduces to C 0 • Persistence and diligence in re-arranging terms will convince the reader that T reduces to C 1 • Since each term ai and bi is no bigger than 255, and by using closed form expressions for 1 + 2 + 3 + · · · + n and 1 + 3 + 5 + · · · + n, we have In an actual implementation, we will have to compute the values S and T for packets longer than 255 bytes. Every so often, we ''jolt!' the quantities S and T by replacing each by the value obtained from adding the upper and lower 16 bits in its two's complement representation.
(Folding a quantity has no effect on the value it represents modulo 65535). The results of a folding operation are bounded by the maximum values for S and T after two iterations, so no carries can occur if an additional 253 iteration steps are made after a folding operation.
A C language implementation of our induction step might look like this: On CPU's such as the DEC VAX, the CO power 6, and the Motorola 68000t, this sequence requires six machine instructions per two bytes of data, which is admittedly the same instruction count as computing C 0 and C 1 in the standard way: unsigned char *cp; long CO, Cl, Reg;
Reg "" *cp++; CO += Reg; Cl += CO; Reg = *cp++; CO += Reg; Cl += CO;
Although the instruction count is the same, the very fact of accessing memory in words may be faster for some machine architectures, as our experiments described below will show.
On the DEC VAX [DE81] and on the CCI Power 6 [Ha87], which both have a special .. double indexed" addressing mode, it is possible to perform the addition of twice S to T in a single instruction, "move address of word". This reduces the instruction count from 6 per pair of bytes to 5 per pair of bytes. Here the induction step would look like: movzwl addl2 movaw andl2 subl2 n(rll),r8; r8,rl0; (r9) [rlO], r9; $255,r8; r8,r9
There remain two minor considerations: auto-increment, and loop control for reduction back to 16 bits. The CCI machine does not support auto-increment addressing, so, when the induction step is unrolled, one can use register displacement addressing with different displacements for each step unrolled. Although the VAX family of computers supports auto-increment, it appears that certain processors in the series compute this checksum more rapidly by using the method for the CCI machine, ignoring auto-increment For periodic reductions to 16 bits (to prevent overflow), it is convenient to test whether to reduce at the end of the largest unrolled loop, which in our implementation passes through the data 32 bytes at a time. Although one might be tempted to decrement a counter each time through the end of the unrolled loop (thus saving reductions for every sixth pass), it turns out to be 2 percent faster to check two bits for simultaneously being zero in the register counting the number of bytes remaining; even though this causes a reduction operation every fourth pass, the test is quite cheap, there is no overhead in resetting the counter, and the reduction itself is not expensive, merely being a store, two loads and an add.
The 16-Bit Algorithm.
Fletcher's original paper carries out an error analysis of his algorithm for arbitrary K bit bytes, not necessarily just for K = 8. We would like to suggest that it may be profitable for future versions of ISO protocols to use the algorithm with K = 16. We have determined by measurement that the computation is significantly less costly, and show here that it has dramatically better error detection properties. An interesting feature of either checksum is that for sufficiently small packets, it will detect all double bit errors. Fletcher gives the bounding size as 2K -1 "bytes". In the 8-bit case, this gives 255 bytes for the packet size, somewhat less than conventional Ethernet packet sizes, for large data transfers. In the 16-bit case, this gives us 64K 16-bit bytes or 128K 8-bit bytes. Doubling the size of the bytes generally squares the probability of other sorts of undetected errors: the fraction of all undetected errors is on the order of 2.37 * 1o-10 (as opposed to 1.58 * 1o-5 in the 8-bit case). The probability of undetected 32 bit burst errors is on the order of 2-40 (compared with a probability of undetected 16-bit burst errors of 2-20 in the 8-bit case). These numbers are computed from the formulas given in Fletcher's paper.
A precise statement of the algorithm would be as follows. If the packet is of an odd number of 8-bit bytes, logically extend it with a zero byte. One then computes the two sums C 0 and C 1 using 16-bit one's complement arithmetic, thinking of each pair of bytes as a 16-bit number. The packet has been properly prepared and transmitted if the two sums are zero.
We wish to show that the verification procedure is independent of the "Endian-ness" of the machine. The effect of the computation being performed by a machine of the opposite Endianness is the same as reversing the order of every pair of bytes in the packet. We state that the tDEC and VAX are trademarks of Digital Equipment Corporation; CCI and Power 6 are trademarks of Computer Consoles, Incorporated; SUN is a trademark of Sun Microsystems.
16--bit checksum of a byte swapped packet is the same as swapping the bytes of the natively computed checksum for the original packet The proof hinges on two observations: fli'St. if we think of two bytes as representing a number modulo 65535, and we multiply the number by 256, the resulting number modulo 65535 is represented by the two bytes in the opposite order; second, multiplication distributes over sums and commutes with the multiplicative "weights" in sum C 1· Clearly, byte-swapping zero gives zero.
As noted earlier, the way the checksum algorithm would be likely to be employed in ISO protocols would be to have a header option, namely a type byte, length byte, and four contiguous bytes whose values would be chosen so that both check-quantities (now 16 bits each, giving us 32 total) would be zero. It is not hard to see that if the four bytes were word-aligned, the same formal computation that generates the present checksum option (for 8-bit Fletcher) would work for our replacement checksum. Nonetheless, it is still possible to choose values for the four bytes, even in the case of odd positioning, so that the two sums C 0 and C 1 will be zero. Using the naming scheme in the previous section, let us first assume we are to adjust a sequence bt-l at bt at+l We ftrst replace all four bytes by zero, then compute Co = {-w; This is satisfied if we let bt-l be the lower byte of C 1 -kC 0 , and At+ 1 be the one's complement of the upper byte. One can then get the other two quantities by substitution in the equation for Co .
Measurements and Comparisons
We have prepared two timing tests to give some feeling for the way checksum routines would compare under different limiting sorts of conditions. The first is the equivalent of checksumming a single 16 megabyte packet This is designed to exercise mostly the inner unrolled loop of each algorithm. The second is checksumming 90 minutes worth of packets as found on one large Ethernet connecting SUN workstations to me servers from trace data giving the length and type of every packet (Gu87].
Most of the traffic on the workstation network consisted of paging and non-checksummed me system references, which we exclude entirely in our test. The other packets were largely single character virtual terminal packets, or acknowledgements, with about a quarter of the remaining packets being me transfers, and mail activity. We discard the paging and me activity as being atypical of communications over long-haul networks.
Under realistic conditions, if we had a dedicated mail gateway, although most of the data packets would be large, there would likely to be about as many acknowledgement packets as data packets, and acknowledgement packets are typically quite small. Thus, one would not expect to see a mix of more than 50 percent of large packets.
We compared the following routines:
The standard IP checksum routine for the machine. A ''portable'' IP checksum routine accessing memory a word at a time.
The standard XNS checksum routine for the machine. A naive ISO implementation (one byte at a time), but with loop unrolling.
Our ISO implementation (two bytes at a time), but without the "movaw" instruction Our ISO implementation (two bytes at a time), with the "movaw" instruction. The ISO implementation given in [Mc87] (no unrolling. using ediv, aob, etc.)
A 16-Bit Fletcher Checksum Routine.
F16-NOAI A 16-Bit Fletcher Checksum Routine without auto-increment.
null
A routine merely returning 0, to measure overhead in the second test.
Each of these routines includes the mechanism (and consequent overhead) to process packets buffered in a segmented fashion. An anonymous reviewer of an earlier draft of this paper calls this "the curse of 4BSD: mbufs"; an amusingly described but regrettably pertinent detail that must be considered when trying to optimize the last iota out of checksum algorithms. Our test bed included the following machines: CCI Power 6, Vax 8800, Vax 8600, Vax 785, Vax 750, and Sun 3/280. In Tables 1 and 2 we show the amount of cpu time in seconds to perform the the single huge checksum or a checksums of approximately 250,000 packets of varying and more reasonable sizes, respectively. As one might guess, one sees less dramatic differences between protocol checksum times in the mixed packet test than in the large packet test, due to increased overhead in the latter. It is amusing to note that using the special machine-specific instruction pays off only on extremely sufficiently sophisticated CPU's, and then only for large packets. Almost in every case, the extended Fletcher algorithm is faster than the XNS checksum, though, at best, it runs twice as slowly as an optimized IP checksum. The regular ISO checksum can be made to run only two and a half times more slowly than the IP checksum in the case of a realistic mix of packets, even though it takes longer by a factor of at least 4.5 for very large packets. We have proposed some alternative techniques for computing the ISO checksum. As we have seen, not every architecture makes it worthwhile to trade more numerous or complicated instructions instead of fewer references to data.
We have proposed a related checksum algorithm which has significantly better error-detection properties. We showed that this checksum presents no special burden on machines of differing "Endian-ness". Our data shows that for every machine, the proposed checksum was faster to compute than the existing one, up to a factor of two for large packets.
For those architectures where it is faster to use more complicated instructions and reference memory less often, one might ask, if two bytes at a time is a good thing for the standard ISO checksum, might not four be even better? Unfonunately, we have not been able to devise an induction step of sufficiently few instructions to make this pay off, as two instructions are required to perform a 64-bit add on most of the machines at our disposal. However, we are tantalized by the prospect of other machines having single instruction 64-bit arithmetic, or possibly even using floating point arithmetic to get us a one-instruction 53-bit (i.e. mantissa-sized) register.
Lastly, we have proposed an alternative checksum algorithm which has significantly better error detection properties than any of the existing three, and is far from the worst among them in computational requirements.
