MD-5 is an authentication algorithm proposed as the required implementation of the authentication option in IPv6. This paper presents an analysis of the speed at which MD5 can be implemented in software and hardware, and discusses whether its use inter-jeres with high bandwidth networking. 
1: Introduction
The current Intemet Protocol (1P) is undergoing its first major revision in 14 years [24] . As part of that revision, the new 1P (IPv6, [12] ) proposes a number of required options that were not required intheprevious IP(IPv4, [24] ) .Tbispaper describes a performance analysis of MD5 [28] , the proposed "required optional" authentication algorithm in IPv6 [1] . Analysis indicates that MD5 may not adhere to the performance criterion of IPv6 [23] , and thus its mandate as thedefardt forarequired option in IPv6 should be reconsidered.
This paper is organized as follows:
q An overview of MD5 and its relevance to IPv6. Permissionto make digital/hard copies of all or part of this material without fee is granted provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright.kerver notice, the title of the pubhcatlon and its date appear, and notice IS given that copyright is by permission of the Association for Computmg Machinery, Inc (ACM). To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
1.1: IPv6 performance criterion
The technical criteria for IPv6 (also known as IPng) includes an explicit performance criterion [23] :
A state of the art, commercial grade router must be able to process and forward IPng tra~c at speeds capable of jidly utilizing common, commercially available, highspeed media at the time. Furthermore, at a minimum, a host must be able to achieve data transfer rates with IPng comparable to rates achieved with IPv4, using similar levels of host resources.
This criterion can be summarized as "first do no harm2". This criterion is specified in IPv6 without condition;
1P header options, whether voluntary or required, are not excepted.
Several voluntary options in IPv4 have become so-called required options in IPv6, notably authentication.
An option is a mechanism that can be enabled or disabled; voluntary options need not be implemented to conform to the standard, but required options must be implemented, but can still be enabled or disabled on individual packets. Authentication is a required option in IPv6. The authentication option allows per-packet specification of the particular authentication algorithm [1] . In keeping with the "required option" spirit, one algorithm is required to be implemented. The IPv6 authentication mechanism proposes MD5 as its required algorithm.
Intemet RFCS and Intemet Drafts regarding authentication do not address performance issues (as a rule). Notable exceptions are the ESP encapsulation mechanism [19] , a keyed MD5 [20] , and the technical criteria of IPv6 [23] . The claim in other documents is that current implementations exhibit poor performance, but that software optimizations or custom hardware will overcome this limitation [10] rithm [27] . The authentication algorithm computes a digest of the entire data of the message, used for authentication.
Typically, the message digest is registered with a trusted third-party, or encrypted via other means [20] . The digest is used by the receiver to verify the contents of a message. It can also be used to encrypt the contents of a message, via a second pass over the data by another algorithm. MD5 requires that both the sender and receiver compute the digest of the entire body of a message.
MD5 is used for authentication in a number of protocols. It is also included as an encapsulation mechanism in SIPP, IPv6, and IPv4 [19] . The following is a partial list of protocols or protocol options using MD5. Some protocols on this list chose MD5 explicitly because of its use in SNMP V2, thus its implementation in many routers (indicated by a star '*'): q SNMP V2 [10] q IPv6* /IPng* [1] [12] is shown below (<<< denotes rotate). The accumulated digest is denoted by {A, B,C,D}, as in RFC-1321 [28] :
There are 16 steps based on each of 4 logical functions; 4 based on Fare shown here. Theconstantskl andk2 are not necessarily identical in basic steps. and are not relevant to this analysis. The logical functions (A denotes .xor-) are: . TCP [ll] 2: Software Implementation Measurements q SOCKSV5 [16] q WWW'S Secure Hypertext Transfer Protocol [25] q WWW'sSimpIeMD5 [13] There is some concern that theperforrnanceof MD5 may not keep upwiththese other protocols. SNMPV2is acontrol protoco], not intended for high bandwidth continuous stream operation, so protocol performance is not critical. Even so, SNMP V2 uses several authentication mechanisms tooptimize security vs. performance, including one(DES-CBC) noted for its implementation at 1 Gbps in hardware (as indicatedin [19] ).
Overview of the MD5algorithm
MD5 is ablock-chained digest algorithm, computed over the data in phases of 5 12-byte blocks organized as little-endian 32-bit words (Figure 1 ). llrefirs tblocki sprocesse dwithaninitia lseed, resulting in a digest that becomes the seed for the next block. When the last block is computed, its digest is the digest for the entire stream. This chained seeding prohibits parallel processing of the blocks. Each 512-byte block is digested in4 phases. Each phase consists of 16 basic steps, for a total of 64 basic steps. Each step updates one word ofa4-word accumulated digest, using the entire intermediate digest as well as block data and constants. In general, each basic step depends on the output of the prior step, defeating simple parallelization of thesteps. The basic structure of the steps
The MD5 RFC-1321 includes a reference implementation written in C [28] . The performance of this software gives a baseline against which tocompare optimizations.
Measurement of the performance of the reference implementation precedes the abstract analysis to determine the focus of the analysis, and the extent of the optimization required. The MD5 reference implementation software was measured in four configurations on a variety of machines. The Raw configurationused thereference code asprovided inthe RFC, with modifications to per-process time measurement rather than "wall-clock" time and cache management.
The Optimized configuration included themodifications describedin Section 2.1, notably byteorder optimization.
Both configurations were executed with 50 passes over alternating blocks of 2 M bytes, to avoid observing cache effects (the i486 used 100 runs over blocks. due to limitations of the 486 platform). This also emphasized the performance of the block digest component, rather than the "housekeeping" overhead. Theresults arepresented in Tablel and Figure2. Two additional runs measured the performance using external (off-chip) andintemal (on-chip) caches. The Extermal-Cache run executed 5,000 passes overa single 20.000 byte block. The Inrernal-Cache run executed 20,000 passes over a single 5,000 byte block (the HP712 used 100,000 passes overa single 1,000 byte block, dueto hardware limitations).
Thevalues chosen for block sizes were based on information on external and internal cache sizes as published with the SPEC Benchmark results [8] .
All four configurations used randomized block initializations to remove potential data-dependent petiormance differences. MD5 is not a data-dependent algorithm, but some architectures could have exhibit data-dependent performance variations (though they did not). Thecode wascompiled under both native and GNU-based C compilers (where available) with all optimizations.
The optimized code spent 95% of its time in the main decode routine on all machines. On little-endian machines this represents only the MD5 data digest routine. On bi,g-endian machines. 2/3 of thetime wasspent on MD5datadigest.
and 1/3 wasspent on data byte reordering, required because MD5 uses little-endian grouping of the byte stream. Tbebyte swap cost was higher than expected due to the data copying involved. Table 1 closely verifies a reported figure of 100 Mbps for a DEC Alphal. This figure was found to occur only for data in the on-chip cache; optimized code run to avoid caching resulted in 86.6 Mbps. This table also indicates the measured speed of UDP/ 1P on Sun SPARC 2, 10/51 and 20/71 machines (measured at USC/ISI using Fore SBA-200 1P over ATM). Note that even the cache-based optimized software cannot keep pace with the 1P capabilities of some of these machines.
2.1: Specific Modifications
The Raw configuration included the following modification:
. Change CPU timer from getrimeofdavo to gewusageo q Add appropriate code and flags for -use block size, repeat number command-line options.
-use randomized initialization of test block.
-double-buffer the test block.
The flags were added to permit run-time configuration, and to provide configuration for cache management. The CPU time measurement was changed to use the more accurate getrusageo function, although even that function is known to have potentially large errors on heavily-loaded machines [18] . Lightly-loaded machines were used to avoid such errors.
The Optimized configuration included modifications:
q On little-endian machines, load 32-bit words directly.
-Avoid byte-swapping and copy overhead altogether.
q On big-endian machines, use more effective swap code,
-Use a more efficient byte-swap source code.
-Unroll the byte swap routine.
These optimization can be grouped into categories: Manual optimization were limited to overhead code only. Prior to optimization, overhead was 33% on all machines. After optimization, overhead was reduced to 4~o on little-endian machines, and 26% on big-endian machines. Manual optimization of the block digest algorithm was not performed after analysis indicated the effective yield would be low (see Section 3).
Cache management
Preliminary test runs indicated that the reference implementation of MD5 exhibited cache effects. This code included a testmode, in which multiple passes over a single data block emulated the body of a long message. This data block was initialized with data prior to the execution of the digest algorithm. The combination of data initialization and block repetition over-emphasizes the effects of caching on some architectures, especially because the test-block size of 1,000 bytes is within the on-chip cache size of all architectures tested. This source code was modified to reduce over-emphasis of cache effects. Provisions for double-buffering were added. By alternating the data blocks used during block repetition, the effects of caching were avoided. Command-line configuration of the block size permitted configuration for blocks larger than the cache (with double-buffering, this eliminates cache effects altogether), blocks that fit in the external cache but not the internal cache (without double-buffering), and blocks that fit in the internal cache 1. Personal conununicauon, R. Atkinson, November 1994 (without double-buffering).
Command-line configuration of the number of block repetitions were used such that the resulting data blocks emulated a 2 M byte stream.
It is not clear whether internal-cache, external-cache, or noncached measures are more appropriate measure of MD5 performance. On hosts using DMA data transfer, non-cached performance is appropriate.
On hosts using programmed I/0, the computation of MD5 might be overlapped with the transfer of 1P packets to the network interface, so internal-cache performance is more appropriate. There are also rumored proposed architectures that support DMA directly into the processor's external cache.
Byte-swapping optimizations
The MD5 algorithm uses little-endian groupings of the byte stream, so implementations on native architectures can avoid the byte ordering and copying routine. Avoiding the reordering and copying resulted in code that ran in 2/3 the original time, as expected. The specification of little-endian byte order is opposite that of "network standard byte order", which is big-endian [24] .
On big-endian architectures, the reordering was optimized by replacing it with a more efficiently compiled source code. The reference code used the following byte-swapping code, which is machine independent (i.e., correct on both big-and little-endian machines), but inefficient:
This code compiles to the following pseudo-assembly, yielding 4 loads, 6 internal operations, and one store: The following code was found to run faster: tern@ <<= 8;
The second code compiles to 8 internal operations on machines with no rotate or swap opcodes, and uses a single 32-bit load (see Table 2 ). This sequence has more internal operations (8, vs. 6 before), but is much more efficient in its interaction with memory (1 word-load, vs. 4 byte-loads before). This code runs 25% faster than the original time. On big-endian machines, this was responsible for an overall speedup of approximately 9% (25Y. * 33% of the code is a reduction of 8. 3V0 in the overall speedup, or a speedup of 1/0.917 =9% faster).
On the HP PA-RISC architectures, this source compiled to onl 7 6 internal instructions because it has a 32-bit rotate instruction The following codetooklload, l store, and5intemal operations, because this machine has a 32-bit rotate (see Table 2 ). As a result, the number of internal operations further decreased from 6 to 5, an additional l%improvement inoverall optimization. This code was not used on other machines because there it would generate a (longer) sequence of 9 internal instructions.
The following This optimization replaces4 loads and 6 compute operations with 1 load and 8 compute operations. The result is a trade of 3 loads for 2 computes. Even on current RISC architectures, loads take multiple clocks, due to memory access delays (Figure 3 ). Loads are typically scheduled and queued, such that subsequent operations on registers pending loads will stall until the load completes. This trade-off, between computation and memory access, can be used to develop a faster hash algorithm. (Section 6.2).
The potential parallelism differs as well. The reference code swap has higher compute parallelism, but no load parallelism. The optimization allows 2-way compute parallelism. It also allows pipelined loads, because the Ioad time is small compared to the compute time (the height of the icons in Figure 3 is approximately representative of execution time).
On the architectures examined, a 25% speedup was measured. Even given the complex interaction of register scheduling and load queues, the increase in compute time is more than accounted by the reduction in load time.
Loop-unrolling
Loop-unrolling is provided by many current compilers. In the case of the MD5 reference code, more efficient loop-unrolling was possible by converting array indices into directly incremented pointers. Manual loop unrolling was required for the minimal byte-swapping optimizations described in Section 2.1.2. Analysis of themanual optimizations in Section 2.1 indicated that analytical bounds were sufficient predictors of performance limitations (e.g., swap analysis). Manual optimization of the main digest algorithm was postponed to determine the analytical performance speedup potential. Themain block digest algorithm differs from the byte-swap algorithm, in that byte-swapping should be a very efficient operation on many machines. Analysis of the main digest algorithm is more complex.
3.1: Analysis of MD5 costs
The MD5 algorithm costs are proportional to the cost of a basic step. For each word of input, 4 basic steps are executed. These basic steps have little opportunity for pipelining or parallelization. Thus by analyzing the costs of a basic step, the overall performance limit can be determined. The basic step can be mapped as a data flow diagram (Figure 4) . Thecritical path in this diagram is indicated by the black lines; the other paths are in grey. In this diagram, dependencies on previous steps (as indicated with a star '*') have been delayed (pushed as low as possible).
The time to process a basic step depends on the time to process each step, as well as the amount of parallelization possible. As described before (Section 2.1.2), the cost of a rotate operation can range from 1-3 opcodes on a RISC machine; CISC machines exhibit similar variability in the number of clock cycles required to execute this instruction.
The cost of the logical operations also vary, depending on whether the processor has a separate orcombined AND-NOT and OR-NOT instructions. Table 4 
Serial cost
The serial cost to execute the basic step in Section 1.2.1 M the sum of the cost of executing its opcodes. There are 4 additions, 1 rotate, and the cost of the logical operation. On a NOT only machine, the logical functions F, G,H,I cost 4,5,2, and 3, averaging to 3.5. On an AND-NOT machine, F,G,H. and I cost 3,3,2, and 2, averaging 2.5. The total is 7.5 + rotate on a NOT only machine, and 6.5 + rotate on an AND-NOT machine. As noted before (Section 2. 1.2), rotates cost either 1 or 3 opcodes. The resulting cost for serial operations is indicated in Table 3 . The cost is presented as an average of the costs for each of the different logical functions (i.e., 4 times the average is the cost of executing one of each step).
Parallel cost
Critical height denotes the number of operations between the highest appearing data dependency and the output (Figure 4 ). In the case where exphclt NOTS are required (Figure 5 ), the respective critical heights for F,G,H,I are 3,2,1, and 2, for an average of 2. In the case where AND-NOT is available, the critical heights are 2,2,1, and 2, and the average is 1. In most cases, there is a parallelism potential of 2-3 integer operations. This is probably the reason the Sun SPARC 20/71, with a Super-SPARC2 2-way integer parallel CPU, achieves more of a speedup compared to a 20/61 than CPU speed differences alone account (Table 1) . If CPU speed alone were the difference. we would expect the Sun 20/71 to operate at 88 Mhz, rather than its real value of 75 Mhz.
On most machines there were 1-4 opcodes of potentially unnecessary overhead. Subsequent examination indicated that the overhead was required, either for explicitly loading constants (1-2 opcodes on the Sun, Intel, HP, and Dec Alpha), or for zeroing out high-ends of 64-bit registers (2 opcodes on the Dec Alpha). This leaves little opportunity for manual optimization.
There are 4 basic steps executed for each word of input and 10-12 opcodes per basic step. so there area total of approximately 40-50 opcodes required per word of input (Table 4) . This is inordinately high, and will limit the bandwidth of processors to a maximum of l/50th their MIPS rate. 
4: Hardware Analysis
The software anal ysis indicated a performance bound that is insufficient to keep pace with IPv4 implementati ens. A hardware analysis was performed to determine the speed and size of a potential hardware implementation.
The parallel implementation was used for the hardware analysis, based on the basic step dataflow diagram (Figure 4) . The hardware design is a clocked dataflow implementation of the abstract dataflow diagram, including the critical path indicated in black (Figure 7) . The 3-input logical functions F, G,H,I were replaced with a single logical block. The four adders remain.
For VLSI CMOS implementation, the one fast adder has a single-cycle time of 3.2 ns, and requires 3.2nsprecharging [14] . This adder would run at approximately 300 Mhz. and require two clocks per add (i.e., supporting a 150 Mhz processor). The r-otare uses with a zero-cost wiring permutation in the best case, and logical circuit in the worst. CMOS latch setup time can be as low as 2 ns. As a result, we can design a clocked CMOS circuit with a 5.2 ns -3.2 ns clocks (dictated by the adder) and 2 ns latch setups.
The critical path through the basic step is 6 clocks long, i.e., 31.2 ns for the critical path of each basic step (Figure 8 ). Given 4 basic steps per word of input, there are 124.8 ns per word of input, or 256 Mbps.
A multi-chip CMOS implementation would require 15 ns per 32-bit addition and 8 ns per logical function or rotate (including off-chip driver delays). The resulting speed of the critical path (two adds, one logical, one rotate) would be 46 ns per step, or 184 ns per word of input. The resulting system would require approximately 9 chips (2 adders, 4 PLD logical units, 1 shifter, 1 RAM. 1 register file) speed would be 175 Mbps.
Both implementations assume a parallelization of 2 logical operations and 2 32-bit additions, as well as near-zero times for register update (due to write-through to the next stage). This design requires a parallelism of 2 adders, 4 logical units, and 1 shifter (Figure 7) . The chip requires the following: The functional units are obvious from the dataflow diagrams. The ROM represents the addition and rotation constants for each step. The RAM comprises two buffers of data blocks. such that one can be used for computation while the other is being loaded. The registers store three sets of the hash -one being currently updated, one from the previous block (for block chaining), and one accessible externally during the current hash. This amount of storage and function is feasible in existing custom CMOS and possibly in GaAs, but probably would not be feasible in ECL.
As with the software. this is not sufficient to keep pace with existing hardware for 1P, capable of speeds in excess of 1 Gbps.
B CD
iii"' 1 
MD5 -57 Mbps
The result is that current alternatives are slower than MD5 in software. MD2 and SHA are message digest algorithms; DES is an encryption algorithm.
5.1: Expected hardware acceleration
DES has a hardware implementation that runs at 1 Gbps in GaAs [19] . This factor of 50x improvement m hardware is not a representative speedup for arbitrary algorithms. MD5 in particular is difficult to accelerate in hardware.
DES can be accelerated by a factor of 50x in hardware, because its basic operations (bit select and bit-logical) are particularly slow in software (3-4 opcodes each), and trivially fast in hardware (implemented with a wire). The basic operations of MD5 are 32-bit additions. The fundamental clock rate of a CPU in a technology is largely limited by the speed at which these additions can occur. There is no advantage to building custom hardware: little further speedup is possible. As a result, MD5 can be accelerated by a factor of 4x in hardware.
Arbitrary algorithms can be accelerated by a factor of 10x when moved from software to hardware (this is an estimate).
6: Suggested
Courses of Action
The MD5 algorithm, whether in hardware or software. cannot keep pace with existing IPv4 implementations.
As a result, it violates the requirements of IPv6, of "first doing no harm" in reducing processing costs compared to IPv4. Although most mechanisms using M D5 propose a per-message digest, parallelization ]s prohibitive because a software implementation is desired. In addition, streamed MD5 throughput will always be fundamentally limited.
There are several alternatives to be considered, the best of which are: MD5 has performance limitations and design choices that affect its use in IPv6. First and foremost, the choice of little-endian byte-order is a detriment to its use in the Intemet, where networkstandard byte order is big-endian. Redefining MD5++ to use bigendian order will increase its implementation speed by 33~0 on native architectures. This argument is based on the assumption that 1P processing should uniformly occur in big-endian order, because much of it already is. This modification will have no effect on the security analysis of the algorithm.
Another modification to MD5 would permit parallelization, both in custom hardware and software on integer-parallel architectures. There are several proposals to modify MD5 to permit finite parallelism over a single stream. One solution replaces chaining with per-block seeds [8] . Each seed is computed as a hash of the offset of the block in the stream, thus retaining the block-order dependency of block chaining. It does not retain the property that the digest of a block is dependent on the contents of all prior blocks. Another proposes to replace a linear block-chain (Figure 9 ) with a finite number of block chains (Figure 10) . A predetermined finite number of chains are processed from independent seeds, such that the I-th block is part of the "1 mod K'-th chain. The resulting sequence of K digests forms another message, which can be MD5-encoded using a single-block algorithm 1. This supports finite parallelism to provide adequate bandwidth at current processing rates, without providing arbitrary power for spoofing. Further analysis is required to deterrmne its authentication properties. This proposal increases the performance of MD5 over continuous streams of blocks. IPv6 uses per-header authentication digests. so that a simple per-packet parallelization would suffice. This assumes that packet digests are not chained for other reasons. Neither parallel solution increases the performance of MD5 on a single processor. The IPv6 performance criterion requires implementations to operate as fast as IPv4, presumably on a single processor. MD5 does not, partly due to the complexity of the algorithm, and partly due to its processing the entire body of a packet (rather than only the header).
These two proposals together, of big-endian byte order and 4-way parallelism, comprise a modified MD5 that we call "MD5++". These techniques have been used elsewhere to develop hash algorithms with as little as -10 opcodes per 32-bit word of input. down from the 40-50 that MD5 uses [29] . Performance measurements for that algorithm were not available at the time of publication, but that estimate uses self-modifying code, which may affect its performance on processors with write-through instruction caches.
In RC5, a parameterized number of phases of data-dependent rotates are used to compute a symmetric block-cipher (comparable to DES) [26] . For a 32-bit quantity to have the resulting hash depend on a 32-bit data-dependent rotate would require 7 rounds of 5-bit hashes. Data-dependent rotates are costly in software, requiring 2 additional opcodes per rotate (mask constant, shift data to right of register), not including the fact that rotates are not single opcodes on many RISC processors. As a result, this algorithm may not be competitive with MD5 in performance, although analysis was not possible prior to this publication.
RC5 has the feature of being parameterized, such that the levels of security and performance can be adjusted by changing the number of rounds, word size, and key size. This is a useful feature for any hash or encryption algorithm.
An alternate hash algorithm can be developed by using some of these recognitions, together with a few trade-offs. Note that hashing does not have the advantage of off-line precomputation of onetime pads that suffices for encryption.
By design, a hash must access every word of input with a high level of "strength", to provide authentication over the entire data stream in a uniform fashion.
The alternate hash algorithm (called "AHA" for brevity only) should provide cross-bit scrambling like 23-bit addition or datadependent rotates, but more efficiently.
A lookup- AHA grabs the data in 16-bit units, XOR'S that with the current state, looks the result up in a 64 K table of 32-bit words, and XORS the result back into the state. By keeping 8 words of state and weaving the state variable accesses, an integer parallelism of 8 is achieved.
The AHAalgorithm isthusbasedon FSM hashes. The table lookup depends on the current state and the input, thus making the algorithm input-order sensitive, although allowing finite parallelism of the input data (because there are 8 half-words of state). The table is small enough to fit in a local cache. The use of lookups replaces that of rotates or additions, to perform thorough scrambling of the data. The table can be either constant or sessiondependent, as required to ensure sufficient authentication without unnecessary overhead.
This algorithm was measured on a Sun SPARC 20/71, and uses 13 opcodes per 32-bit word of input on a SPARC, and 10 opcodes per 32-bit word of input on a 68040. The algorithm achieved 115 Mbps without caching, and 121 Mbps with caching on the Sun 20/ 71, compared with 57-59 Mbps for MD5. Analysis of the implementation may yield up to another 50'% increase in performance. The current performance is sufficient to support existing 1P rates on that host. Further performance analysis and cryptographic strength analysis is underway.
AHA is believed to be useful as a minimal hash, because it is derived from the principles of performance first. Even if its cryptographic strength is low, it may prove useful for the authentication of high bandwidth traffic, where the sheer volume of data obviates the need for the same strength as off-line data would require.
7: Conclusions
MD5 cannot be implemented in existing technology at rates in excess of 100 Mbps, and cannot be implemented in special-purpose CMOS hardware feasibly at rates in excess of 175 Mbps. MD5 cannot be used to support 1P authentication in existing networks at existing rates. Akhough MD5 will support higher bandwidth in the future due to technological advances, these will be offset by similar advances in protocol processing. The MD5 mandate in IPv6 should be reconsidered.
At a minimum, the IPv6 specification should recognize the performance limitations of MD5. The use of MD5 in high-performance environments should be recommended against. Tbe modified MD5++ or alternate hash algorithms, as well as other hash algorithms, should be considered before a default standard is specified.
The source code for the optimized version of MD5 presented here is available at <ftp://ftp.isi. edu/pub/hocc-papers/touch/md5-opt.tar.Z>. Current information on tbe status of this work is available at <http://www.isi.
edu/div7/atomic2/md5 .html>. This document was prepared with the assistance and feedback of Steve Kent at BBN, Burt Kaliski. Victor Chang, and Steve Burnett at RSA, and Ran Atkinson at the NRL. Mike Carlton of USC/ 1S1 assisted with the byte-swapping code, cache interaction, and performance measurement analysis. The alternate hash algorithm (AHA) was developed in conjunction with Amir Herzberg, Hugo Krawczyk, and Moti Yung of IBM.
