ABSTRACT
INTRODUCTION
Data integrity assurance and data origin authentication are essential security services in financial transactions, electronic commerce, electronic mail, software distribution, data storage and so on. The broadest definition of authentication within computing systems encompasses identity verification, message origin authentication and message content authentication. In IPSEC, the technique of cryptographic hash functions is utilized to achieve these security services.
Hash Functions
Hash functions compress a string of arbitrary length to a string of fixed length. They provide a unique relationship between the input and the hash value and hence replace the authenticity of a large amount of information (message) by the authenticity of a much smaller hash value (authenticator) [ I] . In recent years there has been an increased interest in developing a Message Authentication Code (MAC) derived from a hash code. Among the many reasons behind this are that cryptographic hash functions such as MD5 and SHA-1 generally execute faster in software than symmetric block ciphers such as DES. The software for hash functions is widely available and there are no export restrictions from the United States or other countries for cryptographic hash functions. Hence, there are many applications of MD5, SHA-I and other hash functions to generate MACS. The method to implement the MAC for IP security has been chosen as hash-based MAC or HMAC, which uses an existing hash function in conjunction with a secret key. The HMAC algorithm is specified for an arbitrary FIPS-approved cryptographic hash function. With minor modification, HMAC can easily replace one hash function with another [2] .
Message Digest 5 (MD5) Algorithm

MD5
[3] is a message digest algorithm developed by Ron Rivest at MIT. It is basically a secure version of his previous algorithm, MD4 which is a little faster than MD5. This has been the most widely used secure hash algorithm particularly in Intemet-standard message authentication. The algorithm takes as input a message of arbitrary length and produces as output a 128-bit message digest of the input. This is mainly intended for digital signature applications where a large file must be compressed in a secure manner before being encrypted with a private (secret) key under a public key cryptosystem.
Assume we have an arbitrarily large message as input and that we wish to find its message digest. The processing involves the following steps.
(1) Padding The message is padded to ensure that its length in bits plus 64 is divisible by 512. That is, its length is congruent to 448 modulo 512. Padding is always performed even if the length of the message is already congruent to 448 modulo 512. Padding consists of a single 1 -bit followed by the necessary number of O-bits. Figure 1 and its logic is given in Figure 2 . The four rounds have similar structure but each uses different auxiliary functions F, G, Hand I. 
C o m p r h o n function HMDs
The output of the fourth round is added to the input of the first round (CV,) to produce CV,+l.
(5) output After all L 512-bit blocks have been processed, the output from LLh stage is the 128-bit message digest. Figure 3 shows the operations involved in a single step.
The additions are modulo 2". Four different circular shift amounts (S) are used each round and are different from round to round. Each step is of the following form 
B e B + ( ( A + F l n r i c ( B , C , D ) + X [ K l + T [ I ] ) < < s )
C e B D t C
FPGA Implementation
MD5 IMPLEMENTATION
Re-configurable devices such as FPGAs are a highly attractive option for hardware implementations as they provide the flexibility of dynamic system evolution as well as the ability to easily implement a broad range of algorithms.
Most hash functions are targeted at software implementations. [6] . The top-level design was described in VHDL and the available Xilinx core generator modules were utilized wherever applicable. Xilinx Alliance 3.li and Foundation 3.li tools were used for synthesizing and implementation. VSS and Foundation EDIF simulators were used for functional and timing simulations.
MD5 algorithm is a block-chained hashing algorithm. The hash for a block depends on both the block data and the hash of its preceding block. As a result, blocks can not be hashed in parallel. Each step consists of four additions, three component logical operations, two table lookups and one rotation. The tree of operations can be optimized by pertorming operations, which involve items not dependent on the previous step, early. According to Figure 3 , the item that depends on the previous step is word B and hence the result of logical operation has a considerable delay. The optimized tree of operation (assuming each operation takes one unit time) will be as given in Figure 4 . According to this one time unit step can be reduced The following architectural options were investigated and implemented: e iterative looping (Iterate-MDS) e full loop unrolling (Fullun-MDS) Both architectures were implemented at behavioral level in VHDL, simulated, synthesized and functionally simulated. After verifying the functionality, the design underwent the translation, mapping, placing and routing (PAR), timing and configuration stages of the flow engine. The functionality of the PAR implementations is then re-simulated with back-annotated timing using the same test vectors used in functional simulation, verifying that the implementation of the design is successful. In both designs, it is assumed that the first two aspects of the algorithm have already been performed and the input of message blocks can be controlled according to the state machine states.
Iterative Looping Architecture
By implementing a generic step of the MD5 algorithm, a looping architecture with 64 iterations would seem to provide the greatest area optimized solution. The block diagram of the iterative design is shown in Figure 5 .
Load done
MD5 ITERATIVE State Machine
Figure
Block diagram of MD5 iterative design (Iterate-MD5).
CV-se1 Load-done
A few additional multiplexers and a barrel shifter have to be used to perform the selection of the round function and the variable shifting in each round. The state machine has 68 states including three states required for initializing and loading the very first block to the core. In subsequent block operations the state machine utilizes 65 states. The main feature of this design is the loading of message blocks in parallel with computation. The two RAMs can be utilized to load the next block while the present block is being used in computation. This eliminates the loading time. The 5 12-bit message block is loaded to the core using a 32-bit bus. The "Reset-state" signal initiates the state machine and the counters. Then with the "Start" signal the function gets started. The initial vectors are loaded in parallel to the input register and to a buffer. The initial vectors as well as the chaining variables are kept in this buffer until the 64'h step to get added with the last result to form the chaining variable for the next block. Initially the first block is loaded to the XRAMl using the addresses given by the X-in counter. After that the state machine starts to provide addresses for reading of XRAM1. Using the first 16 addresses provided by the state machine, the next block is written to XRAM2. After the 64'h step, XRAM2 is read. During the first 16 steps of processing the second block, the third block is written to XRAM1. This reading and writing of RAMs alternates in every 64 clock cycles. Subsequent blocks utilize the previous chaining variable as their initial values. The generic step is shown in Figure 6 . 
The full loop unrolled architecture has a 64-step combinational logic core as shown in Figure 7 . In this architecture all the elements of each step are implemented as combinational logic. The barrel shifter has been removed by direct wiring of appropriate shifted bits in each step. The use of double buffering (XX and YY) eliminates the loading time from the critical timing path. The next block is loaded during the computation of the present block. IV ROM provides the initialization vector for the first step. The "load-done" signal makes the initialization vector and the chaining variables available for the first block and for the subsequent blocks respectively. During computation of the digest for a block, the next block is stored in buffer YY and after the computation the "YY2XX' signal gets high and hence XX obtains the new input for the next computation. The block diagram of the complete design is given in Figure  8 . In addition to the core, the other main components are the state machine which has four states, X-in counter used for loading the blocks to the core and Wait counter utilized to count the number of cycles for the combinational logic delay of the computation.
MD5 Full Loop
Similar to the iterative design, the "Reset-State" signal initiates the state machine and the X-in counter. The initialization vectors are taken into the register CV-Reg.
With the "Start" signal, the initial block is loaded to buffer YY and right after that "YY2XX' signal loads it to buffer XX and the computation is commenced. During computation, the next block is loaded to buffer YY. When all the blocks in the message are processed, "en2" signal makes the digest available at the output of register, Digest-Reg. 
PERFORMANCE EVALUATION
Both designs were synthesized and placed and routed on the Virtex V I000FG680-6 target device with clock rate up to 200 MHz.
In the case of the iterative design, the utilization of the external IOBs was 161 out of 512 (31%) and the block RAM usage was 2 out of 32 (6%). The number of slices used for this architecture was significantly low. It was 880 out of 12288 (7%) and from this the barrel shifter utilized 288 (2%). There is 4% utilization of three state buffers (TBUFs). According to the timing simulation the maximum frequency of the design was 21 MHz. Hence, the expected throughput is (512 x 21M)/65 = 165 Mbps.
For the full-loop-unrolled design, the utilization of slices was 4763 out of 12288 (38%). The device utilization of iterative design is significantly small. The unused resources can be utilized to implement several cores in the same device and thereby processing several messages in parallel. This would be an attractive feature for a cryptographic accelerator. Although the utilization was fairly high, two full loopunrolling designs could be fitted into a single FPGA device. Hence there is a possibility of processing two messages in parallel. As well, for both architectures there is a possibility of implementing the complete HMAC by implementing other necessary HMAC components, utilizing the unused resources of the FPGA.
The obtained results can be further improved by using the latest FPGA devices such as Virtex I1 family. The Virtex devices provide better performance than the previous generation of FPGAs achieving synchronous system clock rates of 200 MHz [6] . The latest devices however can provide more than 400 MHz clock speeds as well as more resources. Further, using timing constraints it is likely that the delays in the critical paths can be reduced.
The summary is given in Table 1 . According to the performance measurements on software implementations given in [7] , the throughput has been less than 100 Mbps. DEC Alpha (190 MHz) has given a throughput of 87-100 Mbps.
REFERENCES
CONCLUSION
The significance of the hardware implementation of the MD5 algorithm has been examined. Two architectures have been studied for both area utilization and speed with FPGAs as the target device. It is clear that both architectures can be easily fitted to a single device. Although the inherent nature of the MD5 structure does not allow parallel hash operations of blocks, hardware implementations can obtain a significant throughput to cater to some of currently available IP bandwidths.
