Abstract
Introduction
Recent attacks on the most popular hash algorithms such as MD4, MD5, SHA-0 and SHA-1 [13, 11, 14, 12] necessarily demand the use of other hash algorithms. While the second generation of the SHA family is still young and there are only a few security evaluations of this algorithm, a solution can be found in the well known RIPEMD-160 hash algorithm. This algorithm was designed by Dobbertin, Bosselaers and Preneel [1] in 1996 and it is still resistant to the attacks that work against the rest of the MD family [7] .
At the same time the need for high throughput optimized implementations of the crypto primitives is getting more essential in almost every networking application. Message authentication, digital signature scheme and key derivation are just some of the examples where the hash algorithms are used as a building block. Due to the recursive mode of the operations, an efficient implementation of the hash algorithms has always been a challenge.
In this paper we propose two new high-throughput architectures for the FPGA implementation of the RIPEMD-160 hash algorithm and compare them with previous work. Using the Xilinx Virtex2Pro FPGA board we achieve throughputs of 3.122 Gbps and 624 Mbps with and without pipelining, respectively.
The remainder of this paper is structured as follows. In Sect. 2 we give some background information about the RIPEMD-160 hash algorithm. Section 3 describes the theoretically throughput optimal architecture in the microarchitecture level. In Sect. 4 and 5 we show further optimization in the gate level. Implementation results and comparison with previous work are given in Sect. 6. Section 7 concludes the paper and gives some guidelines for future work.
RIPEMD-160 Algorithm
RIPEMD-160 shown in Alg. 1 1 is a hash algorithm that takes an input of arbitrary length (less than 2 64 bits) and produces an output of 160-bit length after performing five independent rounds. Each round is composed of 16 iterations resulting in 80 iterations in total. RIPEMD-160 operates on 512-bit message blocks which are composed of sixteen 32-bit words. The compression function consists of two parallel datapaths as shown in Fig. 1 . F i and F i are non-linear functions and K i and K i are fixed constants. Temporary variables A, B, C, D and E for the left and A , B , C , D and E for the right datapath, are initialized with 
Optimization in Micro-Architecture Level
The MD family hash algorithms can be considered as an example of digital signal processing (DSP) systems. Block diagrams are most frequently used to graphically represent a DSP system. Data flow graph (DFG) is an example of a block diagram, where the nodes represent computations (or functions) and directed edges represent datapaths. Each edge has a nonnegative number of delays associated with it. These unit-delay elements (often called algorithmic delays) can also be treated as functional blocks as they are implemented using registers [9] . As an example of a DFG we can look at the Fig. 2 , where the functional nodes are represented with the light gray circles and registers are represented with the black squares. To each edge, that is at the input of the register, one unit-delay T D is associated.
The iteration bound of the circuit is defined as
where t l is the loop calculation time, w l is the number of algorithmic delays (marked with T D in Fig. 2 ) in the l-th loop, and L is the set of the all possible loops [9] . A DFG of RIPEMD-160, which is shown in Fig. 2 , is derived from Alg. 1 and contains five different loops. The iteration bound is determined by the loop B → F → ⊕ → rol(s) → ⊕ → B and is equal to
Since the DSP systems are mostly implemented using sequential circuits, the critical path is defined as the longest path between any two storage elements [9] . The critical path of the DFG in Fig. 2 is the path marked with the bold lines 4 × Delay(⊕)+Delay(rol) and it is larger than the iteration bound. Therefore, to achieve a throughput optimal design, we need to apply some transformations on the given DFG.
In [5] the authors apply some DSP techniques on SHA-2 family hash algorithms, such as the iteration bound analysis and the retiming and unfolding transformations. By applying these techniques, an architecture whose critical path is equal to the iteration bound can be derived. In this optimization, the functional operations used in a hash algorithm, e.g. non-linear functions and additions, are assumed to be atomic, i.e. a functional operation cannot be split or merged into some other functional operations. In other words, the optimization is limited to the micro-architecture level.
Retiming [6] is a widely used technique in design of the DSP systems. It is a transformation technique that changes the locations of unit-delay elements in a circuit without affecting the input/output characteristic of the circuit [9] .
By applying only the retiming transformation, we can obtain a DFG of RIPEMD-160 whose critical path delay is reduced to the iteration bound. Figure 3 shows the DFG after retiming transformation and the critical path is again marked with the bold lines. Now the critical path delay is equal to the iteration bound, which means that the DFG given in Fig. 3 represents a throughput optimal architecture that achieves a theoretical upper bound in the microarchitecture level.
Due to the retiming transformations, two adders are placed between registers E and A1 now. This causes A1 to be initialized with h 0 ⊕ X 0 ⊕ K 0 instead only with h 0 . Making the values of X i and K i equal to zero at the last iteration, A1 becomes equal to A.
In the next section we show a gate level optimization by merging a few functional nodes, which results in an architecture with an even higher throughput.
Optimization in Gate Level
Observing Fig. 3 we notice that a set of functional nodes consists of four modular adders, one non-linear function and two cyclic shifts. As the critical path is in the loop, the only way of optimizing the DFG is to optimize the loop. Unfortunately, the variable cyclic shift is placed between two modular adders and prevents us from using a carry save adder (CSA) instead. However, in this section we show how another approach can be used for further optimization of the loop. Let us consider a simple example shown in Fig. 4 where the part of the loop with three 32-bit inputs, A1, E and F, and one output, B, is shown. To simplify discussion we omit the input s which represents the number of bits in the cyclic shift. The functionality of this block is given in Fig. 5 . After adding two operands A1 and F, the cyclic shift is applied, and the operand E is finally added resulting in the output B. We define carry 1 as the carry bit that may occur in the result of adding (32 − s) LSB's of A1 and F. This bit will be propagated to the s MSB's of the sum. Another carry bit (carry 2) may occur in the result of the whole addition and will be discarded before the rotation starts (due to the modular addition).
In order to optimize the given block, we rotate operands A1 and F before adding them together. Doing this we have to take care of the carry bits carry 1 and carry 2. The latter carry must not be propagated after the rotation of the two operands. To prevent this carry propagation we can subtract vector ∆ from the sum of Rot(A1) and Rot(F) as it is shown in Fig. 6 . The bit value δ is equal to 1 if carry 2 = 1, otherwise δ = 0. Beside the carry 2, we also need to take care of the carry 1 bit. This carry must be added to the rotated operands Rot(A1) and Rot(F) (see Fig. 6 ). Depending on the carry 1 and carry 2 we have four different possibilities. In order to reduce the critical path, we compute all four possibilities in parallel. The only drawback of this architecture is that subtraction of ∆ and addition of carry 1 is still executed within the loop which does not decrease the critical path. As the addition is an associative operation we can subtract vector ∆ from the operand E before entering the loop. This logic that decreases the critical path is shown in Fig. 7 .
Using the similar design criteria, we add carry 1 after adding Rot(A1), Rot(F) and E. Instead of using one additional adder for this operation, we use a CSA and the fact that carry form of CSA is always shifted to the left for one position before the final addition. In this way we just need to change the LSB bit of the shifted carry form depending on the carry 1.
The architecture describing this whole logic is shown in Fig. 8 . To achieve a high-throughput of the algorithm we prepare all four possibilities in parallel. The values of carry 1 and carry 2 determine which input of the MUX will be propagated to the output value B as it is shown in Table 1 . Note here that input E1 is obtained as E1 = D − ∆ where ∆ is chosen such that δ = 1. Input E represents the case where δ = 0.
Final High-Throughput Architecture
Following the design principles described in the previous section and using additional retiming transformation we can obtain the final high-throughput architecture for RIPEMD-160 algorithm. The first step is to use the throughput optimized part of the loop instead of the original one (see Fig. 8 ). A DFG that shows this architecture is given in Fig. 9 . The optimized part of the loop is denoted as a black box (ADD+ROT).
The fact that we use the optimized part of the loop moves now the critical path between registers E and A1. This problem can easily be solved by using CSA instead of two adders. Figure 10 shows this solution.
As the critical path occurs between the output of E and the input of B (bold line) now, we need to introduce one more register E1 and move subtractor at its input as it is shown in Fig. 11 . In this way the critical path is placed within the loop again and the high-throughput architecture of RIPEMD-160 is finally obtained.
Implementation Results and Comparison With Previous Work
In order to verify the speed of our architectures we have implemented the proposed solutions using GEZEL, a design environment for exploration, simulation and implementa- tion of domain specific architecture [3] . Both implementations were verified on Xilinx Virtex2Pro FPGA board. Our results and comparison with previous work are given in Table 2 .
In [10] the authors propose a RIPEMD processor that performs both RIPEMD-128 and RIPEMD-160 hash algorithms and they separately implement RIPEMD-160 processor for the comparison purpose. In order to achieve high throughput of 2.1 Gbps they use a pipelining technique. However, due to the recursive mode of operation of the RIPEMD-160 algorithm, using pipelining is possible only for hashing independent messages. Our architecture with the optimized loop can easily be pipelined and for this special case we could achieve a throughput of 3.122 Gbps. Pipelining is done by replicating a DFG of the RIPEMD-160 algorithm five times, one for each of the five different nonlinear functions. However, as it comes at a high price of the occupied area and can be useful only in limited numbers of application, we do not provide more details about the pipelined implementation.
Here, we can also notice that the speed of the architecture with optimized ADD+ROT part is 10 % faster than the version without. On the other hand the size is 65 % larger due to the parallel processing shown in Fig. 8 and using one additional register (see Fig. 11 ).
Conclusion
We showed how the iteration bound analysis can be used for the high-throughput implementation of the RIPEMD-160 hash algorithm. Since the iteration bound is a theoretical minimum of the critical path, there is no further through- (1) This is a unified architecture of MD5 and RIPEMD-160 hash algorithms.
(2) This is a unified architecture of MD5, RIPEMD-160, SHA-1 and SHA-256 hash algorithms.
(3) This is a unified architecture of MD5, RIPEMD-160 and SHA-1 hash algorithms.
(4) In the original paper throughput of 2.1 Gbps is shown for hashing 5 independent messages (pipelining).
To make a fair comparison we consider throughput for a single message only. (5) In the original paper the use of 2014 CLBs, 4006 FGs and 1600 DFFs is reported . One CLB in Virtex2 FPGA family contains 8 LUTs [15] . put optimization in the micro-architecture level. Thus, we further optimized our architecture in the gate level, achieving the final high-throughput implementation of the RIPEMD-160 algorithm. This approach can be a guideline for a high-throughput implementation of other popular hash algorithms.
