Abstract. In this paper we propose an architecture design methodology to optimize the throughput of MD4-based hash algorithms. The proposed methodology includes an iteration bound analysis of hash algorithms, which is the theoretical delay limit, and DFG (Data Flow Graph) transformations to achieve the iteration bound. We applied the methodology to some MD4-based hash algorithms such as SHA1, MD5 and RIPEMD-160. Since SHA1 is the algorithm which requires all the techniques we show, we also synthesized the transformed SHA1 algorithm in a 0.18µm CMOS technology in order to verify its correctness and its achievement of high throughput. To the best of our knowledge, the proposed SHA1 architecture is the first to achieve the theoretical throughput optimum beating all previously published results. Though we demonstrate a limited number of examples, this design methodology can be applied to any other MD4-based hash algorithm.
Introduction
Hash algorithms not only have cryptographic key features such as the one way function property and collision resistance but also have operational key features such as high speeds and fixed output sizes independent of the input sizes. Due to these key features, hash algorithms are one of the most popular primitive components in the cryptographic systems. Especially they are commonly used in Digital Signature Algorithms [1] and message authentications. Designing a given hash algorithm to be throughput optimum is a critical issue. Considering that data sizes and communication speeds are increasing every year, a low throughput of hash algorithms can be a bottle neck in the digital and/or communications systems. Even though many architectures have been proposed, the analysis of theoretical limits has not been presented before the initiation by our previous work [17, 18] . This lack of theoretical analysis may lead designers to make futile attempts to improve their architecture when it is already optimal.
In this paper, we propose a systematic design methodology for throughput optimum architectures for MD4-based hash algorithms, which include most commonly used hash algorithms such as SHA1, MD5 and RIPEMD-160. The proposed design methodology is composed of three steps. The first step is to construct a DFG (Data Flow Graph) of a given algorithm. In this process, we select a DFG which has the minimum iteration bound. The second step is to analyze the iteration bound, which defines the minimum delay in which an algorithm may run independent of implementation architectures. The last step is to construct an architecture achieving the iteration bound by transformations to the DFG. The retiming transformation and the unfolding transformation are used. The concepts of the iteration bound analysis of DFG and the transformations are publicly known techniques and detailed explanation can be found in [2] , and we adopt most of the notations and the definitions from this reference. In this paper, however, we develop and formalize the concepts to specifically design MD4-based hash algorithms.
Moreover, we apply the design methodology to the popular MD4-based hash algorithms, SHA1, MD5 and RIPEMD-160, in order to thoroughly investigate the methodology usage. According to [3] , "customized hash functions are those which are specifically designed "from scratch" for the explicit purpose of hashing, with optimized performance in mind, and without being constrained to reusing existing system components such as block ciphers or modular arithmetic. Those having received the greatest attention in practice are based on the MD4 hash function." In this paper we focus on this class of MD4 type hash functions. More specifically we use SHA1, MD5 and RIPEMD-160 as examples as they are most used in practice.
According to our analysis and design, optimal architectures for MD5 and RIPEMD-160 require only the retiming transformation while SHA1 requires both the retiming transformation and the unfolding transformation. Due to this property of SHA1, we synthesized the proposed architecture of SHA1 to verify the achievement of high throughput and the correctness of the transformations, which is done by checking the hash outputs. According to the best of our knowledge, the proposed architecture is the first throughput optimum architecture of SHA1. The result of synthesis shows a throughput of 3,738 Mbps, the highest throughput among all previously published papers.
The remainder of the paper is organized as follows. In Section 2, we review related works for high throughput architectures of hash algorithms and in Section 3, we present our design methodology for throughput optimum architectures of MD4-based hash algorithms. The proposed design methodology is applied to SHA1 in Section 4 and is applied to MD5 and RIPEMD-160 in Section 5. We show the synthesis results of the transformed SHA1 to verify our proposal in Section 6 and we conclude this paper in Section 7.
Related Works
There are several techniques to increase throughputs in MD4-based hash algorithms implementations.
The most commonly used techniques are pipelining, loop unrolling and using Carry Save Adders (CSA):
Pipelining techniques reduce critical path delays by properly positioning registers, whose applications can be found in [4] [5] [6] ; Unrolling techniques improve throughputs by performing several iterations in single cycle [7] [8] [9] [10] ; CSA techniques reduce arithmetic addition delays of two or more consecutive additions [4] [5] [6] [7] 10] . Many of the published papers combine multiple techniques to achieve a higher throughput. SHA1 is implemented in [7, [10] [11] [12] [13] [14] , SHA2 is implemented in [4, 5, [8] [9] [10] 13] , MD5 is implemented in [12] [13] [14] [15] [16] and RIPEMD-160 is implemented in [12, 13, 15] .
Despite numerous proposals for high throughput hash implementations, a delay bound analysis had been neglected except for our pervious work [17, 18] and the architecture designs are mostly done by intuition. In [4] , the authors present a design that achieves the iteration bound, though they do not claim optimality. In fact, their design is the last revision of several other suboptimal attempts [5] . In one of our previous work [17] , the analysis of the iteration bound of SHA1 is done and in order to approach the iteration bound, the unfolding transformation is applied. However, it does not achieve the actual iteration bound and it required too much hardware duplication. This is caused by neglecting the retiming transformation and concentrating only on increasing the unfolding factor. In addition, the CSA technique is not considered in the proposed architecture. In another our previous work [18] , the iteration bound analysis of SHA2 is performed and its throughput optimum architecture is presented.
In this work, due to the property of SHA2, the unfolding transformation was not necessary to achieve its iteration bound.
In this paper, we propose a systematic design methodology for MD4-based hash algorithms starting from a DFG (Data Flow Graph) and analyzing the iteration bound to design the optimum architecture.
Since SHA1 requires both the retiming transformation and the unfolding transformation, applying the design methodology to SHA1 is a good example for a general MD-based hash algorithm. The synthesis result of SHA1 confirms the validity of the design methodology.
Design Methodology for Throughput Optimization
Our proposed design methodology adopts the iteration bound analysis and the transformations of DFG (Data Flow Graph), whose detailed concept and explanation can be found in [2] . We develop and formalize the technique to fit MD4-based hash algorithms. MD4-based hash algorithms are iterative algorithms. This means that the output of one iteration is the input for the next. Iterations in general restrict the usage of pipelining or other straightforward techniques to improve the throughput. In order to analyze the iteration bound, we represent a given algorithm with a DFG. To understand the methodology, we explain the design steps using a simple example. 
C(n + 1) = B(n) the given operations, of + and * as P rop(+) and P rop( * ) respectively. The binary operators, + and * , can be arbitrary operators but we assume P rop(+) < P rop( * ) in this example. The iteration bound analysis starts with an assumption that any functional operation is atomic. This means that a functional operation can not be split or merged into some other functional operations.
The Iteration Bound Analysis
A loop is defined as a path that begins and ends at the same node. In the DFG in The iteration bound is defined as follows:
where L is the set of all possible loops. The iteration bound creates a link between the arithmetic delay and the functional delay. It is the theoretical limit of a DFG's delay bound. Therefore, it defines the maximally attainable throughput. Please note that every loop needs to have at least one algorithmic delay in the loop otherwise the system is not causal and cannot be executed.
Since the loop marked with bold line has the maximum loop delay assuming that P rop(+) < P rop( * ), the iteration bound is as follows:
This means that a critical path delay in this DFG can not be less than this iteration bound. The critical path delay is defined as the maximum calculation delay between any two consecutive algorithmic delays, i.e. D's. In our example ( Fig. 1) , the critical path delay is 2 × P rop(+) + P rop( * ), which is larger than the iteration bound. The maximum clock frequency (and thus throughput) is determined by the critical path (the slowest path). The iteration bound is a theoretical lower bound on the critical path delay of an algorithm. We use the retiming and unfolding transformations to reach this lower bound.
The Retiming Transformation
The minimum critical path delay that can be possibly achieved using the retiming transformation is shown in Eq. 4.
Assuming that a functional node can not be split into multiple parts, · is the maximum part when P rop(+) + P rop( * ) is evenly distributed into N parts, where N is the number of algorithmic delays in a loop. This is denoted by the 2 in our example and sits in the denominator. Since the total delay P rop(+) + P rop( * ) can be partitioned into one delay P rop(+) and the other delay P rop( * ), the attainable delay bound by the retiming transformation is P rop( * ).
The retiming transformation modifies a DFG by moving algorithmic delays, i.e. D's, through the functional nodes in the graph. Delays of out-going edges can be replaced with delays from in-coming edges and vice versa. and its delay is reduced to P rop( * ), which is the same as Eq. 4. However, the iteration bound still has not been met.
The Unfolding Transformation
The unfolding transformation improves performance by calculating several iterations in a single cycle.
The minimally required unfolding factor is the denominator of the iteration bound. This fact can be inferred by noting that the difference between Eq. 3 and Eq. 4 is caused by the un-cancelled denominator of the iteration bound. In our example, the required unfolding factor is two.
We expand Eq. 1 with the unfolding transformation by representing two iterations at a time, resulting in Eq. 5. While optimal architectures can be obtained by applying the unfolding transformation on the retimed DFG (Fig. 2) , doing so involves complex reconstruction of the operational equation. Therefore, we decided to perform unfolding before retiming, which allows us to obtain optimality through a simpler path. The unfolding transformation applied on the original operational equation Eq. 1 produces Eq. 5.
Note that now A(n + 2), B(n + 2) and C(n + 2) are expressed as a function of A(n), B(n) and C(n).
By introducing a temporary variable T mp, Eq. 5 can be simplified into Eq. 6.
Fig. 3. Unfolding and Retiming Transformation
By doubling the number of functional nodes, we are able to unfold the DFG by a factor of two ( Fig. 3(a) ). Box A, B and C now give the outputs of every second iteration. By applying the retiming transformation to the unfolded DFG, the resulting critical path becomes the path in bold between the two bolded D's, which is D−−→ * −−→ +−−→D (Fig. 3(b) ). Therefore, the critical path delay is P rop(+) + P rop( * ). Due to the unfolding factor of two, the normalized critical path delay,T , can be calculated by dividing the critical path delay by two as shown in Eq. 7. This final transformation results in an architecture that achieves the iteration bound of the example DFG (Fig. 1 ).
Now the only remaining step is the implementation of the resulting DFG. Note that some of the square nodes are no longer paired with an algorithmic delay and that the output indexes are changed, which can be seen in Fig. 3(b) . The explanation about how these issues is dealt with during implementation will be given in Section 6, where we synthesize the SHA1 algorithm.
Assumption and Restriction of the Design Methodology
The assumption and restrictions of our design methodology is directly inherited from the properties the absolute functional delay is not considered and the actual iteration delays will vary depending on the implementations of the functional operations. However, note that relative delay amounts between functional operations must be considered. As mentioned before, the reason is that we want to distribute the total functional delay as evenly as possible. Moreover, we assume that any two functional operations can not be merged or split to some other functional operations. If we want to use some other functional operation instead of given operations, e.g. using a CSA and an addition instead of two consecutive additions, then a new DFG must be derived and analyzed.
Iteration Bound Analysis and Transformations of the SHA1 Algorithm
In this section, we analyze the iteration bound of SHA1 and present the DFG which achieves the iteration bound.
The SHA1 Algorithm
SHA1 [20] , which is one of the most popular hash algorithms, was issued by the National Institute of Standards and Technology (NIST) in 1995. SHA1 takes input data of length less than 2 64 bits and gives constant, K t , and a nonlinear operation, F t , where t represent the t-th hash operation. The architecture of one step hash operation is illustrated in Figure 4 and the mathematical expression is described in Eq. 8.
In Fig. 4 Note that in iteration bound analysis, we assume that each functional node in a DFG can not be merged or split into some other operations. Therefore, in order to use a CSA (Carry Save Adder), we have to draw another DFG. A CSA produces two values (carry and sum) from three input operands. Since CSA has small area and propagation delay in throughput optimized implementations, it is commonly used when three or more operands are summed. In Fig. 5 , (a) and (b) represent DFG's of each case, where the shaded nodes represent the maximum loop bound in each DFG.
We assume that P rop(F t ) ≈ P rop(CSA) < P rop(+), which is reasonable since the worst case of F t is the "Three Input Bitwise Exclusive OR" operation and this is the same as the critical path of CSA, and a 32 bit addition has definitely larger critical path than the others. The iteration bounds of the two DFG's are as the following Eq. 9, where super-script (a) and (b) represent each DFG of Fig. 5 . Since shifts are negligible in hardware implementations, we ignore the delays of shifts.
∞ , the use of CSA alone does not help to reduce the iteration bound in this case.
Applying the Retiming Transformation
The attainable minimum critical path delay with the retiming transformation (not using the unfolding transformation) is as the following Eq. 10.
Therefore, even though applying the CSA technique increases the iteration bound, the critical path delays of two cases are similar if there is no unfolding transformation. The DFG's which achieve these critical delays are shown in a ripple carry adder (which is very slow), the required gates are the same as CSA. However fast adders such as carry look-ahead adders will require a significantly larger number of logic gates. Therefore, the use of CSA rather than an adder in throughput optimum architecture will save gate area.
Due to the retiming transformation, some of the square nodes are no longer paired with algorithmic delays. Therefore, care must be used to properly initialize the registers and extract the final result: this is explained in Section 6.
Applying the Unfolding Transformation
In order to achieve the iteration bound, the unfolding transformation is required. By expanding Eq. 8, we get Eq. 11. Since the register values of time index t + 2 can be expressed using registers only with the time index t, the unfolded DFG with unfolding factor 2 can be derived using Eq. 11. Note that the functional nodes are doubled due to the unfolding transformation.
Since the iteration bound of (a) is smaller than (b) in Fig. 5 , we perform the unfolding transformation on Fig. 5 (a) , which results in Fig. 7 . Usually the iteration bound can be achieved with the unfolding factor of the denominator of the iteration bound. In this case, note that in the indexes of F , K and W , t is replaced by 2t due to the unfolding factor of two.
After the unfolding transformation, we can substitute two consecutive adders with one CSA and one adder. We have shown in the previous sub-section that using CSA does not reduce the iteration bound and therefore does not improve the throughput. However, since CSA occupies less area than a throughput optimized adder, we substitute adders with CSA whenever it is possible. Fig. 8 shows the Since 3 × P rop(CSA) < P rop(+), the loop having the maximum loop delay is the loop marked with bold lines in Fig. 8 , and the normalized critical path delay,T , is now equal to the iteration bound as shown in Eq. 12.T = 2 × P rop(+) + P rop(
Finally after performing the proper retiming transformations, we get the result of Fig. 9 . The critical path is the path of shaded nodes in Fig. 9 (i.e. S5−−→ +−−→ A−−→F 2t+1 −−→+) and the critical path delay is 2 × P rop(+) + P rop(F t ), which achieves the iteration bound. Note that there is no propagation delay on S5 and A.
Some Other MD4-based Hash Algorithms
The proposed design methodology can be applied to any other MD4-based hash algorithm. In this section we apply the methodology to MD5 and RIPEMD-160.
Iteration Bound and DFG Transformation of MD5
MD5 [21] is developed by Ron Rivest in 1992 to improve MD4. MD5 is composed of 4 rounds, and each round is composed of 12 hash operations and produces a 128-bit output. The equation and the DFG of MD5 is shown in Eq. 13 and Fig. 10(a) respectively, where S t represents the shift function whose number of shift positions depends on the time index t. X t is a selection of expanded message words and T t is a constant whose value depends on t.
Since the loop (B−−→F t −−→ +−−→ S t −−→ +− −−− → D B) of the shaded nodes in Fig. 10(a) has the maximum loop bound, the iteration bound is shown in Eq. 14. Note that CSA cannot be used in this loop since two adders are separated by a shift function. Though the use of CSA does not reduce the iteration bound, i.e. no benefit in throughput, we can use a CSA in A t + X t + T t to save some area.
This iteration bound can be achieved by the retiming transformations shown on Fig. 10(b) , where the critical path is marked by bold line (B−−→F t −−→ +−−→ S t −−→+). The iteration bound for MD5
is double than that of SHA1. Thus when implemented in a similar technology, we expect that one hash operation of MD5 is half the speed of SHA1.
Iteration Bound and DFG Transformation of RIPEMD-160
RIPEMD-160 [22] is designed by Hans Dobbertin et al. in 1996. RIPEMD-160 is composed of two parallel iterations, where each iteration is composed of 5 rounds, and each round is composed of 16 hash operations. The equation and DFG of RIPEMD-160 are shown in Eq. 15 and Fig. 11(a) respectively. Therefore, we need to analyze only one part and then replicate the results for the second part. S t is a shift function, X t is a selection of expanded message words and K t is a constant whose value depends on the time index t.
The loop having the maximum loop bound, i.e. B−−→F t −−→ +−−→ S t −−→ +− −−− → D B, is shown in Fig. 11 (a) using shaded nodes and its iteration bound is shown in Eq. 16.
The retiming transformation of RIPEMD-160 which achieves the iteration bound is shown in Fig. 11(b) ,
where the critical path is marked by bold line (B−−→F t −−→ +−−→ S t −−→+). The iteration bound of RIPEMD-160 is the same as for MD5 and it is double that of SHA1.
Synthesis of the SHA1 Algorithm
In order to verify our design methodology, we synthesized SHA1 for an ASIC using a TSMC 0.18µm CMOS standard cell library. We verified that the actual critical path occurs as predicted by our analysis and that the hash outputs are correct. We synthesized two versions: one after only the retiming transformation and the other after both the unfolding and retiming transformations. Since the unfolding transformation introduces duplications of functional nodes, its use often incurs a significant increase in area. For the version using only the retiming transformation, we select the DFG of Fig. 6(b) . Though they have the same iteration bound, Fig. 6 (b) has less area than Fig. 6(a) due to the use of CSA. Another benefit of Fig. 6(b) is a smaller number of overhead cycles than Fig. 6(a) , which will be explained in this section. For the version of the unfolding and retiming transformation, we synthesized the DFG of Another difference occurs during initialization. In the original DFG ( Fig. 5(b) ), all the registers are initialized in the first cycle according to the SHA1 algorithm. In contrast, initialization requires two cycles in the retimed DFG ( Fig. 6(b) ). This is because there should be one more cycle to propagate initial values of B, C, D and E into T before the DFG flow starts. The procedure to update the registers is shown in Fig. 12 where all the sub-scripts are ignored for simplicity.
In the first cycle, the values of A, B, C, D and E are initialized as the SHA1 algorithm. At the second cycle, A holds its initial value and C, D, E and T are updated using the previous values of B, C, D and E. From the third cycle, the registers are updated according to the DFG (Fig. 6(b) ). Note that the register B does not overlap with the register T in their use and hence T can be reused to hold B init at the first cycle. Due to two cycles of initialization, the retimed DFG introduces one overhead cycle. This fact can also be observed noting that there are two algorithmic delays from E to A. In order to update A with a valid value at the beginning of the iteration, two cycles are required for the propagation. In the case of the retimed SHA1 without CSA ( Fig. 6(a) ), there are three overhead cycles due to the four algorithmic delays in the path from E to A.
Therefore, the required number of cycles for Fig. 6(b) is the number of iterations plus two cycles for initialization, which results in 82 cycles. Since the finalization of SHA1 can be overlapped with the initialization of the next message block, one cycle is excluded from the total number of cycles.
When extracting the final results at the end of the iterations, we should note the indexes of registers.
In Fig. 6(b) , the index of the output extraction of the register A, i.e. A t , is one less than the others.
Therefore, the final result of the register A is available one cycle later than the others.
SHA1 with the Unfolding and Retiming Transformation
In the case of Fig. 9 , there are 6 algorithmic delays and two of them are not paired with a square node. We name the register for the algorithmic delay between two adders T 1 and the register for the algorithmic delay between an adder and S5 T 2. However, since T 2 is equivalent to B, we do not need a separate register for T 2. Therefore, the total required registers remain at 5 (retiming does not introduce extra registers in this case). The register updates of this case are described in Fig. 13 .
Note that there is no overlapped use of A and T 1 and hence T 1 can be used to hold A init at the first cycle. Since there is only one algorithmic delay in all the paths between any two consecutive square nodes, there is no cycle overhead resulting in the total number of cycles of 41, i.e. 40 cycles for iterations plus one cycle for initialization. When extracting the final result of A, the value must be driven from +(S5(B), T 1). This calculation can be combined with the finalization since the combined Cycle 1: Fig. 13 . Register Update Procedure of retimed and unfolded SHA1 computational delay of +(S5(B), T 1), whose delay is P rop(+), and the finalization, whose delay is P rop(+), is 2 × P rop(+) which is less than the critical path delay.
Synthesis Results and Comparison
The synthesis results are compared with some previously reported results in Table 1 . The 82 cycle version is made by the retiming transformation ( Fig. 6(b) ), and the 41 cycle version is made by the retiming and the unfolding transformations together (Fig. 9) . The throughputs are calculated using the following Eq. 17.
T hroughput = F requency # of Cycles × (512 bits)
In Table 1 , the work of [12] is a unified solution for MD5, SHA1 and RIPEMD160 so that the gate count is quite large. [13] used 0.13µm technology but it is much slower than our proposal which used [17] has a small cycle number and a large gate area due to the unfolding transformation with a large unfolding factor of 8. Even with the use of a large unfolding factor, its throughput is still smaller than our proposal. Comparing our 41 cycle version with [19] , the areas and the frequencies are similar but our version has only half the cycles resulting in twice the throughput.
Though the 41 cycle version of our proposals achieves the iteration bound with throughput optimum architecture, the 82 cycle version can be a good tradeoff since its gate area is much smaller with slightly less throughput.
Our designs were described at the register transfer level and we have concentrated on optimizing at micro-architecture level rather than focusing on lower-level optimizations. In other words, we let standard synthesis tools generate our adders in standard CMOS logic and use generic architectures.
While higher throughput can be achieved with faster adder architecture and/or faster logic families, we
show that iteration bound analysis still determines the optimum high level architecture of an algorithm.
We validate these claims by producing designs that have the highest throughputs among all published results.
Conclusion
We propose a design methodology for throughput optimum architecture of MD4-based hash algorithms.
This methodology is very strong since the upper limit for the maximum throughput can be obtained by a mathematical method and the optimum architecture can be systematically designed. Though straightforward, the transformations sometimes produce architectures which necessitate careful implementation.
These issues are discussed and design examples using some popular hash algorithms such as SHA1, MD5
and RIPEMD-160 are used to illustrate how they are resolved.
In order to verify the design methodology we synthesized the SHA1 algorithm, which is the algorithm requiring all the procedure we preset among our examples. According to our synthesis results in a 0.18µm CMOS technology, the maximum achievable throughput is 3,738 Mbps, which is much faster than previously reported implementations. The optimized architecture for SHA1 that we have developed conclusively shows the effectiveness of our design methodology.
Our proposed design methodology can also be used by hash algorithm designers who wish to use high throughput as a design criteria.
