The security hash algorithm 512 (SHA-512), which is used to verify the integrity of a message, involves computational iterations on data. The huge computation delay generated in such iterations limits the entire throughput of the system and makes it difficult to pipeline the computation. We describe a way to pipeline the computation using finegrained pipelining with balanced critical paths. In this method, one critical path is broken into two stages by using data forwarding. The other critical path is broken into three stages by using computation postponement. The resulting critical paths all have two adder-layers with some data movements, and thus are balanced. In addition, the method also allows register reduction. Also, the similarity in SHA-384 and SHA-512 are used for a multi-mode design, which can generate a message digest for both versions with the same throughput, but with only a small increase in hardware size. Experimental results show that our implementation achieved not only the best area performance rate (throughput divided by area), but also a higher throughput than almost all related work.
Introduction
Secure hash algorithms (SHAs) are used to generate a unique message digest for an arbitrary message [1] . Successive improvements over the first SHA-1 have resulted in the latest version SHA-512, which has longer message length, longer word length, and greater digest data. As a result, SHA-384 and SHA-512 have the greatest security for communications [1] . They contain iterations with data dependencies inside and generates huge computation delay, thus limiting the throughput of the system. Together with the speed demand of communication, throughputs of the SHAs must increase. Hence, low hardware costs with high throughput SHAs are required.
There are two main methods to increase the throughput by reducing the delay in the critical path: pipelining and loop unrolling. Lien [2] introduced the technique and implementation for SHA-1 and SHA-512, each having two versions: one involving pipelining and the other loop unrolling. In the pipelining version of SHA-512, the main loop has two stages, with a maximum delay of three adder-layers. Loop unrolling architectures in [3] - [5] also tried to increase the throughput with various step numbers. However, the un- rolling architecture will increase the hardware size and decrease the frequency due to extra computations in each step. Multi-mode is given by Glabb [8] and Helion [13] but their performance rates are not really good due to complexity and low frequency. This paper describes fine-grained pipelining multimode SHA-384/512 architecture on FPGA in which one iteration computations of E and A (see Fig. 2 ) at step t are pipelined into two and three stages, respectively. It can work with both SHA-384 and SHA-512 versions with same performance and small hardware increase. This architecture aims to achieve high throughput with small hardware size, that is, high area performance rate, by balancing critical paths in fine-grained pipeline stages. On the same size FPGA chip, the small hardware size of the proposed design leaves more free space for other applications. In addition, the high performance of the SHA-2 module eases or removes a bottleneck of data verification in the security protocol, and allows the system to achieve a higher throughput. This work forwards the data among steps to remove the data dependency of a single step inside the main loop. The computation of one intermediate digest operand (E) can be implemented using two-stage pipeline. Computation postponement, which simultaneously do computation and data movements, enable the other intermediate digest operand (A) being implemented using three-stage pipeline. The different number of stages in computing operands in the same step helps our design have computation balance in all stages, which includes two adder-layers with some data movements. In addition, our method allows us to remove an intermediate register in the design. The finalizations of 80 repetitions are pre-computed and transmitted out to save time and hardware. The design was experimented on several FPGAs devices to make comparison with other works in terms of throughput, hardware size, and tradeoff between hardware size and throughput.
SHA-384/512 Hashing Algorithm
The SHA-384 and SHA-512 algorithm are the latest and safest in the SHA series. It can deal with the message with the length up to 2 128 bits, and generates 512 message digest bits (SHA-512) or 384 bit message digest (SHA-384) with the security of 256 bits. It works based on 64-bit words computations with the block size of 1024 bits. The digest computation contains two main processes [1] SHA-512 algorithm is shown in Fig. 1 .
Pre-Processing
The pre-processing consists of the following three steps:
(1) Padding the message purposes to ensure that the padded message is a multiple of 1024 bits. The padding module appends the bit "1" to the end of the message, followed by multiple-zero bits. Then, 128 bit showing length of the original message is added after.
(2) Parsing the padded message divides the result of the previous step into N data blocks, 1024 bits each. This 1024-bit data then is considered as 16 keys, 64 bits each in the secure hash main computation.
(3) Setting initial value will set all the intermediate operands (A ∼ H) for message digest computation once for each message by constants. The initial value is different between SHA-384 and SHA-512.
Secure Hash Algorithm
Secure hash algorithm (H S HA ) shown in Fig. 2 is the heart of the SHA-384/512 algorithm. Same algorithm is applied for both SHA-384 and SHA-512. It processes all 1024-bit blocks (Y 0 to Y N−1 ) of the padded message in sequence from the first one, and compresses them into 512 bit message digest. The H S HA step handles the 1024-bit input data as sixteen 64-bit words, representing for 16 initial keys (W 0 ∼ W 15 ). The secure hash algorithm module also receives 512 bit previous message digest (initial) in form of eight 64-bit words of A ∼ H. The initial values of message digest are also stored into other eight 64-bit words of H a ∼ H h for the finalization purpose. Two main computations are executed internally to generate the intermediate values for A and E. SHA-384/512 requires 80 steps to produce the 512-bit message digest. Each step requires the previous results for the following computation.
(1) The key generation inside the module is in charge to generate keys for 80 repetitions based on the previous several keys following the Eqs. (1) .
in which, σ 0 and σ 1 are two logic functions that include rotations and xor operations. Each step (shown by t) from 0 to 79 uses the corresponding key generated by Eqs. (1). (2) Equations (2) show the main computation, mainly used for A and E generations and repeated for 80 times
in which, K t is a constant extracted from a 80-element constant table. Location of the value is defined by step number t. W t is the corresponding key in each step, given in Eqs. (1) . Σ, Ch, Ma j are logical functions:
(3) The finalization occurs by adding the initial words H a ∼ H h with the corresponding generated hash values A ∼ H to generate the final message digest of the message block, which are considered as one input for the next data block processing. The SHA-512 will complete its process by giving out the final 512 bits message digest. The 384 bit message digest of SHA-384 are given by truncating the final 128 bits from the final 512 bit result.
Related Work

Basic Pipeline Architectures
One work of Lien [2] Chaves [6] , [7] tries to increase throughput of pipeline architecture of SHA-512 by replacing all adders with carry save adders (CSA) and carry propagate adders (CPA). In addition, an improvement in the finalization stage makes the design small by removing six adders.
Glabb [8] describes an architecture that utilizes the similarity in processing of SHA-2s in implementation. The 64-bit data processing of SHA-384/512 is divided into two 32-bit data processing of SHA-224/256. Hence, the multimode version can work as one SHA-384/512 or two SHA-224/256 modules in low performance rate. Higher performance rate is given in the SHA-384/512 version.
Ahmad [10] implements an architecture in which 64-bit adders are replaced by multi 32-bit adders (16/8-bit adders), and studies their area-performance trade-off in the ALTERA devices.
Dadda [11] gives methods to reduce data path in SHA-256 (384, 512) by using (1) delay balancing, (2) pipelining, and (3) combined of delay balancing and pipelining. The critical path is shorten by utilizing full adder array (faa) and pipelining. However, the design is given for SHA-256 and delay is estimated based on logic delay for ASIC devices.
Unrolling Pipeline Architectures
The combination of two consecutive steps or more into one in unrolling architectures are widely used in order to shorten the number of clocks required. The method is given based on the data movements among steps and hardware duplication. The critical path becomes longer in comparison with the basic above. The reduction in number of steps helps these architectures to gain high throughput with low frequency. These architectures can be seen in Lien [2] with 5 steps unrolling, Zeghid [4] with two steps unrolling, and McEvoy [3] with two steps and 4 steps unrolling. Those unrolling architectures also often combine with the pipeline technique to increase the frequency.
Available Commercial Products
Several commercial products for SHA-1 and SHA-2 are introduced by Alma, Helion, and Amphion. Among those, only Helion [13] introduces SHA-384/512 core with technical descriptions in terms of devices, max frequency, throughput, and hardware size. Helion achieves a significant throughput with small hardware size. The design is not published but it processes each 1028-bit data block in 82 clocks. Hence, a two or three stages basic pipeline design seems to be used. The design can be used for both SHA-384 and SHA-512 algorithms.
Designing Method
A top-down design method was introduced by Michail [12] . It contains 5 steps of unrolling, spatial pre-computation, system data pre-fetching, temporal pre-computation and circuit-level optimization. The unrolling step allows the design take advantages of the consecutive steps combination, reduces number of required clock with reasonable penalty in area. The pre-computation and pre-fetching stages are used to divide the critical path to pipeline stages, and to reduce the stage latency. The final stage matches the design with the device.
Three-Stage Pipeline Architecture
Since the delay among the main computation is much larger than the delay in the key generation, this section describes how to break the huge delay in critical paths into small one using data forwarding and computation postponement. The same core is designed for both SHA-384 and SHA-512 due to the similarity in computation between them. Eight multiplexers are required for the pre-processing process.
Data Dependency and Original Critical Delay
Equations (2) and (3) show that values of H, G, F, D, C, and B directly relies on previous values of G, F, E, C, B and A, respectively. The bold arrows in Fig. 3 (1), relies on the previous several keys value only. The keys can be pre-computed or onlinecomputed using 4 adders (created two adder-layers). Σ relies on the previous value of E by rotation and XOR functions. Ch depends on the previous values of E, F, and G just by logical operations, which can be considered to generate very small delay. Hence, the final delay generated in this critical path includes 5 adders, divided into three adder-layers minimum, and concerns previous intermediate digest values of
Meanwhile, the new − A completely depends on the old values of A ∼ C in the same manner with that of new − E. Moreover, it also relies on T 1 , which is a part of new − In short, the critical path in computing next step digest value consists of three adder-layers. New − E relies on W, K, and D ∼ H but not A ∼ C. New − A relies on a part of new − E and A ∼ C. Figure 4 shows the data movements in A and E computations in two consecutive steps. Tracing back the data movements in A and E computations in two steps allows us to decide which operands can be forwarded for intermediate values pre-computations to shorten the critical paths using two-stage pipeline.
Tracing Back A and E Computations for Data Forwarding and Computation Postponement
In order to start computing the current E (e t+1 in Fig. 4 ) at two-step-before, the one-step-before H value is required at that time. Assume the values of operands A ∼ H at the two-step-before are a ∼ h respectively. The value of H at one-step-before is g, and transferred from G at two-stepbefore. In short, the value of H at one-step-before can be taken as value of the operand G at the two-step-before for current E pre-computation. It shows that some part of the Eqs. (2) can be pre-computed to ease the delay in the critical path for E calculation. Figure 4 also shows that the computation of new − E (e t+1 ) has no data dependency into one-step-before A (a t ). Moreover, the calculation for new − A (a t+1 ) can utilize a part of new − E (T 1 ). Thus, new − A can be computed at the later time (one step later) than the new − E of the same step, or the computation of new − A can be postponed for Fig. 4 Original data movements in A and E computations among two steps. e t+1 partly depends on c and g, which come from C and G at twostep-before. a t+1 partly relies on g and T 1 , a part of e t+1 computation. one clock. The computational inheritance in this case helps to reduce logic delays in new − A generation process.
Three-Stage Pipeline Design
The three stage pipeline architecture is based on the data dependency removal using data forwarding, and critical path shortening using postponement. It balances critical paths in all stages.
Data Forwarding to Remove Dependency in New-E Computation
In this implementation, we manage to implement a single step of SHA-384/512 into pipeline stages based on the data dependency in Eqs. (2) . In order to apply the recognitions of H movement shown in Fig. 4 and Sect. 4.2, we rewrite Eqs. (2a), (2c), and (2d).
Equations (4) show that the computation latency can be divided into smaller stages. The key used at the current step can be pre-computed several steps before, because it relies mainly on itself and the step number t. The trace of A ∼ H in Fig. 4 allows us to define value of H at one-step-before (used to calculate e t+1 ), as G at two-step-before. Hence, (4a), if occurs at two-step-before, will become K t+1 +W t+1 +G. In other word, the direct data dependency of E computation in the operand H is removed. The new − E (e t+1 ) now relies on the value of G at two-step-before. Thus, we can pre-compute (4a) for the current e t+1 by forwarding the value of G to H at two-step-before. We can divide the computation of new − E in two stages called KWH and TEcomp as shown in Fig. 5 . The KWH stage computes K t+1 +W t+1 +G for Eq. (4a). Then, the TEcomp stage computes D+KWH +Σ 1 (E)+Ch(E, F, G) value (combining (4c) and (4d)). This computation generates new − E while requiring only three adders (two adderlayers delay) at each stage due to a half of this is precomputed in KWH stage.
Computation Postponement to Shorten Critical Path in New-A Computation
The independence of new − E to previous A (e t+1 has no data dependence on a t ) allows us to separate the computation of E and A. The computation of A (a t computation) if occurs after the computation of E (e t ) can utilize a part of the E computation (T 1 in (4c)). Moreover, it has one more clock to complete. Hence, we give a postponement to the A computation by rewriting Eqs. (2b), (2e), and (2f) into . The data movements given by (4e) occurred at that time (a t movement in Fig. 4 ) moves the newly generated A (a t ) to B operand. Thus, the A t is computed one step later than E t , and immediately stored into the expected location, B. Besides, all input operands in (4f) already moved right then. Thus, register A can be omitted. The three adderlayers computation latency (if A t and E t are computed simultaneously in Fig. 4 ) is broken into two adder-layers latency in TEcomp and Acomp stages. Figure 5 shows the methodology to divide the computations of Eqs. (4) into two-and-three-pipeline stages with data forwarding and computation postponement. The big latency of E and A computation (utilizing 5 and 6 of 64-bit adders, respectively, and be arranged in three adder-layers) in Fig. 3 are reduced to the latency of two adder-layers in Fig. 5 . The KWH is used to prepare intermediate pipeline values. The TEcomp is used to generate the E value for next step. The Acomp reuses a temporary value of the E computing process to generate corresponding A value, and immediately stores it into the corresponding location (B). Eighty two clocks are required to complete the compression of one data block. One computation of E will finish after two stages, while the computation and movement of A will finish after three stages. The computations in Fig. 5 and Eqs. (4) show that all operations in each stage contain up to three adders, arranged into two adder-layers. Hence, this architecture achieves latency balance in all stages.
Pipelining Stages
Improvement for Register Reduction and Finalization
Two improvements are given to reduce the number of registers and computation time for the finalization of SHA process.
The first improvement concerns to the finalization stage in which values of H a ∼ H h are replaced by the sums of H a + A ∼ H h + H, respectively. The Eq. (4f) uses two adders arranged into two layers. Hence, one more adder is included here for the finalization of H a that still keeps delay up to two adder-layers. The second upgrade comes from the recognition that register A has no usage during the computation because its value is moved to B when appeared or written into H a when finished. Hence, register A is removed and replaced by H a when read at the first step, H b at the second step, and H a when written at the final step. Number of registers become 15 (initial registers H a ∼ H h , and online computation registers B ∼ H).
Three-stage Pipelining Implementation
Data Extension Implementation
The pipeline of the data extension or key generation module as shown in Fig. 6 is included (6) show the computation for the key. 
Compression Module Implementation
Data Organization
There are three different data with different length involved in the design of the SHA-384/512 algorithm. They include the keys, constant values, and the message digest data. In this implementation, the keys and digest data are designed as matrices of registers that contain 16 and 8 elements respectively. The constants are located in a 80-element look-uptable. The shifts and rotations are served as special modules using wire swapping technique. Other intermediate data are specified as 64-bit registers. Two modules of data extension and operands computations are pipelined in the implementation.
Pipelining of Operands Computations
The pipeline stages are implemented with two pipeline registers called KWH and T 1 . Figure 7 shows the data movements among operands and their pipeline computations with computation postponement and register reduction. Register KWH is computed at the KWH stage by (4a) in which H is forwarded from G. T 1 , E values are computed at the TEcomp stage by (4c) and (4d). W is pipeline computed at KWH and TEcomp stages using Eqs. (6) . At any intermediate step, the value of E is ready for the next TEcomp step, and the value of T 1 is available for the Eq. (4f), calculated at the Acomp stage. However, at this time, the A, B, C values, which are used to generate new − A in (4f), are already moved once to the right and located in B, C, D instead. The A computation postponement makes the newly generated value (new − A) not locate in A as usual but move to B as the combination of A computation (4f) and A movement (4e) in two consecutive steps. Register A is no more used and can be omitted in intermediate steps. Figure 8 shows the pipeline implementation of the compression module H S HA . It contains three stages of KWH, TEcomp, and Acomp. The bold lines show the critical paths in the stages. They illustrate that all stages are balanced with two adder-layers. Register KWH is used to store K + W + G or K + W + H. Two halves of keys are computed and stored 4f), and immediately forwards it to the corresponding location in B register using two 64-bit adders. Thus, all the critical paths in three stages are balanced with two adder-layers and some data movements.
Hardware Implementation
Implementation Results and Discussion
Four versions are given: SHA-512 1 uses full adders (FAs) in the design; SHA-512 2 replaces all dimmed adders in TEcomp stage with carry save adders (CSAs); SHA-384/512 1 uses FAs in the design; and SHA-384/512 2 replaces all dimmed adders in TEcomp stage with CSAs. The dimmed adders are chosen to be replaced because they form four-input adders when combining, which are suitable for CSAs. They were implemented on the Virtex, Virtex-E, Virtex-2, and Virtex-4 family devices, and were compiled by Xilinx ISE 8.1. Table 1 shows the implementation results in terms of hardware size, speed, throughput, and area performance rate in several different devices. The maximum clock speed at which the proposed design can correctly work on each FPGA device is achieved by the latency balancing in all pipeline stages. Increasing the number of pipeline stages also simplifies the calculations in each stage and so increases the maximum frequency that they can work. The throughputs of the designs, calculated by (frequency*blocksize)/clocks-per-block, achieve nearly 2.4 Gbps. The hardware size keeps smaller than 1,627 hardware slices. The small hardware size and memory occupied by the proposed design guarantees more available space for other applications to simultaneously work on the same FPGA chip. The high throughput also removes the data verification bottleneck in the security protocol, and allows the system to achieve a higher throughput.
The results show that CSAs should be used in the Virtex-4 devices due to the good effects in performance rate. In the Virtex-2, the full-adders should be used for the best performance. In general, CSAs should be used if we prefer a high performance SHA-2 core, and FAs should be used if we prefer a compact hardware one. The results also show that the usage of the multi-mode SHA-384/512 would increase the hardware size by 6-8% with the same throughput compared with the single mode SHA-512. Table 2 shows the implementation results of related work [2] - [11] , [13] in which the basic and the unrolling architectures of SHA-512 and SHA-384/512 are implemented in pipeline.
Comparison with Related Work
Our work tries to balance the latency among pipeline stages, and keep delays of all the computational stages in two adder-layers or using two carry save adders with nearly no hardware duplication. Thus, we achieved better area performance rate (throughput divided by slices) than all others as shown in Fig. 9 . Our work not only occupied smaller area than others except Helion [13] , but also achieved better throughput and area performance rate than that given by all comparable work.
The improvement in the finalization given by Chaves [6] , [7] removes six 64-bit adders compared with the original algorithm. The uses of CSAs and CPAs in the design improves the throughput but increase hardware size. Our implementation makes the critical paths shorter than that of Chaves and removes five adders together with In comparison with commercial product as Helion [13] , our work is much better in throughput and area performance terms, and achieves 30% better in throughput, although the area is only 2% bigger. Our area performance rate is 17% higher than Zeghid [4] , although the throughput is 6% smaller due to the two steps unrolled architecture used by Zeghid. Since Aisopos [5] uses the hardware duplication technique, our hardware size is 73% smaller, and area performance rate is 39% better, but the throughput is 20% smaller.
From architecture view points, the pipeline version of Dadda [11] uses same number of pipeline stages with us. Their method is applied to SHA-256 and the result is estimated on advanced ASIC device only, hence, impossible to compare. However, the different in computation and data movement in our design shorten the critical path with some register removal.
In terms of design methodology, our design method is close to the top-down design method of Michail [12] . However, there are some differences between the two. Our design merges the pre-fetching and pre-computation into one. It helps us to achieve computation balance among stages of critical path. Furthermore, the postpone of A t computation for one clock allows us to reuse temporal value generated during E t computation. It shortens the computation delay for A and results to register reduction.
Effectiveness of Combining Computation with Movement, and Computation with Finalization
The combination of A computation and A movement into one in Fig. 7 generates the asynchronous computation of operands in each step. It helps to reduce the delay in the critical path from three adder-layers to two adder-layers, and gives room for the finalization stage of the SHA-384/512 computation. It also allows us to remove one 64-bit register out of the device. The combination of the finalization of operand A into the step computation utilizing room given by the asynchronous operands computation helps to remove the finalization time. The module can immediately start processing the next data block after finishing intermediate steps.
The combination of finalization for operands B ∼ H into some final step helps to reduce the 64-bit adders from 7 to 2. Data also appears at the output sooner with a few multiplexers used.
In short, the combination of computation with movement and computation with finalization in our implementation helps to improve the frequency and throughput while keeping hardware size small.
Conclusions
This paper has described a fine-grained pipeline architecture for SHA-384/512 in which intermediate operands are computed through two-stage and three-stage pipeline processes. The architecture achieved high throughput with small hardware size by balancing critical paths in fine-grained pipeline stages. A data forwarding technique was used to remove the data dependency, which allowed us to divide one original critical path into two. A computation postponement was applied to the other critical path to break it into three shorter ones.
Experimental results show that the design achieved higher frequency, and thus greater throughput, than most comparable work. In addition, the register reduction in our design helped us to achieve a better area performance rate than all others. Moreover, the module can deal with both SHA-384 and SHA-512 with the same throughput, although the hardware size would increase by 6-8%. A significant throughput result of 2.3 Gbps was achieved while requiring 1,627 hardware slices on the Xilinx Virtex-4 family device for the SHA-512 core.
