Abstract: This paper proposes a new SHA-1 architecture to exploit higher parallelism and to shorten the critical path for Hash operations. It enhances a performance without significant area penalty. We implemented the proposed SHA-1 architecture on FPGA that showed the maximum clock frequency of 118 MHz allows a data throughput rate of 5.9 Gbps. The throughput is about 26% higher, compared to other counterparts. It supports cryptography of high-speed multimedia data.
Introduction
In network supporting multimedia data service, the information security becomes a critical issue. Hash functions are widely used in message authentications and integrity protections for many wireless protocols because it dose not require the processed data to be retrieved. SHA-1 (secure Hash algorithm-1) is a representative algorithm of the SHA family [1] . Many SHA-1 implementations have been proposed to enhance the performance [2, 3, 4, 5, 6] . However, low throughput of secure data processing is a bottleneck for data encryption in the multimedia service network. Y. Lee et al. [6] proposed a SHA-1 architecture employing unfolded transformation, which combines the several iterations of SHA-1 into a single cycle. This architecture enhances performance by reducing the number of required cycles for one block Hash, which achieves 3.5 Gbps as a result. H. Michail et al. [5] proposed holistic methodology that shows the maximum throughput over 4.7 Gbps with 91 MHz clock frequency. This paper proposed a high-speed SHA-1 architecture aiming high throughput. It is based on the unfolded transformation performing two Hash operations in a cycle. We shorten the critical path of Hash operation by pre-computation of the coefficients and its parallelism.
SHA-1 algorithm and related works
SHA-1 algorithm takes input message with a maximum length of less than 2 64 bits and produces a 160-bit message digest. The conventional architecture of SHA-1 is shown in Fig. 1 a. In Fig. 1 a, the input message split in 80 × 32-bit words and it requires 4 rounds of Hash operation and each round performs 20 operations iteratively. Main differences among the rounds are a scrambling constant, K t , a nonlinear operation, F t , and 32-bit dataword, W t for each Hash operation.
H. Michail presented cost function analysis for the unfolded SHA-1 algorithm. As a result, the best achieved throughput/area ratio was obtained by partially unfolding two operations. Y. Lee presented an unfolded SHA-1 design employing two unfolded SHA-1 as shown in Fig. 1 c instead of the conventional SHA-1 as shown in Fig. 1 b. It reduces the number of cycles by half because it performs two Hash operations in one cycle. However, it 
The proposed SHA-1 architecture
The proposed SHA-1 design employs two-unfolded architecture. We propose coefficient pre-computation of Hash function and parallelism available in two Hash operation blocks. The outputs of Hash operation are shown in Eq. (1).
where RotL x (y) represents the left rotation of y by x, and f t (o, p, q) stand for non-linear function at time t(t = 0, 2, 4, . . .). The outputs c t+2 , d t+2 , e t+2 are directly derived from a t , b t , c t , whereas the other outputs, a t+2 and b t+2 , require the computational result of a t+1 as shown in Eq. (1).
We propose pre-computation for Hash coefficients that makes higher parallelism in Hash operation. We newly define three parameters l t , m t , and n t that are presented in Eq. (2). These terms are pre-computed before computing other parameters. Note that, l t is used to compute a t+2 and b t+2 ; m t and Fig. 2 . The proposed SHA-1 Hash operation block n t are used to compute a t+2 and l t+2 , respectively.
Thus, a t+2 and b t+2 in Eq. (1) can be modified to Eq. (3) using precomputed parameters. The pre-computed parameters are fed with other inputs such as a t , b t , and c t simultaneously. The critical path delay computing a t+2 is dramatically decreased because l t and m t is pre-computed.
The proposed SHA-1 architecture is composed of Pre comp and Hash core as shown in Fig. 2 . Pre comp is responsible for the pre-computation of the newly defined terms l t , m t , and n t before next stage of Hash operation. Among these parameters, l t has the longest delay that is two additions and one non-linear function, f t . Hash core is responsible for the computation of n − th Hash operation using the values from a t to e t as well as the previous outputs of Pre comp. In these outputs, a t+2 has the longest delay that is two additions and one non-linear function. In addition, there is not any data dependency between Pre comp and Hash core that it makes possible parallel computation. Consequently, the critical path for Hash operation block is converged to the delay of two additions and one non-linear function, f t .
The proposed SHA-1 architecture requires 41 cycles to generate message digest. The first cycle is to initialize the newly defined terms l t , m t , and n t . The other 40 cycles are required for iterative Hash functions. During this iterative Hash operation, the proposed SHA-1 processes Pre comp and Hash core in parallel. The outputs of Pre comp and Hash core are independent at same cycle because Pre comp is the former computation of Hash core. The select signal, Sel is used to identify either the first initialized cycle or the other iterative cycles. It selects the input of Hash core as shown in Fig. 2 . When Sel is high, Pre comp generates the newly defined terms l 0 , m 0 , and n 0 using the constant values W t and K t . The initial values are from H 0 to H 4 , which are the constant of SHA-1 algorithm. Thus, the proposed implementation needs total of 41 cycles for completing of Hash operations. Although it requires one extra cycle comparing to the conventional two unfolded SHA-1, it dramatically shortens the critical path delay in Hash operation block.
Performance analysis
We evaluate the proposed SHA-1 architecture on Xillinx Vertex-2 FPGA, xc2v1000 and compare it with the other counterparts. The design was fully verified using a large set of test messages that is the test example by the standard [1] . Table I shows the comparison results with the other counterpart SHA-1 implementations based on same FPGA technology, Xillinx Vertex-2 family [2, 3, 4, 5, 6] . For the better comparison, we referred to the data sheet of each FPGA chip and found out how many slices are occupying. The proposed architecture with pipelines and the unfolding factor two in Figure 2 shows maximum operating frequency 118 MHz and throughput 5.9 Gbps.
In terms of circuit size, a conventional SHA-1 architecture without both pipelining and unfolding scheme [2] shows the smallest hardware size, 854 slices, among all listed works in Table I . The proposed SHA-1 design has about 2,900 Slices that it is smaller than the other ones employing both pipelining and unfolding architecture [5, 6] . As a result, the proposed SHA-1 reduces the area over 32% compare with these unfolded architectures.
The proposed SHA-1 architecture brings advantages in both maximum operational frequency and throughput as shown in Table I . The SHA-1 designs without pipelining and unfolding [2, 3] shows the maximum throughput to 1.3 Gbps. Instead of that, the SHA-1 employing the alternative of unfolding transformation or pipeline architecture shows 900 Mbps and 2.5 Gbps Even though our implementation requires 41 cycle including initializing cycle for pre-computation, it increases throughput by 26% comparing to the previous work [5] . 
Conclusions
SHA-1 is a popular Hash algorithm and suitable for high-speed crypto graphing. We proposed a new architecture of SHA-1 reducing critical path that enhances throughput of Hash algorithm. We implemented the proposed SHA-1 architecture on FPGA chips. The proposed implementation works at 118 MHz clock that gives the maximum throughput of 5.9 Gbps. As a result, the proposed architecture shows more than 26% better throughput with 32% smaller hardware size compared to the previous implementations. The high-speed SHA-1 is useful to generate a condensed message and may strengthen the security of mobile communication and internet service.
