Abstract. Compact hardware architectures are proposed for the ISO/IEC 10118-3 standard hash function Whirlpool. In order to reduce the circuit area, the 512-bit function block ρ [k] for the main datapath is divided into smaller sub-blocks with 256-, 128-, or 64-bit buses, and the sub-blocks are used iteratively. Six architectures are designed by combining the three different datapath widths and two data scheduling techniques: interleave and pipeline. The six architectures in conjunction with three different types of S-box were synthesized using a 90-nm CMOS standard cell library, with two optimization options: size and speed. A total of 18 implementations were obtained, and their performances were compared with conventional designs using the same standard cell library. The highest hardware efficiency (defined by throughput per gate) of 372.3 Kbps/gate was achieved by the proposed pipeline architecture with the 256-bit datapath optimized for speed. The interleaved architecture with the 64-bit datapath optimized for size showed the smallest size of 13.6 Kgates, which requires only 46% of the resources of the conventional compact architecture.
Introduction
Whirlpool [1, 2] is an ISO/IEC 10118-3 [3] standard 512-bit hash function based on the Miyaguchi-Preneel scheme using the SPN-type compression function W with the 512-bit round function ρ [k] that is similar to the block cipher AES [4] . Highperformance hardware architectures for Whirlpool were proposed and their performances were evaluated using the ASIC library described in [5] . However, the 512-bit function ρ[k] requires a large amount of hardware resources, resulting in a much larger circuit area compared to other hash functions, such as SHA-256/-512 [3, 6] . Several compact architectures were also examined in [7, 8, 9] , but these architectures are limited to the 64-or 8-bit datapath-width on an FPGA supporting a large built-in memory (i.e., Block RAM).
In this paper, we propose compact hardware architectures for ASIC implementations, which partition the function block ρ [k] into sub-blocks with 256-, 128-, or 64-bit datapath-widths. The proposed architectures generate round keys on the fly to eliminate the requirement for memory resources to hold pre-calculated round keys. This feature is especially effective in ASIC implementations in which memory is expensive. The pipelining technique for the compact architectures is also investigated to achieve higher throughput with a smaller circuit area. In total, six hardware architectures for Whirlpool that combine two schemes (interleave and pipeline) with the three datapath-widths are designed. The corresponding throughput and gate count are then evaluated using a 90-nm CMOS standard cell library. Performance comparisons with the Whirlpool hardware using a 512-bit datapathwidth [5] and the SHA-256/-512 hardware [10] synthesized using the same library are also presented.
The 512-bit Hash Function Whirlpool
The Whirlpool compression function W has two datapaths, and in each path, the 512-bit function ρ[k] is used 10 times, as shown in Fig. 1 . One of the paths receives the 512-bit hash value H i-1 that was generated from 512-bit message blocks m 1~mi-1 in the previous cycles and then outputs ten 512-bit keys K 1~K10 by using ten 512-bit constants c 1~c10 . Another path processes the current message block m i using the keys K 1~K10 . No hash value is calculated before receiving the first message block m 1 , and thus 0 is assigned to H 0 .
The function ρ[k] consists of four 512-bit sub-functions γ, π, θ, and σ [k] , which are similar to SubBytes, ShiftRows, MixColumns, and AddRoundKey, respectively, of AES. Each sub-function treats a 512-bit block as an 8×8-byte matrix. The first function γ is a nonlinear substitution function consisting of sixty-four 8-bit S-boxes, and the S-box is defined as a combination of three types of 4-bit mini-boxes, namely, E, E -1 , and R. The following function π rotates each row of the matrix by 0 ~ 7 bytes. The function θ then operates matrix multiplication using the parameters shown in Fig.  1 . When the individual input and output bytes of the multiplication are defined as a ij and b ij (0 ≤ i, j ≤ 7), respectively, the eight bytes of the column j = 0 are calculated as where the byte multiplication is performed over a Galois field GF (2 8 
structure shown in the leftmost part of Fig. 1 . The mini-boxes E, E -1 , and R in (II) are implemented using 4-bit input/output lookup tables. For the pipeline architecture described later, the pipeline register is placed between the mini-boxes E, E -1 , and the XOR gates in (II) in order to partition the critical path of ρ[k] within the S-box.
Compact Hardware Architectures

Data manager
The data manager is a circuit component for the function π, which performs byte-wise permutation using a series of shift registers and selectors. Reference [8] used a simple data manager that processes one 8-byte row of the 8×8-byte matrix every cycle. In order to increase throughput using a compact circuit, we extend the data manager to Let us denote the byte permutation of the function π as shown in Fig. 2 , and its straightforward implementations are shown in Fig. 3 . The function block π is implemented by metal wire interconnection and no transistor device is used, although a large number of selectors and long data buses in the feedback path have an adverse impact on hardware performance with respect to size and speed. In order to settle this problem, the four-stage (128-bit datapath) and two-stage (256-bit datapath) data managers shown in Figs. 4 and 5, respectively, are introduced herein. In these figures, the square boxes denote 8-bit registers. The four-stage data manager receives 16 bytes (128 bits) corresponding to the two rows of the left-hand matrix in Fig. 2 By reducing the number of registers, the critical path is shortened, and thus, the new data manager provides a smaller circuit area with a higher operating frequency in comparison to the conventional schemes. The two-stage data manager processes four rows of Fig. 2 simultaneously using 16-byte (128-bit) selectors, while the straightforward implementation in Fig. 3(b) requires four times the number of selectors.
Datapath Architectures
The compact Whirlpool hardware architectures using the new data managers are 
Interleave Architecture
The datapath of the 128-bit interleave architecture are shown in Fig. 6 . Two fourstage data managers are used for the architecture. A 512-bit message block or a 512-bit key are stored to a 512-bit shift register in each data manager using four clock cycles. There is only one γθ-function block, and it is used alternatively for data randomization and key scheduling every four clock cycles. In the architecture, the order of the substitution function γ and the permutation function π of the data manager are reversed from the original order in Fig. 1 so that the datapath and sequencer logic are simplified. This has no effect on the operations in data randomization, but the data 
Pipeline Architecture
Pipeline architecture divides the functions γ and θ by inserting pipeline registers to shorten the critical paths and improve the operation frequency. In the 128-bit datapath architecture, the number of pipeline stages can be increased up to five without causing pipeline stall. The upper limits of the number of pipeline stages are three and nine for the 256-bit and 64-bit versions, respectively. The maximum numbers of stages are used for the performance comparison in the next section.
The datapath and operation of the 128-bit pipeline architecture are shown in Figs. 8  and 9 , respectively. The functions γ and θ are partitioned into four sub-blocks as stages 0 ~ 3, and the XOR gates for key addition σ[k] followed by selectors correspond to the final stage (stage 5). The partitioned sub-blocks perform message randomization and key scheduling simultaneously, as shown in Fig. 9 . The data manager is only used for the message randomization, and the key scheduler uses the 512-bit register with a 4:1 selector. This is because the dependency between key bytes cannot be satisfied, even when using the function π -1 as in the interleave architecture. This 128-bit architecture performs two ρ[k] operations in eight cycles and thus requires 80 cycles for the function W, which is the same as the 128-bit interleave 
Performance Evaluation
The proposed Whirlpool architectures were designed in Verilog-HDL and were synthesized by Synopsys Design Compiler (version Y-2006.06-SP2) with the STMicroelectronics 90-nm CMOS standard cell library (1.2-volt version) [11] , where two optimization options, size and speed, were specified. Hardware sizes were estimated based on a two-way NAND equivalent gate, and the speeds were evaluated under worst-case conditions. The efficiency is defined as the throughput per gate, and thus higher efficiency indicates better implementation. For performance comparison, the Whirlpool circuits with 512-bit datapath architecture from [5] and the SHA-256/-512 circuits proposed in [10] were also designed and evaluated using the same library.
The synthesis results are shown in Table 1 , where the two types of S-box described in Section 2 are referred to as the GF (2 8 used for the pipeline architectures so that the pipeline register can be placed in the middle of the S-box. The results are also displayed in Fig. 10 where the horizontal and vertical axes are gate count and throughput. Note that the largest implementation with 179.0 Kgates is outside of this graph. In the figure, the circuit is smaller when the corresponding dot is located in the left region, and is faster when the dot is in the upper region. As a result, implementations plotted at the upper left region of the graph have better efficiency. The interleave architecture with a wider datapath including the conventional datapath always obtained higher efficiency. This is because throughput is halved if the datapath width is halved, whereas the hardware size cannot be halved due to the constant size of the data registers. In contrast, the 256-bit datapath achieved higher Table   efficiency than 512-bit version for the pipeline architecture. In this case, the proposed deep pipeline scheme allows a much higher operating frequency, and consequently the high throughput with the small circuit resulted in a higher efficiency. However, the operating frequencies of the 128-bit (five-stage) and 64-bit (nine-stage) pipeline architectures were not improved compared with that of the 256-bit architecture, even though the number of pipeline stages was increased. The major reason for this is the increasing additional selectors in the critical path from the key register to the data manager through key addition. Therefore, we must consider the hardware resources and the signal delay time caused by additional selectors for optimizing datapath in deep pipeline operation. The 64-bit interleave architecture in conjunction with the GF(2 4 ) S-box and the area optimization option achieved the smallest size of 13.6 Kgates with a throughput of 817 Mbps. The circuits using the GF(2 4 ) S-box are smaller and have higher efficiency than those using the GF(2 8 ) S-box. The GF(2 8 ) S-box leads to a large circuit but is still suitable for high-speed implementation. Generally, the interleave architecture is smaller than the pipeline architecture which requires a number of pipeline registers. The smallest circuit based on the conventional scheme is 29. 6 Kgates for the 512-bit pipeline architecture with the GF(2 4 ) S-box. Therefore, the gate count of 13.6 Kgates obtained by the proposed interleave architecture is 54% smaller than that of the conventional scheme. The 256-bit pipeline version optimized for In comparison with the SHA-256 and -512 circuits, the smallest Whirlpool circuit is larger than the area-optimized SHA-256 circuit with 9.8 Kgates but is smaller than the SHA-512 circuit with 17.1 Kgates. The highest throughput of the proposed architectures (13.2 Gbps) is 2.8 times higher than that (4.7 Gbps) of the speedoptimized SHA-512, and the highest efficiency of 372.3 Kbps/gate is 1.4 times higher than 263.8 Kbps/gate of the area-optimized SHA-256. The proposed Whirlpool architectures also achieved a wide variety of size and speed performances, while the SHA-256 and -512 circuits have only four implementations. Consequently, the proposed Whirlpool hardware has great advantages in both performance and flexibility.
Conclusion
In the present paper, compact hardware architectures with interleave and pipeline schemes were proposed for the 512-bit hash function Whirlpool, and their performances were evaluated using a 90-nm CMOS standard cell library. The fastest throughput of 13.2 Gbps @ 35.5 Kgates, the smallest circuit area of 13.6 Kgates @ 817 Mbps, and the highest efficiency of 372.3 Kbps/gate were then obtained using the proposed architectures. These results indicate that the proposed architectures can provide higher performance with respect to both size and efficiency, as compared to the conventional 512-bit architectures. In addition to the peak performance in size and speed, the flexibility of the proposed architectures enables various design options to meet a variety of application requirements.
Further research to reduce the overhead of additional selectors in the pipeline architectures is currently being conducted. The method will further improve both size and speed.
