Abstract-We propose a compact hardware architecture for the 64-bit block cipher CAST-128, which is one of the ISO/IEC 18033-3 standard algorithms. Part of the complexity of CAST-128 is its use of various S-boxes in various sequences, and three types of f-function are switched depending on the round numbers. Therefore a large amount of hardware resources are required for a straight-forward implementation. In order to create compact CAST-128 hardware, we minimized the number of S-box components, and merged the three f-functions into one arithmetic component. The CAST-128 hardware based on the proposed architecture was synthesized using 0.13-µm and 0.18-µm CMOS standard cell libraries and small, practical circuits of 26.4~39.5 Kgates and 189.9~614.7 Mbps were obtained.
I. INTRODUCTION
CAST-128 [1] is a 64-bit block cipher developed by Carlisle Adams, and its specification was published as RFC 2144 [2] . CAST-128 was approved by the CSE (Communications Security Establishment) for use by the Government of Canada [3] , and was also adopted as one of the ISO/IEC standard block ciphers [4] . The popular e-mail ciphering tool PGP (Pretty Good Privacy) [5] [6] uses CAST-128 as the default algorithm.
CAST-128 has eight different types of 8-bit input and 32-bit output S-boxes defined as lookup tables, and they are used a number of times in the key scheduling and data randomization processes. It also has three different types of 32-bit f-functions. The algorithm can be efficiently implemented on 32-bit processors, but the large Sboxes and the three f-functions are problematic in the development of compact hardware. Therefore, only one hardware implementation on a FPGA platform with a sufficiently large RAM exists [7] , and no evaluation on an ASIC platform was done, as far as the authors know.
In this paper, we propose a small CAST-128 hardware architecture, where a minimum set of S-boxes is used and the three ffunctions are merged by using a unified arithmetic unit. The ASIC performances of our CAST-128 circuit are evaluated in comparison with the standard block ciphers AES and DES in 0.13-µm and 0.18-µm CMOS standard cell libraries.
II. CAST-128 ALGORITHM
The 64-bit block cipher CAST-128 has a Feistel-type data randomization block as shown in Fig. 1 , and three f-functions of Types 1~3 are switched in accordance with Table I . The 32-bit additions and subtractions are performed on modulus 2
32
, and the four S-boxes S1~S4 are 8-bit input and 32-bit output lookup tables defined by using a bent function.
The key length is variable in the range of 40~128 bits but divisible by 8. The number of iteration rounds is 12 or 16 for 40~80-bit and 88~128-bit keys, respectively. When an input key is shorter than 128 bits, zero bytes are padded on the right end. Then 32-bit round keys K 1~K4 are generated according to Equations (1)~ (8) ,
where 
The key scheduling procedure uses four 8-bit input and 32-bit output S-boxes S5~S8 that are different from the S-boxes used in the data randomization. Similar procedures are repeated seven times (as the order of the input bytes and S-boxes changes) to generate the rest of the round keys K 5~ K 32 . The round keys K 1~K16 are used as the 32-bit mask keys Km 1~K m 16 in Fig. 1 
III. PROPOSED HARDWARE ARCHITECTURE

A. Data Randomization Block
Fig . 2 shows the data randomization block of our compact CAST-128 hardware. This datapath can be used for both encryption and decryption because of the Feistel-network feature of CAST-128. Four S-boxes S1~S4 and the barrel shifter in Fig. 1 can be shared between the three f-functions of Types 1~3. A 32-bit adder can easily support subtraction with minor additional circuitry, and XOR gates are already included in the adder. Therefore, we designed a unified arithmetic unit ASX (Add-Sub-OR) shown in Fig. 3 , which switches three arithmetic operations, and merged the three ffunctions into one functional block.
A carry look-ahead scheme is used for addition and subtraction, considering the balance between speed and gate count. The signals Sel0 and Sel1 are used to control the operations in ASX. When Sel0=Sel1=0 in Fig. 3 , all of the carry signals C and C 1~31 fed to the ASX units are disabled, and then the 32-bit XOR result between A 0~31 and B 0~31 is output to S 0~31 . To perform subtraction, the signal Sel0 is set to 1, and then the carry signals are enabled. Sel1 is also set to 1 so that all A 0~31 bits are inverted and 1 is added at the LSB ASX cell through the carry signal C, and then the two's complement form of the operand A 0~31 is added to B 0~31 . The carry generation unit CG0 that does not generate the MSB carry is used for mod 2 32 operation. Addition between A 0~31 and B 0~31 is performed by setting Sel0=1 and Sel1=0. When Sel0=0 and Sel1=1, the XNOR result is output to S 0~31 , though this operation is not used in the CAST-128 hardware.
B. Key Scheduler
The same S-box is used twice in each of the Equations (1)~ (8) for the key scheduling. Therefore, two sets of S-boxes S5~S8 are required if each equation is executed in one clock. To minimize the number of S-boxes, we transformed Equations (1)~ (4) and (5)~ (6) into Equations (9)~(12) and (13) 
The round keys of CAST-128 cannot be executed in reverse order for decryption, and thus all the keys are generated and stored in key registers in advance. Therefore, even though the number of clocks for key generation is doubled, it has no affect on a number of clocks required for data randomization. Fig. 4 shows the datapath architecture of our key scheduler. There are two paths for output from the two 128-bit registers . One path goes to four S-boxes after four bytes are selected by the "Switching Box," and the other path is XORed with the S-box output after one 32-bit data block is selected by a 10:1 multiplexer. Then the results are fed back to the registers as shown in Equations (9)~(12). In order to generate four 32-bit round keys K 1~K4 (identical to the four mask keys Km 1~K m 4 ) by XORing the five Sbox outputs according to Equations (13)~(16), the four 32-bit registers Km 1~K m 4 in Fig. 4 , and the feedback path of the 10:1 multiplexer is used. The round keys K 17~K32 are generated similarly, but only the lower 5 bits of each key are used as the rotate key. Therefore, the registers Kr 1~K r 16 are 5 bits each. During the rotate key generation processes the outputs from the 5-bit registers Kr 1~K r 16 are fed back to the XOR-tree through the 10:1 multiplexer. At that time, the upper 27 bits of the multiplexer output are not used.
IV. PERFORMANCE EVALUATION IN ASIC
The proposed compact CAST-128 hardware architecture described above was designed and synthesized with two optimizations, size and speed, by using 0.13-µm and 0.18-µm CMOS standard cell libraries under the worst case conditions. Tables II and III show the synthesis results. AES [8] and DES [9] circuits were also synthesized under the same conditions for performance comparisons.
Our CAST-128 hardware achieves 26.9 Kgates for size optimization and 39.5 Kgate for speed optimization using the 0.13-µm library. The gate counts are 26.4K and 32.8K with the 0.18-µm library. These numbers are comparable with 26.7 Kgates and 43. 6 Kgates of AES hardware with lookup table S-boxes. When we implement CAST-128 S-boxes using memory, the eight different 8-bit input (2 8 =256 addresses) and 32-bit output (4 Bytes) lookup tables requires 256 % 4 Bytes % 8 = 8 Kbytes. In contrast, AES uses sixteen 256-Byte S-boxes for data randomization, and four of them for key scheduling, and thus the total capacity is 256 Bytes % 20 = 5.1 Kbytes. AES can generate round keys on-the-fly, but CAST-128 needs many clocks to generate round keys for encryption, and cannot generate the keys on-the-fly for decryption. Therefore, all the CAST-128 round keys should be pre-computed and stored into large registers (registers Kr1~ Kr16 and Km1~ Km16 in Fig. 4) . Considering these facts, it was unexpected that our compact CAST-128 hardware architecture would be comparable in size to AES.
However, the highest throughput is 614.7 Mbps by using the 0. where no carry propagation occurs. Additional circuitry in the critical path is also required to switch the f-functions. By utilizing the advantage of binary field arithmetic, AES can also achieve a very compact S-box on the composite field GF (((2   2   ) 2 ) 2 ) S-box [10] as shown in Tables II and III. The gate counts of DES is much lower than CAST-128, but this is obviously because DES has only 256 Bytes for eight 6-bit input and 4-bit output S-boxes in total while CAST-128 needs 8 Kbytes. The throughputs of CAST-128 are only 1/4~1/5 of DES, but triple-DES that repeats DES three times is recommended because of security issues. In comparison with triple-DES, the throughput of CAST-128 is from the same level to half, which is good enough for practical use. As far as the authors know, only one FPGA implementation was reported for CAST-128 hardware in [7] , and a throughput of 220 Mbps was obtained for a loop architecture version. Even though the platforms are different (FPGA and ASIC), our design achieved the three-times-higher throughput of 614.7 Mbps.
CAST-128 that uses large S-boxes has usually been implemented as software, but our results show that small and fast ASIC hardware implementations can be achieved by using our proposed architecture.
V. CONCLUSION
In this paper, we proposed a compact hardware architecture for the ISO/IEC 18033-3 standard 64-bit block cipher CAST-128. Its performances were evaluated using 0.13-µm and 0.18-µm CMOS standard cell libraries and gate counts of 26.4~39.5 Kgates with throughputs of 189.9~614.7 Mbps were obtained. These gate counts are almost the same as AES with lookup table S-boxes. The throughputs are rather low, but good enough for actual use.
We are developing ASIC hardware for all of the other ISO/IEC standard ciphers such as Camellia, SEED, and MISTY1, and will report performance comparisons in the near future. 
