Abstract. The 128-bit blockcipher CLEFIA is known to be highly efficient in hardware implementations. This paper proposes very compact hardware implementations of CLEFIA-128. Our implementations are based on novel serialized architectures in the data processing block. Three types of hardware architectures are implemented and synthesized using a 0.13 µm standard cell library. In the smallest implementation, the area requirements are only 2,488 GE, which are about half of the previous smallest implementation as far as we know. Furthermore, only additional 116 GE enable to support decryption.
Introduction
CLEFIA [9, 11] is a 128-bit blockcipher supporting key lengths of 128, 192 and 256 bits, which is compatible with AES [2] . CLEFIA achieves enough immunity against known attacks and flexibility for efficient implementation in both hardware and software. It is reported that CLEFIA is highly efficient particularly in hardware implementations [12, 10, 13] .
Compact hardware implementations are very significant for small embedded devices such as RFID tags and wireless sensor nodes because of their limited hardware resources. As for AES with 128-bit keys, low-area hardware implementations have been reported in [3] and [4] . The former uses a RAM based architecture supporting both encryption and decryption with the area requirements of 3,400 GE, while the latter uses a shift-register based architecture supporting encryption only with the area requirements of 3,100 GE. Both implementations use an 8-bit serialized data path and implement only a fraction of the MixColumns operation with additional three 8-bit registers, where it takes several clock cycles to calculate one column. Very recently, another low-area hardware implementation of AES was proposed in [5] requiring 2,400 GE for encryption only. Unlike the previous two implementations, it implements MixColumns not in a serialized way, where one column of MixColumns is processed in 1 clock cycle. Thus it requires 4 times more XOR gates for MixColumns, but requires no additional register and can reduce gate requirements for control logic.
In this paper, we present very compact hardware architectures of CLEFIA with 128-bit keys based on 8-bit shift registers. We show that the data processing part of CLEFIA-128 can be implemented in a serialized way without any additional registers. Three types of hardware architectures are proposed according to required cycles for one block process by adaptively applying clock gating technique. Those architectures are implemented and synthesized using a 0.13 µm standard cell library. In our smallest implementation, the area requirements are only 2,488 GE, which are to the best of our knowledge about half as small as the previous smallest implementation, 4 ,950 GE [10, 12] , and competitive to the smallest AES implementation. Furthermore, only additional 116 GE are required to support decryption by switching the processing order of F-functions at even-numbered rounds.
The rest of the paper is organized as follows. Sect. 2 gives brief description of CLEFIA and its previously proposed hardware implementations. In Sect. 3, we propose three types of hardware architectures. Sect. 4 describes additional hardware resources to support decryption. Sect. 5 gives evaluation results for our implementations, compared with the previous results of CLEFIA and AES. Finally, we conclude in Sect. 6.
2 128-bit Blockcipher CLEFIA
Algorithm
CLEFIA [9, 11] is a 128-bit blockcipher with its key length being 128, 192, and 256 bits. For brevity, we consider 128-bit key CLEFIA, denoted as CLEFIA-128, though similar techniques are applicable to CLEFIA with 192-bit and 256-bit keys. CLEFIA-128 is divided into two parts: the data processing part and the key scheduling part.
The data processing part employs a 4-branch Type-2 generalized Feistel network [14] with two parallel F-functions F 0 and F 1 per round. The number of rounds r for CLEFIA-128 is 18. The encryption function EN C r takes a 128-bit plaintext P = P 0 |P 1 |P 2 |P 3 , 32-bit whitening keys W K i (0 ≤ i < 4), and 32-bit round keys RK j (0 ≤ j < 2r) as inputs, and outputs a 128-bit ciphertext C = C 0 |C 1 |C 2 |C 3 as shown in Fig. 1 .
The two F-functions F 0 and F 1 consist of round key addition, 4 non-linear 8-bit S-boxes, and a diffusion matrix. The construction of F 0 and F 1 is shown in Fig. 2 . Two kind of S-boxes S 0 and S 1 are employed, and the order of these S-boxes are different in F 0 and F 1 . The diffusion matrices of F 0 and F 1 are also different; the matrices M 0 for F 0 and M 1 for F 1 are defined as The key scheduling part of CLEFIA-128 takes a secret key K as an input, and outputs 32-bit whitening keys W K i (0 ≤ i < 4) and 32-bit round keys RK j (0 ≤ j < 2r). It is divided into the following two steps: generating a 128-bit intermediate key L (step 1) and generating W K i and RK j from K and L (step 2). In step 1, the intermediate key L is generated by 12 rounds of encryption function which takes K as a plaintext and constant values CON i (0 ≤ i < 24) as round keys. In step 2, the intermediate key L is updated by the DoubleSwap function Σ, which is illustrated in Fig. 3 . Round keys RK j (0 ≤ j < 36) is generated by mixing K, L, and constant values CON i (24 ≤ i < 60). Whitening keys W K i is equivalent to 32-bit chunks
Previous Hardware Implementations
Hardware implementations of CLEFIA-128 have been studied in [12, 10, 13] . In [12] , optimization techniques in data processing part including S-boxes and diffusion matrices were proposed. The compact architecture, where F 0 is processed in one cycle and F 1 is processed in another cycle, was implemented, and its area requirements in area optimization are reported to be 4,950 GE.
In [10] , two optimization techniques in key scheduling part were introduced. The first technique is related to implementation of the DoubleSwap function Σ. Σ is decomposed into the following Swap function Ω and SubSwap function Ψ as Σ = Ψ • Ω.
denotes a bit string cut from the a-th bit to the b-th bit of X. Please note that Ω and Ψ are both involutive. The 128-bit key register for the intermediate key L is updated by applying Ω and Ψ alternately. Round keys are always generated from the most significant 64-bit of the key register. After the final round of encryption, L is re-stored into the key register by applying the following F inalSwap function Φ.
Please note that Φ is also involutive. In case of decryption, round keys are always generated from the most significant 64-bit of the key register by applying the inverse functions of Ω, Ψ and Φ in reverse order of encryption. Due to their involutive property, only three functions Ω, Ψ and Φ are required for encryption and decryption.
In the second technique, XOR operations with the parts of round keys related to a secret key K are moved by an equivalent transformation into the two data lines where key whitening operations are processed. Therefore, these XOR operations and key whitening operations can be shared.
In [13] , five types of hardware architectures were designed and fairly compared to the ISO 18033-3 standard blockciphers under the same conditions. In their results, the highest efficiency of 400.96 Kbps/gates was achieved, which is at least 2.2 times higher than that of the ISO 18033-3 standard blockciphers.
Proposed Architectures
In this section we propose three types of hardware architectures. Firstly, we propose a compact matrix multiplier for CLEFIA-128. Next, in Type-I architecture, we propose a novel serialized architecture of the data processing block of CLEFIA-128. By adaptively applying clock gating logic to Type-I architecture, 
we can reduce the number of multiplexers (MUXes) in Type-II and Type-III architectures with increasing cycle counts. Clock gating is a power-saving technique used in synchronous circuits. For hardware implementations of blockciphers, it was firstly introduced in [8] as a technique to reduce gate counts and power, and have been applied to KATAN family [1] and AES [5] . Clock gating works by taking the enable conditions attached to registers. It can remove feedback MUXes to hold their present state and replace them with clock gating logic. In case that several bits of registers take the same enable conditions, their gate counts will be saved by applying clock gating.
Matrix Multiplier
Among low-area AES implementations, M ixColumns matrix operations are computed row by row in [3] , while they are computed column by column in [4] . In our architecture, matrix operations are computed column by column in the following way.
The 4-byte output of M 0 operation is XORed with the next 4-byte data as shown in Fig. 4 (a) . The matrix multiplier in Fig. 4 (b) performs the matrix multiplication together with the above XOR operation in 4 clock cycles. Fig. 4 (c) presents the contents of the registers R i at the l-th cycle (1 ≤ l ≤ 4). At the 1st cycle, the output a 0 of S 0 are fed to the multiplier and multiplied by {01}, {02}, {04}, and {06}. The products are XORed with the data z i (0 ≤ i < 4), and then the intermediate results are stored in the four registers R j (0 ≤ j < 4). As each In [4] , three 8-bit registers are required for the construction of a parallel-toserial converter due to avoiding register competition with the next calculation of a matrix. On the other hand, no competition occurs in our architecture because z i is input at the 1st cycle of a matrix multiplication. w i can be moved into the register where z i for the newly processing F-function is stored. Fig. 5 shows the data path of Type-I architecture, where the width of data path is 8 bit except those written in the figure. It is divided into the following two blocks: the data processing block and the key scheduling block. Type-I architecture processes a round of the encryption function in 8 clock cycles. We show, in appendix, the detailed data flow of the data registers R ij (0 ≤ i, j < 4) in Fig. 5 for a round of the encryption processing. As described in Sect. 3.1, at the 1st and the 5th cycle in the 8 cycles, the data stored in R 20 -R 23 are moved into R 03 -R 12 , and simultaneously the data stored in R 10 -R 13 are input to the matrix multiplier. Therefore, no additional register but the 128-bit data register exists in the data processing block. Please note that R 30 -R 33 hold the current state at the 5-8th cycle by clock gating.
Type-I Architecture
In the start of encryption, a 128-bit plaintext is located to R ij in 16 clock cycles by inputting it byte by byte from data in. After 18 rounds of the encryption function which require 144 cycles, a 128-bit ciphertext is output byte by byte from data out in 16 clock cycles. Therefore, it takes 176 cycles for encryption. The reason why data out is connected to R 30 is that no word rotation is necessary at the final round of encryption. In the start of key setup, a 128-bit secret key K input from key in is located to R ij in 16 clock. After 12 rounds of the encryption function which require 96 cycles, a 128-bit intermediate key L is stored into the key registers L ij (0 ≤ i, j < 4) by shifting R ij and L ij in 16 clock cycles. Therefore, it takes 128 cycles for key setup.
The two S-box circuits S 0 and S 1 are located in the data processing block, and one of those outputs is selected by a 2-to-1 MUX (8-bit width) and input to the matrix multiplier. The encryption processing of CLEFIA-128 is modified by a equivalent transformation as shown in Fig. 7 (a) . The 32-bit XOR operation with 32-bit chunks K i is reduced to the 8-bit XOR operation by locating it in the matrix multiplier. A 32-bit chunk K i selected by a 32-bit 4-to-1 MUX is divided into four 8-bit data, and then one of the data is selected by a 8-bit 4-to-1 MUX and fed into the matrix multiplier one by one in 4 clock cycles.
In the key scheduling block, the intermediate key L stored in L ij is cyclically shifted by one byte, and the 8-bit chunk in L 00 is fed into the data processing after being XORed with the 8-bit chunk of CON i . At the end of even-numbered rounds, L ij is updated by (8-bit shift+Σ) operation; at the end of encryption, L ij is updated by (8- 
Type-II Architecture
In Type-II architecture, we aim the area optimization of the key scheduling block. Since DoubleSwap function Σ is decomposed as Σ = Ψ • Ω, where Ψ and Ω are both involutive, as described in Sect. 2.2, Σ −8 satisfies the following equations.
Swap function Ω is realized by 8 iterations of cyclic shifting. Thus Σ −8 operation can be achieved by 8 iterations of cyclic shifting, 8 iterations of Σ operation, and 8 iterations of cyclic shifting again, which require 24 cycle counts.
During the encryption processing the intermediate key L is updated by Σ operation at the 17th cycle after 16 iterations of cyclic shifting every two rounds. At the 17th cycle, the data registers must hold the current data by clock gating. Accordingly, both 8 additional cycles for the encryption processing and 8 additional cycles to recover the intermediate key L after outputting a ciphertext are required, which results in 192 cycles for encryption. In compensation for the increase of 16 cycle counts, a 128-bit input of MUX in the key scheduling block can be removed.
Type-III Architecture
In Type-III architecture, we achieve the area optimization of the data processing block by applying clock gating effectively. Fig. 6 shows the data path of Type-III architecture. Instead of using MUXes, the data stored in R 10 -R 13 and those stored in R 20 -R 23 are swapped by cyclically shifting these registers in 4 clock cycles, while the other data register and the key registers hold the current state by clock gating. Simultaneously, the XOR operation with a 32-bit chunk K i is done by XOR gates in the matrix multiplier, which leads the savings of 8 XOR gates. These data swaps are required twice for a round of the encryption processing. Therefore, it takes 16 cycles for a round of the encryption processing; in total 328 and 224 clock cycles are required for encryption and key setup, respectively. In compensation for the increase of many cycle counts, several 8-bit inputs of MUXes together with 8 XOR gates for secret key chunks can be removed.
Supporting Decryption
Any encryption-only implementation can support decryption by using the CTR mode. Nevertheless, if an implementation itself supports decryption, it can be used for more applications, for example, an application requiring the CBC mode. Accordingly, we consider the three types of hardware architectures supporting decryption. Since the data processing part of CLEFIA employs a 4-branch Type-2 generalized Feistel network [14] , the directions of word rotation are different between the encryption function and the decryption function. The encryption and decryption processing of CLEFIA-128 is shown in Fig. 7 (a) and (b) , respectively. When the hardware architectures described in Sect. 3 support the decryption processing straightforwardly, many additional multiplexers are considered to be required due to these different directions of word rotation. For avoiding this, we switch the positions of F 0 and those of F 1 at even-numbered rounds as shown in Fig. 7 (c) , and then the direction of word rotation becomes the same as the encryption processing shown in Fig. 7 (a) . Thus we do not have to largely modify the data path of the above three architectures by processing F 1 ahead of F 0 at even-numbered rounds. However, as the order of round keys fed into the data processing block has been changed, the 8-bit round keys are fed from L 10 when F 1 is processed at even-numbered rounds and from L 30 when F 0 is processed at even-numbered rounds. Accordingly, a 8-bit 3-to-1 MUX is required for selecting the source registers of appropriate round keys including L 00 . Since the leading byte of a ciphertext is stored in R 10 , not R 30 for encryption, at the end of decryption because of the modified decryption processing, a 8-bit 2-to-1 MUX is required for selecting data out.
Implementation Results
We designed and evaluated the three types of hardware architectures presented in Sect. 3 together with their versions supporting both encryption and decryption. The environment of our hardware design and evaluation is as follows:
Language
Verilog-HDL Design library 0. The area savings for the key scheduling block of Type-II/III implementation over Type-I implementation are 128 GE. In the library we used, a register with a 3-to-1 MUX costs 7.25 GE per bit; a register with a 4-to-1 MUX costs 8.25 GE per bit. The key register of Type-I implementation consists of 120 registers with a 3-to-1 MUX (870 GE) and 8 registers with a 4-to-1 MUX (66 GE), while the key register of Type-II/III implementation consists of 120 scan flip-flops (750 GE) and 8 registers with a 3-to-1 MUX (58 GE). Thus, the area savings of 128 GE are achieved.
The area savings for the data processing block of Type-III implementation over Type-I/II implementation are 78 GE. As for the data register of Type-III implementation 32 scan flips-flops (200 GE) is replaced with 32 D flip-flops (144 GE), which leads savings of 56 GE. 24 3-to-1 MUXes with output inverted (54 GE) can be replaced with 24 2-to-1 MUXes with output inverted (42 GE) in the matrix multiplier, leading to savings of 12 GE. In addition, 8 XOR gates (16 GE) for secret key XOR is merged to XOR gates in the matrix multiplier. Therefore, the area savings of 78 GE are achieved despite the additional 6 GE for the other MUX. Table 2 shows the implementation results of the proposed architectures together with their versions supporting both encryption and decryption. We also show, for comparison, the best known result of CLEFIA and low-area implementation results of AES. Our implementations supporting encryption only achieve 46-50% reduction of the area requirements compared to the smallest implementation [10, 12] of CLEFIA. As for implementations supporting both encryption and decryption, our implementations are 44-47% smaller. Type-III implementation is 4% larger than the smallest encryption-only implementation [5] of AES, but its encryption/decryption version achieves 23% reduction of the area requirements compared to the smallest encryption/decryption implementation [3] of AES.
Conclusion
In this paper, we have proposed very compact hardware architectures of CLE-FIA with 128-bit keys based on 8-bit shift registers. We showed that the data processing part of CLEFIA-128 can be implemented in a serialized way without any additional registers. Three types of hardware architectures were proposed according to required cycles for one block process by adaptively applying clock gating technique. Those architectures were implemented and synthesized using a 0.13 µm standard cell library. In our smallest implementation, the area requirements are only 2,488 GE, which is 50% smaller than the smallest implementation of CLEFIA-128, and competitive to the smallest AES-128 implementation. Moreover, the area requirements for its version supporting both encryption and decryption are only 2,604 GE, which achieve 23% reduction of area requirement compared to the smallest encryption/decryption implementation of AES-128. Future work will include the application of side-channel countermeasures such as threshold implementations [6, 7] to the proposed architectures. 
