We present the smallest FPGA implementation of Camellia for 128-bit key length to date. This architecture was designed for low area and low power applications. Through specific optimizations such as shift registers for storing and scheduling key, distributed RAM for storing data, we achieved compact implementation using only 318 slices at a throughput of 18.41Mbps on the smallest Xilinx Spartan-S XC3S50-5 device.
INTRODUCTION
The Camellia algorithm [1] was jointly developed by Nippon Telegraph and Telephone Corporation (NTT) and Mitsubishi in 2000. It was designed for a wide range of design platforms from low power and limited resources to high performance on multiple platforms. However, the main design goal was security. The New European Schemes for Signatures, Integrity, and Encryption (NESSIE) project has nominated Camellia as a strong block cipher in 2003 [2] . The structure of Camellia provides features for a compact design. Several different attacks were performed successfully only on reduced round versions of Camellia. An impossible differential cryptanalysis on reduced round Camellia is described in [3] , collision attacks in [4, 5] . The best know attack can break 9-rounds of Camellia with 128-bit key [4] .
CAMELLIA
Camellia [1] is a 128-bit block cipher which supports key lengths of 128, 192 or 256 bits. In this paper, we describe an implementation with 128-bit key length. The Camellia algorithm uses a Feistel network with pre-whitening before first and post-whitening after last rounds. The functions, F L and F L -1 are inserted after 6 t h and 12 t h round introduce non-regularity. The block diagram of 128-bit encryption can be seen in Figl. The F-function contains a SubstitutionPermutation Network (SPN) which is composed of non-linear S-function and linear P-function. The S-function consists of 8 S-Boxes which are selected from four different types. The 
Notations
XL left-half data of X. X R right-half data of X.
E9 bitwise exclusive-OR operation. Tables (LUTs) . The LUTs can be configured as a Distributed RAM (DRAM) or a 16-Bit shift register (SRL16) apart from implementing logic. The DRAMs are used for fast and efficient implementation of memory. In this architecture, a dual port 16x8 Distributed RAM (DRAM) is used for storing the data which reduces the area by approximately 75% compared to using a 128-bit register.
In this implementation, the 64-bit F-function is broken down into several of 8-bit operations. The XOR of 8-bits of the left data X and 8-bits round key K passes through S-Box . The result is XORed with of corresponding 8-bits of right data Y. This is repeated depending on the number of XORs required to complete the P-function. XORing the output from S-box with right data saves storage requir ed for the intermediate values. For this reason a dual port DRAM is used. The swapping of data is accomplished by addressing. It takes 44 clock cycles to complete one round. The multiplexers before the 1st XOR and after the 2 n d XOR operation enable the computation of the modified key from the original key and the pre-and post whitenin g operation. X SC &l -E+1-;H;;l--:""t--+-+-,..E1fH-++-+---,-HHfH-+-h ,XX):,..
FL and FL
-1 P-Fun ction '----..J \ L -------::c -= --: : ---------'I S-Function Dual-port RAM (DRAM)
Key schedule
In the first phase of the key schedule, the modified key K A is computed from the original key K L . In the second phase, round keys are generated through rotation of K L or K A by 15 or 17-bits.K A is computed by passing K L through 4 rounds of the same Feistel network which is used for encryption with XOR of K I. after the 2 n d round. The round keys used are four constant
COMPACT ARCHITECTURE OF CAMELLIA
The F L function breaks its 64-bit input into two 32-bit halves namely X L and X Rand similarly 64-bit key kl as kl i. and kl R. The F L 'is broken into two parts F L 1 and F L 2 and
Our goal is to design a very compact architecture for small area with an acceptable throughput. We choose to implement our architecture on Xilinx spartan-3 family FPGA devices. Our architecture uses a 8-bit datapath and does both encryption and key scheduling. Figure 3 shows the top level block diagram of our architecture. We tried different implementation strategies for several component used in the architecture to get the best results.
S-Boxes and F-Function
The S-Boxe s 52, 5 3, 5 4 can be derived from S-Box 5 1 as (6) This saves two XORs needed for F Land F L-1 operation s.
The 32-bit cyclic rotation in F L M 1 is implemented as a 1-bit left shift on 8-bit data with one flipflop to store the shifted bit. After completing F LM 1 , the last bit is computed again to get the correct bit.
8-bits
Round key Fig. 4 . Key scheduling using shift registers.
Key Storage and Scheduling
The Camellia algorithm needs two keys of size 128 bit, the original key K L and the modified key K A , which are rotated to generate the round keys. An efficient way of storing the keys is to either using SLICEM LUTs as a DRAM or using as shift register. Addressing for DRAM is complicated as key scheduling involves shifting. The best choice is to store the keys in a LUT based shift register. However, such a shift register has only a single bit output and each output or tab requires a flip-flop. Hence, the area consumed by such a shift register depends mainly on the number of taps required to access the data. All the shift registers in this implementation shift by 8-bit in order to match the width of the datapath . However, the rotations needed for round key generation are 15, 30, 45, 60, 77, 94 and Ill-bits. as shown in Table 1 . We can accomplish this by 8-bit shifts and an 8-bit 5:I multiplexer as n mod 8 has only 5 different results. In order to make the control logic simple and uniform, shifting of the key is done at the last clock cycle of the round. For normal round key generation, tapping 15-bits is sufficient. However, due to F LM 1 which has a 32-rotation, 4I-bits additional tabs are required. This increased the size of the shift register approximately by 2 folds. The key scheduling can be seen in Figure 4 . The original key K L is initially loaded into both DRAM and K L shift register. The constants for generating modified key KA are stored in a seperate shift register. KA is computed using the datapath from Fig 3. It is loaded into the K A shift register, while data is loaded into DRAM . Using shift registers for both K L and K A reduces the area by about 75% compared to using two l28-bit registers.
Controller
The control unit consists of a main controller and sub controllers for the F-and FLM functions. The main controller stores its control words in as Distributed ROM (DRaM) for the reasons stated in Section 3.1. The address for the con- trol word is generated by a 6-bit counter. Using a DRaM and a counter for the main controller, its size is reduced by 97% as compared to a FSM. The sub controllers stores their control words in a shift registers as they repeat a sequence of operations.
RESULTS
We implemented our design on the smallest device of the Xilinx Spartan-3 FPGA family using Xilinx ISE 9.2i for synthesis and Active HDL 7.2 for simulation. The results are verified with the test vector provided in [2] . Table 3 shows the results of different implementations strategies of the major components of camellia. Using the components denoted by (1), the total area of the design is 800 slices due to huge multiplexers used for key which are not shown in Table 3 . Using components denoted by (2), the total area is reduced to 318 slices which is the smallest Camellia implementation to date. We implemented two versions of Camellia namely Camellia-2a and Camellia-2b using implementation strategy (2) . Camellia-2a uses LUT based S-Box and Camellia-2b uses Block RAM . The results of these implementations are shown in Table 3 and compared with other block and stream ciphers . Camellia-2a is the optimum implementation achieving a throughput of 18.4IMbps. Even though, our design was not optimized for Xilinx Spartan 2 FPGA family, we implemented it in order to compare it with the smallest AES implementations. Our implementation has a higher throughput and efficiency then AES [6] but it is outperformed by AES [7] which is designed for higher throughput. Camellia has a comparable throughput to TinyXTEA-1 [8] and it is only marginally larger.
CONCLUSION
Our Camellia implementation is mainly based on usage of shift registers and DRAMs for efficient memory implementation. Furthermore, using shift registers also removes the need of addressing. This type of architecture is applicable for ciphers which have key scheduling based on shifting. In low power applications, where higher throughput with minimum area is required, its a good alternative for AES.
