HIEROCRYPT-3 had key schedule problems [5, 7] , and there were attacks for up to 3,5 rounds out of 6 [1, 3, 7] , at least hardware implementations of this cipher were extremely slow [12] . HIEROCRYPT-3 was not selected to Phase II.
CAMELLIA was selected as an algorithm suggested for future standard [10] .
In the paper we present the hardware implementations these two algorithms with 128-bit blocks and 128-bit keys, using ALTERA devices and their comparisons.
Short description of the HIEROCRYPT-3 cipher
The HIEROCRYPT-3 block cipher algorithm was designed by TOSHIBA Corporation and its detailed specification is given in [11] . We have implemented the version of the algorithm with 128 bit blocks and 128 bit main key. The HIEROCRYPT-3 has 6 rounds and each round needs two 128 bit subkeys and one 128 bit subkey is necessary to EXOR with the text block at the end of the encryption process.
Structure of HIERCORYPT-3 cipher is based on "wide trail strategy" described by Joan Deamen in his PhD in 1995 [4] . This paper suggested design strategies based on linear and differential cryptoanalysis. In HIEROCRYPT-3: non-linearlity is represented by two layers (2x16 simultaneously working sboxes) and linear layers are represented by matrices:
MDS L (operating on 4x32-bit word) and MDS H (operating on 128-bit block of data). In this strategy, obviously, each round of encryption and decryption process is dependent on subkeys (In HIEROCRYPT-3: twice EXOR with 2x128-bit subkey). 
Round of encryption

Round key generation procedure
Key generation procedure for 1 ≤ t ≤ t turn .
Key generation procedure for t turn +1 ≤ t ≤ T+1. 
5 σ
6 σ PW -key pre-whitening. 
Decryption
The decryption of Hierocrypt-3 is the inverse of encryption, and consists of the final key addition, the inverse of XS-function (XS -1 ), and (T-1) inverse operations of round function (ρ -1 ).
The plaintext P (128) is given as the final output X (0) (128) . P (128) = X (0) (128) .
Analysis of the HIEROCRYPT-3 main components
The analysis presented in this section concerns the ability of implementing HIEROCRYPT-3 using ALTERA FPGA devices. In the following section we will discuss: all basic functions used in the algorithm, and the way of implementing them in ALTERA FPGA.
These basic functions include:
-Fσ -function.
Round of encryption and decryption
Substitution boxes
Basically, there are two possible ways of implementing s-boxes:
-as a direct logic implementation, or -as a 2048-bit configured embedded array block (EAB).
We analyzed both solutions (there are 40 sboxes in the HIEROCRYPT-3: 32 in round of encryption or decryption and 8 in key schedule and FLEX10KE have only 24 EABs).
The best solution seems to be the implementation:
-one layer of sboxes from round of encryption (16 sboxes) and 8 sboxes from key schedule implemented in EABs (24 sboxes together),
-one layer of sboxes from round of encryption (16 sboxes) implemented as a direct logic implementation. We used DAMAIN tool [13] , developed at Warsaw University of Technology, for the functional decomposition of sbox (it provides more efficient and faster implementation than Max PLUS optimalisation methods). 
MDS lower level
We implement the rest of multiplication (by C8h, 65h, 8Bh) in the same way. 
MDS higher level
Key schedule
P (n) -function
This kind of operation is easily implemented in hardware, too. This linear transformation operates on n-bit block. It consists of four input n-bit values. The only thing we have to do is to XOR correct n-bit values. In this way we receive four output n-bit values.
We executed this transformation similarly to MDS higher level operation. 
M 5E -function
M B3 -function
This transformation is very similar to M 5E . We implemented this in the same way.
Fσ -function
We described the ways of implementation of each part of this transformation in previous units. Sboxes in 2.1.1 and P (n) -function in 2.2.1. Flex 10K devices. This FPGA family is considered to be correct for dedicated cryptographic solutions. In this paper we will also try to present the realisation of HIEROCRYPT-3 using these ALTERA products (Max+plus II and Flex 10 KE).
There are various approaches to the realisation block ciphers. Some papers suggest that the generation of the subkeys and the round calculations should be parallely executed. In the first order the subkey to the round number one is calculated. Then the round is executed and the subkey to the next round is calculated. The main adventage of this design is the fact that we do not need to store the subkeys, they are currently calculated [6] .
Another proposition is to implement the key generation algorithm in other units of the cipher. In the first phase (called key setup) all necessary subkeys are generated and they are stored in the internal implemented registers (or memory). There is in the second phase encryption or decryption process only [2] .
Both realisations have got lots of advantages and disadvantages and both are most suitable for algorithms presented in these papers [2, 6] . 
Implementation of HIEROCRYPT-3 with short setup
In the first order we present a solution with short setup. In the phase of key setup, we realise the operation of main key pre-whitening and calculation of the subkey to the round number one. Next, in the external register from INTERMEDIATE STORAGE (Fig. 3.1 This project executes correct encryption (decryption) process during 8 clock cycles.
Frequency of the clock could be 8,05 MHz and the throughput of this project is 115 Mb/s.
The key round is the critical path.
Implementation of HIEROCRYPT-3 with long setup
The next feature of algorithm HIEROCRYPT-3 considered in the project with long setup is symmetry in the intermediate part of the key schedule ( This project executes correct encryption (decryption) process during 8 clock cycles.
Frequency of the clock could be 11,91 MHz and throughput of this project is 190 Mb/s.
The computation of F σ -function output data is critical path.
Implementation of HIEROCRYPT-3 with very long setup
Two very important changes in project with long setup were made and these are the main features of this project. This project executes correct encryption (decryption) process during 7 clock cycles.
Frequency of the clock could be 15,64 MHz and throughput of this project is 304 Mb/s.
Extensive implementation of HIEROCRYPT-3
We implement this implementation in STRATIX circuit available in QUARTUS II, because it was not possible to implement it in Flex 10 KE (too much resources necessary).
Only one change was executed in the last project. This change was mathematical and it resulted from flexibility of HIEROCRYPT-3 algorithm.
Sbox in HIEROCRYPT-3 is 8x8 size and it is the bijective function, that means it is permutation of GF (2 8 ) elements. Multiplication by MDS lower level matrix is executed in this way:
(primitive polynomial for this field x 8 + x 6 + x 5 + x + 1).
Each element from GF( 2 8 ) is firstly multiplied by four constants: C4h, 8Bh, C8h, 65h, and then they are EXORed. Multiplication of all elements from GF(2 8 ) by constant causes permutation of the elements from GF(2 8 ).
Hence, we can consider sbox as a permutation. We can consider the multiplication by This project execute correct encryption (decryption) process during 7 clock cycles.
Frequency of the clock could be 21,73 MHz and throughput of this project is 397 Mb/s.
Summary of HIEROCRYPT-3 cipher implementation.
Efficiency throughput 
CAMELLIA algorithm and its implementation
The CAMELLIA block cipher algorithm was designed by NTT Corporation and Mitsubishi Electric Corporation and its detailed specification is given in [8] .We have implemented the version of the algorithm with 128 bit blocks and 128 bit main key. The CAMELLIA has 18 rounds and each round needs one 64 bit subkey. Four 64 bit subkeys are necessary to pre-whitening and post-whitening operations at the beginning and end of the encryption process.
Structure of CAMELLIA algorithm
Encryption and decryption round
The data randomizing part has an 18-round Feistel structure with two FL/FL -1 -function layers after the 6-th and 12-th rounds and 128-bit XOR operations before the first round and after the last round. The key schedule part generates subkeys kw In the data randomizing part, first the plaintext M (128) is XORed with kw 1(64) ||kw 2(64) and separated into L 0(64) and R 0(64) of equal length, M (128) ⊕ (kw 1(64) ||kw 2(64) ) = L 0(64) ||R 0(64) .
Then, the following operations are perfomed from r = 1 to 18, except for r = 6 and 12;
R r = L r-1 .
For r = 6 and 12, the following is carried out;
L r = FL(L' r , kl 2r/6-1 ), R r = FL -1 (R' r , kl 2r/6 ).
Lastly, R 18(64) and L 18(64) are concatenated and XORed with kw 3(64) ||kw 4(64) . The resultant value is the ciphertext, i.e., C (128) = (R 18(64) ||L 18(64) ) ⊕ (kw 3(64) ||kw 4(64) ). It turned out that the best solution is the third case.
P -function
This kind of operation is similar to operations described in section 2.1.3 (MDS higher level in HIEROCRYPT-3) in implementation ( Fig. 4. 2).
FL -function
FL -function is defined as follow (Fig.4. 3):
detailed:
There are four kind of operations:
-logical AND,
-logical OR,
-shift by 1 bit left, -EXOR.
All of these operation are hardware oriented and implementation of FL-function is very simple.
FL -1 -function
Implementation of this operation is as easy as the previous (Fig.4.3 ).
Implementation of the CAMELLIA and its results
Performance of CAMELLIA (Hardware Performance)
Performance of CAMELLIA algorithm is given in [9] . The table is 
Proposition of implementation of CAMELLIA.
The most satisfactory results of implementation of CAMELLIA algorithm are achieved using loop-unrolled architecture. It means that in one clock cycle we execute 3 rounds of encryption (decryption).
We used the same interface as in HIEROCRYPT-3 projects. It is shown The efficiency of CAMELLIA implementation is 2973 logic elements and 49152 memory bits.
Conclusions
The implementation of HIEROCRYPT-3 is not simple. The optimal implementation of this algorithm is achieved when all conditions from section 3.5 are taken seriously. This implementation has a very high operation speed 304 Mb/s and it is almost 6 times faster than the fastest implementation proposed by the authors. This proposition of implementation needs only 9758 logic elements and 48 kb of EAB (embedded array block) -additional memory, it is twice more efficient than that proposed by the authors and it fits to one FPGA circuit.
HIEROCRYPT-3 is a very flexible algorithm. It is possible to connect substitution layer with MDS lower level layer and replace them by one substitution layer with 64 sboxes and few xor-operations. This project needs a lot of logic elements (more than 25000 logic elements), but it is still a practical implementation and its performance is 397 Mb/s.
It is easy to implement CAMELLIA in hardware. We achieve the best result of throughput when we execute three rounds in one clock cycle (240 Mb/s). We call this project LOOP-UNROLLED architecture.
Both ciphers seem to be very suitable for hardware implementation, but, surprisingly, we achieved better results of throughput for HIEROCRYPT-3. However, as to efficiency CAMELLIA is still better.
Our work suggests that possibilities of the algorithm's implementation (HIEROCRYPT-3) should not be evaluated by authors only who very often have not enough knowledge about optimalisation in designing.
At the end of our paper we present comparison of presented implementation by the authors of the primitives: HIEROCRYPT-3 and CAMELLIA and our projects.
HIEROCRYPT-3:
efficiency throughput 
