CRYPTON is a 128-bit block encryption algorithm proposed as a candidate for the Advanced Encryption Standard (AES), and is expected to be especially efficient in hardware implementation. In this paper, hardware designs of CRYPTON, and their performance estimation results are presented. Straightforward hardware designs are improved by exploiting hardware-friendly features of CRYPTON. Hardware architectures are described in VHDL and simulated. Circuits are synthesized using 0.35 μm gate array library, and timing and gate counts are measured. Data encryption rate of 1.6 Gbit/s could be achieved with moderate area of 30,000 gates and up to 2.6 Gbit/s with less than 100,000 gates.
Introduction
The explosive growth in computer systems and their interconnections via networks has changed the way we live. From politics to business, our lives depend on the information stored and communicated using these systems.
This in turn has led to a heightened awareness of the need of the information security. To enforce information security by protecting data and resources from disclosure, a secure and efficient encryption algorithm is needed. Since the widely used encryption algorithms, DES or Triple DES, are no more considered secure enough or efficient for future applications, a new block encryption algorithm with a strength equal to or better than that of Triple DES and significantly improved efficiency is needed.
Recently very high bandwidth networking technologies such as ATM and Gigabit Ethernet are rapidly deployed [1] . Network applications such as virtual private network [2] need high-speed executions of encryption algorithms for high-speed networks. In order to utilize the high-speed networks at link speed, encryption speed as fast as 1 Giga bits per second is required. According to our experiment, the execution speed of Triple DES on because it was designed from the beginning with hardware implementations in mind. The encryption and decryption use the identical circuitry, and there needs no large logic for S-boxes. Moreover it does not use addition or multiplication but only uses exclusive-OR operations, and the exclusive-OR operations can be executed in parallel. CRYPTON is considered as the most hardware-friendly AES candidate on several researches [9] [10] [11] .
In this paper, hardware designs of CRYPTON Version 1.0 [6] , which were optimized by exploiting the hardware-friendly features of CRYPTON, are presented. To maximize operation parallelism, key generation and data encryption operations are executed simultaneously. Some round loops can be unrolled for speedup without increasing the quantity of logic. S-boxes used for both key scheduling and data encryption are shared to minimize the area. Hardware architectures are described in VHDL and simulated using Synopsys Compiler and Simulator. Circuits are synthesized using 0.35 μm gate array library, and timing and gate counts are measured. This paper is organized as follows. In Section 2, CRYPTON algorithm is briefly introduced. In Section 3, our design considerations of CRYPTON hardware are described. In Section 4 and 5, the detailed hardware designs of CRYPTON and estimation results are presented respectively. Concluding remarks are made in Section 6.
CRYPTON Algorithm
In CRYPTON, each data block of 128 bits is processed in the form of a 4 × 4 byte array. The round transformation of CRYPTON consists of four steps: byte-wise substitutions, column-wise bit permutation, column-to-row transposition, and then key addition. The encryption process involves 12 repetitions of the same round transformation. The decryption process can be made identical to the encryption process with a different key schedule. Figure 1 shows the high level structure of CRYPTON. 12-round self-reciprocal cipher (with identical processes for encryption and decryption) with block length 128 bits and key length up to 256 bits.
Strong security against existing attacks.
Efficiency in both SW and HW: thanks to the high degree of parallelisms and the use of only very simple operations (logical ANDs/XORs, rotates, and table lookups), CRYPTON runs very fast on most platforms in software as well as in hardware.
Wide tradeoffs between speed and memory: It can be implemented using the storage of 32, 256, 1K, 4K, 16K or 512K bytes, etc. Its speed on Pentium Pro 200 MHz ranges from 54 Mbps (in C) to 64 Mbps (in in-line ASM).
Fast key scheduling: The encryption key schedule runs much faster than one-block encryption, so it is very efficient for use in the application requiring frequent key changes (e.g., in hashing mode).
Basic Building Blocks

Byte Substitution γ
CRYPTON uses a nonlinear byte substitution using four 8 × 8 S-boxes, S i (0 ≤ i ≤ 3). These S-boxes are derived from one 8 × 8 involution S-box S (i.e., S = S -1 ) and satisfy the inverse relationships S 2 = S 0 -1 and S 3 = S 1 -1 . The involution S-box S was designed to enable efficient logic implementation (see [6] for details).
The S-box transformation γ consists of byte-wise substitutions on a 4 × 4 byte array. Two different transformations are used alternatively in successive rounds: γ ο in odd rounds and γ e in even rounds. They are depicted in Figure 2 . Observe that the four S-boxes are arranged so that the following holds for any 4 × 4 byte array A:
This property is used to derive the identical process for encryption and decryption. 
where m 0 = 0xfc, m 1 = 0xf3, m 2 = 0xcf, m 3 = 0x3f. To make the encryption and decryption processes identical, two slightly different versions of a bit permutation is used; π ο in odd rounds and π e in even rounds.
They are defined as follows:
Bit permutation π o for odd rounds: B = π o (A) defined by
Bit permutation π e for even rounds: B = π e (A) defined by
Note that in hardware each bit of π outputs requires only exclusive-ORs of three different bits. 
Key Addition σ
Key mixing is performed by simply exclusive-ORing round keys with data words. That is, for a round key
Round Transformation ρ
One round of CRYPTON consists of applying in sequence S-box transformation, bit permutation, byte transposition and key addition. That is, the encryption round functions used for odd and even rounds are defined (for round key K) by
for r = 2, 4, ... etc.
Encryption and Decryption
The encryption transformation E K of r-round CRYPTON under key K consists of an initial key addition and r/2 times repetitions of ρ ο and ρ e and then a final output transformation. Encryption and decryption can be performed by the same code (logic) if different round keys are applied.
Key Scheduling
r-round CRYPTON requires total 4(r + 1) round keys each of 32-bit length. These round keys are generated from a user key of 8k (k = 0,1,...,32) bits in two steps: first nonlinear transform the user key into 8 expanded keys and then generate the required number of round keys from these expanded keys using simple operations. This two-step generation of round keys is to allow efficient on-the-fly round key computation in the case where storage requirements do not allow storing the whole round keys. For details of CRYPTON algorithm, please refer to [6] .
Design Considerations
Parallel Execution of Key Generation and Encryption
Some block ciphers pre-compute round keys and store them. Then the stored keys are used repeatedly while
encrypting. This style needs extra cycles for key setup, and large storage if all 12 round keys of 128-bit length are to be stored. However, CRYPTON can generate keys simultaneously with encryption. The time it takes to generate a round key of CRYPTON is insignificant compared to the time it takes for a round transformation, and this makes it possible to generate round keys with the encryption proceeding. The parallel operation of hardware CRYPTON is illustrated in Figure 4 .
K1 Generation
K12 Generation
User Key K 12
Round 1 Transformation
Round 2 Transformation Round 12 Transformation
Data Input Data Output 
Loop Unrolling
Because CRYPTON consists of 12 repetitions of round transformation, iteration is an inevitable choice for small area designs. Rather than building every round transformation separately, a data flow is made to pass the same hardware block repeatedly. We can build the whole encryption with only a small, basic building block by exploiting iteration. The block diagram for 12-cycle iteration is shown in Figure 5 Although iteration results in small area designs, it accompanies additional path delay taken from multiplexer and register. To reduce the number of pass through multiplexer and register, we can unroll the loop so that one cycle contains double or many times of the round transformation logic. In Figure 5 (b), two round transformations are performed in a cycle, but the number of components included in the design is one multiplexer less than that of Figure 5 (a). We can find that the design in Figure 5 (b) has better speed and area than the design in Figure 5 (a). CRYPTON has a good tradeoff between speed and area when 2, 3, 4, 6, or 12 rounds are concatenated in a loop.
Common Logic Sharing
CRYPTON has only a few simple operations as its components. Because the components appear in several different parts of the algorithm, we can make those parts share a common logic block rather than building many separate logic blocks. Common logic sharing will contribute to area reduction only if the common components have smaller area than the multiplexer added. Otherwise it will only result in increased control burden and longer path delay. CRYPTON has byte substitution operation both in key scheduling module and encryption module. Because byte substitution charges a significantly larger area than 128 2:1 multiplexers, there is a benefit of adopting common logic sharing as shown in Figure 6 .
Hardware Design of CRYPTON
In this section we propose two hardware designs of CRYPTON.
Two-Round Model
Two-round model performs two successive transformations for even and odd rounds within a clock cycle. Tworound model is efficient in area-speed tradeoff as we saw in Section 3.2, and it has smaller area than designs with 3, 4, 6, or 12 rounds concatenated. Two-round model consists of three modules: encryption module, key scheduling module, and control module as shown in Figure 9 . The following scheme describes how two-round model works.
(1) Load 128-bit input data at "DATA_IN" port, and 256-bit user key at "USER_KEY" port.
(2) If decryption is to be performed, apply logic high on "DECR" port until the encryption is finished. (3) Start encryption by applying logic high pulse for one clock cycle at the "START" port.
(4) Check the output port "DONE" to see if the encryption is completed.
(5) If "DONE" port outputs logic high, read the encryption result through output port "DATA_OUT".
The 12-round encryption process needs 13 round keys; from the round 0 key to the round 12 key. Generating 13 round keys will take 7 clock cycles because two-round model computes two keys within a clock cycle. Thus the result is available 7 clock cycles later after logic high on "START" port is latched at the rising edge of clock. Figure 1 . Because round 0 has uniqueness that only key addition is performed in it, it will require extra 128 two-input exclusive-OR gates if we didn't take the sequence shown in Figure 7 (a).
The block diagram for key scheduling module is depicted in Figure 8 . Key scheduling module conveys two successive round keys to the encryption module at every clock cycle. One of our design principles is parallel execution mentioned in Section 3.1, and we don't adopt any round-key pre-computation style for speed enhancement.
There is design concern in the final stage of round-key computation. A round key is exclusive-OR sum of a round-constant, masking constants, and expanded keys. There are only 4 masking constants that can be easily built into combination logic, and expanded keys are unknown before the round, thus we have no design choices about them. But there are as many as 13 round-constants of 32-bit length, and one should decide if he will build wired logic out of pre-computed round-constants, or make the design compute the constants from the specified equation at each clock cycle. Computation of constants in fully parallelized design like Figure 8 needs at least 4 32-bit adders and two 32-bit registers. On the other hand using wired logic for the 13 constants introduces a little longer propagation delay in key computation. Because two-round model aims at small area, implementation with wired logic is chosen. The most unwieldy part in CRYPTON might be the nonlinear substitution, γ. γ consists of byte-wise substitution on a 4 × 4 byte array, in other words 16 256-entry tables. As Figure 7 (a) shows, encryption module must have two γ 's because γ is distinct for odd rounds and even rounds. In addition, key scheduling module has two γ 's for user key expansion, which is used only once when a new user key is set. Seeing that most of the other operations takes comparatively small area, incorporating separate γ 's for key expansion looks very mismatched for its rare usage. This lack of balance can be corrected by sharing γ 's between the expansion block of key scheduling module and round transformation block of encryption module. This scheme effectively reduces the total area but needs one more clock cycle only for key expansion when new expanded keys are to be computed. Figure 7(b) shows the new block diagram for encryption module with γ sharing. In Figure 7 (b), outputs "Ur" and "Vr" are fed to the inputs of two 128-bit registers and stored in them. Those stored values in the registers are used repeatedly for generation of round keys unless a new user key is set. The key scheduling area also has change in expansion block. Key scheduling area takes U' and V' [6] -the outputs of expanded key registers -as inputs, and performs operations later than γ.
Combining encryption module and key scheduling module, the whole encryption block is built as shown in Figure 9 . In Figure 9 , "Cycle0" is a signal indicating the key expansion cycle, and "Cycle1" tells the first cycle among 7 is going on. Both signals are used as control inputs for multiplexers.
Control Module
DATA_IN USER_KEY
U 
128-bit Register
Full-Round Model
To make a fast design, pipelining or loop unrolling can be generally applied. In pipelining, the algorithm is partitioned into several stages, and this enables several data blocks to be encrypted simultaneously. This is possible in ECB mode because each result of block encryption is independent from others. But modes except ECB require the previous result of encryption to be available to complete the present encryption, which makes pipelining useless. Since most of the recent application of block ciphers use chaining or feedback mode, speed enhancement through pipelining is not considered here. Instead we make full-round model with 12 rounds fully unrolled. This full-round model computes 12 rounds of transformation without looping, and it is the fastest but the largest design among those exploiting loop-unrolling. 
Estimation Results
Our estimations are all based on Synopsys DesignCompiler and Hyundai 0.35 μm gate array library. Table 1 shows area and speed of two-round model and full-round model each with two speed-area tradeoffs. In this paper, all estimations are for typical cases. "Gate" means a 2-input NAND gate, equivalent to 4 transistors. The two results for two-round model have identical logic design, but are different only in optimization strategies. The one indicated as "Area Critical" was optimized to reduce as much area as possible. On the other hand, the result of "Speed Critical" was obtained by minimizing the clock period. As shown in "Time to switch keys", one clock is spent on expanding the user key when the user key is switched to a new value. Since the design uses one clock source for its whole area, the time to switch keys will be at least "minimum clock period", but the actual propagation delay in key expansion is less than the minimum clock period. Round keys are generated on the fly, and if the user keys remain same, 7 clock cycles are needed for both encryption and decryption.
The same optimization strategies were applied to full-round model. However, in speed critical optimization, the path from data input to the final output was optimized rather than minimizing the clock period as in tworound model.
Although results of full-round model were entered into the same format of table with two-round model, the numbers should not be compared directly because the two designs have different architectures. The following three explanations qualify the meaning of each item for full-round model.
Time to switch keys: time from the insertion of a new user key to the generation of the round 0 key (this is because time taken to compute the next round key is much shorter than time to compute one round transformation. By the time γ operation of 1 st round has been performed, all round keys from 1 st round to 12 th round are available).
Key setup time: time needed to compute the whole 12 round keys.
Time to encrypt one block:
The longest path delay from data input to the final data output (it is the time taken to encrypt one block when all round keys are set up and ready to be used).
In two-round model, "Total Area" is larger than the sum of "encryption module" and "key scheduling module" because area of buffer for expanded keys and control unit is missing. Full-round model doesn't have any extra logic except encryption module and key scheduling module, and thus the sum is exactly matched.
Two main issues in logic design are speed and area. Because speed and area are objects of tradeoff in most cases, we can get the best results on one criterion by optimizing for it while ignoring the other. On the other hand, the result is the worst one for the ignored criterion. We can find the approximate upper and lower bounds of speed and area of CRYPTON by once optimizing for area and then for speed.
The main trade-off between space and time takes place in optimization of S-box. We could obtain as high speed as 2.6Gbps by growing the area of S-box, but the total number of gates was over 90,000. Resorting to speed-area tradeoff, two-round model faster than full-round model was possible contrary to our initial scheme of using loop unrolling to make a fast design. In our estimation, two-round model was found more moderate and practical both in area and speed than full-round model. Although optimization in the synthesis tool was very useful to achieve a goal in speed or area, a better result was possible by modifying the design itself.
Conclusions
In this paper, we designed and proposed two hardware architectures of CRYPTON exploiting the inherent hardware-friendly features of CRYPTON. The architectures were described in VHDL and circuits were synthesized using 0.35 μm gate array with several speed-area tradeoffs. 0.9 Gbps with the smallest area of 18,000 gates and the fastest speed of 2.6 Gbps with less than 100,000 gates could be achieved. The speed of 2.6 Gbps is faster than the commercially available fastest Triple-DES chip with an order of magnitude. This is enough speed to support the Gigabit networks. Since CRYPTON has good scalability in gate count, a designer can select a proper speed-area tradeoff from the large set choices.
