In this paper, we present a general framework for evaluating the performance characteristics of block cipher structures composed of S-boxes and Maximum Distance Separable (MDS) mappings. In particular, we examine nested Substitution-Permutation Networks (SPNs) and Feistel networks with round functions composed of S-boxes and MDS mappings. Within each cipher structure, many cases are considered based on two types of S-boxes (i.e., 4×4 and 8×8) and parameterized MDS mappings. In our study of each case, the hardware complexity and performance are analyzed. Cipher security, in the form of resistance to differential, linear, and Square attacks, is used to determine the minimum number of rounds required for a particular parameterized structure. Because the discussed structures are similar to many existing ciphers (e.g., Rijndael, Camellia, Hierocrypt, and Anubis), the analysis provides a meaningful mechanism for seeking efficient ciphers through a wide comparison of performance, complexity, and security.
Introduction
In product ciphers like DES [1] and Rijndael [2] , the concepts of confusion and diffusion are vital to security. The Feistel network and the Substitution-Permutation Network (SPN) are two typical architectures to achieve this. In both architectures, Substitution-boxes (Sboxes) are typically used to perform substitution on small sub-blocks. An S-box is a nonlinear mapping from input bits to output bits, which meets many security requirements. In many recently proposed block ciphers (e.g., Rijndael, Hierocrypt [3] , Anubis [4] , and Khazad [5] ), the outputs of a layer of parallel S-boxes are passed through a linear transformation based on a Maximum Distance Separable (MDS) code.
In this paper, the performance of several cipher structures is considered in terms of hardware time and space complexity. A performance comparison is made between different parameterized cases of 128-bit block ciphers in relation to security requirements. In the analysis, the hardware complexities of S-boxes and MDS mappings are based on the upper bounds of the minimum hardware complexity deduced in [6] . For a general invertible S-box, the upper bounds of the gate count and delay are obtained from the logic minimization of a hardware-efficient S-box model; for an MDS mapping, the upper bounds of the gate count and delay are obtained by searching MDS candidates for an optimal one when implemented by bit-parallel multipliers. Hence, the structures discussed in this paper are constructed with optimized components to produce high efficiencies in their categories. A conventional evaluation approach is taken in [6] with the space complexity evaluated by the number of bit-wise invertors and 2-input gates and the time complexity evaluated by the number of traversed layers in the gate network. In this paper, a weight is associated with different types of gates to distinguish their discrepancies in hardware cost. Performance metrics are defined for hardware with consideration of complexity and security.
Many ciphers are derived from appropriate configurations of S-boxes and linear transformations (typically MDS mappings). Rijndael, Hierocrypt, and Anubis can be regarded as specific cases of nested SPNs [3] . On the other hand, the round function of a Feistel network may contain one or several layers of S-boxes followed by a linear transformation such as an MDS mapping. For example, Camellia [7] is such a cipher with one layer of S-boxes in the round function (although the linear transformations are not MDS). In this paper, many cases of these cipher structures will be analyzed for their hardware complexities and performances.
Background

Properties of S-boxes
The properties of the S-boxes in a cipher are important in the consideration of a cipher's security against differential cryptanalysis [8] and linear cryptanalysis [9] . An m×n S-box, S, performs a mapping from an m-bit input X to an n-bit output Y . Considering all S-boxes, {S i }, in a cipher, the maximum differential probability p s is defined as:
where "⊕" denotes a bitwise XOR and " " denotes a bitwise XOR difference. The maximum linear probability is defined as:
where "·" denotes a bitwise inner product and Γ X and Γ Y denote masking variables. In this paper, all 4 × 4 S-boxes are assumed to satisfy p s , q s ≤ 2 −2 and all 8 × 8 S-boxes are assumed to satisfy p s , q s ≤ 2 −6 . Many proposed ciphers such as Serpent [10] , Rijndael, Hierocrypt-3 [11] , and Camellia have S-boxes with these features; others such as Anubis and Khazad have slightly higher p s and q s .
MDS Mappings
A linear code over Galois field GF(2 n ) is denoted as a (k, m, d)-code [12] , where k is the symbol length of the encoded message, m is the symbol length of the original message, and d is the minimal symbol distance between any two encoded messages.
where C is an m×m matrix and I is an identity matrix, determines an MDS mapping from the input X to the output Y through matrix multiplication over a Galois field as follows:
where
Every entry in X , Y, and C is an element in GF(2 n ).
When an invertible linear transformation f : X → Y is used in a cipher, the avalanche effect which creates resistance to differential and linear attacks may be measured with its branch number B, which is defined as [13] :
where H(X ) and H(Y) denotes the numbers of nonzero elements in X and Y. It is proved that an MDS mapping as defined in (1) has an optimal branch number B equal to m + 1.
Nested SPNs
The concept of a nested SPN was first introduced in [3] . In a nested SPN, S-boxes may be viewed at different levels: each S-box at a higher level is actually a small SPN at the lower level. In this paper, we examine nested SPNs which have the following properties:
• The structure contains just two levels of SPNs. A higher level S-box consists of a lower level SPN; a lower level S-box is a real 4×4 or 8×8 S-box.
• The linear transformation layers in both levels are based on MDS codes, denoted as M DS H for the higher level and M DS L for the lower level.
• The round key mixture occurs directly before each layer of actual (i.e., lower-level) Sboxes. One additional subkey mixture is appended at the end of the cipher structure. The subkey bits are mixed with data bits by XOR operations.
• A "round" refers to the combination of the subkey mixture, lower-level S-box layer, and subsequent M DS L or M DS H linear transformation.
As Figure 1 shows, M DS L is an MDS mapping from a (2m 1 , m 1 , m 1 + 1)-code over GF(2 n 1 ), while M DS H is an MDS mapping from a (2m 2 , m 2 , m 2 +1)-code over GF(2 n 2 ). The variables m 1 , m 2 , n 1 , and n 2 represent parameter choices for a nested SPN. In the most straightforward case, the output of each S-box forms one source symbol for the MDS mapping, and each encoded symbol forms the input of a subsequent S-box at the same level. So the size of an S-box is n 1 bits at the lower level and n 2 bits at the higher level. This leads to n 2 = n 1 m 1 . Thus, the block size of the SPN is n 1 m 1 m 2 . For example, Hierocrypt (Type I) is described as the iteration of such a 4-round structure where n 1 = 8, n 2 = 32, and m 1 = m 2 = 4.
MDS H
At each level of a nested SPN, the branch number of the MDS layer determines the minimum number of active S-boxes in differential or linear cryptanalysis. For 4 rounds of a nested SPN, an active S-box at the higher level contains at least m 1 + 1 active S-boxes at the lower level. Since there are at least m 2 + 1 active S-boxes at the higher level, the minimum number of active lower-level S-boxes is (m 1 + 1)(m 2 + 1). Therefore, the security against differential and linear attacks is evaluated as the following:
With the assumption that all S-box approximations involved in linear and differential cryptanalysis are independent, for 4 rounds of a nested SPN the maximum differential characteristic probability is upper bounded by p (m 1 +1)(m 2 +1) s and the maximum linear characteristic probability is upper bounded by q
The basic operations in MDS codes are multiplications and additions in finite fields. When n 2 is large, operations over GF(2 n 2 ) are inefficient and M DS H can be costly in computation. An alternative method to obtain the same branch number is to concatenate several parallel MDS codes over a smaller finite field. The concatenated codes may be designed to ease a bitslice implementation.
Theorem 2 [3] : An MDS mapping defined by a (2m, m, m + 1)-code over the nl-bit symbol set can be constructed by concatenating l mappings defined by a (2m, m, m + 1)-code over the n-bit symbol set, where l can be any positive integer.
For the example illustrated by Figure 1 , since n 2 = m 1 n 1 , the mapping M DS H over GF(2 n 2 ) can be implemented with m 1 parallel MDS mappings over GF(2 n 1 ). In this case, the basic M DS H layer is denoted as 1×(2m 2 , m 2 , m 2 + 1) over GF(2 n 2 ), and its simplified M DS H layer is denoted as m 1 ×(2m 2 , m 2 , m 2 + 1) over GF(2 n 2 ) where n 2 is now the size of a smaller field and for this case n 2 = n 1 . Since m 1 n 1 may be factored in other ways, other simplifications are also possible. Hence we can consider that the general relation n 2 l = m 1 n 1 can be used to determine different cases of M DS H defined by the values of the symbol size, n 2 , or the number of parallel MDS mappings, l. A similar approach can also be applied to the M DS L layer. However, restrictions on values of n and m must be considered for designing a (2m, m, m + 1)-code over GF(2 n ) such that 2m ≤ 2 n + 1 [12] .
The 128-bit ciphers Square, Rijndael, and Anubis can be regarded as the iterations of 4-round nested SPNs where n 1 = n 2 = 8, m 1 = m 2 = 4. The parameters of Hierocrypt (Type II) are selected as n 1 = 8, n 2 = 4, m 1 = m 2 = 4.
One Type of Feistel Networks
As a typical form of block ciphers, the Feistel network has been widely used and studied. In each round i of a Feistel network as shown in Figure 2 In a Feistel network whose round function has an invertible linear transformation appended to parallel S-boxes, it is proved in [16] that the number of active S-boxes in any differential or linear characteristic of 4r rounds is lower bounded by r×B + r/2 , where B is the branch number of the linear transformation. Therefore, we get: 
Comparison of Hardware Performance
It is normally hard to compare hardware performance among different block ciphers. The main problems are: 1) each implementation represents a tradeoff between area and delay; 2) the specific hardware cost of a gate network is dependent on the target technology; 3) ciphers may contain different security margins.
For the first problem, the classical delay-area product is used to evaluate the hardware complexity universally. The typical methods used in the hardware implementation of a block cipher include a round iterated design, a pipelined design, a loop-unrolled design, and a block parallel design [18] . For a given cipher, the delay-area product is kept roughly unchanged across the different design methods (except for a loop-unrolled design), assuming the control overhead for parallelism can be ignored. If a round iterated design is regarded as a reference, a k-block parallel design using several round iterated implementations will cost about k times the number of gates and result in about 1/k of the average time to produce an encrypted block. The same situation occurs in a pipelined design when each stage performs one or several rounds of the cipher. For loop unrolling, when k rounds are unrolled, the gate count will increase over an iterative design, but the average encryption time can be reduced. Loop unrolling usually results in low performance in the sense of the delay-area product.
For the second problem, a universal way is to assume that all gates have the same hardware cost [17] . Thus, the gate count and delay of all components are deduced from the upper bound of typical implementations. Such an approach leads to a measure of complexity which is technology-independent. However, in a certain target VLSI technology, the hardware costs of different gates may not be similar. In this case, it is possible to estimate the overall area (respectively, delay) by summing weighted gate counts (respectively, weighted gate layers traversed). The weights are proportional to the size of a gate (respectively, delay) and can be calculated by statistical comparison of hardware among gates based on a target technology. The hardware complexity is then evaluated by weighted area A W and weighted delay D W :
Associated with gate type u, G(u) and W G (u) return the gate count and weight of each gate. In the critical path of the circuit, D(u) and W D (u) return the number of traversed gate layers and weight of each layer associated with gate type u.
For the problem caused by different security margins, we use a rule-of-thumb to determine resistance to differential and linear cryptanalysis. For differential cryptanalysis, the number of chosen plaintext pairs to attack a cipher is expected to be in the order of 1/P d , where P d is the maximum differential characteristic probability determined by Theorems 1 and 3. Similarly, to attack a cipher using linear cryptanalysis, the number of known plaintexts is expected to be in the order of 1/P l , where P l is the maximum linear characteristic probability.
Based on above considerations, we define three hardware performance metrics η s , η t , and η to measure the space, time, and overall performance, respectively. The three metrics integrate security and complexity and are defined as follows:
where P = P d for hardware performance in relation to differential attacks and P = P l in relation to linear attacks. In each expression, the numerator is essentially a security measure in bits and the denominator is a complexity measure. Since we assume that the S-boxes in the three discussed cipher structures satisfy p s = q s , the values of log 2 1/P d and log 2 1/P l are the same. For the nested SPNs and Feistel networks discussed in Section 2, log 2 1/P is a linear function of the number of rounds. Therefore, the values of η s , η t , and η indicate how much security is expected to be obtained for a specific hardware cost, regardless of the number of rounds in a cipher.
Targeted to the same design method, η s shows the security contribution provided by each area unit; η t shows the security contribution provided by each delay unit. For a fast implementation such as a pipelined or parallel design, a high η s means that many independent blocks can be processed simultaneously. For a round iterated design, a high η t means that the encryption time for a block is small. More generally, using the classical delay-area product as its denominator, η indicates the performance integrating both the delay and area complexities.
The cases that we compare in the following sections are generated as 128-bit block ciphers defined by the nested SPN and Feistel networks. To calculate the gate count and number of gate layers per round, we consider the construction of the combinational circuits of the round structure with S-box and MDS mapping components which can produce high efficiencies in hardware. The hardware design and optimization of these components are described in [6] . The detailed data used in the complexity estimation is presented in the Appendix.
Hardware Performance of Nested SPNs
A set of nested SPNs can be generated with appropriate configurations of parameterized M DS L , M DS H , and S-boxes. As Theorem 2 illustrates, the MDS mapping defined over a large Galois field can be simplified using several mappings in a smaller Galois field. Table 1 lists the cases of nested SPNs in 12 categories (labelled as N1 to N12) defined by the S-boxes and M DS L . Thus, the cases within a category only differ in the simplification of M DS H . Each case can be regarded as a 128-bit cipher, after a particular key schedule is defined. Due to the difficulty of finding optimized MDS mappings, the cases with a Galois field larger than GF(2 8 ) are not considered.
In relation to real ciphers, Case N4-a includes Square, Rijndael, and Anubis. Type II of Hierocrypt belongs to Case N4-b with a simplified M DS H over GF (2 4 ). Similar to SHARK [13] and Khazad, Case N8 is a one-level SPN. However, SHARK and Khazad are 64-bit ciphers because their MDS mappings are based on a (16, 8, 9 )-code over GF (2 8 ).
From the viewpoint of implementation, a nested SPN follows the iterative dataflow of key addition, one S-box layer, and an MDS mapping layer (either M DS L or M DS H ). Since S-boxes cost the most hardware complexity, a 128-bit multiplexor selects M DS L and M DS H dynamically such that only one layer of S-boxes is needed in a round iterated design. So assuming a round iterated implementation, the round circuit used for each case in Table 1 includes a 128-bit key addition, one layer of S-boxes, M DS L , M DS H , and a 128-bit multiplexor 2 . The 128-bit multiplexor can be implemented by 385 NAND gates (i.e., y = x 1 · c + x 2 · c where c is the select signal and "+" denotes OR).
For the main components and the iterative round structures of each SPN, Table A-1 in the Appendix lists their gate counts and delays of layers. Although each individual value in Table A-1 cannot be perfectly accurate, the comparison of these measures does enable us to distinguish the cases which are more efficient in hardware. Figure 3 shows the tendency of the universal performance comparison when W G (u) = W D (u) = 1 for any gate type u (i.e., all gates are assumed to have the same hardware cost). In an ASIC design, XOR gates are more expensive than other gates such as NOT, AND, and OR gates. Figure 4 shows a weighted performance comparison when W G (XOR) = W D (XOR) = 2 and weight for others is one. The two figures follow the similar tendency in performance comparison:
• The size of the S-box largely determines space and time performances. Using small S-boxes tends to cost less hardware area, but more delay than using large S-boxes. Given fixed chip area, the cipher cases using small S-boxes are more advantageous for • Many SPN structures (N1-N10, N11-N12) are essentially equivalent with respect to their hardware performance. Hence, it is wise for a cipher designer to consider those structures which can facilitate software implementations.
• When the symbol size is 8 bits or less, the simplification of MDS mappings through concatenation does not significantly improve the performance when the MDS mappings have been selected to be optimized for hardware. For example, Case N4-b in Table 1 does not gain a much higher improvement in hardware than Case N4-a.
• When m 1 or m 2 is very high, the MDS mapping determined by m 1 or m 2 (e.g., M DS H in cases of N9 and N10) will cost much more hardware and overwhelm S-box costs, which degrades the cipher performance.
• As a cipher of Case N4-a, Rijndael is very suitable for a round iterated design. However, its suitability for pipelined or parallel implementations is not as high as cipher cases using 4×4 S-boxes such as cases of N11 and N12.
The above conclusions are based on hardware complexity and security against differential and linear attacks. For some other attacks such as Square attack, the effectiveness significantly decreases after a certain number. In this circumstance, a performance metric of the round structure is defined as:
Since the security in bits to resist these attacks increases very rapidly in the number of rounds, with a trend much steeper than differential and linear attacks as more rounds are appended, we take a fixed number of rounds (e.g., about 8 for the Square attack to Rijndael) as enough for the security. The comparison of round performance is also included in Figures 3 and 4 . It is obvious that the nested SPNs with small S-boxes and modest sized M DS L and M DS H have significantly better performance in relation to the Square attack than other cases.
Hardware Performance of Feistel Networks
The Feistel network discussed in this section is limited to the subset described in Section 2.4, which has an SPN round function. To construct a typical 128-bit cipher, such a Feistel network has a 64-bit F -function which contains sixteen 4×4 or eight 8×8 parallel S-boxes followed by an MDS mapping layer. As listed in Table 2 , six categories (labelled as F1 to F6) of these 128-bit Feistel networks can be generated. To ensure a good avalanche effect, an appropriate fixed permutation of MDS symbols after the MDS mapping is expected, which does not cost any gates. The hardware of one round of the cipher includes a 64-bit XOR for round key addition, one layer of S-boxes, one MDS mapping, and another 64-bit XOR appended to the output of the F -function. The cases of the same category in Table 2 only differ in the simplification of the MDS mapping. The performance comparison in Figures 5 and 6 indicates (refer to the Appendix for detailed data):
• It is useful to pick an MDS mapping that has a large branch number (i.e., m + 1).
The cases with such an MDS mapping have significantly higher values in all three performance measures.
• With high η t values, the cases with 8 × 8 S-boxes demonstrate high performance in non-pipelined and non-parallel implementations. With high η s values, the cases with 4×4 S-boxes demonstrate high performance in pipelined and parallel implementations because many independent blocks can be processed simultaneously.
Camellia is a 128-bit Feistel cipher with a 64-bit round function which consists of eight 8×8 invertible S-boxes and a linear transformation. Hence, Camellia is similar to our discussed Feistel networks but does not use an MDS mapping. The branch number of the Camellia F6-b 2× (16, 8, 9) over GF (2 4 ) linear transformation is 5. An efficient implementation of such a linear transformation costs 176 two-input XOR gates and a delay of 3 gate layers in universal comparison. Thus, Camellia has performance similar to Case F2-a which has 264 XOR gates and a delay of 3 gate layers (see Table A -2 in the Appendix). Compared with the case F3-a, Camellia has a slightly more compact round structure (i.e., about 5% less in gate count than Case F3-a). However, each round of Camellia contributes much less to the security. Eleven rounds of F3-a provides equivalent security to nineteen rounds of Camellia. Further calculation shows that the overall hardware performance of F3-a is about 50% higher than that of Camellia. The weighted performance comparison follows a similar trend.
Synthesis Results
The above performance analysis is based on theoretical evaluation of hardware complexity. The usability of these analytical results can be verified when VLSI technology is targeted. To avoid arduous work on synthesizing each cipher case, we did a high level synthesis of each component used in Tables 1 and 2 . The components are coded in VHDL and synthesized with Synopsys Design Compiler. Two CMOS libraries 3 were used where most standard cells have one or two bitwise inputs.
During synthesis, if the minimum area (respectively, delay) is set as the main constraint 4 , the numbers of equivalent gates (respectively, critical delay time) of 8×8 S-boxes are close to their estimates in Tables A-1 and A-2. The gates and delays of 4×4 S-boxes are slightly less than their estimates because it is much easier for CAD tools to simplify smaller S-boxes. This effect indicates that the performance advantage of using small S-boxes as shown in 2.00 Since the MDS mapping is implemented in XOR gates, the areas and delays closely follow the proportional relation of their estimations in Tables A-1 and A-2. Because XOR gates are larger and slower than other gate types, synthesis tools may replace them with other gates such as NXORs during optimization. Nevertheless, the delays and numbers of equivalent gates imply that a weight of 2 is reasonable for an XOR gate. This effect makes the cases with large MDS mapping worse in weighted performance, e.g., the cases in N8 to N12, F5, and F6. This problem is encountered in the realizations where a large percent of XORs are used. The weighted performance shown in Figures 4 and 6 are thus more useful for a closer comparison than the universal method.
Conclusions
In this paper we have considered two cipher structures composed of S-boxes and MDS mappings. Various cipher cases are generated from these structures with different component configurations. Their security and complexity are examined and integrated into performance metrics.
In hardware, the discussed cipher cases using large S-boxes are suitable for non-pipelined and non-parallel applications where delay is the main design criterion; however, in pipelined and parallel applications, the cipher cases using small S-boxes produce high performance. Further, appropriate selection of an MDS mapping layer is important for security against differential and linear attacks.
Compared with Feistel networks, the nested SPNs generally have higher hardware performance. When the same S-boxes are used, a nested SPN tends to be more efficient in hardware to resist differential and linear attacks. Considering the threat of Square attacks, nested SPNs with smaller S-boxes are preferred. For a Feistel network, more rounds are needed to be secure against differential and linear attacks. With little change in the linear transformation, a suggestion is made to improve Camellia in terms of security and hardware efficiency.
In line with a nested SPN, MISTY [19] can be regarded as a nested Feistel network. Using provable security as the security measure, it will be interesting future work to compare the hardware performance between these two nested structures with similar performance metrics defined in Section 3.
Appendix: Complexity Evaluation of Cipher Components
In hardware, the complexity of S-boxes are evaluated through the simplification results deduced from an encoder-switch-decoder model [6] . In this model, S-boxes are composed of low complexity gates (ANDs, ORs, and NOTs). A 4 × 4 S-box can be implemented using 50 gates and produces a delay of 6 gate layers; an 8 × 8 S-box can be implemented using 806 gates and produces a delay of 11 gate layers. Involution MDS codes [4] are found by searching Hadamard matrices and have been optimized for hardware [6] . MDS codes are composed of XORs. The evaluated hardware costs of S-boxes, MDS mappings, and round structures are listed in Tables A-1 Using these results, the complexity of each 128-bit 2-level nested SPN is evaluated for each round. The hardware of one round SPN includes a 128-bit key addition layer, an Sbox layer, two MDS mappings at different levels, and a 128-bit multiplexor. The 128-bit multiplexor selects M DS L and M DS H alternatively in consecutive rounds, which costs 385 NAND gates and a delay of two gate layers. The key addition costs 128 XOR gates and a delay of one gate level. The calculation of the delay per round assumes the highest delay of M DS L and M DS H .
The hardware of one round of the Feistel network includes a 64-bit key addition layer, an S-box layer, an MDS mapping layer, and a 64-bit XOR after the F -function (as shown in Figure 2(a) ). The key addition costs 64 XOR gates and a delay of one gate level. The XOR after the F -function has the same hardware complexity as the key addition. 
