Abstract. This work proposes a compact implementation of the AES S-box using composite field arithmetic in GF(((2 2 ) 2 )
Introduction
After an open competition ending in 2000, the National Institute for Standard and Technology (NIST) has selected the Rijndael block cipher as the new Advanced Encryption Standard (AES) [1] . The AES algorithm, designed by Joan Daemen and Vincent Rijmen, has an SPN (Substitution Permutation Network) structure. Its use is mandatory for the encryption of sensitive but unclassified US government information; in 2003 the US government has announced that it can also be used for encrypting secret and top secret information (for the last category key lengths of at least 192 bits need to be used). AES is currently replacing the Data Encryption Standard as the worldwide standard algorithm.
Since 2000, extensive research has been performed on AES implementations. In this article we are focusing on compact hardware implementations for mobile devices and smart cards, but our results can also be applied in high-speed pipelined implementations for network security and e-commerce applications. Note that the best known software implementations achieve about 15 cycles/byte on a modern PC.
Design challenges for AES mainly lie in exploring all the options for the Sbox design. The most common strategy to reduce the gate complexity consists of exploiting composite field arithmetic. By following that approach one still has several options to represent the finite field GF (2 8 ). In this paper we represent GF (2 8 ) as the composite field GF (((2 2 ) 2 )
2 ). In this way, we reduce the arithmetic in GF (2 8 ) to operations in smaller fields. There are many ways to
). Choices need to be made with respect to the irreducible polynomials that are used to create the extension fields and with respect to the transformation matrices that map elements from one representation to the other. Exploring these two degrees of freedom we optimize the S-box of Satoh et al. [17] , which is to our knowledge the most compact implementation today. Another area efficient implementation is the one of Wolkerstorfer et al. [20] . More precisely, according to Daemen and Rijmen [6] , the number of kgates for the implementations of [17] and [20] are 5.4 and 5.7 respectively.
The remainder of this paper is organized as follows. In Sect. 2 some details on the AES algorithm are discussed. Section 3 lists previous work on hardware implementations of Rijndael. In Sect. 4 we explain our approach to minimize the area of the S-box and compare our new solution with the S-box of Satoh. Section 5 concludes the paper and outlines future work.
The AES Algorithm
Rijndael has a variable block and key length which can be 128, 192 or 256 bits; the AES standard includes only block lengths of 128 bits. In this implementation we focus on the 128-bit key version of AES which has 10 rounds. In this case, each round and the initial stage require a 128-bit round key. In total 10 sets of round keys are generated from the secret key by using the S-box. The input data is arranged as a table i.e., a matrix of bytes. Figure 1 outlines the basic structure of the algorithm. The round transformation consists of four different transformations: ByteSub, ShiftRow, MixColumn and AddRoundKey. They are performed in this order with the exception of the final round which is slightly different. All transformations are based on byte-oriented arithmetic and AddRoundKey is a bitwise XOR operation. The transformations operate on the intermediate result, which is called the State.
The ByteSub transformation is a non-linear byte substitution also called S-box (substitution table). It operates on bytes independently. The S-box is invertible and consists of the following two transformations: The schematic of the complete AES algorithm is shown in Fig. 1 . Further details on the AES algorithm can be found in [4, 5] . 
Previous Work
Many hardware architectures for Rijndael were proposed as either ASIC [11, 12, 19] or FPGA implementations [2, 3, 7, 8, 10, 14, 18] . Most of the known implementations, particularly the early ones, were quite simple and not small enough as they did not exploit composite field arithmetic. Among those who tried to produce a really small circuit we mention the work of Satoh et al. [17] and Wolkerstorfer et al. [21] . In [16] the use of the composite field GF ((2 4 ) 2 ) was also proposed but no hardware implementation was presented.
2 ) which resulted in an optimized S-box. More precisely, their S-box requires less than 1/4 of the size of one using a look-up table. This resulted in a compact AES implementation with a gate complexity of 5.4 kgates. To our knowledge this is the most compact architecture so far. Wolkerstorfer et al. used 
2 ) involves only operations in GF (2 4 ), which are easily computed using combinational logic. Macchetti and Bertoni [13] have described an ASIC implementation for the same composite field GF ((2 4 ) 2 ), but with a representation as given in [16] . The work of Chodowiec and Gaj [3] also offers a compact design that is targeting low-cost embedded applications. They used dedicated Block RAMs for the implementation of the S-boxes. Recently, the work of Wu et al. [22] gives an area and delay reduction of 1/6 and 1/4 respectively compared to [21] . The proposed approach uses dual AES in combination with a composite field.
Here we use the composite field GF (((2 2 ) 2 )
2 ), which was only explored by Satoh et al. [17] . By a systematic exploration of all options we show that Satoh's S-box is at least 5% away from an optimal solution. The implementation of this optimal solution and the approach we use to explore all design possibilities is explained in the next section.
Hardware Implementation
In this section we examine the S-box of Satoh et al. [17] and we try to optimize it for area. Section 4.1 describes the approach we used to optimize Satoh's S-box. Section 4.2 presents our new S-box. Implementation results and comparison with Satoh's S-box are given in Sect. 4.3.
Theoretical Approach to Optimize the S-Box
Let us view GF (2 2m ) as a field extension of degree 2 over GF (2 m ). The field GF (2 2m ) is generated as an extension field of GF (2 m ) using an irreducible polynomial say
, where ω is a root of f (x) and GF (2 2m ) can be viewed as a two-dimensional vector space over GF (2 m ). Hence, an arbitrary element ∆ ∈ GF (2 2m ) can be written as ∆ = δ 1 ω + δ 0 , where δ 1 , δ 0 ∈ GF (2 m ). We want to calculate the inverse of ∆ i.e.
. The multiplicative inverse of ∆ ∈ GF (2 2m ) can therefore be computed as:
(1) This equation consists of operations which can be performed in the subfield GF (2 m ) [9] .
Equation (1) can be used recursively to find the inverse in
.
) is a field extension of degree 2 over GF ((2 2 ) 2 ) constructed using
Let us call a root of P also x. GF ((2 2 ) 2 ) is a field extension of degree 2 over GF (2 2 ) using the irreducible polynomial Q(y) = y 2 + q 1 y + q 0 , with y a root of the polynomial and q 1 , q 0 ∈ GF (2 2 ). GF (2 2 ) is a field extension of degree 2 over GF (2) using the irreducible polynomial R(z) = z 2 + z + 1, with root z. In Satoh et al. the following choices are made for the coefficients of the irreducible polynomials:
2 ) :
Inversion in GF (2 2 ) requires only one addition:
The inversion in GF ( 2 8 ) is finally decomposed into operations in GF ( 2 2 ). Therefore a transformation is needed to transform a representation in GF ( 2 8 ) to a rep-
). In [15] , Paar explains how a matrix can be created to perform this transformation. Different choices for the irreducible polynomials P (x) and Q(y) lead to different transformation matrices. For every combination of P (x) and Q(y) there are 8 possibilities for the transformation matrix. For hardware implementations, the most area efficient transformation matrix is the one that has the least '1' entries, because this number determines the XOR gate count for the transformation. After performing the inversion using
2 ) representation we need to go back to the GF (2 8 ) representation using the inverse of the transformation matrix. This matrix can be combined with the affine transformation matrix at the end of the S-box.
We stick to the choice of Satoh et al. to make p 1 = q 1 = 1 and q 0 = φ = z. Based on (2) and the fact that the transformation matrix depends on P (x) and Q(y), we conclude that the hardware complexity of the circuit depends on the choice of p 0 = λ. That is why we explored all values of λ to determine the most compact solution for the S-box. There are 8 choices for λ. The two elements that determine the hardware complexity of the circuit are:
-the number of gates in the constant multiplication with λ in GF ((2 2 ) 2 ),
-the number of '1' entries in the transformation matrix and in the combination of the inverse transformation matrix with the affine transformation matrix.
For every λ, Table 1 gives the number of 2-input XORs for the constant multiplication and the total number of '1' entries for every option of the transformation matrix. Out of 8 possible transformation matrices for every λ, the one that gives the least total number of '1' entries is given in the last column. The values in the table are depicted in Fig. 2 . From Table 1 and Fig. 2 we conclude that the solution of Satoh et al., which has λ = (z + 1)y uses the most area efficient constant multiplication. The transformation matrix they chose gives a total number of '1' entries equal to 61. Their implementation can be made more efficient by choosing the most compact transformation matrix which leads to a total number of '1' entries equal to 59. But the most optimal solution ("best case") would be to change the implementation even more by taking λ = zy. The constant multiplication requires only 1 XOR more than Satoh's constant multiplication, but the total number of '1' entries in the matrices is reduced by 5. On the other hand, the design with a maximized Table 1 . The arrows point at Satoh's Sbox, the S-box with minimized gate count ("best case") and the S-box with maximized gate count ("worst case").
gate count ("worst case") uses 5 XOR gates for the constant multiplication and 71 '1' entries in the matrices. The implementation of the new optimized S-box is explained in the next section. Implementation results for Satoh's S-box, the "best case" and the "worst case" are given in Sect. 4.3. Figure 3 shows the structure of the S-box implementation. The transformation used here is  
Implementation of the New Optimized S-Box
2 ) respectively. This results in the following combination of the inverse transformation with the affine transformation  
The total number of '1' entries in both 8 × 8 matrices is equal to 54. The addition of the column vector in the affine transformation is fixed and hence does not have to be considered for optimization. The number of '1' entries in the matrices in Satoh's implementation is equal to 61. Implementing the matrices in a straightforward way, the number of XORs would be equal to the number of '1' entries minus the number of rows in the matrices. This would lead to an XOR gate count of 38 and 45 for our and Satoh's S-box respectively, which results in a reduction of 7 XOR gates. By finding common terms in the XOR equations and exploring some rools of logic it is possible to reduce the number of XOR gates. We leave this to a synthesis tool and give results on the final implementations in Sect. 4 
The inversion in GF
2 ) has many levels of hierarchy. At the highest level the architecture looks the same as Satoh's architecture (see Fig. 4) . At the next level of hierarchy, the only difference with Satoh's design is the implementation of the constant multiplication with λ. Figure 5 gives the gate-level implementation of both Satoh's (top) and our (bottom) constant multiplication. As can be seen, our constant multiplication requires one extra XOR gate compared to Satoh's implementation. We can summarize this section by stating that, in a straightforward implementation, our S-box would require 6 XOR gates less than Satoh's S-box (7 less for the implementation of the matrices and 1 extra for the multiplication with λ). The implementation results of both approaches after synthesis are compared in the next section. To show the upper bound of the gate complexity, we also implemented the "worst case" S-box and included it in the comparison.
Implementation Results and Comparison
We implemented both our new optimized S-box and Satoh's S-box using a 0.18 µm CMOS standard cell library. To show that the area is sensitive to the choice of the polynomials and the transformation matrix we also implemented the "worst case" S-box (with maximized gate count). All three implementations are synthesized with a rather slow target delay of 10 ns. Table 2 gives the number of gates (in equivalent number of 2-input NAND gates) for all designs. Satoh et al. implemented their S-box using a 0.11 µm CMOS standard cell library, which resulted in 294 gates with a delay of 3.69 ns. This corresponds to 286 gates in 0.18 µm CMOS with a maximum delay of 10 ns. The table shows our new S-box has a 5% area reduction compared to Satoh's S-box. This is equivalent to the expected reduction of 6 XOR gates. The table also shows that a bad choice for the polynomials and the transformation matrix can lead to an area enlargement of 9%.
Conclusions and Future Work
We explored various options for low gate counts in the design of the AES S-box. We used the architecture of Satoh as a reference and we showed that it is 5% away from an optimal solution. Furthermore, we proved that the "worst case" S-box leads to a 9% area increase. We optimized Satoh's S-box by choosing the irreducible polynomial and transformation matrix that lead to the most compact solution.
There exists a possibility that a more compact S-box can be achieved by choosing an irreducible polynomial P (x) = x 2 + p 1 x + p 0 with p 1 = 1. The high level architecture of Satoh does not stay the same in this case. On the other hand, in this work we only considered the gate complexity of the S-box while encryption is done. The same strategy can be applied to the decryption operation as well.
