In this article, we propose a new comparison metric, the figure of adversarial merit (FOAM), which combines the inherent security provided by cryptographic structures and components with their implementation properties. To the best of our knowledge, this is the first such metric proposed to ensure a fairer comparison of cryptographic designs. We then apply this new metric to meaningful use cases by studying Substitution-Permutation Network permutations that are suited for hardware implementations, and we provide new results on hardware-friendly cryptographic building blocks. For practical reasons, we considered linear and differential attacks and we restricted ourselves to fully serial and round-based implementations. We explore several design strategies, from the geometry of the internal state to the size of the S-box, the field size of the diffusion layer or even the irreducible polynomial defining the finite field. We finally test all possible strategies to provide designers an exhaustive approach in building hardware-friendly cryptographic primitives (according to area or FOAM metrics), also introducing a model for predicting the hardware performance of round-based or serial-based implementations. In particular, we exhibit new diffusion matrices (circulant or serial) that are surprisingly more efficient than the current best known, such as the ones used in AES, LED and PHOTON.
Introduction
RFID is a rising technology that is likely to be widely deployed in many different situations of everyday life, leading to new security challenges that the cryptography community has to handle. Significant advances in this area have already been obtained. In particular, many lightweight block ciphers [10, 12, 17, 22] have recently been proposed, and designing such ciphers is not an easy task as showed by the numerous candidates that eventually got broken. Moreover, it is interesting to note that in most privacy-preserving RFID protocols proposed [3, 18, 19] a hash function is required, and since a hash function can be easily built from a block cipher (for example with the Davies-Meyer mode) or a permutation (for example with the sponge construction [9] ), a crucial question for the researchers is how to design a hardware efficient permutation (that can later be utilized to build a hash function and/or a block cipher).
Hardware efficiency can have very different meanings depending on the utilization scenario targeted by the designer. For example, a classical metric is to estimate the minimum silicon area required by the primitive to perform the cryptographic operations. This, of course, depends on the parameters of the function itself (the area is highly dependent on the amount of memory required) and most lightweight block ciphers have a rather small block size of 64 bits. It is to be noted that the area is usually not directly linked to the security of a primitive, as adding extra rounds will have an impact on the throughput of the implementation, but only a very limited one concerning the area (we assumed that the function has no weakness that is independent of the number of rounds). Area and other metrics such as throughput, latency or power dissipation can be traded-off for one another, making the comparison between different primitives difficult. In the direction of fairer comparisons of hardware implementations of cryptographic primitives, Bogdanov et al. [11] introduced the efficiency metric throughput/area in order to take in account these tradeoffs. However, the possibility of trading off throughput for power was not taken in account and Badel et al. [4] proposed instead a figure of merit, defined as FOM = throughput/area 2 .
However, as of today, no metric takes in account the inherent security of a building block, therefore making it hard to compare for example two diffusion matrices that have different area footprint and different branching number.
The construction of good diffusion matrices has always been an important research topic in cryptography, equally important as the search for good confusion functions. The AES [15] for example uses a 4 × 4 matrix with elements in GF (2 8 ). This matrix is Maximum Distance Separable (MDS), which means that it has a branching number of 5, optimal for a 4 × 4 matrix. However, this security feature comes at a cost that computations in GF (2 8 ) might not be the best choice for some hardware purposes, even though special care has been taken by the designers to choose a circulant matrix instantiated with lightweight coefficient (i.e. low Hamming weight coefficients, such as 0x01, 0x02 and 0x03). Recently, Guo et al. [16, 17] described a new type of diffusion matrix, so-called serial, that trades more clock cycles in the execution for a smaller area. This idea was later extended to the use of linear Feistel-like structures or Linear Feedback Shift Registers (LFSR) to build the diffusion matrix [21, 23] . On the opposite side, PRESENT [10] uses a simple bit permutation layer, the real diffusion coming in fact directly from the S-box application. The advantage being of course that a bit permutation layer is basically free in a hardware implementation. Now, one may ask the following question: what is better when the goal is to maximize some hardware metric, a very weak diffusion matrix with a low area footprint, or a strong diffusion matrix but requiring more silicon to be performed?
More generally, many different trade-offs exist when building an AES-like Substitution-Permutation Network (SPN) primitive, such as the general geometry (number of lines and columns), what size of S-box, what type of matrix, with what branching number, in what finite field, with which irreducible polynomial, etc. When a cryptographer would like to design a permutation with a specific hardware efficiency metric in mind, it is not trivial for him to make the best construction choices directly. Since implementing many different trade-offs is very time consuming, he will have to rely on his own intuition when picking the basic building blocks and choosing the general structure of the primitive, therefore accepting that his final design might not be optimal.
Our contributions. In this article, we study the problem of designing hardware efficient permutations for lightweight symmetric key cryptography purposes, and we propose new promising diffusion matrices as building blocks. We first explain in Section 2 the family of functions that we will study, namely AESlike SPN permutations, and we describe a new generalized diffusion layer (i.e. the ShiftRows function in AES), that allows a provable optimal diffusion even for non-square internal state matrices. Then, we introduce in Section 3 a new metric, the figure of adversarial merit (FOAM), that for the first time takes into account the inherent security provided by the primitive. We then explain in Section 4 the various SPN design tradeoffs that we will consider for our comparisons, such as the geometry of the SPN, the S-box size, the type of matrix (circulant or serial), the field size for the diffusion or even the irreducible polynomial. The goal being that the designer only has to input the type of implementation (round and/or serial) and the size of the permutation he would like to build, and he can directly get the SPN structure and its internal components that are the best suited for him. We study in Section 5 the security of the AES-like SPN permutations by only taking in account simple linear/differential attacks. We then describe how one can estimate the hardware implementations efficiency in Section 6 and in particular the non trivial task of estimating the area consumption induced by the control logic. Moreover, we show that in some situations, a coefficient rewrapping trick can be used to significantly improve the efficiency of a diffusion matrix. We chose to focus our work on designing permutations only since many cryptographic primitives can be built from them. Therefore, we will not cover other components such as key schedule for a block cipher, or message expansion for a hash function. Moreover, due to the obviously vast amount of implementation trade-offs, we restricted ourselves to the two most important cases: fully serialized and round-based.
Finally, the results obtained by our analysis are given in Section 7, with the best diffusion matrices and SPN parameters we could find for many different scenarios. Notably, we show that the diffusion matrices of ciphers such as AES, LED or PHOTON are not the best possible choices. For example, in the case of AES encryption, a circulant matrix with coefficients (0x01,0x01,0x04,0x8d) would have been, surprisingly, a better choice in terms of implementation while keeping the same Maximum Distance Separable (MDS) security.
In this section, we describe the family of AES-like SPN functions we are considering. Our scope is quite classical, but we propose a new generalized diffusion layer (i.e. the ShiftRows function in AES), that allows an optimal diffusion even for non-square internal state matrices.
Extended AES-like permutations
An n-bit AES-like SPN permutation transforms an r × c array of s-bit cells (n = r × c × s). During one round, each cell is first transformed by an s-bit S-box (similar to the AES SubBytes operation). Then each r-cell column is transformed by an r × r diffusion matrix (similar to the AES MixColumns operation), followed by an optimal diffusion which permutes the c cells of each row to provide further mixing (similar to the AES ShiftRows operation). Finally, an (r × c)-cell constant is XORred to complete a round transformation (in a block-cipher design, this phase is a subkey addition, but we will not consider key-schedules in this article). Note that in AES, we have a square array r = c = 4 and cell size s = 8bit. The diffusion matrix is usually defined over the finite field GF (2 s ) because of the s-bit cell size. Sometimes, we might actually use a smaller subfield of size GF (2 i ), i divides s, in order to define the diffusion matrix. This framework captures many known ciphers such as AES, PRESENT, LED, etc.
In this paper, a cell is called differentially (resp. linearly) active if its value (resp. mask value) is non-zero in a differential (resp. linear) attack. The differential branch number of a diffusion matrix is the minimum number of differentially active input and output cells (among all non-zero inputs). The notion of linear branch number is similar, except that we consider the transpose of the diffusion matrix instead. From this point onwards, we will not distinguish between differential and linear branch number unless necessary. That is, if it is stated that a matrix has branch number, say, 3, we mean that both the differential and linear branch number are 3. The maximum branch number for an r by r diffusion matrix is r + 1, and a matrix which achieves this optimal branch number is called Maximum Distance Separable (MDS). If the diffusion matrix has branch number r, then it is called almost-MDS.
The generalized optimal diffusion
In this section, we generalize the concept of optimal diffusion [15] for non-square state array. This has been done already when r < c with a security bound equivalent to the case where r = c (square array) [15] . When r > c and c divides r, a simple generalization has been proposed in [13] where a 4-round security bound is proven when the diffusion matrix is MDS. In this section, we propose a generalized optimal diffusion for the case r > c where c may not divide r and the diffusion matrix may not be MDS, i.e. for all branch number B ≤ r + 1.
An example of optimal diffusion is the ShiftRows operation of AES which helps to diffuse the effect of the AES SubBytes and MixColumns operation over 32-bit to the whole 128-bit block. The AES ShiftRows transforms a 4 × 4 byte-array by rotating row r to the left by r bytes, for r = 0, 1, 2, 3. The effect of ShiftRows is that each byte of an input column is mapped to a different output column. This is captured by the concept of optimal diffusion (another example is the ArrayTranspose map of the SQUARE cipher [14] ). Definition 1. For an r-by-r cell-array, the optimal diffusion map is a cell-permutation that maps each cell of an input column to a different output column.
However, the optimal diffusion only applies for r × c cell array where r ≤ c. When r > c, there are not enough output columns c to map each of the r cells of an input column. Thus, we extend a new concept from [13] called Generalized Optimal Diffusion (GOD) for r × c cell-array when r > c, which we describe below. Our strategy is to distribute the cells of an input column as uniformly as possible to each output column.
Note that here, without loss of generality, we apply the permutation operations from right-to-left, i.e. SC (SubCells) is first applied, followed by MC (MixColumn) and then the optimal diffusion. The Generalized Optimal Diffusion (GOD) defined in [13] applies only when r is a multiple c. Here, we deinfe GOD for any r > c. Definition 2. For an r × c cell-array, a generalized optimal diffusion is a cell-permutation such that looking at any r-cell column:
1. r/c input cells are mapped to each of (r mod c) output columns. 2. r/c input cells are mapped to each of c − (r mod c) output columns. Example 1. Consider r = 5, c = 3. For each input column of 5 cells, 5/3 = 2 input cells are mapped to each of (5 mod 3) = 2 columns. 5/3 = 1 input cell is mapped to 3 − (5 mod 3) = 1 column. One example is given by the transform of the following arrays:
Consider a 4-round AES-like SPN as follows (omitting the constant addition since it has no effect on our reasoning): We provide the proof of this theorem in Appendix A.1. We note that it is tight in the sense that it naturally provides a 4 round path that corresponds to a "luckiest" scenario for the attacker, which involves the minimum number of active Super-Sboxes (the (c × s)-bit S-boxes composed of two SubCells layers surrounding one MixColumns).
Let us look at an application example of Theorem 1 to derive the number of active S-boxes of an AES-like SPN structure, which cannot be deduced by the known results of [13, 15] . Consider an SPN structure with state size 24-cell, the diffusion matrices being an 8 × 8 matrix with branch number 7, i.e. r = 8, c = 3 and B = 7. By Theorem 1, we have y = 2 and x = 1, therefore B = max{2; x + y} = 3 and there are B × B = 7 × 3 = 21 active S-boxes guaranteed over 4 rounds of this 24-cell SPN structure.
As a corollary, we explore the cases when the formula for the number of active S-boxes over 4 rounds can be simplified (the proof is given in Appendix A.2). Corollary 1. In Theorem 1, we have the following special cases:
1. If c > r, then the number of active S-boxes over four rounds is at least B 2 (known result from [15] ). 2. If c divides r and B = r + 1, i.e. MixColumn is MDS, then the number of active S-boxes over four rounds is at least (r + 1) × (c + 1) (known result from [13] ). 3. If c divides r and B = r, i.e. MixColumn is almost-MDS, then the number of active S-boxes over four rounds is at least r × c.
FOAM: Figure Of Adversarial Merit
As explained in the introduction, the various trade-offs inherent in any design of a cryptographic primitive make a fair and consistent comparison of software and hardware implementations thereof a challenging task. For hardware implementations exist a few metrics, like the Area-Time (AT) product, which multiplies the area in Gate Equivalents (GE) occupied by the design with the number of clock cycles required (the smaller the number, the more efficient is the design). Closely related is the hardware efficiency [11] , which divides the throughput at a given frequency by the area (hence the greater the number, the better the design). In order to also address the area-power trade-off, [4] proposed a new 
where S(x) and A are basically equivalent to special definitions of speed and area, respectively. More precisely, S(x) denotes the speed of the cipher based on the number of rounds required to achieve a certain security x against some set of attacks (in this article, we will later restrict ourselves to simple differential/linear attacks). For a round-based permutation, it is defined as S(x) = p(x) × t where p(x) represents the number of rounds required to achieve security x, and t the number of clock cycles to perform one round. Moreover, for SPN-based primitives, we decompose the area requirements A into six parts: the intermediate state memory cost C mem , the S-boxes implementation cost C sbox , the diffusion matrix implementation cost C dif f , the constant addition C cst , the control logic cost C log , and the IO logic cost C io :
This FOAM metric will be useful to compare different design strategies, different building blocks (such as diffusion matrices) with a simple value computation. Even better, we would like to roughly compare all these possible design trade-offs without having the hassle to implement all of them: in Section 6 we will detail how to estimate these six subparts of the area cost and the number t of clock cycles required to perform one round. The value p(x) can be deduced by the number of active S-boxes proven in Theorem 1 and the S-box cryptographic properties (see Section 5) . Note that in the rest of the paper, we consider that the security aimed by the designer is equal to the permutation size, i.e. we are aiming at a security of 2 n computations (thus p(x) = p(2 n )).
Trade-offs considered
In this section, we explain all the various trade-offs we will consider when building an AES-like SPN permutation. The goal being that a designer specifies a permutation bitsize n, the metric he would like to maximize (area, FOAM), the degree up to which serial or round-based implementations are important, and he directly obtains the best parameters to build his permutation.
The S-box. One of the first choice of the designer is the size of the S-box, and we will consider two possible trade-offs: s = 4 and s = 8. Note that, for simplicity, we will consider that the S-box chosen has perfect differential and linear properties relative to its size (one could further extend the trade-offs to non-optimal but smaller S-boxes, but the search space being very broad we leave this as an open problem).
The geometry of the internal state. When building an AES-like SPN permutation, one can consider several internal state geometries (the values r and c). The classical case is a square state, like for AES. However, depending on the diffusion matrices available, it might be worth considering more line-shaped or column-shaped designs.
Diffusion matrix field size. The designer can choose the field size 2 i in which the matrix computations will take place. The classical case, like in AES, being that the field size for the diffusion matrix is the same as the S-box. However, depending on the diffusion matrices available, it might be worth considering designs with thinner diffusion layers but repeated several times. For example, in the case of AES, instead of the MixColumns matrix one could use a 4 × 4 diffusion matrix on GF (2 4 ) applied two times (one time on the 4 MSB and one time on the 4 LSB of the 8-bit cells in the AES column). Overall, we will cover a scope from binary matrices (in GF(2)) up to matrices on the same field size as the S-box (in GF (2 s )).
Irreducible polynomial for the diffusion matrix field. Once the field size 2 i is fixed, the designer can choose the irreducible polynomial defining the field. For i = 1 and i = 2 only a single polynomial exists, while for i = 4 at most 3 choices are possible (α 4 + α + 1, α 4 + α 3 + 1 and α 4 + α 3 + α 2 + α + 1). For the i = 8 case, many polynomials are possible (this was already observed by [5] ), thus in order to focus the search space we will only consider the irreducible polynomial used in AES (α 8 + α 4 + α 3 + α + 1) and in WHIRLPOOL hash function [7] 
Type of diffusion matrix. The designer can choose what type of matrix he will implement, the two main hardware-friendly types being circulant or serial. In the circulant case, the designer picks r coefficients Z = (Z 0 , . . . , Z r−1 ) and the matrix Z is defined as
In the serial case, the designer picks r coefficients Z = (Z 0 , . . . , Z r−1 ) and the matrix Z is defined as
The matrix therefore takes r operations to be computed.
Branching number of the diffusion matrix. In general, implementing a matrix with very good diffusion property will cost more area and/or cycles than a weak one. For example, the AES matrix has ideal MDS diffusion property, but certainly requires more area to implement than a simple binary matrix with weaker properties. Since the former is bigger but stronger and the latter is smaller and weaker, it is not clear which alternative will lead to the best FOAM. Therefore, the designer can choose between a wide range of possibilities concerning the branching number B of the diffusion matrix, from B = 3 to B = r + 1 (MDS).
Security assessment of AES-like primitives
The FOAM metric takes into account the security of the permutation with regards to simple differential/linear attacks. We would like to evaluate this security for the AES-like SPN permutations we are considering. Theorem 1 gives us the minimal number of active S-boxes for a given number of rounds,
We note that the number of active S-boxes given by Theorem 1 is tight if the number of rounds is not equal to 3 modulo and knowing the S-box cryptographic properties we can compute the maximum differential and linear characteristic probabilities of our generic SPN ciphers easily. In other words, we can easily compute the number of rounds p(x) = p(2 n ) required to achieve the aimed security 2 n . As stated before, for simplicity, in the rest of this article we will consider that the S-boxes have perfect differential and linear properties: for a 4-bit S-box the maximum differential and linear characteristic probabilities are 2 −2 (e.g. PRESENT S-box), while for a 8-bit S-box the maximum differential and linear characteristic probabilities are 2 −6 (e.g. AES S-box). Of course, one can extend the trade-off by considering other S-boxes, that might require a smaller area, but will have worse security properties.
We reuse the example from Section 2.2 (i.e. SPN structure with state size 24-cell, the diffusion matrix being a 8 × 8 matrix with branch number 7) as an illustration. Previously, we know from Theorem 1 that there are at least 21 active S-boxes over 4 rounds of this SPN permutation. Suppose that 8-bit Sboxes (i.e. s = 8) having maximum differential and linear probabilities 2 −6 are used. Then the maximum differential and linear characteristic probabilities over four rounds are upper-bounded by (2 −6 ) 21 = 2 −126 .
We are aware that other attacks rather than simple differential/linear might exist, some perhaps much more complex and powerful (it would be interesting to give some generic description of various attacks like impossible differential attack, boomerang attack, etc. only using the parameters of the AES-like SPN permutation). However, our goal here is not to fully specify a permutation, but it is to compare many trade-offs and design strategies that will lead to good hardware performances. Therefore, we emphasize that the number of rounds p(x) is of course not the number of rounds that should be chosen by a designer. This number should be carefully chosen after thorough cryptanalysis work on the entire primitive. Yet, we believe that this simple differential/linear criterion is a quite accurate way to compare the security of various AES-like SPN permutations.
Implementations in ASIC
In this section, we introduce some notation before we present formulas to estimate serialized and roundbased implementations (we restricted ourselves to these two important practical cases due to the obviously vast amount of implementation trade-offs). Please note that all estimates have to be seen as lower bounds, as we use scan flip-flops, and consider neither reset nor I/O requirements, which can significantly impact the area count in practice. We argue that those requirements -though very important in practiceare highly application specific, and thus need to be determined on a case by case basis. In practice, a higher throughout can be achieved by using pipelining techniques to reduce the critical path at the cost of additional area. As this design goal is, again, highly application specific and FOAM is designed to be frequency independent, we have not considered it in our analysis.
We have estimated all serial architectures with the single optimization goal of minimal area in mind. In practice, some design decisions will most likely use another trade-off point more in favor of smaller time and larger area. To reflect this, we have estimated all round-based architectures optimized for maximum FOAM.
The table below provides an overview over the hardware building blocks we used, their notation and typical area requirements for a UMC 180 nm technology. In addition, we denote i the exponent for the field GF (2 i ).
Notation
Description GE In this section, we give more details about the ASIC implementation estimates given in Section 6. A general serialized architectures using a serial diffusion matrix is depicted in Figure 1a , while Figure 1b shows the circulant diffusion matrix module. In the special case of c = 1 the state module can be simplified to one of the two the architectures shown in Figure 1c , dependent whether i = s (left) or i = s (right). Finally, Figure 1d depicts a round-based architecture. This is just one example for a technology and the area of the building blocks can be easily adapted for other technologies.
(a) Serialized using a serialized diffusion matrix, c ≥ 2 00 01 0(c-2) 0(c-1) 
S-boxes cost. We use the S-boxes of AES and PRESENT to estimate the area for 8-bit and 4-bit S-boxes, respectively. In a UMC 180 nm technology they require SB8 = 233 GE and SB4 = 22 GE.
Diffusion matrix cost. A denotes the numbers of XORs required to implement the last row of the serial or circulant matrix, and we provide an extensive list of good matrices and their XOR count in Section 7. In case a serial matrix is used A XORs is all that is required to implement the diffusion matrix.
Once one column has been computed, all columns are rotated by one column to the left and the next column can be processed. In case a circulant matrix is used, additional temporary storage of s × r − i 1-input flip-flops are required before the result is fed back to the leftmost column. Then all columns are rotated by one column to the left. One additional i-bit multiplexer is required.
Constant addition cost. To add a constant s XOR gates are required.
Control logic cost. A particular challenge is to estimate the control logic C log required for a given architecture. Our estimates contain four parts: three counters a r (rows), a c (columns), and a p (rounds); the finite state machine b; clock gating logic cg; and other combinational logic oc. The area for counters is mainly determined by the storage required for the minimal number of bits, some simple (e.g. NOT, NAND, NOR) feedback function and at least a 1-bit MUX. In total we estimate the area for any such counter to be a x = DF F × log 2 (x) + 5, where x denotes either the number of rows, columns or rounds. The number of states is dependent on the geometry. In case c ≥ 2 the serialized architecture we based our estimates on requires c − 1 states for GOD, two for MixColumns, and each one for SubCells, IDLE, and INIT; thus in total c + 4 states are required. The area for the finite state machine is estimated with b x = SF F × log 2 (c + 4) . Based on post-synthesis figures for different variants of PHOTON we have derived the following formulas for clock gating (cg) and other combinational logic (oc): cg = r × 10 + 5 and oc = 8 × r + 20, respectively. In case c = 1 no clock gating logic, no column counter a c , and only one state is required for MixColumns, thus the state machine estimates simplify to b = SF F × 2. In total we get the following formula:
Input/Output logic cost. For our serialized architecture we need only one additional s-bit multiplexer.
Timing cost. Below are the formulas to estimate the time required to compute one round, dependent on the geometry and whether a serial or a circulant matrix is used:
For c ≥ 2 the three summands represent the time required for: 1) AddConstant and SubCells; 2) GOD; 3) MixColumns. Please note that in case a serial matrix is used and i = s, it is possible to optimize the architecture in a way that for no extra hardware c clock cycles can be saved per round [1] . In case c = 1 there is no GOD and the first summand stays the same, while MixColumns does not require to rotate the columns regardless whether a circulant or serial diffusion matrix is used.
Round-based architectures
Memory cost. The n = s × r × c-bit state can be stored in 2-input flip-flops.
S-boxes cost. In total r × c S-boxes are required.
Diffusion matrix cost. We need r × c × s i implementations of the last row of the matrix regardless whether a serial or circulant matrix is used.
Constant addition cost. In total s × r × c XORs are required to add all constants.
Control logic cost. In a round-based implementation basically only a round-counter a p and -optionallya very simple finite state machine is required. If we assume only three states IDLE, INIT, and ROUND, the area requirement for the finite state machine is b = 2 · SF F .
Input/Output logic cost. One of the two inputs of the 2-input flip-flops used to store the state can be used for multiplexing the input. Hence, no additional logic is required
Timing cost. In a round-based implementation one round is computed in one clock cycle.
t = 1
We now present the formulas for the estimates for the various parts of the ASIC implementations in Table 1 . Table 2 compares FOAM of some real implementations of LED and PHOTON with our estimated FOAM. Please note that the authors of [17] and [16] did not use the special optimization trick, described above. To reflect this, we provide two FOAMs, one taking the optimization into account and one which does not. For LED [17] reports 966 GE for LED-64, where 299 GE are required to store the key state and around 40 GE are required for the key addition, multiplexers, and control logic. We thus compare with 627 GE. [20] reports 2,400 GE for AES-128 out of which 835 GE are required for the key schedule, thus we compare our FOAM to 1,565 GE. The authors chose a different optimization point, as can be seen in the higher area and significantly lower cycle count. We used the formulas above with the parameters of PHOTON-224 (r = 8, c = 8, i = 4, s = 4, p = 8, serialized MDS) for the estimation of the 256-bit permutation. As PHOTON is, contrary to LED, an unkeyed permutation, the last row of Table 2 is actually the best suited comparison. In summary, Table 2 underlines how close FOAM is to real implementations. Table 1 : ASIC implementations estimation for the intermediate state memory cost C mem , the S-boxes implementation cost C sbox , the diffusion matrix implementation cost C dif f , the constant addition C cst , the control logic cost C log , the IO logic cost C io and the number t of clock cycles to perform one round.
Discussion

Serial architectures
Round-based architectures
Results and new diffusion matrices
In this section we provide the results of our framework, as well as new diffusion matrices that are very interesting for hardware implementations. As explained in Section 4, the designer's input is the permutation bitsize n, the metric he would like to maximize (area or FOAM), and the degree up to which serial or round-based implementations are important. To illustrate our method, we focused on the case where the designer would like to build a 64-bit permutation (which is a typical state size for a lightweight block cipher). For the implementation types, we focused on three scenarios: only serial implementation is important, only round-based implementation is important, serial and round-based implementations are equally important for the designer.
Before describing our results, we first explain how we found good diffusion matrices (circulant and serial) and then list these matrices in the next three subsections. Our optimal matrices outperform known ones from the AES, LED ciphers and the PHOTON hash function.
Lightweight coefficients
Consider the AES matrix, a circulant matrix with coefficients (0x01, 0x01, 0x02, 0x03) over GF (2 8 ) defined by the irreducible polynomial α 8 + α 4 + α 3 + α + 1. The matrix appears to be very lightweight due to the low Hamming weight of its entries. But surprisingly, we found an even lighter circulant matrix over the same field with coefficients (0x01,0x01,0x04,0x8d). We now explain why this is so.
We first illustrate how to compute the number of XORs required to implement a multiplication by a finite field element x. To do so, we use GF (2 8 ) defined by α 8 + α 4 + α 3 + α + 1 as an example. Let x = x 7 · α 7 + x 6 · α 6 + · · · x 1 · α + x 0 = (x 7 , x 6 , · · · , x 1 , x 0 ). Further, for ease of explanation, we employ hexadecimal encoding: (x 7 , x 6 , x 5 , x 4 , x 3 , x 2 , x 1 , x 0 ) can be encoded as a tuple of hexadecimal numbers
We use the binary representation to represent finite field elements. E.g., 0x8d is 10001101 in binary, which corresponds to the finite field element α 7 + α 3 + α 2 + 1 in GF (2 8 ). (0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01). Then, the multiplication of 0x04 and 0x08 by x can be represented respectively as:
=(0x20, 0x10, 0x88, 0xc4, 0x42, 0x81, 0xc0, 0x40),
=(0x10, 0x88, 0xc4, 0x62, 0xa1, 0xc0, 0x60, 0x20).
It can be seen that the number of XORs required for the multiplication of 0x04 and 0x08 by x is 6 and 9 respectively. Now we can compute 0x8d · x = (α 7 + α 3 + α 2 + 1) · x = (0xb1, 0x58, 0x2c, 0x96, 0xfa, 0x4c, 0xa6, 0x62) ⊕(0x10, 0x88, 0xc4, 0x62, 0xa1, 0xc0, 0x60, 0x20) ⊕(0x20, 0x10, 0x88, 0xc4, 0x42, 0x81, 0xc0, 0x40) ⊕(0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01) = (0x01, 0x80, 0x40, 0x20, 0x11, 0x09, 0x04, 0x03)
Due to the 'cancellation of XORs', we see that multiplication of x by 0x8d requires only 3 XORs. Similarly, the multiplication of x by 0x02 and 0x03 requires 3 and 11 XORs respectively.
In a similar fashion, for the purpose of finding lightweight diffusion matrices, we compute the XOR count for each field element. Table 3 and Table 4 shows the XOR count for every element of the finite fields GF (2 4 ) and GF (2 8 ) defined by different irreducible polynomials, respectively. Now we explain how to use the tables to calculate the number of XORs required to implement a row of a matrix, as denoted by A in Section 6. Denote a given row of an r × r matrix by (x 1 , x 2 , · · · x r ) over a finite field GF (2 i ). Let γ j be the XOR count in Table 3 and Table 4 (i = 4 and i = 8 respectively) corresponding to the field element x j . Then A is equal to (γ 1 +· · ·+γ r )+(z−1)·i, where z is the number of non-zero elements in the row. We give some examples: row (0x1,0x1,0x4,0x9) uses (0 + 0 + 2 + 1) + 3 × 4 = 15 XORs to implement over GF (2 4 ); the AES matrix with coefficients (0x01,0x01,0x02,0x03) uses (0 + 0 + 3 + 11) + 3 × 8 = 38 XORs to implement per row over GF (2 8 ); the circulant matrix with coefficients (0x01,0x01,0x04,0x8d) uses (0 + 0 + 6 + 3) + 3 × 8 = 33 XORs to implement per row over GF (2 8 ). This explains why the circulant matrix with coefficients (0x01,0x01,0x04,0x8d) is lighter than the AES matrix.
For a fair comparison, when decryption also needs to be lightweight, it is to be noted that this matrix presents less interesting inverse coefficients (0x71,0x12,0xdd,0x20) compared to the ones in the AES diffusion matrix inverse (0x0b,0x0d,0x09,0x0e). According to Table 4 , the former matrix requires 98 XORs, while the latter requires 86 XORs to be implemented. 
Subfield construction
In this section, we describe the subfield construction which allows us to outperform the AES matrix even more than the optimal matrix found in Section 7.1. As computed in the previous subsection, the MDS circulant matrix circ(0x1, 0x1, 0x4, 0x9) over GF (2 4 ) defined by α 4 + α + 1 requires 15 XORs to implement per row. Using the method of [13, Section 3.3], we can form a circulant MDS matrix over GF (2 8 ) by using two parallel copies of Q = circ(0x1, 0x1, 0x4, 0x9) over GF (2 4 ). The matrix is formed by writing each byte q j as a concatenation of two nibbles q j = (q L j ||q R j ). Then the MDS multiplication is computed on each half 2 4 ). The result is concatenated to form four output bytes (u 1 , u 2 , u 3 , u 4 ) where u j = (u L j ||u R j ). This matrix needs just 15 × 2 = 30 XORs to implement per row. In comparison, the lightest MDS circulant matrix circ(0x01,0x01,0x04,0x8d) over GF (2 8 ) defined by α 8 + α 4 + α 3 + α + 1 requires more XORs (33 XORs per row).
Further, we can serialize the above multiplication to do the left half followed by the right half, in which case only 15 XORs are needed to implement one row of the MDS matrix over GF (2 8 ). Another advantage of subfield construction is exemplified by the SPN-Hash construction [13] . It is difficult to find an 8 × 8 serial MDS matrix over GF (2 8 ) exhaustively. Instead, two parallel copies of the PHOTON 8 × 8 serial MDS matrix over GF (2 4 ) were concatenated to form the 8 × 8 serial MDS matrix over GF (2 8 ) for SPN-Hash.
It is straightforward to generalize this method to form a diffusion matrix with branch number B over GF (2 s ) from s/i copies of a diffusion matrix of the same branch number over a subfield GF (2 i ), where i divides s.
Good matrices
In this section, we list optimal low-weight circulant and serial matrices of different branch number over the finite fields GF (2), GF (2 2 ), GF (2 4 ) and GF (2 8 ). Using the construction of Section 7.2, we can form diffusion matrices to transform nibbles and bytes from these subfields.
The optimal matrices are found by exhaustively checking the branch number of all matrices and choosing the one with the least number of XORs according to the method explained in Section 7.1. To 11d x 11b 11d x 11b 11d x 11b 11d x 11b 11d x 11b 11d x 11b 11d check the branch number of matrix Q, we concatenate it with the identity matrix I r to form (I r |Q), the generating matrix of the corresponding linear code, and use the MAGMA software to find the distance. When we say a matrix has branch number B, we mean the matrix has both differential and linear branch number equal to B. That is, we check that both Q and its transpose Q t has branch number B.
We are aware that better techniques than naive exhaustive search might be used here. However, such improvements are not the goal of this article and we leave them as potential future work.
Circulant matrices. In Table 5 , we list optimal r × r circulant matrices with branch numbers 3 to r + 1. The "First Row" column represents the first row of the circulant matrix (as described in Section 4). The "Number of XORs" column represents the number of XOR gates needed to implement one row of the circulant matrix.
The matrices are optimal in the sense that they need minimal number of XORs to implement. In the events of a tie between two matrices, possibly over different finite field representations, we just list one of them. For example, the circulant matrices circ(0x01,0x01,0x04,0x8d) over GF (2 8 ) defined by α 8 + α 4 + α 3 + α + 1 and circ(0x01,0x01,0x04,0x8e) over GF (2 8 ) defined by α 8 + α 4 + α 3 + α 2 + 1 both outperforms the AES matrix by using 33 XORs to implement one row, so we just list the latter. We use "-" to represent that no circulant matrix with branch number B exists (verified either by exhaustive search or by coding theory bounds). For example, it can be verified that 8 × 8 circulant MDS matrix does not exist in the finite field GF (2 4 ).
The only exception where we did not find the optimal matrix is for 8 × 8 circulant MDS matrix over GF (2 8 ). Because the search space is too big to exhaust, we just list the WHIRLPOOL matrix which is MDS and low weight.
Serial matrices. Here, in a similar fashion to the case of circulant matrices, we provide optimal low-weight serial matrices of various branch numbers over different finite fields.
In Tables 5, we list optimal r × r serial matrices with branch numbers 3 to r + 1. The "Last Row" column represents the last row of the serial matrix (as described in Section 4). The "Number of XORs" column represents the number of XOR gates needed to implement the last row. Again, we simply list one matrix in the event of a tie, and use "-" to represent that no serial matrix with branch number B exists. In addition, we use "*" to denote that we have not found the serial matrix with branch number B at this point of time due to the huge search space. For instance, as the search space is too big to exhaust, we could not find a 8 × 8 serial MDS matrix over GF (2 8 ). In this case, we can employ the method of subfield construction (described in Section 7.2), i.e. use two parallel copies of the 8 × 8 MDS serial matrix with last row (0x2,0xd,0x2,0x4,0x3,0xd,0x5,0x2) (refer to second row of 8 × 8 subtable of Table 5 ) over GF (2 4 ) to obtain the desired 8 × 8 serial MDS matrix over GF (2 8 ).
Note that the special structure of the serial matrices leads to the fact that only XORs for the last row are required. In particular, for an r × r serial matrix, we just need to implement the last row as an LFSR feedback function and iterate it r times to obtain the required matrix multiplication. This tactic has been adopted in [16, 17] to define the PHOTON and LED MDS matrices over GF (2 4 ) respectively. The last rows of the serial matrices are (0x2,0x4,0x2,0xb,0x2,0x8,0x5,0x6) and (0x4,0x1,0x2,0x2) respectively, which require 53 and 16 XORs to implement per row. In fact, we have found lighter serial matrices over the same finite field: 8 × 8 serial matrix with last row (0x2,0xd,0x2,0x4,0x3,0xd,0x5,0x2) requiring 50 XORs; and 4 × 4 serial matrix with last row (0x2,0x1,0x1,0x4) requiring 15 XORs (refer to 4 × 4 and 8 × 8 subtables of Table 5 ).
Application: FOAM Comparison for 64-bit SPN Structures
In this section, we compare the FOAM metric for 64-bit SPN Structures (a typical blocksize for lightweight block ciphers). Table 6 (resp. Table 7) gives the results for a SPN structure based circulant matrices (resp. serial matrices) and with 4-bit PRESENT S-box or 8-bit AES S-box. The diffusion matrices are based on the optimal matrices found in Section 7.3. To compute p(2 64 ), the number of rounds to achieve differential/linear probability ≤ 2 −64 , we use the fact that the differential/linear probability of the PRESENT S-box is 2 −2 and that of the AES S-box is 2 −6 . Then we lower bound the number of active S-boxes by concatenating 4-round bounds with B × B active S-boxes from Theorem 1, 2-round bounds with B 1,0,0,0,0,1,1,2 27 1,0,0,1,1,1,0,0 24 4 1,0,0,0,0,0,1,1 16 1,0,0,0,0,1,1,0 16 3 1,0,0,0,0,0,0,2 11 1,0,0,0,0,0,1,0 8 GF (2 4 ), α 4 + α + 1 6 1,0,0,0,1,1,1,2 17 1,1,0,0,1,1,2,0 17 5 1,0,0,0,0,1,1,2 13 1,0,0,1,1,1,0,0 12 4 1,0,0,0,0,0,1,1 8 1,0,0,0,0,1,1,0 8 3 1,0,0,0,0,0,0,2 5 1,0,0,0,0,0,1,0 4
GF (2 2 ), α 2 + α + 1 9 -8 ----7 --2,1,0,3,1,2,0,1 13 6 1,0,0,0,1,1,1,2 9 1,0,0,1,1,1,0,2 9 5 1,0,0,0,0,1,1,2 7 1,0,0,1,1,1,0,0 6 4 1,0,0,0,0,0,1,1 4 1,0,0,0,0,1,1,0 4 3 1,0,0,0,0,0,0,2 3 1,0,0,0,0,0,1,0 2 GF (2) 9 -6 ----
5
--1,0,0,1,1,1,0,0 3 4 1,0,0,0,0,0,1,1 2 1,0,0,0,0,1,1,0 2 3 --1,0,0,0,0,0,1,0 1 active S-boxes and 1-round bound which involves only 1 active S-box. We also write down t, the time to compute one round for serialized implementation (the time t for round based implementation is the constant 1, so it is not presented). We compute the FOAM for round-based and serialized implementation based on the formula found in Section 6. We also present the FOAM for half-half implementation, where we take the average, i.e. equal weighting, of the round-based and serialized FOAM. This corresponds to implementations which are good for both scenarios. However, please note that this represents just one example, as the weighting of the scenarios is clearly a designer's choice. The structure with the best area and FOAMs are presented in bold font.
We see that for designing 64-bit SPN:
1. For minimal area the geometry is the most important criterion, while the choice of the field of the MDS matrix is of less importance. The geometry should be chosen, such that c is maximized, and consequently, many internal columns can be realized with 1-input flip-flops. A serial matrix is favorable over a circulant matrix and in general smaller fields allow to save a few GE, but come at a high timing overhead. 2. When circulant matrices are used with PRESENT S-box in Table 6 Tables 6 and 7 , when AES S-box is used with 8 × 8 matrices, we go for the one with branch number 6 instead of the optimal 9.
Designs with Optimal FOAM for Different Block Sizes
In Section 7.4, we showed a detailed comparison table for all possible configurations of 64-bit SPN structures based on AES and PRESENT S-box. From it, we extract the optimal design with the highest FOAM, which gives the best trade-off between speed, area and security.
In this section, we apply the same computations to other common block sizes which are used in the construction of block ciphers and hash functions. The block sizes we consider are 48, 64, 96, 128, 256 and 512 bits. For conciseness and ease of reference, we only list the best FOAM values for each of these block sizes. In Table 8 , we list the designs with the best FOAM values based on circulant matrices. In Table 9 , we list the designs with the best FOAM values based on serial matrices.
When computing Table 8 , we found that out of the 19 configurations for 128-bit block size based on AES S-box and circulant matrices, the best FOAM is given by a 4 × 4 state array with an MDS matrix. This corresponds to the AES block cipher structure and shows that the best design picked by our FOAM measure corresponds to the best design picked by human intuition. 
Conclusion
We have introduced FOAM (Figure of Adversarial Merit) which for the first time allows comparison of security-time-area trade-offs. Previous metrics, such as FOM only take into account the trade-off between speed and power, or in other words implementation trade-offs. By integrating the cryptographic strength (due to the vast amount of distinct attacks, we only took in account the simple but most meaningful cryptanalysis techniques), FOAM enables a fairer comparison of the vast amount of design choices, thus easing the optimization of designs for target applications. Implementation estimates are crucial at an early design stage to make the right choices, as, e.g. for serialized architectures, the choice of the S-box size, geometry, subfield and type of matrix leads to an area range from 508 GE to 1336 GE (×2.6). In this work we have made a step into a generic hardware estimation metric that for the first time also considers control logic, which is hard to estimate. Furthermore, we have generalized the SPN structure from a square array to a rectangular array which allows us to construct structures with more flexible sizes. A new bound for the number of active S-boxes for such structures is proven in Theorem 1. We also introduced new ways to compute lightweight coefficients of diffusion matrices, which we use to find circulant matrices which are lighter than the AES matrix and serial matrices which are lighter than the LED and PHOTON matrices.
Possible future works include defining a similar FOAM for software, finding new ways to define lightweight diffusion matrices (other than circulant and serial) and compute their FOAM. For example, we can construct the FOAM for SPN structures using the diffusion matrices from [2] , which are baesd on BCH codes. Lastly, we need not limit ourselves to SPN structures, but also extend FOAM to different Feistel and generalized Feistel structures such as CLEFIA, SMS4 and Skipjack structures.
