Abstract. This paper describes and analyzes the security of a general-purpose cryptographic function design, with application in RFID tags and sensor networks. Based on these analyzes, we suggest minimum parameter values for the main components of this cryptographic function, called ARMADILLO. With fully serial architecture we obtain that 2 923 GE could perform one compression function computation within 176 clock cycles, consuming 44 µW at 1 MHz clock frequency. This could either authenticate a peer or hash 48 bits, or encrypt 128 bits on RFID tags. A better tradeoff would use 4 030 GE, 77 µW of power and 44 cycles for the same, to hash (resp. encrypt) at a rate of 1.1 Mbps (resp. 2.9 Mbps). As other tradeoffs are proposed, we show that ARMADILLO offers competitive performances for hashing relative to a fair Figure Of Merit (FOM).
Introduction
Cryptographic hash functions form a fundamental and pervasive cryptographic primitive, for instance, providing data integrity in digital signature schemes, and for message authentication in MACs. In particular, there are very few known hardware-dedicated hash function designs, for instance, Cellhash [6] and Subhash [5] . On the other hand, Bogdanov et al. [2] suggest block-cipher based hash functions for RFID tags using the PRESENT block cipher. Concerning block and stream ciphers, the most prominent developments include PRESENT [1] , TEA [22] , HIGHT [13] , Grain [12] , Trivium [4] and KATAN, KTANTAN family [3] .
We propose a cryptographic function dedicated to hardware which can be used for several cryptographic purposes. 3 Such functions rely on data-dependent bit transpositions [16] . Given a bitstring x = x 2k · · · x 1 , fixed permutations σ 0 and σ 1 over the set {1, 2, . . . , 2k}, a bit string s, a bit b ∈ {0, 1} and a permutation σ, define x σ s = x when s has length zero, and, x σ s b = x σ s •σ b , where x σ is the bit string x transposed by σ, that is, x σ = x σ(2k) · · · x σ (1) . The function (s, x) → x σ s is a data-dependent transposition of x. The function s → σ s can be seen as a particular case of the general semi-group homomorphism from {0, 1} * to a group G. It was already used in the Zemor-Tillich construction [21] for G = SL 2 and in braid group cryptography [10] . We observe that when σ 0 and σ 1 induce an expander graph on the vertex set v = {1, . . . , 2k}, then (s, x) → x σ s has good cryptographic properties.
This paper is organized as follows: Sect. 2 describes a general-purpose cryptographic function called ARMADILLO. In Sect. 3 we analyze ARMADILLO. Sect. 4 contains design criteria for the bit permutation components of ARMADILLO. Sect. 5 suggests parameter vectors. Sect. 6 presents an updated design, called ARMADILLO2. Sect. 7 provides implementation results. Sect. 8 compares hardware implementations of ARMADILLO with other well-known hash functions.
Notations. Throughout this document, denotes the concatenation of bitstrings, ⊕ denotes the bitwise XOR operation, x denotes the bitwise complement of a bitstring x; we assume the little-endian numbering of bits, such as x = x 2k · · · x 1 .
Xinter Xinter 
The ARMADILLO Function
ARMADILLO maps an initial value C and a message block U i to two values
By definition, C and V c are of c bits, V t as well as each block U i are of m bits, a register Xinter is of k = c + m bits. ARMADILLO is defined by integer parameters c, m, J = c + m, and two fixed permutations σ 0 and σ 1 over the set {1, 2, . . . , 2k}. ARMADILLO(C,U) works as follows (see Fig. 1) 1: set Xinter = C U; 2: set a 2k-bit register x = Xinter Xinter; 3: x undergoes a sequence of bit permutations, σ 0 and σ 1 , which we denote by P. P maps a bitstring of k bits and a vector x of 2k bits into another vector of 2k bits. Assuming J = k, the output of this sequence of J bit permutations is truncated to the rightmost k bits, denoted S, by
The security is characterized by two parameters S offline and S online . Concretely, the best offline attack has complexity 2 S offline , while the best online one, with practical complexity, has success probability 2 −S online . Typically, we aim at S offline ≥ 80 and S online ≥ 40. However, we can only upper bound S offline and S online .
Application I: FIL-MAC. For challenge-response protocols (e.g. for RFID tags [17] ), the objective is to have a fixed input-length MAC. Suppose that C is a secret and U is a challenge. The value V t is the response or the authentication tag. We write
Additionally, the V c output could be used to renew the secret in a synchronized way or to derive an encryption key for a secure messaging session as specified in [17] . The security of challenge-response protocols requires that an adversary cannot extract from the RFID tag enough information that allows it to impersonate the tag with high probability. In this FIL-MAC context, the C parameter can be recovered by exhaustive search with complexity 2 c , where c = |C|; so, S offline ≤ c. In addition to this, the adversary can try to guess V t online with probability 2 −m , so S online ≤ m.
Application II: Hashing and digital signatures. For variable-length input messages hashing, we assume a strengthened Merkle-Damgård [7, 15] construction (with padding using length suffix) for ARMADILLO, with V c as chaining variable, U as message block and V c as hash digest. The initial value (IV) can use the fractional part of the square root of 3 truncated to c bits, similar to the values adopted in SHA-2 hash function family [20] . We write V c = AHASH IV (message padding).
Generic birthday attacks are expected to find collisions in ARMADILLO with complexity 2
with an input x with length multiple of m and cste a (r − 1)m-bit constant. A relevant property for this application is indistinguishability. Assuming a secret seed, AR-MADILLO could be used as a stream cipher. The keystream is composed of t-bit frames where the ith frame is APRF seed (i). The index i can be synchronized, or sent in clear in which case we have a self-synchronous stream cipher. In this setting, the output should be indistinguishable from a truly random string when the key is random.
Dedicated Attacks
Key recovery. Suppose V c V t and U are known, and we look for C. Since U is known, the tail m (Xinter) are known. Guessing the tail J−m (C) gives access to the tail J (S), since U is known. This fact motivates a meet-in-the-middle attack to recover tail J−m (C). Free-start collision. We look for a triplet (C,U,U ′ ) that causes a collision, that is,
with ≈ meaning that the Hamming weight of the difference is some low value w. Then, we hope that the next P permutation will move all w different bits outside the window of the c + m bits which are kept in S. Since the probability for a vector to have weight w is 2c+2m w 2 −2c−2m , the number of solutions we get is 2c+2m w 2 −c on average. The probability that a solution leads to a collision is the probability that w difference bits are moved outside a window of c bits. Finally, the expected number of collisions we can get is c+m w 2 −c . We can now fix w = w opt such that c+m w opt ≥ 2 c so that we can find one solution with complexity 2 w opt . To implement the attack, for all U and U ′ we enumerate all C's such that
A distinguisher. Assuming that the J iterations in the P permutation output a random 2k-bit vector of Hamming weight k, we have 2k k possible vectors. By extracting a window of t bits we do not have a uniformly distributed string. Indeed, any possible string of weight w has a probability of p(w) =
k . There exists a distinguisher to tell whether a t-bit window comes from a random output from P or a truly random string, with advantage
For t = k = 160, this is 0.1658. Here, the distinguisher recognizes P when the Hamming weight w is in the interval [75, . . . , 85], and a random string otherwise. The final XOR hides this bias a bit but we can wonder by how much exactly. Assume that we hash a message of r blocks. The final output is the XOR of the initial value together with r outputs from P. Assuming that the initial value is known and that the P outputs are random and independent, we can compute the distribution of the final hash by convolution. Indeed, the probability that it is a given string x is p r (x) such that
r . We can now computê
It only depends on wt(µ) so we writep 1 (wt(µ)). Since ∑ µpr (µ) 2 = 2 t ∑ x p r (x) 2 we deduce that the Squared Euclidean Imbalance (SEI) of the difference of the hash of r blocks with the initial value is
We have S offline ≤ − log 2 SEI r , where r is the minimal number of blocks which are processed in Application III. The SEI expresses as
As an example, we computed SEI r for four selections of t = k. Given k and c, we look for r and t such that SEI r < 2 −c and r/t is minimal.
Permutation-Dependent Attacks
In this section we present security criteria for the σ 0 and σ 1 permutations.
Another distinguisher. Consider a set I of indices from V = {1, . . . , 2k}. Let swap I (σ) = #{i ∈ I; σ(i) ∈ I} and wt I (x) = ∑ i∈I x i . We assume that s b = swap I (σ b ) is low for b = 0 and b = 1 to see how much the low diffusion between inside and outside I would lead to a distinguisher on P(s, ·) with a random s of J bits. In the worst case we can assume that all indices in I are in the same half of x so that the distinguisher can choose the input on P with a very biased wt I (x). A permutation σ b keeps #I − s b of the bits inside I and introduce s b bits from outside I. Assuming that all bits inside and outside I are randomly permuted, we have the approximation
Thus,
On average over the control bits, we have
The best strategy for the distinguisher consists of having either wt I (x) = 0 or wt I (x) = #I. In both cases we have
The number of samples to significantly observe this bias is
So, S offline ≤ log 2 T . This expression relates to the theory of expander graphs. We provide below a sufficient condition which can be easily checked.
To compute the minimal value of s 0 +s 1 2#I over all I we observe that if P σ b is the matrix of permutation σ b and if x I is the 0-1 vector whose coordinate of index in I are the ones set to 1, then
Let u be the vector with all coordinates set to 1. Clearly, the hyperplane u ⊥ orthogonal to u is stable by the matrix
, where the superscript indicates the transpose matrix. We can easily see that Mu = u. Furthermore, we notice that Mx = λx with x = 0 implies |λ| ≤ 1. Let λ be the second largest eigenvalue of M, or equivalently the largest eigenvalue of operator M restricted to u ⊥ . Note that λ can be λ = 1 if the eigenvalue 1 has multiplicity higher than one. We can easily prove that |λ| = 1 and Mx = λx with x = 0 implies that x i is constant for all i ∈ I, for all connected components I for the relation i ∼ j ⇐⇒ ∃s σ s |s| • · · · • σ s 2 • σ s 1 (i) = j. Hence, the only sets I which are stable by σ 0 and σ 1 at the same time are the empty one and the complete set if and only if eigenvalue 1 has multiplicity one. So, having λ < 1 is already a reasonable criterion but we can have a more precise one. We know that for any vector x orthogonal to u we have
≤ λ with equality when x is an eigenvector for λ. Thus,
From (2),
). Going back to the complexity (1) of our distinguisher we have T ≥ λ −2J . Hence, by having λ ≤ 2
for an offline complexity 2 S offline , we make sure that the distinguisher has complexity T ≥ 2 S offline . To conclude, if λ is the second largest eigenvalue of M = 1 4 (P σ 0 + P t
) then we have an attack of complexity λ −2J . So, S offline ≤ −2J log 2 λ.
Yet another distinguisher. We define the vector x of dimension k such that the ith coordinate of x is the probability that x i is set to 1. If x is fixed, we can consider that x is equal to x by abuse of notation. If y = x σ , we have that y is obtained by multiplying a permutation matrix P σ by x. We have (P σ ) j,i = 1 if and only if j = σ(i). Clearly, for y = x σ b we can write
We define a square matrix F in which all terms are equal to 1 k . Clearly, if s is a uniformly distributed J-bit random string, the probability vector . . .
The complexity is
, so, we have S offline ≤ −2 log 2 b 2 − 1.
The parity of P. Let ε i be the parity of σ i . The x → P(s, x) is a permutation whose parity is ε |s|−wt(s) 0 ε wt(s)
1
. If ε 0 = ε 1 , an adversary with black-box access to x → P(s, x) and knowing |s| can thus easily deduce wt(s). We thus, recommend that ε 0 = ε 1 .
Parameter Vectors
Here we suggest sets of parameters for four different applications, based on our analyzes. In all cases, we require J = c + m and also that σ 0 and σ 1 have the same parity.
I: in a challenge-response application: S offline ≤ min(c, To match the ideal security, we need these bounds to yield S offline ≤ c and S online ≤ m for Application I, S offline ≤ Table 1 . Note that c is the key length for Applications I and III and also the digest length for Application II. 
ARMADILLO2
Ever since the first version of ARMADILLO, we have developed an updated design, called ARMADILLO2, that is even more robust than the version presented in Fig. 1 . In fact, ARMADILLO2 brings in a new compression function, called Q, which is not only more compact in hardware than P, but also addresses security concerns brought about during the continuous analyzes of ARMADILLO. For these reasons, ARMADILLO2 is our preferred design choice. Due to space limitations, further details about the security analysis of ARMADILLO2 are omitted. ARMADILLO2 is defined by
We call the new permutation Q, instead of P as in Fig. 1 , to avoid confusion. The main novelties are:
-there is no complementation of the k-bit input Xinter = C U anymore; as a consequence, the σ i permutations (and therefore Q) now operate on k-bit data C U, instead of C U C U, leading to a more compact design; -a new permutation Q which interleaves σ i 's, i ∈ {0, 1}, with an xor using the k-bit constant bitstring γ = 1010 · · · 10; Q is defined recursively as Q(s b, X) = Q(s, X σ b ⊕ γ) and Q( / 0, X) = X, for b ∈ {0, 1} and bitstrings s and X;
-the outermost Q is controlled by a data-dependent value, X = Q(U,C U), in contrast to simply C U in Fig. 1 ;
In the new structure of Q, the output bias disappears and we can take r = 1 and t = k.
Hardware Implementation and Performance
There exist different demands on the implementation and the optimization meanings for various application scenarios. In this context, the scalability of ARMADILLO allows to deploy the implementation in a very wide realm of area and speed parameters, which constitutes the most essential trade-off in electronics circuits. The implementation of the P function, using the building block, is depicted in Fig. 2(b) . It accepts an input vector of 2k bits and a key of J bits. It consists of a variable number N of permutation stages, all identical, and each stage essentially requires 2k multiplexers ( Fig. 2(a) ). One register of 2k bits is needed to hold the input and/or intermediate data, as well as one J-bit register to hold the permutation key. At each cycle, these registers are either loaded with new data/key or fed back the output data/key for a new permutation round, depending on the state of the load signal. The number N of permutations executed in each cycle can be adjusted, the only restriction being that J be an integer multiple of N. The output data is the 2k bits vector resulting from the permutation round, and the output key is the J − N bits remaining to be processed. This building block can be flexibly assembled into a T -stage pipeline, where each stage performs a number R = J/(N · T ) of permutation rounds (building blocks) before passing the results to the next stage and accepting new input from the previous stage. In that case, the throughput is 1/R items per cycle and the latency is J/N cycles, the parameters being linked by the equality R · N · T = J. The latency / throughput / cost trade-off can be adjusted, the two extreme cases being R = 1 (fully pipelined, resulting in a throughput of 1 item per cycle) and T = 1 (fully serial, resulting in a throughput of S/J items per cycle). Obviously, the more pipeline stages, the more hardware replication and therefore the higher the cost in area and power. To construct the complete hash function of Fig. 1 , we essentially need to add a state machine (which is little more than a counter) around the permutation function block, and the final XOR operation.
Metrics for evaluating performance
In order to compare different cryptographic functions, several metrics can be taken into account. The security is of course the primary concern. The silicon area, the throughput, the latency and the power dissipation are other metrics of interest, and can be traded-off for one another. For example, the power dissipation is nearly proportional to the clock frequency in any CMOS circuit, therefore, power can be reduced by decreasing the clock frequency and thus at the expense of throughput. Conversely, throughput can be increased by running at a faster clock frequency, up to a maximum clock frequency which is process-and implementationdependent. Another example is serialization, where an operation is broken into several steps executed in series, allowing to reuse the same hardware, but again at the cost of a longer execution time. Through serialization, throughput and latency can be traded-off for area, down to a point where operations can not be broken into smaller operations anymore and we have reached a minimum area. Given this large design space, comparing the relative merit of different cryptographic functions is a challenging task.
The approach taken in [2] (and numerous other publications) includes comparing the area of synthesized circuits as reported in the literature or estimated by the authors in gate-equivalent (GE). It is notable though that the GE unit of measure, while being convenient because it is process-independent, is very coarse. For example, does the reported area after synthesis include the space needed for wiring? Typically, the utilization of a routed circuit can be in the range of 50%-80%, and is especially critical when using a limited number of metal layers for routing. A synthesis tool may report an estimated routing area, but in all cases it may vary to a large extent after physical implementation. Consider also that one design may have scan chains inserted while another may not, which may increase the register area by as much as 20-30% and require extra interconnections. Furthermore, different standard cells may be of varying area efficiency; as an illustration of this fact, a comparison of gate-equivalent figures from different standard-cell libraries can produce different results with a ratio up to Besides comparing areas, the authors of [2] also use a metric called efficiency, which is defined as the ratio of the throughput (measured at a fixed clock frequency) over the area. It may seem at first sight that such a metric provides a more general measure of quality, since it may be fair to give up some area for a higher throughput, however it is flawed in that it does not consider the possibility of trading off throughput for power. Indeed, according to this metric, two designs A and B would be deemed of equal value if, for example, A's throughput and area were twice B's throughput and area, respectively. However, if B's power dissipation is half that of A at the same clock frequency, then by doubling B's operating frequency, its throughput can be made equal to that of A while consuming the same power and still occupying a smaller area. Clearly then, B should be recognized as superior to A, which can be captured by dividing the metric by the power dissipation, thus making it independent of the power/throughput trade-off. However, this does not come without its own problems, since the power dissipation is an extremely volatile quantity. Being subject to the same error factors as the area as described above, it also depends heavily on the process technology, the supply voltage, and the parasitic capacitances due to the interconnections. Furthermore, it can vary largely depending on the method used to measure it (i.e. gate-level statistical or vector-based simulation, or SPICE simulation). As if this were not enough, different standard-cell libraries also exhibit various power/area/speed trade-offs, for example, a circuit implemented with a high-density library is likely to result in a lower power figure than the same circuit implemented with a general-purpose library, for a similar gate count.
Nevertheless, a fairer figure of merit would need to include the influence of power dissipation. In order to keep process-independent metrics, we can assume that the power is proportional to the gate count. 4 This is reasonable since the dynamic power in CMOS circuits is proportional to the total switched capacitance, which correlates to the area. We propose therefore to use a figure of merit defined as FOM = throughput/GE 2 . In practice, this is a coarse approximation, since it does not take into account switching activity or the influence of wire load; it is nevertheless fairer than not including power dissipation at all, since it tends to favor designs with smaller area (at equal throughput) which are very likely to dissipate less power. Table 2 presents the results of synthesis for the hash function described above in a 0.18µm CMOS process using a commercial standard-cell library, with the parameters given in Sect. 5. Synthesis was performed with Synopsys Design Compiler in topographical mode, in order to obtain accurate wire loads. The power consumption was evaluated with Synopsys Primetime-PX using gate-level vector-based analysis.
Synthesis Results
In RFID applications, the latency is constrained by the communication protocols (though the constraint is relatively easily satisfiable) but a high throughput is not necessary, designating a fully serial implementation as the ideal candidate. Therefore T is set to T = 1. The number N of permutations per clock cycle in the permutation function is set to N = 1, which is favorable to smaller area and power consumption for the tight power budget associated with RFID applications. The clock frequency is set to 1MHz, which is a representative value for the target application.
In hash mode we hash m bits per compression. In encryption mode we encrypt t/r bits per compression. The throughput values given in Table 2 correspond to hash mode.
Our goal for selecting T = 1 and N = 1 was to minimize the hardware. The area in the proposed implementation is roughly proportional to (k reg * (2k + J) + k log * (2k(N + 1) + J))T for some constants k reg and k log .
To maximize the FOM with T given, we can show that we should in theory pick
For k reg ≈ 2k log and J = k, this is N = 4.5. In practice, the best choice is to take T = 1 and N = 4 for ARMADILLO2 in context A, for which we would get an area of 4 030 GE, 77 µW, and a latency of 44 cycles (1.09 Mbps for hashing or 2.9 Mbps for encryption). Table 3 shows a comparison of hardware implementations of ARMADILLO in the hash function setting, relative to other hash functions such as MD4, MD5, SHA-1, SHA- Table 2 . Synthesis results at 1MHz. Table 4 . Implementation comparison for encryption with throughput at 100 kHz.
Comparison

