Abstract. We present a bitsliced implementation of AES encryption in counter mode for 64-bit Intel processors. Running at 7.81 cycles/byte on a Core 2, it is up to 25% faster than previous implementations, while simultaneously offering protection against timing attacks. In particular, it is the only cache-timing-attack resistant implementation offering competitive speeds for stream as well as for packet encryption: for 576-byte packets, we improve performance over previous bitsliced implementations by more than a factor of 2. We also report more than 30% improved speeds for lookup-table based Galois/Counter mode authentication, achieving 11.51 cycles/byte for authenticated encryption. Furthermore, we present the first constant-time implementation of AES-GCM that has a reasonable speed of 22.19 cycles/byte, thus offering a full suite of timing-analysis resistant software for authenticated encryption.
Introduction
encrypting, say, 576-byte packets would presumably cause a slowdown by more than a factor of 3, making the approach unsuitable for many network applications.
Könighofer presents an alternative implementation for 64-bit platforms that processes only 4 input blocks in parallel [19] , but at 19.8 cycles/byte, his code is even slower than the reference implementation used in OpenSSL.
Finally, Intel has announced a new AES-NI instruction set [15] that will provide dedicated hardware support for AES encryption and thus circumvent cache leaks on future CPUs. However, processors rolled out to the market today do not yet support these instructions, so cache-timing attacks will continue to be a threat to AES for several years until all current processors have been replaced. This paper presents a constant-time implementation of AES which only needs 7.81 cycles/byte on an Intel Core 2 Q9550, including costs for transformation of input data into bitsliced format and transformation of output back to standard format. On the newer Intel Core i7, we show even faster speeds of 7.08 cycles/byte, while lookup table based implementations on the same platform are still behind the 10 cycles/byte barrier. Not only is our software up to 30% faster than any previously presented AES software for 64-bit Intel processors, it also no longer needs input chunks of 2KB but only of 128 bytes to achieve optimal speed and is thus efficient for packet as well as stream encryption.
Secondly, we propose a fast implementation of Galois/Counter mode (GCM) authentication. Combined with our fast AES encryption, we demonstrate speeds of 11.51 cycles per encrypted and authenticated byte on the Core 2 Q9550. Our fast GCM implementation, however, uses the standard method of lookup tables for multiplication in a finite field. While no cache-timing attacks against GCM have been published, we acknowledge that this implementation might be vulnerable to cache leaks. Thus, we also describe a new method for implementing GCM without lookup tables that still yields a reasonable speed of 22.19 cycles/byte. The machine-level strategies for implementing AES-GCM in constant time might be of independent interest to implementors of cryptographic software.
Note. All software presented in this paper will be made freely available online to maximize reusability of results.
Organization of the paper. In Section 2, we analyze the applicability of cache-timing attacks to each component of AES-GCM authenticated encryption. Section 3 gives an overview of the target platforms. In Sections 4 and 5, we describe our implementations of AES and GCM, respectively. Finally, Section 6 gives performance benchmarks on three different platforms.
Cache timing attacks against AES and GCM
Cache-timing attacks are software side-channel attacks exploiting the timing variability of data loads from memory. This variability is due to the fact that all modern microprocessors use a hierarchy of caches to reduce load latency. If a load operation can retrieve data from one of the caches (cache hit), the load takes less time than if the data has to be retrieved from RAM (cache miss).
Kocher [18] was the first to suggest cache-timing attacks against cryptographic algorithms that load data from positions that are dependent on secret information. Initially, timing attacks were mostly mentioned in the context of public-key algorithms until Kelsey et al. [17] and Page [27] considered timing attacks, including cache-timing attacks, against secret-key algorithms. Tsunoo et al. demonstrated the practical feasibility of cache-timing attacks against symmetric-key ciphers MISTY1 [29] and DES [28] , and were the first to mention an attack against AES (without giving further details).
In the rest of this section, we analyze separately the cache-timing vulnerability of three components of AES-GCM: encryption, key schedule, and authentication.
Attacks Against AES Encryption
A typical implementation of AES uses precomputed lookup tables to implement the S-Box, opening up an opportunity for a cache-timing attack. Consider, for example, the first round of AES: the indices of the table lookups are then defined simply by the xor of the plaintext and the first round key. As the attacker knows or even controls the plaintext, information about the lookup indices directly leaks information about the key.
Bernstein [3] was the first to implement a cache-timing key-recovery attack against AES. While his attack relies on the attacker's capability of producing reference timing distributions from known-key encryptions on a platform identical to the target platform and has thus been deemed difficult to mount [9, 26] , several improved attack strategies have subsequently been described by Bertoni et al. [6] , Osvik et al. [26] , Acıiçmez et al. [16] , Bonneau and Mironov [9] , and Neve et al. [24, 25] .
In particular, Osvik et. al. [26] propose an attack model where the attacker obtains information about cache access patterns by manipulating the cache between encryptions via user-level processes. Bonneau and Mironov [9] further demonstrate an attack detecting cache hits in the encryption algorithm itself, as opposed to timing a process controlled by the attacker. Their attack requires no active cache manipulation, only that the tables are (partially) evicted from cache prior to the encryption. Finally, Acıiçmez et. al. [16] note that if the encrypting machine is running multiple processes, workload on the target machine achieves the desired cache-cleaning effect, and provide simulation results suggesting that it is possible to recover an AES encryption key via a passive remote timing attack.
Attacks against AES key expansion
The expansion of the 128-bit AES key into 11 round keys makes use of the SubBytes operation which is also used for AES encryption and usually implemented through lookup tables. During key schedule, the lookup indices are dependent on the secret key, so in principle, ingredients for a cache-timing attack are available also during key schedule.
However, we argue that mounting a cache-timing attack against AES key-expansion will be very hard in practice. Common implementations do the key expansion just once and store either the fully expanded 11 round keys or partially-expanded keys (see e.g. [2] ); in both cases, table lookups based on secret data are performed just once, precluding statistical timing attacks, which require multiple timing samples.
We nevertheless provide a constant-time implementation of key expansion for the sake of completeness. The cycle count of the constant-time implementation is however inferior to the table-based implementation; a performance comparison of the two methods is given in Section 6.
Attacks Against Galois/Counter Mode Authentication
The computationally expensive operations for GCM authentication are multiplications in the finite field F 2 128 . More specifically, each block of input requires multiplication with a secret constant factor H derived from the master encryption key. As all common generalpurpose CPUs lack support for multiplication of polynomials over F 2 , the standard way of implementing GCM is through lookup tables containing precomputed multiples of H.
The specification of GCM describes different multiplication algorithms involving tables of different sizes allowing to trade memory for computation speed [22] . The basic idea of all of these algorithms is the same: split the non-constant factor of the multiplication into bytes or half-bytes and use these as indices for table lookups.
For the first block of input P 1 , this non-constant factor is C 1 , the first block of ciphertext. Assuming the ciphertext is available to the attacker anyway, the indices of the first block lookups do not leak any secret information. However, for the second ciphertext block C 2 , the non-constant input to the multiplication is (C 1 · H) ⊕ C 2 . An attacker gaining information about this value can easily deduce the secret value H necessary for a forgery attack. 4 The lookup tables used for GCM are usually at least as large as AES lookup tables; common sizes include 4KB, 8KB and 64KB. The values retrieved from these tables are 16 bytes long; knowledge of the (64-byte) cache line thus leaves only 4 possibilities for each lookup index. For example, the 64KB implementation uses 16 tables, each corresponding to a different byte of the 128-bit input. Provided that cache hits leak the maximum 6 bits in each byte, a 2 32 exhaustive search over the remaining unknown bits is sufficient to recover the authentication key.
We conclude that common implementations of GCM are potentially vulnerable to authentication key recovery via cache timing attacks. Our software thus includes two different versions of GCM authentication: a fast implementation based on 8KB lookup tables for settings where timing attacks are not considered a threat; and a slower, constant-time implementation offering full protection against timing attacks. For a performance comparison of these two implementations, see Section 6.
The Intel Core 2 and Core i7 processors
We have benchmarked our implementations on three different Intel microarchitectures: the 65-nm Core 2 (Q6600), the 45-nm Core 2 (Q9550) and the Core i7 (920). These microarchitectures belong to the amd64 family, they have 16 128-bit SIMD registers, called XMM registers.
The 128-bit XMM registers were introduced to Intel processors with the "Streaming SIMD Extensions" (SSE) on the Pentium III processor. The instruction set was extended (SSE2) on the Pentium IV processor, other extensions SSE3, SSSE3 and SSE4 followed. Starting with the Core 2, the processors have full 128-bit wide execution units, offering increased throughput for SSE instructions.
Our implementation mostly uses bit-logical instructions on XMM registers. Intel's amd64 processors are all able to dispatch up to 3 arithmetic instructions (including bit-logical instructions) per cycle; at the same time, the number of simultaneous loads and stores is limited to one. Aside from these obvious performance bottlenecks, different CPUs have specific limitations:
The pshufb instruction: This instruction is part of the SSSE3 instruction-set extension and allows to shuffle the bytes in an XMM register arbitrarily. On a 65-nm processor, pshufb is xor/and/or pshufd/pshufb xor (mem-reg) mov (reg-reg) TOTAL implemented through 4 µops; 45-nm Core 2 and Core i7 CPUs need just 1 µop (see [13] ). This reduction was achieved by the introduction of a dedicated shuffle-unit [11] . The Core i7 has two of these shuffle units, improving throughput by a factor of two.
Choosing between equivalent instructions: The SSE instruction set includes three different logically equivalent instructions to compute the xor of two 128-bit registers: xorps, xorpd and pxor; similar equivalences hold for other bit-logical instructions: andps/andpd/pand, orps/orpd/por. While xorps/xorpd consider their inputs as floating point values, pxor works on integer inputs. On Core 2 processors, all three instructions yield the same performance. On the Core i7, on the other hand, it is crucial to use integer instructions: changing all integer bit-logical instructions to their floating-point equivalents results in a performance penalty of about 50% on our benchmark Core i7 920.
What about AMD processors? Current AMD processors do not support the SSSE3 pshufb instruction, but an even more powerful SSE5 instruction pperm will be available for future AMDs. It is also possible to adapt the software to support current 64-bit AMD processors. The performance of the most expensive part of the computation-the AES S-box-will not be affected by this modification, though the linear layer will require more instructions.
Bitsliced Implementation of AES in Counter Mode
Bitslicing as a technique for implementing cryptographic algorithms was proposed by Biham to improve the software performance of DES [7] . Essentially, bitslicing simulates a hardware implementation in software: the entire algorithm is represented as a sequence of atomic Boolean operations. Applied to AES, this means that rather than using precomputed lookup tables, the 8 × 8-bit S-Box as well as the linear layer are computed on-the-fly using bit-logical instructions. Since the execution time of these instructions is independent of the input values, the bitsliced implementation is inherently immune to timing attacks.
Obviously, representing a single AES byte by 8 Boolean variables and evaluating the S-Box is much slower than a single table lookup. However, collecting equivalent bits from multiple bytes into a single variable (register) allows to compute multiple S-Boxes at the cost of one. More specifically, the 16 XMM registers of the Core 2 processors allow to perform packed Boolean operations on 128 bits. In order to fully utilize the width of these registers, we thus process 8 16-byte AES blocks in parallel. While our implementation considers 8 consecutive blocks of AES in counter mode, the same technique could be applied equally efficiently to other modes, as long as there is sufficient parallelism. For example, while the CBC mode is inherently sequential, one could consider 8 parallel independent CBC encryptions to achieve the same effect. Several AES implementations following a similar bitslicing approach have been reported previously [19] [20] [21] . However, compared to previous results, we have managed to further optimize every step of the round function. Our implementation of SubBytes uses 15% fewer instructions than previously reported software implementations. Also, replacing rotates with the more general byte shuffling instructions has allowed us to design an extremely efficient linear layer (see Section 4.3 and 4.4). In the rest of this section, we describe implementation aspects of each step of the AES round function, as well as the format conversion algorithm.
Bitsliced Representation of the AES State
The key to a fast bitsliced implementation is finding an efficient bitsliced representation of the cipher state. Denote the bitsliced AES state by a[0], . . . , a [7] , where each a[i] is a 128-bit vector fitting in one XMM register. We take 8 16-byte AES blocks and "slice" them bitwise, with the least significant bits of each byte in a[0] and the most significant bits in the corresponding positions of a [7] . Now, the AES S-Box can be implemented equally efficiently whatever the order of bits within the bitsliced state. The efficiency of the linear layer, on the other hand, depends crucially on this order.
In our implementation, we collect in each byte of the bitsliced state 8 bits from identical positions of 8 different AES blocks, assuring that bits within each byte are independent and all instructions can be kept byte-level. Furthermore, in order to simplify the MixColumns step, the 16 bytes of an AES state are collected in the state row by row. Figure 1 illustrates the bit ordering in each 128-bit state vector a[i].
Several solutions are known for converting the data to a bitsliced format and back [19, 21] . Our version of the conversion algorithm requires 84 instructions to bitslice the input, and 8 byte shuffles to reorder the state row-by-row.
The SubBytes Step
The SubBytes step of AES transforms each byte of the 16-byte AES state according to an 8 × 8-bit S-Box S based on inversion in the finite field F 2 8 . We use well-known hardware implementation strategies for decomposing the S-Box into Boolean instructions. The starting point of our implementation is the most compact hardware S-Box proposed by Canright [10] , requiring 120 logic gates. Our implementation of the SubBytes step is obtained by converting each logic gate (xor, and, or) in this implementation to its equivalent CPU instruction. All We omit here the lengthy description of obtaining the Boolean decomposition; full details can be found in the original paper [10] . Instead, we highlight differences between the hardware approach and our software "simulation", as the exchange rate between hardware gates and instructions on the Core 2 is not one-to-one.
First, the packed Boolean instructions of the Core 2 processors have one source and one destination; that is, one of the inputs is always overwritten by the result. Thus, we need extra move instructions whenever we need to reuse both inputs. Also, while the compact hardware implementation computes recurring Boolean subexpressions only once, we are not able to fit all intermediate values in the available 16 XMM registers. Instead, we have a choice between recomputing some values, or using extra load/store instructions to keep computed values on the stack. We chose to do away without the stack: our implementation fits entirely in the 16 registers and uses 132 packed Boolean instructions and 36 register-to-register move instructions. Table 2 lists the instruction/gate counts for the S-Box in software and hardware.
The ShiftRows
Step Using the dedicated SSSE3 byte shuffle instruction pshufb, the whole ShiftRows step can be done in 8 XMM instructions.
The MixColumns Step
MixColumns multiplies the state matrix Owing to the circularity of the multiplication matrix, each resulting byte b ij can be calculated using an identical formula:
where indices are reduced modulo 4.
Recall that each byte a ij is an element of F 2 8 = F 2 [X]/X 8 + X 4 + X 3 + X + 1, so multiplication by 02 x corresponds to a left shift and a conditional masking with 00011011 b whenever the most significant bit a ij [7] = 1. For example, the least significant bit b ij [0] of each byte is obtained as , we are able to save a rotation and we thus only need to compute two rotations per register, or 16 in total. There is no dedicated rotate instruction for XMM registers; however, as all our rotations are in full bytes, we can use the pshufd 32-bit-doubleword permutation instruction. This instruction allows to write the result in a destination register different from the source register, saving registerto-register moves. In total, our implementation of MixColumns requires 43 instructions: 16 pshufd instructions and 27 xors.
The AddRoundKey Step
The round keys are converted to bitsliced representation during key schedule. Each key is expanded to 8 128-bit values, and a round of AddRoundKey requires 8 xors from memory to the registers holding the bitsliced state. The performance of the AddRoundKey step can further be slightly optimized by interleaving these instructions with the byte shuffle instructions of the ShiftRows step. 
AES Key Schedule
The AES key expansion algorithm computes 10 additional round keys from the initial key, using a sequence of SubBytes operations and xors. With the input/output transform, and our implementation of SubBytes, we have all the necessary components to implement the key schedule in constant time. The key schedule performs 10 unavoidably sequential SubBytes calls; its cost in constant time is thus roughly equivalent to the cost of one 8-block AES encryption. The performance results in Section 6 include an exact cycle count.
Implementations of GCM Authentication
Galois/Counter mode is a NIST-standardized block cipher mode of operation for authenticated encryption [22] . The 128-bit authentication key H is derived from the master encryption key K during key setup as the encryption of an all-zero input block. The computation of the authentication tag then requires, for each 16-byte data block, a 128-bit multiplication by H in the finite field F 2 128 = F 2 [X]/(X 128 + X 7 + X 2 + X + 1). Figure 2 illustrates the mode of operation; full details can be found in the specification [22] .
The core operation required for GCM authentication is thus Galois field multiplication with a secret constant element H. This section describes two different implementations of the multiplication-first, a standard table-based approach, and second, a constant-time solution. Both implementations consist of a one-time key schedule computing H and tables containing multiples of H; and an online phase which performs the actual authentication. Both implementations accept standard (non-bitsliced) input.
Table-Based Implementation
Several flavors of Galois field multiplication involving lookup tables of different sizes have been proposed for GCM software implementation [22] . We chose the "simple, 4-bit tables Algorithm 1 Multiplication in F 2 128 of D with a constant element H.
method", which uses 32 tables with 16 precomputed multiples of H each, corresponding to a memory requirement of 8KB.
Each multiplication needs 6 arithmetic instructions (2 moves, two shifts and two xors) to extract two half-bytes from a 64-bit value, 2 loads from a lookup table and two xors to accumulate the retrieved values; a total of 8 arithmetic instructions and two loads per byte. The computation is free of long chains of dependent instructions. This allows the processor to execute the maximum of 3 arithmetic instructions almost every cycle, yielding a performance of 11.51 cycles/byte on a Core 2 Q9550.
Constant-Time Implementation
Our alternative implementation of GCM authentication does not use any table lookups or data-dependent branches and is thus immune to timing attacks. While slower than the implementation described in Section 5.1, the constant-time implementation achieves a reasonable speed of 22.19 cycles per encrypted and authenticated byte and, in addition, requires only 2KB of memory for precomputed values, comparing favorably to lookup-table based implementations.
During the offline phase, we precompute values H, X · H, X 2 · H, . . . , X 127 · H. Based on this precomputation, multiplication of an element D with H can be computed using a series of xors conditioned on the bits of D, as shown in Algorithm 1.
For a constant-time version of this algorithm we have to replace the conditional statements by a sequence of deterministic instructions. Suppose that we want to xor register %xmm3 into register %xmm4 if and only if bit b 0 of register %xmm0 is set. Listing 1 shows a sequence of six assembly instructions that implements this conditional xor in constant time. Lines 1-4 produce an all-zero mask in register %xmm1 if b 0 = 0 and an all-one mask otherwise. Lines 5-6 mask %xmm3 with this value and xor the result. We note that the precomputation described above is also implemented in constant time, using the same conditional-xor technique.
In each 128-bit multiplication in the online phase, we need to loop through all 128 bits of the intermediate value D. Each loop requires 6 · 128 instructions, or 48 instructions per byte. We managed to further optimize the code in Listing 1 by considering four bitmasks in parallel and only repeating lines 1-3 of the code once every four bits, yielding a final complexity of 3.75 instructions per bit, or 30 instructions/byte. As the Core 2 processor can issue at most 3 arithmetic instructions per cycle, a theoretical lower bound for a single Galois field multiplication, using our implementation of the conditional xor, is 10 cycles/byte. The actual performance comes rather close at around 14 cycles/byte for the complete authentication. 
Performance
We give benchmarking results for our software on three different Intel processors. A description of the computers we used for benchmarking is given in Table 3 ; all benchmarks used just one core. To ensure verifiability of our results, we used the open eSTREAM benchmarking suite [12] , which reports separate cycle counts for key setup, IV setup, and for encryption.
Benchmarking results for different packet sizes are given in Tables 4 and 5 . The "simple Imix" is a weighted average simulating sizes of typical IP packages: it takes into account packets of size 40 bytes (7 parts), 576 bytes (4 parts), and 1500 bytes (1 part). Table 5 . Cycles/byte for AES-GCM encryption and authentication For AES-GCM authenticated encryption, the eSTREAM benchmarking suite reports cycles per encrypted and authenticated byte without considering final computations (one 16-byte AES encryption and one multiplication) necessary to compute the authentication tag. Cycles required for these final computations are reported as part of IV setup. Table 5 therefore gives performance numbers as reported by the eSTREAM benchmarking suite (cycles/byte and cycles required for IV setup) and "accumulated" cycles/byte, illustrating the "actual" time required for authenticated encryption.
For AES in counter mode, we also give benchmarking results of previously fastest software [5] , measured with the same benchmarking suite on the same computers. Note however that this implementation uses lookup tables. The previous fastest bitsliced implementation [21] is not available for public benchmarking; based on the results in the paper, we expect it to perform at best equivalent for stream encryption; and significantly slower for all packet sizes below 2KB.
For AES-GCM, there exist no benchmarking results from open benchmarking suites such as the eSTREAM suite or the successor eBASC [4] . The designers of GCM provide performance figures for 128-bit AES-GCM measured on a Motorola G4 processor which is certainly not comparable to an Intel Core 2 [23] . Thus, we only give benchmarks for our software in Table 5 . As a frame of reference, Brian Gladman's implementation needs 19.8 cycles/byte using 64KB GCM lookup tables and 22.3 cycles/byte with 8KB lookup tables on a nonspecified AMD64 processor [14] . LibTomCrypt needs 25 cycles/byte for AES-GCM on an Intel Core 2 E6300 [1] . Our implementation of AES-CTR achieves up to 30% improved per-formance for stream encryption, depending on the platform. Compared to previous bitsliced implementations, packet encryption is several times faster. Including also lookup table based implementations, we still improve speed for all packet sizes except for the shortest, 40-byte packets.
Similarly, our lookup table based implementation of AES-GCM is more than 30% faster than previously reported. Our constant-time implementation is the first of its kind, yet its performance is comparable to previously published software, confirming that it is a viable solution for protecting GCM against timing attacks.
Finally, our benchmark results show a solid improvement from the older 65nm Core 2 to the newer i7, indicating that bitsliced implementations stand to gain more from wider registers and instruction set extensions than lookup table based implementations. We conclude that bitslicing offers a practical solution for safeguarding against cache-timing attacks: several of the techniques described in this paper extend to other cryptographic algorithms as well as other platforms.
A Equations for MixColumns
We give the full equations for computing MixColumns as described in Section 4.4. In MixColumns, the bits of the updated state are computed as follows: [6] b ij [7] = a ij [6] ⊕ a i+1,j [6] ⊕ a i+1,j [7] ⊕ a i+2,j [7] ⊕ a i+3,j [7] .
