Abstract-Data must be encrypted if it is to remain confidential when sent over computer networks. Encryption solves many problems involving invasion of privacy, identity theft, fraud, and data theft. However for encryption to be widely used, it must be fast. The problem is so important that new Intel processors provide hardware support for encryption. These instructions implement key stages of the Advanced Encryption Standard (AES), allowing encryption to be completed more quickly and using less power. The AES algorithm consists of several 'rounds' of encryption, each of which involves a relatively complicated computation. This new hardware support allows an entire round to be implemented with just a single instruction.
I. INTRODUCTION
There is no shortage of sensitive information that is transmitted on a daily basis. Medical records, military communications, and credit card numbers are all examples of sensitive information that we do not want to freely share with others. To prevent unintended parties from accessing information, encryption must be used. Using encryption can be costly in both time and power requirements.
The Advanced Encryption Standard (AES) is a common and widely used method of encrypting data. The AES algorithm contains a computationally expensive loop which requires large amounts of CPU clock time that provides many avenues for optimization [1] . This paper presents a code generator that creates variants of the AES encryption loop.
The use of code generators to find which combination of optimizations yield a good result is an established technique in optimizing for modern architectures [2] . Code generators, such as Spiral [3] and FFTW [4] , have been very successfully applied to their respective domains. Cost and time are factors that result from maintaining hand-tuned assembly are motivating factors when building code generators. A code generator can tune itself to the architecture it is running on to find the best combination of optimizations for that architecture, while remaining readable and maintainable as it can be written in a high level language, such as C++.
AES encryption costs are greatly reduced with Intel's Westmere microarchitecture [5] and its instruction set extension [6] (AES-NI) implements key stages of the AES algorithm. Our code generator optimizes the AES algorithm itself, regardless of architecture, by creating billions of different implementations while maintaining correctness. Our system is a valuable resource to find an optimized AES implementation on a given target architecture. Our contributions are as follows:
• We show our generator finds optimized variants with an average speedup of 1.43x over all the baselines.
• We show that a simple generator can find a good variation of the code without any specific knowledge of the target microarchitecture.
• We offer a viable alternative to maintaining multiple versions of hand-optimized code.
• We show simulated annealing is an effective and quick method to find a solution in a wide search space. The remainder of this paper is organized as follows: Section II provides background on AES and simulated annealing. Our optimization techniques are described in Section III. Discussion of results are featured in Section IV, while our conclusions are offered in Section V.
II. BACKGROUND A. Encryption
AES is one of the most popular algorithms used in symmetric encryption. Originally published as Rijndael [7] , AES was adopted as a standard by the U.S. government in November 2001 [8] . The standard comprises three block ciphers: AES-128, AES-192 and AES-256 that each have 10, 12, and 14 keys, respectively. AES is a block cipher which encodes incoming 128-bit blocks of plaintext with a secret key to produce the ciphertext. Listing 1. Psuedocode our implementation of simulated annealing. a n n e a l ( ) c0 = c o s t ( d e f a u l t a r g u m e n t s ) t = s t a r t t e m p e r a t u r e f o r i i n r a n g e ( 0 , i t e r a t i o n s ) : f o r j i n r a n g e ( 0 , c o o l i n g s t e p s ) : new args = random arg * w e i g h t c1 = c o s t ( new args ) d e l t a = ( c1 − c0 ) / c0 i f d e l t a < 0 c0 = c1 e l s e i f ( eˆ(− d e l t a / ( k * t ) >= random [ 0 , 1 ) ) c0 = c1 t = t * r e d u c e Listing 2. Psuedocode for the AES CTR encryption loop. a e s e n c r y p t ( * s o u r c e , * d e s t , nonce , * key , nKeys , b l o c k s ) f o r i i n r a n g e ( 0 , b l o c k s ) : r e s u l t = nonce + i r e s u l t = e n c r y p t i n i t i a l ( r e s u l t , key [ 0 ] ) f o r j i n r a n g e ( 1 , nKeys ) : r e s u l t = e n c r y p t r o u n d ( r e s u l t , key [ j ] ) ) r e s u l t = e n c r y p t f i n a l ( r e s u l t , key To encrypt data that exceeds the block size, a mode of operation must be used. Our generator produces code for both counter (CTR) and cipher-block chaining (CBC) modes. CTR mode is a stream cipher which encrypts a counter value and is xor'd with plaintext, making it ripe for parallelizable optimizations. CBC mode has a cyclic dependency that occurs because the result of each encrypted block is then used as the seed to encode the next block.
To facilitate these operations in hardware, Intel introduced support for AES through AES-NI [6] . Included in AES-NI are six instructions for symmetric encryption/decryption used by AES. Natively supporting AES instructions provides both performance and security benefits.
B. Simulated Annealing
Simulated annealing is a heuristic search algorithm that employs probabilistic reasoning to traverse the search space [9] . In the beginning, the search can be quite erratic in direction, but always testing a neighbouring solution. Bad solutions are accepted in early stages to prevent the search from being stuck in local minima. As the search goes on, the probability of accepting bad solutions reduces through a temperature variable that 'cools down'. The algorithm suggests that it has found a good solution and is more likely to test solutions closer to the good solution when the temperature is sufficiently cool.
The goal of using simulated annealing in conjunction with our generated code is to use a guided search to reduce the time it takes to find a solution. Exhaustively trying all variants takes days of compilation time alone. In Listing 1, we outline the pseudo-code we use to implement simulated annealing. We modify the classic algorithm slightly to keep track of a global best solution.
III. AES CODE GENERATION
Our generator creates C source code variations of the AES encryption loop in Listing 2. Our generated C code uses AES-NI compiler intrinsics when compiling for hardware with AES support. These functions can be substituted on non-AES hardware. Optimizations are turned on and off by a set of flags that traverse a wide search space. These variants are further optimized at low-level by gcc or icc compilers.
The generator can test several implementations, though our results in Section IV showcase 128 and 256-bit versions of both CTR and CBC modes and different optimizations exist for each.
The UD] ), restrict pointers, streaming store, and xor Optimization options that make small differences individually can make large improvements collectively. Options like streaming store that writes to memory without polluting the cache, using C restrict pointers that provide the compiler with pointer aliasing information can be turned on and off. Specific to CBC mode, a modification to the first two xor instructions can also be enabled.
The generator also takes in a bit-vector parameter to enable individual keys to be held in registers. This can be done in both modes along with software prefetching and preloading. Prefetching uses Intel prefetch instructions to move data into the cache. Preloading loads plaintext early, to help cover the cost of cache misses.
Interleaving is applied to both CTR and CBC modes differently. In CTR, interleaving consists of unrolling the outer AES loop. The code from each unrolled iteration can be interleaved as there is no dependency between them. This allows a single key value to be used more frequently in succession. In CBC, we interleave individual encryption streams instead of successive iterations per stream. This minimizes the cyclic dependency on each stream from one iteration to the next. The generator can also adjust the distance between the interleaved sections.
Exploitable in CTR mode, software pipelining divides several iterations into distinct parts (see Listing 3). The size of the parts depend on the initiation interval (ii). Inside the loop, only one full set of AES iteration instructions exist while operating on several different blocks in parallel. Since the same keys and different plaintexts are used in each loop iteration, there are smaller dependencies between rounds.
Listing 3. Software pipelining with an initiation interval of 2 / / p r e s o f t w a r e p i p e l i n i n g code f o r ( i = 0 ; i < b l o c k s ; i ++){ / / s o f t w a r e p i p e l i n i n g loop s p 0 r e s u l t = mm xor si128 ( s p 0 r e s u l t , p l a i n t e x t [ i ] ) ; s p 1 r e s u l t = e n c r y p t r o u n d ( s p 1 r e s u l t , key [ 1 3 ] ) ; s p 2 r e s u l t = e n c r y p t r o u n d ( s p 2 r e s u l t , key [ 1 1 ] ) ; s p 3 r e s u l t = e n c r y p t r o u n d ( s p 3 r e s u l t , key [ 9 ] ) ; s p 4 r e s u l t = e n c r y p t r o u n d ( s p 4 r e s u l t , key [ 7 ] ) ; s p 5 r e s u l t = e n c r y p t r o u n d ( s p 5 r e s u l t , key [ 5 ] ) ; s p 6 r e s u l t = e n c r y p t r o u n d ( s p 6 r e s u l t , key [ 3 ] ) ; s p 7 r e s u l t = e n c r y p t r o u n d ( s p 7 r e s u l t , key 
A. Selective-Exhaustive searching
AES 256 CTR performance ranged from 1.73 cycles/byte (C/B) to over 6.0 C/B on a subset of possible variants. We find that mid-range (6-12) interleaving values tend to run best. Software pipelining ii values of 2 and 3 are also good. Small ii values increase instruction level parallelism (ILP)-as does interleaving-but also increases register pressure. Figure 1 shows performance of software pipelining. The graph trends downward with better performance from smaller ii values. Lower ii values mean higher number of iterations that have little to no dependency inside the encryption loop and maximizes the ability of exploiting ILP.
Nearly all of the fastest 200 variants use 10 or more localkeys. Among this selection, prefetching upcoming sourcetext seems to also be beneficial. Prefetching sourcetext three iterations in advance tends to yield the best results. Prefetching has little effect on performance enabled on its own, but can improve running times by several percentage points when used with other optimizations. In CBC mode, the fastest running times when generating AES 256 were with 4 stream buffers (CBC-4) and its fastest variant clocked at 1.71 C/B. Figure 2 's peaks and valleys show much more texture compared to those in the CTR graph. In Figure 2 , the number of localkeys affect performance much more when interleaving 4 stream buffers in CBC mode. As each stream would have their own set of keys, there are far more keys than registers available in 64-bit mode. This pressure will affect how tight interleaving each stream can optimally be generated. In CBC mode, 3 or 4 streams with interleave distances higher than 7 show a significant decrease in performance as the generated code has large amounts of data dependency.
Optimizations like streaming store, restrict and prefetching/preloading can improve a variant by a few percent individually but generally improve performance when used collectively. This shows us that a generator is useful for finding which set of these minor optimizations will improve performance. The combination of these smaller optimizations with the larger impact ones like interleaving, software pipelining, and keys in registers give us a significant improvement over our baselines. 
B. Simulated Annealing
The selective-exhaustive experiments we ran in CTR mode included over 40,000 variants. This is a very small subset of possible combinations which takes several hours to complete. Our simulated annealing implementation tries 1500 solutions and completes in under an hour. Its initial solution applies no optimizations. When flags are changed, greater weights are given to the number of localkeys and the interleaving factor. Lower weights are given to on/off options. Annealing in CTR mode yielded a top performance of 1.62 C/B with software pipelining and 1.64 C/B using all other possible optimizations as seen in Figure 3 .
The selected-exhaustive experiments we ran in CBC was again only a subset of billions of possible combinations and annealing is used to reduce search time. We tune the argument weights for CBC-specific optimizations. Annealing found variants for CBC with 1 to 4 stream buffers all achieved speedups over their baselines (Figure 3 ). Unoptimized CBC-2 has a speedup over unoptimized CBC-1 of 2.23x. However, our optimizations to CBC-2 only increase perfromance by 6-7%. We found better speedups with CBC-3 and CBC-4 at 1.56x and 1.63x, respectively.
In both CTR and CBC modes, annealing was able to traverse much more of the search space to find variants than ran faster than the selective-exhaustive searches found. Running exhaustive searches on even small subsets took hours to complete, whereas simulated annealing took under an hour to complete and came up with a better result.
V. CONCLUSION In this paper, we presented a code generator that creates optimized AES implementations. Our system can generate billions of versions the AES encryption loop. Searching even small subsets of the possible combinations yields an average speedup of 1.43x over all baselines. The generator applies both generalized and specific-to-AES optimizations as mentioned in detail in Section III of this paper.
To evaluate our system, we evaluated the generated code on hardware with native AES support. Architectural properties can influence the search for the best optimizations for several AES modes with varying keysizes. The generator can search this space without any insight into hardware specifics, such as cache sizes, number of registers, or even the instruction set available to implement AES.
In order to implement a version which runs efficiently, AES is often optimized by hand. The generator is a viable alternative to maintaining hand-optimized code when new microarchitectures are introduced-saving both time and money. To save additional time, we implemented simulated annealing as a guided search heuristic. This heuristic performed well in both CTR and CBC modes. Exploring only a small fraction of the search space of only 1500 variants, the algorithm found variants that performed better than ones in our selective-exhaustive searches.
Using a code generator to find optimized implementations is a good system to find which version runs fast on a target platform in order to perform an already very common everyday task-AES encryption.
ACKNOWLEDGMENT
We would like to thank Mike O'Hanlon at Intel Shannon for his help, input, and facilitating access to test hardware.
