We provide an implementation of the Data Encryption Standard highly optimized for the Intel Pentium processor.
B E R G E N S I S U N I V E R S I T A S

Universitas Bergensis
Department of Informatics 11th April 2003
Introduction
Since its acceptance as a standard by the National Bureau of Standards in 1977, a lot of eort has been spent on optimizing implementations of the Data Encryption Standard (DES) [1] . This thesis provides one more contribution to this eort, detailing an assembly language implementation of the DES specically optimized for the Intel Pentium processor. We start by giving an introduction to how the DES works, followed by an introduction to the architecture of the Pentium. Then we present the details of the dierent components of the DES, and how to make them fast in software, given the specic strengths and weaknesses of the Pentium.
Introduced in 1993, the Pentium is a member of the popular 'x86' family of processors, and it was the rst member of the family having the ability to execute more than one instruction per clock cycle. Its number of registers (the fastest kind of storage available) is very limited, and there are a lot of limitations to consider when attempting to produce an optimal program for this processor.
We have chosen to use assembly language for our implementation. This way we are able to take advantage of all available integer registers and special features of the processor. We are then also able to explicitly schedule all instructions in the encryption loop for maximum speed.
The Data Encryption Algorithm
This chapter provides an overview of the encryption and key setup algorithms of the Data Encryption Standard. Details are left for the more detailed analysis in Chapter 4.
Encryption
DES encrypts data in blocks of 64 bits. It has three components; the initial permutation (IP), the round function shown in Figure 2 .1, and the inverse of IP, also known as the nal permutation (FP). The round function is applied 16 times, each time with a dierent 48-bit round key.
Figure 2.1: DES round function
The structure of the cipher, known as a Feistel structure, is shown in Figure  2 .2. The '+' operation used is bitwise addition modulo 2, the 'xor' operation. Note the lack of a left/right swap after the last round. Combined with reversing the round key sequence, this allows the same algorithm to be applied for decryption.
The`f' function of Figure 2 .3 is where the bulk of the work is done. First the 32-bit R i−1 is expanded to 48 bits by making two copies of half its bits. Then this value is xor'ed with the round key K i . The result is split into 8 6-bit values and fed into 8 dierent S-boxes. Each S-box maps a 6-bit input to a 4-bit output. These outputs are concatenated, and then these 32 bits are permuted by the permutation function P. DES keys have 56 key bits and 8 parity bits. The parity bits will be ignored by our implementation. The`Permuted choice 1' (PC-1) transform drops the parity bits and permutes the rest. Each round key is then generated by one application of the cyclic left shift (LS i ) and`Permuted choice 2' (PC-2) as shown in Figure    The PC-2 transform is similar to PC-1 in that it picks and permutes 48 out of 56 bits. The LS i function cyclically shifts each half of its input bits 1 or 2 positions left depending on the round number. The Pentium Processor
For the purpose of describing our implementation of the DEA specically optimized for the Pentium, this section is devoted to a somewhat simplied overview of the processor, with more details on aspects relevant to the implementation of the DES.
Registers
Since the 80386, x86 processors have had 8`general-purpose' 32-bit registers. Four of these registers provide special access to their lower half, both as one 16-bit register, and as two 8-bit registers. Accessing 16-bit registers requires a prex, and incurs a 1-cycle penalty in 32-bit mode, but access to the 8-bit registers has no penalty on Pentium (`Classic' or`P5') processors. Hence the 8-bit registers provide a fast alternative to the sequence of instructions that would otherwise be used to read individual bytes within a word. Note that writing to a partial register does not alter the rest of the full register. The stack pointer (ESP) register, although counted among the generalpurpose registers, should not be used for other purposes. That leaves us 7 registers for computations, with 8-bit partial registers in four of them. 
Caches
The Pentium has two internal rst-level (L1) 8-kilobyte caches, one for instructions and one for data. Cache line length is 32 bytes, and cache lines in the data cache are spread across 8 banks with 4 bytes in each bank. The data cache supports two simultaneous accesses, but only to dierent banks. Load latency on a cache hit is just 1 clock cycle. Unaligned accesses (accessing two neighboring banks in one operation) are at least 3 cycles slower, but are also easily avoided. The L1 caches are 2-way set associative, meaning that every location in memory maps to a set of 2 lines in the cache. There are 128 such sets. Whenever the processor performs a load operation, it rst looks for the data in the corresponding set. If the data is not there, the least recently used of the two lines is replaced with the line from o-chip memory (L2 cache or actual RAM) containing the requested data.
Level 2 cache for the Pentium is external to the chip, and has a much longer latency (access time) than level 1. The minimum penalty for an L1 cache miss is 4 clock cycles. We therefore do our best to keep all the data we need in L1 cache during encryption.
When writing to an uncached address, the Pentium does not load the corresponding cache line into L1, but instead writes directly to L2 or RAM. We will use this to avoid removing existing contents from the cache. When we want writes to go to the cache, we will rst load from their cache lines.
Pipelining
The Pentium divides instruction execution in two pipelines, called U and V, and ve pipeline stages. Prefetching reads memory sequentially until interrupted by branch (jump) instructions. Branches are predicted taken or not based on their previous history.
We do not need a thorough analysis of the prediction algorithm for this implementation, but observe that simple patterns are correctly predicted until they are broken. A thorough explanation of branch prediction can be found in Agner Fog's excellent optimization guide [3] .
Prefetching is able to continue past a correctly predicted branch instruction, continuously feeding instructions to the next stage.
Decode 1 (D1)
Here two parallel decoders attempt to decode two instructions and pass them on to the next stage. There are a number of limitations on which instructions may be executed in parallel. Pairability of the instructions we have used to implement the DES is presented in Section 3.4. When a pair of instructions have been selected for simultaneous execution, they pass through the pipeline in lockstep. Each stage may only contain one instruction pair, and they may only pass from one stage to the next when they are both ready to do so.
Decode 2 (D2)
At this stage memory addresses are calculated. Addresses are generally of the form 2 n a + b + c, where a and b are registers, c is a constant, and n is 0, 1, 2, or 3. a is called the index register, b is the base register, and c is the displacement. a and b may be the same register.
Addresses are calculated`for free', provided the operands are ready when the instruction reaches the D2 stage. Otherwise the instruction will stall in this stage until operands are ready, usually one clock cycle. This is called an Address Generation Interlock (AGI) stall.
Execute (EX)
The execute stage performs both memory access and ALU operations. If an instruction species both kinds, its parts are executed in successive cycles in this stage, stalling instructions in earlier stages of the pipeline.
Writeback (WB)
This is the nal stage, where instruction results are commited to processor state.
Write buers
There is one write buer connected to each pipe, allowing write instructions to complete in one cycle, even when the referenced memory is not contained in the L1 cache. Only one write miss may be buered per pipeline, meaning that a write instruction following a write miss in the same pipe will have to wait (it stalls the pipeline) until the preceding write operation is completed. Instruction pairing is limited both by the pairability of consecutive instructions and by dependencies between them. An instruction may not be issued to the V pipe in parallel with another one in the U pipe if it reads or writes a register written to by the U pipe instruction. There are only a few exceptions to this rule, some of which have been used in this implementation. They are listed in Table 3 .3. Jcc is a conditional jump, where`cc' is replaced by a condition code.
Condition codes refer to specic (combinations of) bits in the ags register, like e.g. the zero ag, which is set if the result of an arithmetic operation is zero. We will only be using the zero ag in this implementation, to check for a remaining block count of zero. So the conditional jumps we will be using arè jz' (jump if zero) and`jnz' (jump if not zero).
U V test jcc push push pop pop Table 3 .3: Special instruction pairings Some instructions have prex bytes, and require an extra cycle in the D1 stage per prex. These instructions are also not pairable. There is but one exception; the conditional near (32-bit displacement) jump has a prex, however this incurs no prex penalty, and it is pairable in the V pipe.
Note that all instructions in Return from function call NP Table 3 .5: Additional instructions used by the encryption function
The lea instruction (load eective address) uses the D2 pipeline stage to calculate an address, and then stores that address in its target register. No memory access is performed. This instruction can be used to perform a multiway add (constant + register + scaled register) in combination with a copy (the target register is freely chosen).
Only the U pipe is able to execute shift and rotate instructions, and multi-bit rotate instructions are not pairable. This limits our freedom in scheduling these instructions. Some places we will replace 2-bit rotates with two 1-bit rotates, since this allows us to run other instructions in the V pipe.
Two`xor r32,m32' instructions may be paired and execute eciently in parallel, provided they don't access the same cache bank. Their combined execution time is 2 cycles. When paired with a simple instruction, the execution time is still 2 cycles. This is called an imperfect pair.
All instructions used here are compatible with the 80486 and later x86 pro- Implementing the DEA
Bit Ordering
The permutations employed by the cipher are described using bit numbers. The numbering used in the standards documents is enumerating the bits from left to right, starting at 1. When displayed as a matrix, row major order is used. This is best illustrated by the identity transform shown in The Pentium processor reads its memory using the opposite byte (row) order, giving the bit number matrix shown in Table 4 .2. We have here divided the matrix in upper and lower halves. On the Pentium we need one 32-bit register to store each half, and hence swapping the halves amounts to swapping the roles of those two registers.
Encryption
We now turn to a more detailed description of the various components of DES encryption, with the resulting assembly language code performing them.
Initial and nal permutations
The initial permutation of the DEA has a very simple structure, and can be performed as a series of bit block swaps known as Hoey's Initial Permutation Algorithm. This algorithm is shown in is applied by performing the swaps of IP in reverse order. Table 4 .5 shows the implementation of IP with adjacent rotate instructions merged, as well as how they will be paired on the processor. In cycles 3,7,11,15 and 19, the one instruction listed is capable of running in either pipe. In cycle 21, the V pipe is available for the code following IP, i.e. the round function.
Round Function
The round function will be executed 16 times for each block encrypted. This makes it extra important to optimize it as much as possible.
Expansion Function (E)
This function expands a 32-bit input value to a 48-bit output by duplicating half of the bits, as shown in the table. Only the two center columns have unique bit numbers. Note that the bits from the ends of the input (bits 1 and 32) appear at both ends of the output value.
The structure of E is easy to spot, and is also possible to take advantage of in the implementation, as is shown in Figure 4 .1. Each half of the output can be computed from the input simply by rotating the input left or right by 1 bit The upper output word contains the rows that (after being xor'ed with the round key) are input to odd-numbered s-boxes. The lower word contains inputs for the even-numbered s-boxes. This requires some extra work in the key setup, but reduces the work needed for encryption; E is actually reduced to only 4 instructions. These are all marked with an E in the note column of Table 4.9. The next step after E is an xor with 48 key bits, implemented using two 32-bit xor's. As noted above, the key setup function must place key bits so they t the structure of the encryption. The key mix instructions are marked with a K in Table 4 .9.
S-boxes and the Permutation Function (P) Table 4 .7: The permutation P The S-boxes, being the non-linear components of DES, don't have easily exploitable structures. Furthermore, the permutation P following the S-box lookups has no obvious regular structure.
But then, we can combine an S-box with P in a single table lookup providing 32 output bits. That is, in one single load operation, we can both perform an S-box lookup and position its bits according to P.
To ensure maximum speed of the Figure 4 .2 shows the structure of f in our implementation. Shaded boxes represent unused bits. Note that R i is rotated one bit left, allowing us to complete one half of E earlier than the other, and start table lookups for the even-numbered s-boxes. Since we need R i rotated one bit right for the other  half, this also means we will now have to do a rotation two bits right. This turned out to be necessary to do in two separate instructions, so we could pair them with other instructions.
The output function also requires that the output halves from IP are rotated one bit left, and IP −1 must rotate its inputs one bit right. We make IP satisfy this requirement simply by changing the last instruction in Table 4 .5 from ror esi, 1' to`rol edi, 1'.
The dependency graph constructed to aid in the scheduling of the round function is shown in Figure 4 .3. Apart from the shift instructions, you can easily see the similarity between this and each half of the result in Figure 4 .2. Movb' in the gure corresponds to a`mov r8,r8' instruction. To improve readability and ease debugging, the Pentium's registers have been assigned xed roles in the round function implementation. The assignment chosen is shown in Table 4 .8. This is but one of many possible choices; within this function, ESI/EDI/EBP are interchangeable, as are EAX/EBX/ECX/EDX. The only limitation is the need to match the choices made for IP/IP Table 4 .12: Permuted Choice 1, rearranged version
The rst block swap is performed as shown in Table 4 .13, using a deliberate imbalance preparing for the next parts. Next, we have two parts where we swap 8 bits from each register with 8 other bits from the same register. Table 4 .14 contains an example from the actual code, swapping blocks of 2x2 bits within EAX using ECX as temporary storage. The previous imbalance is exploited by performing each half of the second and third swaps out of sync by two cycles, allowing us to easily schedule shift instructions for the U pipe.
The last part of our PC-1 contains a swap of two bytes, which is performed by rst shifting the lower half one byte`down' (8 bits right), then reversing its byte order. The remaining part consists of simple operations on the lower bytes of each half. LS i performs a cyclic left shift (a rotate) of each 28-bit half of the bits. The shift count for LS i is 1 for i ∈ {1, 2, 9, 16}, otherwise 2. Figure 4 .5 illustrates the application of LS 1 to one 28-bit half placed in a 32-bit register.
Observe that the lea instruction can perform a left shift by up to 3 bits and write the result to a freely specied register. That is, it can perform a small left shift and keep the input register unaltered. We use this to make a left-shifted copy of each input register, and then right-shifting the previously unaltered inputs. We also remove unwanted bits from the shaded area. Last we recombine the values. The code for this is given in Tables Table 4 .17: Implementation of double left shift Running LS 1 directly after PC-1, there will be an AGI stall caused by the lea instruction trying to read EAX before it has been modied by the last instruction of PC-1. We resolve this using the observing that the output from PC-1 has all zeros in the shaded areas of the LS 1 inputs. This allows us to use the alternate function show in Table 4 .18, with the new instructions highlighted.
Permuted Choice 2 (PC-2)
We can easily see that only numbers ≤ 28 are present in the upper half of PC-2, and only numbers > 28 in the lower half. This implies that the upper (left) half and bl, 0xe0 4 xor edx, ebx xor al, cl Table 4 .18: Alternate implementation of single left shift of PC-2 only selects bits from the upper half, while the lower half only selects from the lower half. In other words, the halves are independent of each other.
Attempts at nding any further useful structure in this function have failed, and it is therefore implemented using table lookups. The challenge is then to do the table lookups eciently. This was achieved mainly by constructing a very fast algorithm to rearrange the input bits. Table 4 .19 shows PC-2 inverted. That is, the number in each bit position tells where that bit in the input to PC-2 is placed in its output. Table 4 .20 shows the inverted PC-2 table expanded with 4 empty positions at the end of each half, corresponding to the shaded bits in Figure 4 .5. We want to arrange the bits of this table so that there are equally many (6) on each line, with no spaces (dashes) between them. We also want the resulting rearrangement to run as eciently as possible, since this will be the part of PC-2 preparing for table lookups. Table 4 .21: PC-2 inverted, expanded, shued
Our implementation of PC-2 can be divided in the three following parts, as illustrated in Figure 4 .6.
Bit reordering Distribute 24 of the 28 input bits in each half to contiguous sets of 6 bits, one set in each byte. Bits are not moved from one half to the other; we want the halves to stay independent to avoid doubling the table size and number of lookups. Table lookups Each lookup puts a set of bits in their intended positions, except for two changes preparing for the next step; the two bytes in the middle of each half are swapped, and we swap left and right halves of the right half. That is, byte order 1234 is changed to 1324, and 5678 to 6857. The reason for doing this swap here is that it's only a change to precomputed tables; there is no runtime cost.
Byte reordering The encryption core divides S-box lookups into one half using odd-numbered round key bytes, and the other half using evennumbered bytes. We must put the key bits in corresponding positions, and we do this by swapping lower halves, and then rotate the right half. The swapping is done as a series of simple instructions, since this is one cycle faster than the single instruction accomplishing the same result.
The number of a byte in this context refers to the number of the S-box whose input it will aect in the encryption function.
Looping
Now that we have all the highly optimized components of the DEA, we must also consider how to combine them and do real encryption. Specically, we must be able to load inputs, store outputs, and decide when to stop encrypting. We must also here ensure we do not waste even a single clock cycle.
As it turns out, we are actually able to add all of this, and still add only one single clock cycle on top of the encryption time. This is achieved by utilizing every free instruction slot in the IP and FP transforms for loading, storing and counting blocks.
This also involves some complex register allocation in our FP implementation shown in Table 4 .25 to ensure we have the registers we need for the loop construct, shown in . The block counter counts negative blocks, a well known optimalization trick allowing us to use it for indexing the input and output blocks. The input and output pointers are therefore also adjusted to point one block past the end of the input and output arrays.
Results
The best previous results we know for DES are those of Antoon Bosselaers [2] , encrypting at 340 cycles per block, and generating round keys in 686 cycles. His code uses 2 kilobytes of lookup tables for each of encryption and key setup.
Encryption
When all program code and data are in L1 cache, our new DES encryption runs at 315 cycles per block on Pentium (non-MMX). This includes both the encryption itself and the surrounding loop. This is more than 7.9 % faster than the speed achieved by Bosselaers.
Using Cipher Block Chaining (CBC) mode adds only 1 cycle per block when encrypting. CBC decryption requires another 2 cycles to handle the initialization vector.
When encrypting data from memory (too big to t in any cache) on a 120 MHz Pentium running Linux 2.4.19, in-place encryption runs at approximately 329.5 cycles per block. Encrypting from one array to another takes 333.5 cycles per block. These numbers include operating system overhead. Figure 5 .1 shows timing results for ECB encryption. Samples were made for block lengths a multiple of 8 DES blocks, up to 1024 (8 kilobytes). To emphasize per-block timing, the best number of cycles for encrypting 4 blocks is subtracted, and the resulting clock count is divided by 4 less than the block count. The 4-block startup time was 1361 cycles.
Key setup
The key setup function runs in 576 cycles including function call. This is more than 19% faster than the results of Bosselaers.
Using 4 kilobytes of tables, key setup can be done much faster (cf. Svend Olaf Mikkelsen [8] ). Yet it might turn out to be slower in real use, since if key setup is seldom used, (larger parts of) its tables will be evicted from L1 cache, and then larger tables have to be reloaded each time key setup is run. It will also more easily remove the encryption tables from cache, reducing actual speed even more. That being said, his approach is a much better one on newer processors with larger L1 data cache (e.g. Pentium MMX/II/III, Duron, Athlon). Preliminary optimizations for newer processors have also been done, with good results. In our tests, DES encryption runs in approximately 240 cycles on the AMD Duron (Spitre core), 319 cycles on Intel's Pentium III, and 445 cycles on their Pentium 4. These are all achieved with only minor modications to our implementation, namely the addition of prefetch instructions, and a modied round function schedule. Only the Pentium 4 requires major reworking of the round function in order to run eciently, mostly due to its slow shift and rotate instructions, and its implicit use of shift operations to access high 8-bit registers (ah/bh/ch/dh).
Even without modications, our implementation runs fast on these newer processors: 295 cycles on the Duron, 327 on Pentium III, and 456 on Pentium 4. All these timings are for encrypting 4 kilobytes (512 blocks) within L1 cache in ECB mode.
For a comparison, Eric Young reports DES CBC encryption at 4.35 · 10 7 bytes/second on an 1.6 GHz Athlon, and 7.771 · 10 6 bytes/second on a 333 MHz Celeron (Pentium II) [9] . This translates to approximately 294 and 343 cycles, respectively.

Chapter 6
Discussion
The Pentium is the only processor for which Antoon Bosselaers provides performance and memory use gures for his highly optimized assembly implementation. Although the original Pentium processor itself is no longer a very interesting optimization target, our tailored implementation also turns out to perform very well on more modern processors. Unlike its newer cousins, the Pentium is also much easier to describe, and its performance is highly predictable.
The main part of our improvement comes from the construction of a round function which is able to execute in only 17 cycles on the Pentium. This depends on having enough (7) registers available, which we achieve by using the stack for storing key data, thereby using the stack pointer as our round key pointer. This in turn requires us to copy all round keys to the stack, incurring a startup cost slightly bigger than twice the cycle count gained per block.
Since we are already copying all the round keys as part of the startup of our encryption function, reversal of the keys for decryption is almost free. Hence we don't need to generate or store the reversed set of round keys for decryption, saving both on key setup time and memory.
We have also achieved perfect pairing of instructions -every single cycle, the processor is executing either an unpairable instruction or a pair. The unpairable instructions are all multibit rotates, and there are 5 of them in each of IP and FP. In total, we have 620 instructions execute in only 315 cycles, for an average of almost 1.97 instructions per cycle.
Just like the speed of encryption depends most on the speed of the round function, key setup speed depends on ecient implementation of LS i and PC-2. LS i is quite simple, hence most of our eort on the key setup focused on PC-2. This resulted in a very ecient bit reordering algorithm, and combined with ecient algorithms and scheduling in the rest of the key setup, we achieve a reduction of 110 cycles compared to Bosselaers.
As seen from the comparison with Eric Young's results, our implementation is very competitive on newer processors, and is also a good starting point for reoptimizing the algorithm for them.

