Abstract. SeveraI improvements to realize implementations for DES are discussed. One proves that the initial permutation and the inverse initial permutation can be located at the input, respectively the output of each mode in DES. A realistic design for an exhaustive key search machine is presented.
Introduction
In [14] the reader h d that hardware which is at the same time cheap and fast does not exist [l]. In this paper we propose mainly one efficient chip design for the DES. Nevertheless, depending on the need of the user, different versions are possible. The proposed version i s deaigned for general purpoae. In section 10 however a version for ezhauatic search of the keys is illustrated. Important is that all versions can use the same techniques.
The reader not familiar with the DES finds the NBS description of the DES in the literature [9].
a modular controller with microprogramming in order t o ensure the flexibility of the design (section 8)
Remark that as consequence of the needed coherence in the hardware solution, a lot of simplifications proposed in previous papers (e.g. [3j) couldn't be used.
In the next sections the routing problem for the transport part of the chip is solved by using a serial-parallel structure. At the same time, we reduce the area needed for IP, IP-' and PC1 to about 1/20 of the area compared with a full paralleI realization.
Important i s that dl this soIutiom do not slow down the dotorote.
The routing problem for the part which carries out the 16 iterations (incl. the subkey generation), is solved by rearranging the different elements. So we shorten the length of the interconnections. This will be explained in extension in section 7 .
survey of the chip
We divide the chip in 2 important parts. The 6rst called trunaport port, supports the datatransport with the environment and also the internal transport between different memories ( i d . the permutations I P , IP-land P C 1 ) . The second called iteration hardwore, calculates the 16 iterations of the DES-algorithm without IP and IP-' lfrorn now o n w e call these 16 iterutiona D E S ' ) incl. the generation of the subkeys.
We will first explain the basic idea used t o simplify the realization of the modes. Later on we will explain the other techniques used. the upper part of the f i u r e i8 the transport part the loweat part ahowa the iteration hardware On the firat sight, the DES algorithm auflers from an enormou8 rolltiragproblem. Consequently the main goal of our design was to aolve this problem by serialization and rearrangement without 3[0Wing down datarate. 
Equivalent representation for the modes: An intmduction
In the Fibs publication [8] , four modes of operation for the DES are defined. These modes specify different ways to encrypt and decrypt data. We show that you can reorder these modes so that IP and IP-' appear at an other location in the algorithm. We'll explain first the general idea and the motivation of this transformation. Next we'll apply this idea in the case of the 8 byte modes and the 1 byte modes.
Mostly the execution of permutations in hard-or software slows clown the performances. The idea is to put IP respectivily IP-I as close as possible to the input respectivily the output of the mode. So you don't have to carry out IP and IP-l on the data in the feedbackloop. To move the permutations, we'll use the following properties. A permutation followed by a selection, can always be transformed into a selection' followed by a permutation* (figure 2). A similar remark is true for an injection followed up by a permutation. The elementary transformations as explained at crypto 83 will also be used [31* 2.1. 8 
BYTES MODES
In the case of the ECB mode it's trivial that I P is located at the input and IP-' at the output.
In the CFB mode we can write on location A (figure 3) IP e1P-I and propagate them over the exors, using the elementary transformations of C q t o -8 3 [3) page 182, we obtain then the desired result.
A similar result is similar to 6nd for the CBC and OFB modes.
1 BYTE MODES
Ln CFB mode, a new input for the DES is formed out of the previous input for the DES and the actual output ciphertext. The 56 most significant bits of the new input for the DES come from the old input shifted 8 times to the left. This can be represented ( figure 4) as a selection of 56 bits out of 64 bits (S1) together with an injection of 56 bits into 64 bits (11). The 8 least significant bits of the new input for the DES are the 8 bits output ciphertext. This c a n be represented by an injection 12 of those 8 bits into 64 bits. To form the output ciphertext the 8 most significant bits of the DES output are selected by selection S2, and exored with the plaintext. The content of these selections and injections is showed in table 1 and 2. The selections and injections are similar represented as the permutations of the DES in the NBS norm [9] . Now by applying some transformations one is able to obtain the desired result.
Let us therefore put 1P.IP-I at location B in figure 4. Using the property of figure 2 we obtain figure 5, where the effect of Qa(S,(646ita)) is the same as S~(IP-'(64bit8)).
In figure 5 we put QF1 . Q,, at location C. By moving Qa, I P and IP-l (using Crypto 83) we obtain figure 6. Using the first property of figure 4, this figure is transformed into figure 7, where Qb(Sb(64bitrr)) is the same as S1(IP(64bits)) and similar for IC(Qc) ,and S i = [ 9 10 11 12 ' x * meam that ' nothing ia injected at that place A similar result for the OFB mode can be obtained using similar techniques as for
Remark that this equivalent representation of the modes can also speed up the softthe CFB mode.
ware, as explained in [14].
a fast serial/parallel redieation for the permutations
The transport part of the chip will communicate with the environment using 8 bit buses. The main reason for this is the limitation on the amount of connection pins of the chip. However these buses allow us also to realize the permutations IP, IF'' and PC1 in an elegant way.
The idea is that by shifting data from a bus into a shiftregister, you carry out a permutation in a hidden way. If you put an 8 4 8 permutation between the bus and the shiftregister (see figure 9) , you can realize a whole set of permutations. We found that I P , IP-' and PC1 can be realized using this set of permutations.
permuting with shiftregisters: an introduction
When you send a block of 64 bit over an 8 bit bus, you'll send it byte after byte. If we place at the end of the bus 8 shiftregisters of each 8 bit, we can enter sequentially these 8 bytes (8x8) (see figure 10) . Normally, we'll call the first shiftregister the 6rst byte of our memory, the second shiftregister the second byte etc. (indicated with italic numbers e.g.
What we see is that the numbers of the memory locations (I, 2, 3, . . . , 64) and the numbers of the data bits (1, 2, 3, . . . , 64) don't agree. Conclusion, we have carried out a permutation. This permutation is represented by the vector at the foot of figure 10 (for the interpretation of the vector notation, we refer to the Fibs publication of the data encryption standard [9] ). From now on we'll call this permutation S R .
In the same way, we can show (see figure 11 ) that when we read out information of 8 shiftregisters into a bus, we carry out another permutation. You can easily check that we get in that way the permutation SR-'. Now that we know S R and SR-', we'll search for which 8 4 8 permutations we need to realize I P , IP-' and PC1.
realization of IP and IP-'
If we want to have I P after we have shifted in, we have to do a 64+64 permutation before SR, satisfying the following equation:
The result is shown in So the realization of IP has become very simple as shown in figure 12 . This realization is much smaller than a hardware connection path realizing the permutation. At the same time, the execution of I P doesn't consume extra time because it is carried out during the input from the environment. A similar method is used to realize I P -l .
Note that it is possible to use the same shiftregisters to carry out I P and IP-'. This can be done simultaneously by reading 1 byte out every time you read 1 byte in (see figure 12 ).
realization of PC,
For Pi71 we have a more complicated solution because of the irregularity in PC1 [3] .
This can be shown by rewriting the notation of PC, (table 9) . If we could turn the second part of the permutation PC1, we get the permutation SR (for 7 shiftregisters of 1 byte ).
This is exactly the way we will realize the permutation. First we realize SR(7x8) with 7 shiftregisters during the input of the key in a similar way as for I P (see the -path in figure 13 ). Then we rearrange these seven registers in two registers of 28 bit. The first register of 28 bit goes from the first byte up to half the fourth byte. The second register goes in reversed order from the last byte up to half the fourth byte (see the -----path in figure 13 ). This reversed order of the second 28 bit register permits to turn the second part of the key at the moment that those 2 registers of 28 bit are loaded in a following memory unit of 2 times 28 bit (see figure 13) . This realization consumes a little bit more time, but this isn't a drawback because you don't change the key very often. A variant is possible which change the 2 keys e.g. in 1 clockperiod but this consumes a little bit more place. The fact of using 2 memory-units for 2 keys can be used for the multiple key mode. We'll take the first memory as the major-key register and the second register as the active-key register. Therefore we only have to add 2 feedbacklines which carry back the content of the active-key register to the major-keyregister so that no key is lost by the transport from the major-key register to the active-key register (see figure 13) . This configuration can also expanded with a third keyregister which is very interesting for multiple encipherment of the active key [12] 4. a fast serial/parallel realization for memory and transport
IP'=
We'll explain our internal interconnection system built out of shiftregisters and muItiplexers. This system is small, very fast and flexible enough for the DES-slgorithm. We took another approach because we don't need those advantages. W e chose to organize the memory in 4 shiftregister units of 8 byte (8x8) each. Instead of one bus, we made a connection path from every output to each of the four inputs (included his own input) (see figure 14) . For our design this structure is faster, small and enough flexible.
PCI
It's faster because of two reasons. First, it isn't necessary to specify a new address for every byte. Second, it c a n transport 4 bytes simultaneously. So the maximum capacity is 32 byte in 8 clockperiods. We'll use this structure to calculate the feedbackmodes. This is a time critical job because it isn't possible to pipeline it with something else (see section 6).
On the first sight YOU may think to discover a new routing-problem. This isn't so because the length of the connections can be kept small by a good floorplan. In section 9.2 we describe a f l 0 0~1~ for a nMOS realization. Most interconnections between these registers aren't longer than 150prn. It was even possible to place all the interconnections and multiplexers on 0, 5mm2 ( 8 x 8 version).
It is difficult t o explain why this structure is very small. This structure has a smaller flexibility because we always have to transport ail 8 bytes in the same sequence from one unit to the other. Transporting less will break up every byte by partly shifting it out and partial shifting something else in. It is clear that this partial transport is a way to transport entire bytes. But as we see in section 5.1, we can use this (L wrong way of transport to execute the 1 byte modes in a very simple way. Remark that to execute the modes we 'also need to incorporate some exors in the structure. This can be done in many ways depending on the used technology.
One method well suited for &OS is adding 8 fixed exors between the workregister and the input output register and also 8 fixed exors between the workregister and the HR1 (= additional register 1). The output of those exors are connected to each multiplexer again. It can be shown that this configuration is sufficient to calculate the modes. The difference between the two is that the interconnection hardware is only half of the size for the 4 x 16 version. On the other hand the 4 x 16 version is also 2 times slower for the transport of data. Therefore the 4 x 16 version is only usefull for a slow and small version ( 5 5 Mbit and 14mm2).
Equivalent representations

modes
As explained in section 3.2, we can easily realize IP when we transport data from an 8 bit bus into a shiftregister. This consumes no extra time when it can be combined with the input and output from and to the environment. However, this method needs a lot of time when you have to use it for data already on chip. Therefor it's very usefull to move the permutations IP and IP-' out of the feedback loop of the modes to the input and output of the chip. Figure 3 shows clearly that for the 8 byte modes the solution is found. However for the 1 byte modes, a little bit more explanation is needed to show why the result in figure 8 is usefull. The permutations Q,, and QZ1 of figure 8 are the permutations IP' and IP-'* we used to read in and out (section 3). The realization of the selections and injections is very simple with o w internal tansport structure (section 4). We saw there that you c a n only transport entire bytes if you transport all 8 bytes together by shifting 8 times the shiftregisters. However if you shift those registers only once, you will eject the last bit of every byte (=So). The other seven bits of every byte (=&) will be shifted to the seven last places of every byte (=Ic) and there will enter a new bit on the first place of every byte ( = I d ) . This new bit is the ejected bit exored with the correspondent bit of the incoming byte of data.
the permutation P
Because the permutation P will be realized hardware with wires, it's obvious that the needed area can be diminished by a well choosen modified P [3] . How you get the optimal in in Figure 15 : optimization of P for a hardwired realization modified P is shown in figure 15 .
optMiation of the 16 iterations
Starting from the official representation of the DES algorithm, we see that each iteration starts with the evaluation of the non-linear function, and ends with reversing the 2 datablocks. If you reorder [13] the algorithm as in figure 16 , it becomes clear that the exchange can be done at the same moment as the evaluation of the non-linear function. In this way, the time to execute DES' (=DES without I P and 1P-l) is almost equal to the time needed for 16 evaluations without losing time by exchanging the 2 databloca. As consequence, the evaluation of the non-linear function is the time critical path of the algorithm. The only small drawback is that at the end an extra exchange of the right and left block is required.
Pipelining
The aim of the pipelining is to prevent that datarate is slowed down by the time used for datatransport between chip and the environment. This is very important because this communication is slow.
In our design we distinct 3 sections able to work simultaneously: an input section, a D E S section and an output section. While the input section is entering the next datablock, the DES' is working on the actual datablock and the outputsection is busy to releaze the previous datablock. When these 3 sections have finished, there is a large amount of data which has to be exchanged between the sections. The operation doing this job is called the transfert. The transfert is executed with the structure described in section 4 . This operation calculates also the modes.
To illustrate the functioning of the device, especially to show how the modes are realized, we have in figure 17 a representation of the activity in function of time e.g. for the modes ECB and CBC. Note that HR1 and HR2 (additional memory 1 and 2) are memories needed in the feedbackloop of the modes. 
Rearrangement of the functional elements
The aim of rearranging is not to avoid the interconnections but to shorten the length.
A large part of the routing is shortened by mixing the memory cells in the iteration hardware. If you number the cells from 1 to 64 and put them in the foIlowing order:
(1 33), ( 2 341, (3 35)' (4 36), ... (29 61), (30 62), (31 63), (32 64) ' the connected cells come next to each other. So the connections become shorter than 1OOpm. This structure can be build with 32 cells each containing a schiftregister of 2 bit and 1 exor.
The second way to shorten lines relies on the fact that a memorycell is much larger than an interconnection. So we'll put the lines from the subkey, the lines from the memory and the lines from the S-boxes wired next to each other. In this way we have designed a floorplan for the iteration hardware that minimize the length between the memory and the S-boxes that is part of the time critical path (cfr. section 5.3).
Modular construction of the controller
Microprogramming is known as a good but slow way of controlling. The speed problem c a n be solved by using a lot of small units. Every unit is able to carry out one class of tasks. Above this units for the tasks is one unit to coordinate the cooperation between the units. The communication between this coordination unit and the task units happens with microcode. There is no communication between the task units so that modularity and testability is assured. There are also 2 key memories of 56 bit(=7 x 1 byte). The parity check is carried out on each key byte that is entered for the first time. With this configuration you c a n memorize 2 keys e.g. a major and an active key. When you enter a new key, you'll overwrite the key which was in the input key register.
In the chip we have the following three transports:
1. Maybe it is now a good moment to take attention on the conformity of the used techniques. E.g. reading in the data needs shiftregisters to realize at the same time I P ; with those shiftregisters a very fast transport is possible; we need that fast transport to allow the pipelining and the execution of the modes; to allow this way of transport, the workregister has to be a shiftregister; on the other hand, the iterations can be carried out very fast using cells of a shiftregister,. . . etc.. It is with this conformity that we could design a chip which at the same time is very fast and very small.
aninputbus
the floorplan
In figure 19 you find a floorplan of an nMOS design. The total area used is about 9 mm2 and the transistor density is more than a thousand transistors on 1 mm2. In the floorplan you can destinct 4 important parts.
1.
2.
3. The input-output bus
The controllers
The datapad doing the 16 iterations (+ the subkey generation)
The memories and multiplexers of the fast internal transport The idea t o cryptanalyze the DES by an exhaustive search was proposed by Diffie and Hellman [15] . An improvement is now presented, which mainly solves the problem of the complexity of the machine and its cooling.
A chip builded for exhaustic search of the keys must have 2 properties. First it has to be very fast and second it must work with a minimum of communication with the environment.
To make it fast we'll use the property that an exhaustic search of the key can be realized using only the ECB mode. Therefore we will divide the path calculating the non-linear function in e.g. 3 section and pipeline those sections. There is however the problem that we need the result of this non-hear function to calculate the new input for the non-linear function. This will be solved by calculating simultaneous DES for 3 different keys. To do that we need three workregisters and three keyregisters. In that way the speed can be improved by a factor 3.
TO minimize the communication with the environment, we'll generate the subsequent keys on chip and we'll do the check of the result also on chip. To generate the subsequent keys, we enter once a start value for the key in a counter on chip and then augment this value each time by 1. By giving each chip a good startvalue, the whole key space will be checked. TO check the result, we enter only once the 2 datablocks for which we search the key. The first block will serve as input for the DES algorithm and the result of the DES algorithm will be compared with the second datablock on chip. If the result is equal to the second block, an interrupt signal is given. In this way, only a pcomputer and a big power supply is needed to command e.g. 10,000 chips.
Important a8 at which a p e d thb device can work! To calculate three outputs you need 48 (=16 iterations) +2 (=delay in the pipeline) +3 (=time for in and output) = 53 c lockcycli.
At a clockfrequency of 20 Mhz you can check 51. 13 . lo6 keys in one second. Suppose that one device (incl. connection) costs 40$ and that you spend 1,000,000 S, you can with 2.5 .lo' devices calculate 2.8. 1O1O keys in one second, or 1 -lo1' keys in one hour. So you calculate 1.7 + lo1* keys in only I week. In total there are 7.2 . 10le keys. O n the avarage you will find the key after you've tried 3.6 . lo1' keys and for this you need about two weeks. If a choosen plaintext attack is possible, the time needed to find the key is devided by 2 [ll] . It may also be necessary to make allowance for more than one key satisfying: cypherblock(64bit) = DES(p1aintext ,key) I?].
The proposed hardware can be designed for CMOS such that no power problems would exist.
Obtained results
Without exaggerating, we may expect that a speed up to 20 Mbitfsec. is possible to obtain. Maybe higher speeds should be possible, but this should certainly cost a large amount of power and area.
It's easy to calculate the speed by hand. Because of the pipelining, the chip is sometimes doing many tasks simultaneously (see figure 17) . The chip is so designed that the datarate 5,5Mbit l0Mhz 15,5Mbit 18Mhz 20 Mbit 4x4 mm2 
Conclusions
In this paper we presented several improvement in order to realize faster and smaller
We also proved that the initial permutation and the inverse initial permutation can always be located at the input, respectively the output of each mode.
To use DES in a strong way one has to change frequently the active key (e.g. every 10 seconds) and this active key must be multiple enciphered with different major keys before transmision.
chips.
