Exploiting Parallelism in Hardware Implementation of the DES by Broscious, Albert G & Smith, Jonathan M
University of Pennsylvania 
ScholarlyCommons 
Technical Reports (CIS) Department of Computer & Information Science 
February 1993 
Exploiting Parallelism in Hardware Implementation of the DES 
Albert G. Broscious 
University of Pennsylvania 
Jonathan M. Smith 
University of Pennsylvania, jms@cis.upenn.edu 
Follow this and additional works at: https://repository.upenn.edu/cis_reports 
Recommended Citation 
Albert G. Broscious and Jonathan M. Smith, "Exploiting Parallelism in Hardware Implementation of the 
DES", . February 1993. 
University of Pennsylvania Department of Computer and Information Science Technical Report NO. MS-CIS-93-22. 
This paper is posted at ScholarlyCommons. https://repository.upenn.edu/cis_reports/460 
For more information, please contact repository@pobox.upenn.edu. 
Exploiting Parallelism in Hardware Implementation of the DES 
Abstract 
The Data Encryption Standard algorithm has features which may be used to advantage in parallelizing an 
implementation. The kernel of the algorithm, a single round, may be decomposed into several parallel 
computations resulting in a structure with minimal delay. These rounds may also be computed in a 
pipelined parallel structure for operations modes which do not require cryptext feedback. Finally, system 
I/O may be performed in parallel with the encryption computation for further gain. Although several of 
these ideas have been discussed before separately, the composite presentation is novel. 
Comments 
University of Pennsylvania Department of Computer and Information Science Technical Report NO. MS-
CIS-93-22. 
This technical report is available at ScholarlyCommons: https://repository.upenn.edu/cis_reports/460 
Exploiting Parallelism 
In Hardware Implementation of the DES 
MS-CIS-93-22 
DISTRIBUTED SYSTEMS LAB 20 
Albert G. Broscius 
Jonathan M. Smith 
University of Pennsylvania 
Scllool of Engineering and Applied Science 
Computer and Information Science Department 
Philadelpllia, PA 19104-6389 
February 1993 
Exploiting Parallelism 
in Hardware Implementation of the DES 
Albert G. Broscius Jonathan M. Smith 
Distributed Systems Lab Distributed Systems Lab 
Dept. of CIS Dept. of CIS 
Univ. of Pennsylvania Univ. of Pennsylvania 
Phila PA, 19104-6389 USA Phila, PA, 19104-6389 USA 
broscius8cis.upenn.edu jms0cis.upenn.edu 
Abstract 
The Data Encryption Standard algorithm has features which may be used to  
advantage in parallelizing an implementation. The kernel of the algorithm, a single 
round, may be decomposed into several parallel computations resulting in a struc- 
ture with minimal delay. These rounds may also be computed in a pipelined parallel 
structure for operations modes which do not require cryptext feedback. Finally, sys- 
tem 1/0 may be performed in parallel with the encryption computation for further 
gain. Although several of these ideas have been discussed before separately, the 
composite presentation is novel. 
1 Introduction 
The Data Encryption Standard (DES) is probably the most widely used publicly avail- 
able secret-key algorithm. Since its introduction by the National Bureau of Standards 
(NBS) in 1977[FIPS46], DES implementations have improved greatly in encryption rate. 
Yet, typical computer communication rates have also increased significantly during the 
same period. Today's high-performance computer networks extend still further the en- 
cryption bandwidth needed for adequate performance of secure systems [Gigago]. Thus, 
we examine means to increase the throughput of a DES implementation to satisfy these 
demands. 
We discuss parallel approaches for several levels of an implementation. At the lowest 
level, the kernel of the algorithm can be split into several parallel computations for 
increased speed. By generating subkeys one cycle in advance, the time required can be 
effectively overlapped with the use of the subkey in the rest of the round operation. An 
additional overlap can be made of the two stages of exclusive-or (XOR) gates at the 
expense of increased complexity and gate-count. 
One level upward in the hierarchy, the use of multiple round implementations can 
lThis research was supportedYy NSF and DARPA through the Corporation for National Research 
Initiatives, and by Bellcore through Project DAWN. 
increase computation bandwidth if the DES mode of operation chosen does not require 
feedback of ciphertext. Of the official modes[FIPSSl], this requirement rules out all but 
the Electronic Code Book (ECB) method. Unfortunately, ECB is known to be susceptible 
to plaintext frequency-analysis based attacks since multiple identical input blocks result 
in the same output cryptext block. We discuss in section 3 of this paper a proposed 
operating method [Feldmeiergl] that resists this attack yet does not require feedback of 
ciphertext. 
Finally, at the system level, the processing of 1 / 0  concurrent with DES computation 
provides for continuous operation of the encryption unit. In addition to this buffering, 
the use of Direct Memory Access (DMA) for encryption allows the host processor to 
continue other work concurrently with the ongoing encryption. 
2 Algorithm Kernel 
Figure 1: Simple Implementation of Algorithm Kernel 
1 C 
L 
Figure 2: Timing Diagram for Circuit of Fig. 1 
E I 
Shift1 
The kernel of the DES algorithm consists of four operations: key generation, key mixing, 
substitution table lookup and data mixing between the R and L words. This kernel 
I 
XO%s KSR 
I 
S Boxes 
I 
X O h z  
I 
R 
1 
Shift2 XOR#l S Boxes XOR#2 Reg. Load 
is repeated for sixteen iterations with only one key generation parameter dependent on 
iteration number. This single parameter specifies one or two shifts of the circular registers 
from which the current key is derived. 
A straightforward implementation of the kernel is depicted in Fig.1. The box labeled 
KSR represents the circular shift registers which hold the key data. These are t o  be 
clocked either once or twice depending on the iteration number. Once the key has been 
shifted, the key-mixing box denoted X O h 8  outputs the modulo 2 sum of the key data 
with the extended R data after a propagation delay interval. The S Boxes then begin 
their access time delay interval before output of their results. The box marked XO&j2  
then begins mixing in data from the L register. After a propagation delay of an XOR 
gate, data are ready at the input to the R register. Once a register setup-time has passed, 
the R and L registers may be clocked once. A register propagation delay later, the cycle 
may begin once more. 
Timing analysis reveals that the critical timing path results from two shifts of the 
key registers, the keying XOR array, the S Box table lookup, the R-L mixing XOR array, 
plus the register loading delays. A simplified timing diagram is shown in Fig. 2. The 
critical path timing defines the limiting rate at which round computation may proceed. 
Assuming these delays are minimal, the only way to improve the critical path timing is 
to modify the circuit so that these sequential processes become concurrent. We will now 
examine several ways to achieve this concurrency. 
2.1 Key Parallelism 
Separating the key generation from the remaining three stages of the algorithm kernel 
can reduce the critical path timing. This approach saves delay by updating the key shift 
register in anticipation of the next iteration simultaneous with the remaining operations 
in the current iteration. An additional key latch is introduced to buffer the key value for 
the current iteration on the input to the key-mixing XOR stage as shown in Fig. 3. 
This key parallelism was suggested by Diffie and Hellman [Diffie77] in their timing 
approach for the proposed DES key-search device. They did not include the additional 
key latch but instead relied on strict control of key shift timing with respect to the overall 
R-L clock timing to prevent a race condition. Our introduction of the key latch allows 
greater tolerance in clock provision by ensuring that the key data input to the key-mixing 
XOR cannot change during the iteration cycle. 
Later, a different key-parallel approach incorporating a multiplexer (MUX) was used 
Figure 3: Pipelined Key Generation Algorithm Kernel 
1 + 
Figure 4: Timing Diagram for Circuit of Fig. 3 
L I E KSR 
I I 
xo%8 Key 
I 
S Boxes 
I 
XOR.32 
1 
R 
1 
Figure 5: Timing Diagram for Circuit of Fig. 4 
by Fairfield et a1 [Fair84]. Their MUX approach allowed either one or two shifts in 
either the encryption or decryption direction to be performed in one clocking operation. 
Additionally, a key loading operation could be selected by the multiplexer. This shortens 
the time required for the key generation somewhat since the MUX propagation delay 
is likely to be much lower than a full key shift cycle. More importantly, since the key 
shift register (KSR) no longer generates intermediate results, as it did when two shifts 
were required for a given iteration, the extra key latch introduced above to prevent 
race conditions is no longer necessary so the block diagram reverts to that of Fig.1. A 
simplified timing diagram for this arrangement is shown in Fig. 5. 
2.2 XOR Rearrangement 
Figure 6: Block Diagram of Datapath with Single XOR Delay in Critical Path 
Since the XOR summation is a bitwise linear operation, the order in which XOR opera- 
tions are performed does not alter their algebraic correctness. Thus, these operations may 
be grouped (or associated) in any order whatsoever without changing the final results of 
the combined operations. 
MUX Latch 
S Boxes S XOR Reg. Load 
Figure 7: Timing Diagram for Circuit of Fig. 6 
If we remove the labeling of R and L in a pair of consecutive rounds of the DES, we 
observe that there are two stages of XORs, where the second follows the first directly 
with only the E expansion separating the two. Since E may be convnuted with the XOR 
at  the cost of additional bits in the XOR array, we may combine the 48-bit XOR and the 
32-bit XOR into a single 48-bit XOR, which would remain in the critical timing path, 
and a 48-bit XOR which would be computed concurrent with the previous S Box table 
lookup operation. This transformation also requires the addition of a 32-bit XOR array 
since the R value is no longer produced in the critical timing path. 
3 Multi-Round Parallelism 
Using multiple stages in parallel limits the computation of feedback modes of encipher- 
ment. Since a parallel implementation begins processing subsequent blocks before com- 
pletion of a current block's encryption, modes that use the cryptext of a prior block 
cannot be computed at the full bandwidth that a feedforward mode can achieve. Three 
of the four modes defined by the NBS for use of the DES require feedback. 
A feedforward operation mode proposed by Feldmeier and McAuley appears to over- 
come the weaknesses of ECB. Their modified ECB mode of operation combines a sequence 
number with each plaintext block using the XOR operation. This approach should thwart 
frequency-analysis style attacks since multiple instances of a plaintext block are mapped 
to different cryptext elements. Using a 64 bit sequence number, cycling of this space 
would take place in 267 bytes of a data stream. This mode allows independent processing 
of data elements by avoiding the interdependence of subsequent encryption operations 
found in feedback modes. 
An intermediate alternative between feedback and feedforward modes is the use of 
multiple interleaved chains. The degree of interleaving can be chosen to allow for as 
much bandwidth gain through parallelism as needed. 
3.1 Pipeline 
L, R, and K Inputs from Previous Stage 
L, R, and K outputs to Subsequent Stage 
Figure 8: Pipeline Segment for DES with Key Transport 
The parallel computation elements may each be configured to implement a fraction of 
the rounds in a pipeline approach or each element may operate independently in a com- 
putation farm approach which we discuss in section 3.2. A pipeline of elements may be 
configured from two, four, eight or sixteen round in~plementations. Each element would 
operate on a 64-bit block for an equal number of cycles needed to partition the algo- 
rithm con~putation. Thus, a two round pipeline would execute eight cycles on each of 
the processors in the pipeline. Similarly, a four round pipeline would execute four cycles 
on each processor. 
When keying needs to  be updated frequently, the pipeline style allows a matching 
of the datapath flow with a parallel keypath. In this way, data blocks are accompanied 
by their key throughout the computation. Switching keys between successive datablocks 
without flushing the pipeline is made possible since the key and data streams are syn- 
chronized. Each stage of the pipeline has the structure depicted in Fig. 8. Note that the 
8:l key MUX actually selects between four different shifted versions of either the current 
key or the input key from the previous stage. 
For infrequent key changes, the tradeoff in keying interconnection may not be worth- 
while as compared to maintaining separate key registers for each stage of the pipeline. 
Each round would then maintain its own key load (shadow) register in this approach. 
The standard's key shifting sequence would be partially executed on each stage. Since 
a partial execution of the key schedule would not result in a complete cycling of the 
keytext, the key would be reloaded from the shadow register when it had completed its 
share of the computation on a data block. 
A limitation of the pipeline approach is the bandwidth ceiling imposed by the num- 
ber of rounds in the algorithm, sixteen. This means that a single pipeline of processors 
cannot provide more than sixteen times the encryption bandwidth of a single proces- 
sor. Additionally, the pipeline suffers in scalability since the number of stages possible is 
restricted to be a factor of sixteen. This deficit is most notable when considering an up- 
grade of a pipeline: to gain any increase in bandwidth requires a doubling in computation 
resources. 
3.2 Computation Farm 
Instead of a pipeline approach, multiple devices may be configured as a computation farm 
with each given subsequent plaintext elements to process. This configuration allows for 
smooth increase in available encryption bandwidth since the number of rounds used need 
not be a power of two as in the case of the pipeline parallelism. 
Managing the farm requires logic similar to that used in FIFO buffers. A counter to 
keep track of the next available processor and the last busy processor are required. A 
generalized depiction of the interconnection is shown in Fig.9, using an Input Manager 
and Output Manager to coordinate operations. 
Figure 9: Computation Farm Block Diagram 
I 
Input Manager 
I I I 
4 System-Level Parallelism 
To maintain constant throughput rates requires careful consideration of the encryption 
system's interface or input/output section. Overlap of the input, output and encryp- 
tion processes of subsequent text blocks provides high throughput [VerBB]. Similarly, 
DMA support decouples the host processor from the encryption function to allow CPU 
processing of other tasks to proceed in parallel with the encryption request[Anderson87]. 
D E S # N  
5 Conclusion 
I I I 
Output Manager 
1 
. . . 
D E S # l  
Parallel aspects of the DES may be exploited at three levels: within the algorithm kernel, 
through duplication of the algorithm kernel, and in the encryption processor I/O design. 
Consideration of operation mode also impacts the maximum performance attainable 
- nonstandard or hybrid operations modes should be studied further as a means of 
increasing bandwidth without compromising security. 
DES#2 
As part of our work with the Aurora network testbed [GigaSO], we have developed 
a DES board [Broscius91] using SSI TTL and MSI PALS using the MUX key register 
approach. Testing of the wirewrapped prototype indicated an encryption rate of 93 
Mbps. Further work on a DMA interface to the Microchannel interface bus of the 
IBM RS/6000 is planned. However, recent announcement by VLSI Technology of their 
VM007 encryption processor [VLSISl] with 192 Mbps performance obsoletes our discrete 
approach and will most likely be used in our final version. 
References 
[Anderson871 David P. Anderson and P. Venkat Rangan, High-Performance Interface Ar- 
chitectures for Cryptographic Hardware, EUROCRYPT '87 Proceedings, Springer- 
Verlag, Amsterdam, 1987 
[Broscius91] Albert G. Broscius, Hardware Analysis and Implementation of the NBS 
Data Encryption Standard, MSE Thesis, CIS Dept., Univ. of Penn., May 1991 
[Davio83] Marc Davio, Yvo Desmedt, Marc Fosseprez, Rene Govaerts, Jan Hulsbosch, 
Patrik Neutjens, Philippe Piret, Jean-Jacques Quisquater, Joos Vandewalle and 
Pascal Wouters, Analytical Characteristics of the DES Advances in Cryptology: 
Proceedings of CRYPTO 83, Plenum Press, New York, 1984 
[Denning82] Dorothy Denning, Cryptography and Data Security, Addison-Wesley (1982) 
[Diffie77] Whitfield Diffie and Martin E. Hellman, Exhaustive Cryptanalysis of the NBS 
Data Encryption Standard, IEEE Computer Vol. 10 No. 6, June 1977 pp. 74-84 
[Fair841 R.C.Fairfield, A. Matsuevich and J .  Plany, An LSI Digital Encryption Processor, 
Advances in Cryptology: Proceedings of CRYPTO 84, Springer-Verlag, New York, 
1985 
[FIPS46] National Bureau of Standards, Federal Information Processing Standard #46: 
The Data Encryption Standard 
[FIPS81] National Bureau of Standards, Federal Information Processing Standard #81: 
Operational Modes of the DES 
[Feldmeiergl] Anthony McAuley and David C. Feldmeier, Minimizing Protocol Ordering 
Constrainis to  Improve Performance , Submitted for publication, Available via 
anonymous ftp from Internet host thumper. bellcore.com 
[Gigago] Anonymous, Gigabit Network Testbeds , IEEE Computer, Vol. 23 No. 9 
[Ver88] Ingrid Verbauwhede, Frank Hoornaert, Joos Vandewalle and Hugo de Man, Se- 
curity and Performance Optimization of a New DES Data Encryption Chip , IEEE 
Journal of Solid State Circuits Vol. 23, No. 3, pp. 647-656, June 1988 
[VLSI91] VLSI Technology, Inc., VMOO7 Data Encryption Processor, Tempe, AZ 1991 
