A VLSI implementation of the International Data Encryption Algorithm is presented. Security considerations led to novel system concepts in chip design including protection of sensitive information and on-line failure detection capabilities. BIST was instrumental for reconciling contradicting requirements of VLSI testability and cryptographic security. The VLSI chip implements data encryption and decryption in a single hardware unit. All important standardized modes of operation of block ciphers, such as ECB, CBC, CFB, OFB and MAC, are supported. In addition, new modes are proposed and implemented to fully exploit the algorithm's inherent parallelism. With a system clock frequency of 25 MHz the device permits a data conversion rate of more than 177 Mbit/s. Therefore, the chip can be applied to on-line encryption in high-speed networking protocols like ATM or FDDI.
1 Introduction
Cryptography
Encryption is the conversion of information, the plaintext or message, into code, the ciphertext, which is intelligible only for an authorized receiver. Data encryption is applied to ensure secrecy of a message transferred. Historically, cryptographic techniques have been developed for diplomatic or military applications; today they can be found everywhere in private and public sectors where con dential information has to be processed, transferred, and stored. An important characteristic of modern cryptography is the use of publicly known, i. e. published, algorithms 1]. The secrecy is therefore not kept in the algorithm itself, but only in a small additional piece of information shared between sender and receiver, the key.
Secret-Key Block Encryption
Secret-key block encryption divides the plaintext into a series of blocks of equal length, typically 64 or 128 bits. These blocks are then processed sequentially using a key known to the sender and receiver exclusively. A well-known and widespread algorithm implementing secret-key block encryption is the Data Encryption Standard (DES) which has been adopted for inter-government use by the American National Bureau of Standards 2] and for the commercial sector by the American National Standards Institute (ANSI, 1982) . However, serious concerns arise about long-term security because of DES's relatively short key word length of a mere 56 bit, and, more recently, from the cryptanalysis attack of Biham and Shamir 3].
The Data Encryption Algorithm IDEA
In this context, a new block encryption algorithm called Idea TM (International Data Encryption Algorithm), that overcomes the problems mentioned above, has been developed and published by Lai and Massey 4, 5] . In this cipher, the plaintext and the ciphertext are processed in blocks of 64 bits, while the key is 128 bits long. The cipher relies on combining operations from three algebraic groups. The three group operations on vectors of length 16 are:
1. Bitwise addition modulo 2 (XOR) of two 16-bit subblocks, 2. Addition of integers modulo 2 16 , and 3. Multiplication modulo the Fermat prime p = 2 16 + 1, where the value`0' is never used, and 2 16 is represented by the all-zero vector 0::00. Figure 1 gives an overview of the encryption process. The cipher is constructed in such a way that the deciphering process is the same as the enciphering process, the only di erence being that di erent key subblocks are used. This property is called similarity of encryption and decryption. As becomes clear from Fig. 1 , eight identical rounds, followed by an output transformation, are cascaded to form the complete cipher. The 64-bit plaintext block is partitioned into four 16-bit subblocks, the i-th of which is denoted as X i in Fig. 1 , and where the four key subblocks used in the output transformation are denoted as Z (9) 1 ; : : : ; Z (9) 4 . The fty-two key subblocks used in the encryption or decryption process are generated from the 128 user-selected key bits according to a key schedule 5] . The transformation at the heart of a round, consisting of two multiplications mod 2 16 + 1 and two additions mod 2 16 (structure on grey background in Fig. 1 ), provides the desired statistical properties of the cipher. This transformation is called the multiplication-addition (MA) structure.
VLSI Implementation
A software implementation of Idea on a Sun SPARCstation 2 performs data encryption at a processing speed of 400 kBit/s. A rst prototype VLSI circuit, developed and successfully tested two years ago 6], essentially served to speed up cryptographical tests and statistical analyses of the algorithm. Its throughput of 44 Mbit/s exceeded that of the software version by a factor of more than 100. Speci cations for the present implementation aimed at a data rate in excess of 155 Mbit/s, thus demanding a more sophisticated design as was realized in the new VLSI implementation called Vinci described in this paper. The stringent speed requirement essentially posed three problems, namely, to design e cient basic building blocks, to nd a high-throughput datapath architecture, and to come up with a chip interface capable of handling the resulting o -chip data tra c.
Basic Building Blocks
Whereas the implementation of the XOR and addition operations was obvious, the structure of a modulo (2 16 + 1) multiplier turned out to be challenging. The realization of the required on-chip bu er memories was the subject of careful investigation as well.
Multiplication Modulo 2 16 + 1
Combinational delay and area consumption of the multiplication modulo (2 16 + 1) unit are crucial to the entire chip architecture and to the maximum attainable data throughput rate. Various methods of implementing such a dedicated multiplication unit were investigated and compared 7]. Table-lookup based solutions and an ordinary multiplication with subsequent modulo correction both resulted in overly high computation time and area. Therefore a customlayout realization was chosen using a computation scheme with stepwise modulo reduction. A modi ed Booth recoding multiplication and fast carry-select additions for the nal modulo correction form the two stages of the multiplier pipeline structure. Additional test circuitry has been added to increase fault coverage. The small size of the resulting layout block allowed the placement of four multiplication units instead of only two, which together with the massively reduced computational delay increased the throughput of the encryption/decryption datapath by a factor of four compared to the scheme implemented in the prototype VLSI circuit 6]. Technical data of the multiplication unit are summarized in Table 1 
On-chip Bu er Memory
On-chip storage elements can be classi ed as delay bu ers, input and output (I/O) bu ers, key memory, and pipeline registers and their compensation registers on parallel feed-forward branches of the data dependency graph.
To implement the di erent modes of operation two eight-cycle 64-bit delay bu ers were required, which were implemented as (8 64)-bit shift registers. There speed was not critical but, however, area e ciency was very important. Di erent implementation variants using automatically generated layout were compared, but none of them met the stringent area requirements. A very area-e cient storage block, for example, was the (64 8) -SRAM, which in turn was not suited for our needs because of its small word size. Consequently, a custom-layout realization was chosen, which was also encouraged by the high regularity of the shift register structure. Input and output bu ers are implemented in ping-pong mode, i. e., both bu ers are divided into two parts. During encryption or decryption, each part alternatingly receives and releases eight 64-bit data blocks. The (8 64)-bit shift registers described in the previous paragraph were reused for that purpose.
Finally, the key memory is implemented as a (256 16)-bit RAM to store all subkeys required for encryption and decryption. Two di erent keys can be stored in parallel, which allows the usage of an additional session key. Recall that a total of 104 subkeys forms the complete set of encryption and decryption subkeys.
Datapath Architecture 2.2.1 Encryption and Decryption
Encryption and decryption are the speed-critical data processing operations that must be carried out in real-time, as opposed to subkey computation, for instance. The main objective in designing the Vinci datapath architecture has therefore been to meet the 155 Mbit/s throughput requirement at reasonable costs. The fact that the cascaded rounds in Fig. 1 are identical, except for the key bits, suggested recycling data through a single datapath. Note that the output transform in this context is computationally identical to the rst section of a regular round. This single datapath was then chosen to be isomorphic to one round, which is to say that it includes a hardware unit for each computation and that all units operate in parallel. In order to further increase clock rate and throughput, an eight-stage pipelining scheme was incorporated. The nal datapath structure is shown in Fig. 2 .
pipeline.eps 90 106 mm 
Exploiting Pipelining: New Modes of Operation of a Block Cipher
The chip implements the ve standardized modes of operation of block ciphers 9] which are ECB (Electronic Code Book), CBC (Cipher Block Chaining), CFB (Cipher Feedback), OFB (Output Feedback), and MAC (Message Authentication Code).
All operation modes which include feedback, i. e., CBC, CFB, OFB, and MAC, require rst order recursive computation. However, this type of feedback scheme cannot accommodate any latency caused by pipelining in its computational units without changing the input-to-output relationship. The highly nonlinear nature of the ciphering algorithm also precludes any unfolding of the recursive loop. In our eight-stage pipeline this dilemma can be resolved in two ways. First, the original data ow can be preserved by refraining from pipelining within the recursive computations, thus reducing throughput to one eighth of its potential. Second, accepting the decomposition of the data stream into eight separately encrypted chains makes it possible to run the pipeline at full speed.
Here the l-th block is combined with the (l + 8)-th (see Fig. 3 ), resulting in eight di erent and independent data streams as depicted in Fig. 4 
Computing Subkeys
Key management is performed entirely on-chip. Only the 128-bit wide master key has to be loaded whereupon all subkeys are generated internally. This generation process includes shifting of the master key and computing additive and multiplicative inverse elements. The method for computing a multiplicative inverse modulo 2 16 + 1 is based on Fermat's theorem and implemented using modular exponentiation. Modular exponentiation is carried out as a sequence of modulo-squaring and modulo-multiplication operations, (1) for which the on-chip modulo-multipliers are used.
Interfacing the VLSI Circuit
The overall architecture of Vinci is depicted in Fig. 5 . With consideration for high-speed applications, the data interfaces are designed as unidirectional input and output ports. The two 16-bit wide ports of Vinci are designed such that data can be transferred continuously and independently from the internal chip clock. All parameters required to set up Vinci for a speci c conversion can be loaded and the actual status watched via an 8-bit wide bidirectional control port. Seven control registers are implemented for that purpose. The master key is also loaded via this port. 3 Veri cation and Testing
O -line Built-in Self-test
Security standards for cryptographic equipment (see 10]) require physical protection of secret keys and unencrypted or partially encrypted data. On the other hand, a su cient testability of complex modern electronic devices is a must that is usually achieved by easing physical access to internal signals (observability and controllability). Such contradictory requirements can be reconciled by introducing a built-in self-test (BIST). BIST includes stimulus generation and response analysis on-chip and thus makes accessibility to potential sensitive internal nodes unnecessary. It has been shown in 11] that ciphers like Idea which are based on di erent algebraic group operations are well testable using pseudo-random patterns. Therefore a VLSI implementation of such a cipher is also well-suited to pseudo-random BIST. The implemented o -line self-test scheme starts with a test of the subkey RAM using checkerboard and retention tests. A pseudo-random pattern generator based on a linear feedback shift register provides data for the subsequent subkey generation process and a number of encryptions and decryptions in di erent modes of operation. A 21-bit signature is generated out of the results and compared to a precomputed and hard-wired value. This o -line BIST scheme covers nearly the complete on-chip circuitry. It achieves a gate-level fault coverage of 93.6 % using the stuck-at fault model, and it results in an area overhead of only 2.0 %. Note that a comparable device, 12], achieves a fault coverage of 94.7 % with an area overhead of 9.4 %. The entire self-testing scheme is described in detail in 13].
Concurrent Self-test
The area requirements for o -line and concurrent self-test are illustrated in 
Boundary Scan
A boundary-scan scheme was implemented for system test requirements. However, sensitive I/O signals are excluded from boundary scan for security reasons.
Technical Data
The technical data of the Vinci chip are shown in Table 4 . The chip was fabricated in CMN12, a 1.2 m double-metal n-well CMOS process by VLSI Technology, Inc. Fig. 6 illustrates the oorplan and the corresponding photomicrograph of the chip. Tools by COMPASS Design Automation were used throughout for design, simulation, verication, and logic synthesis. Total design e ort for the VLSI circuit was approximately 18 man-month. 
Conclusions
A high-speed VLSI block encryption circuit based on the new cipher Idea has been presented. Vinci, Idea's rst silicon realization, integrates high-speed encryption and decryption, comprehensive key management functions, and all standard cipher modes of operation in their ordinary and high-speed adapted versions. High data throughput of 177 Mbit/s has been obtained from pipelining and from incorporating four full-custom modulo (2 16 + 1) multipliers. Two unidirectional high-speed 16-bit data ports allow for continuous operation of the datapath. All controller, pipeline stage registers, and the remaining arithmetic units were realized with standard-cells.
The resulting VLSI circuit achieves data rates signi cantly higher than the fastest silicon implementation of DES known to the authors. The only known faster single-chip implementation of a block cipher uses GaAs-technology 14]. Vinci is therefore the rst silicon block encryption device that can be applied to on-line encryption in high-speed networking protocols like ATM (Asynchronous Transfer Mode) or FDDI (Fiber Distributed Data Interface).
