Abstract The CryptoBooster is a modular and recon gurable cryptographic coprocessor that takes full advantage of current high-performance recon gurable circuits (FPGAs) and their partial recon gurability. The CryptoBooster works as a coprocessor with a host system in order to accelerate cryptographic operations. A series of cryptographic modules for di erent encryption algorithms are planned. The rst module we implemented is IDEACore, an encryption core for the International Data Encryption Algorithm (IDEA TM ).
Introduction
In this paper we describe a novel cryptographic coprocessor, the CryptoBooster, optimized for recon gurable computing devices (e.g., Field Programmable Gate Arrays, or FPGAs). Our implementation is modular, scalable, and helps to resolve the trade-o between device size and data throughput. The implementation is designed to support a large number of di erent encryption algorithms and includes appropriate session management.
The rst CryptoBooster implementation we propose implements IDEA TM1 , a symmetric-key block cipher algorithm 7] . IDEACore is the rst of a series of cryptographic modules for the CryptoBooster coprocessor. A simple recon guration of the recon gurable computing device will su ce to replace IDEACore by another block cipher module (e.g., DES). The other modules of the CryptoBooster generally remain unchanged. 1 IDEA is patented in Europe and the United States 11, 12] . The patent is held by Ascom Systec Ltd., http://www.ascom.ch/systec.
In section 2 we describe the CryptoBooster architecture. Section 3 gives a short introduction to the IDEA TM encryption algorithm and reviews existing implementations of this algorithm in hardware. Section 4 describes the implementation of IDEACore, the rst cryptographic module for the CryptoBooster. We conclude in section 5 the paper with the synthesis and performance results of our implementation.
CryptoBooster
CryptoBooster is a modular coprocessor dedicated to cryptography. It is designed to be implemented in Field Programmable Gate Arrays (FPGAs) and to take advantage of their partial recon guration features. The CryptoBooster works as a coprocessor with a host system in order to accelerate cryptographic operations. It is connected to a session memory responsible for storing session information (Figure 1) . A session is characterized by a set of parameters describing the cryptographic packets, the algorithm used, the key(s), the initial vector(s) for block chaining, and other information.
Host System
CryptoBooster Memory Session Figure 1 . The CryptoBooster works as a coprocessor together with a host system.
Typically, the host system is a PC. The CryptoBooster needs additional memory to store session information.
Our design is motivated by the following objectives: (1) The main goal is to have maximum data throughput to provide a design able to cope with ever-increasing network speed. This justi es hardware implementation in place of a software solution. Physical security may, however, also be an argument for hardware implementation. (2) Our requirements include the ability to easily con gure di erent algorithm sub-blocks and the associated subkey generation. We clearly need a highly modular architecture, allowing us to easily change building blocks. The modularity has been pushed far enough to allow partial recon guration of the coprocessor. Partial recon guration allows to o er several algorithms for one and the same physical chip with limited resources.
Modular Architecture
A block diagram of the CryptoBooster architecture is shown in Figure 2 . The InterfaceAdapter module is a technology-dependent interface to the host system. Typically, this is a PCI or a VME interface, but one may also imagine a networking interface like Ethernet. The HostInterface is the software interface to the host system. It o ers read/write registers and interruptions to con gure and control the coprocessor. The SessionMem module allows to interface di erent types and con gurations of physical memories. A separate session memory has been chosen mainly to limit the communication between host and coprocessor and in order to change rapidly between di erent sessions.
The CryptoCore module itself is subdivided into three parts: CypherCore: encryption algorithm, SessionAdapter: session parameter management (speci c to each CypherCore), SessionControl: central controller for session management. The CypherCore and SessionAdapter modules are intelligent modules and can be queried by the SessionControl module. They respond with the implemented features available. It is therefore possible to exchange these modules without changing the control mechanism. All these modules communicate together and with the modules outside the CryptoCore using unidirectional point-to-point links called CoreLink. These links are designed to transmit control or data packets. This homogeneity at the interconnection level strongly enforces the modularity of the system. 
Advantages of an FPGA-based Implementation and Recon guration Features
An FPGA circuit is an array of logic cells placed in an infrastructure of interconnections 14]. Each cell is a universal function or a functionally complete logic device, which can be programmed to realize a certain function. Interconnections between the cells are also programmable. The versatility allowed by logic blocks and the exibility of the interconnections provide high freedom of design during the utilization of FPGAs. The CryptoBooster is implemented using the VHDL hardware description language. The design can thus be synthesized without major problems for FPGAs as well as for VLSI technology. A VLSI solution results in general in higher performance than an FPGA implementation but the latter has several important advantages: (1) Recongurability of the FPGA allows the developer to easily provide speci c solutions to the customer as it is often needed in cryptography; (2) A VLSI multi-algorithm coprocessor requires all corresponding CypherCores to be implemented in the chip which demands a huge amount of transistors; (3) On the other hand, an FPGA only contains one encryption algorithm at a given time. Other algorithms are available in the form of a con guration bitstream. Thus the maximum area required corresponds to the area used by the largest algorithm. One can distinguish between full and partial FPGA recon guration. Full recon guration is the common method currently used. The con guration is replaced by a new one each time the algorithm is changed (Figure 3 ). Partial recon guration allows to recon gure parts of the FPGA, i.e., only the part where the algorithm is implemented on the FPGA has to be recon gured. This normally results in a much shorter interrupt of service compared to full recon guration. The CryptoBooster is designed to take full advantage of partial recon guration.
Configuration Memory
CryptoBooster IDEA TM is one of a number of conventional encryption algorithms that have been proposed in recent years to replace DES. However, there has been no rush to adopt it as a replacement to DES, partly because it is patented and must be licensed for commercial applications, and partly because people are still waiting to see how well the algorithm fares during the upcoming years of cryptanalysis.
IDEA TM is a 64-bit block cipher that uses a 128-bit key to encrypt data (DES also uses 64-bit blocks but only a 56-bit key). The same algorithm is used for encryption and decryption. It consists of 8 computationally identical rounds followed by an output transformation. Round r uses six 16-bit subkeys K (r) i , 1 < i < 6, to transform a 64-bit input X = (X1; X2; X3; X4) into an output of four 16-bit blocks, which are input to the next round. The round 8 output enters the output transformation, employing four additional subkeys K (9) i , 1 < i < 4 to produce the nal ciphertext Y = (Y1; Y2; Y3; Y4). All 52 16-bit subkeys are derived from the 128-bit master key K. The key is long enough to withstand from exhaustive key searches well into the future.
IDEA TM is easy to implement in both software and hardware, even in embedded systems. A typical software implementation of IDEA TM in C (using the Ascom Sys- Current hardware implementations have stressed the importance of the combinatorial delay and area consumption of the multiplication modulo (2 16 + 1) units which are crucial to the entire system. These units are the limiting factor to obtain high data throughput. Various methods of implementing such a multiplication are investigated in 3,5,10,15,16]. The VINCI implementation uses a modi ed Booth recording multiplication and fast carry select additions for the nal modulo correction 17]. In general, for larger words, ROM-based solutions using lookup-tables require large ROMs. In a recent paper, Zimmermann 16] presents an e cient VLSI implementation of the modulo 
The VINCI VLSI Implementation
A new VLSI implementation called VINCI 4, 17] was developed in 1993 at ETH Zurich. The chip consists of 250,000 transistors on a total area of 107:8mm 2 and attained 177.8 Mbit/s @ 25 MHz. The data path was optimized using an eight-stage pipeline and full-custom modulo (2 16 + 1) multipliers. The VINCI chip was the rst chip that could be used for on-line encryption in high-speed networking like ATM or FDDI. Figure 5 shows the pipeline of the VINCI data path for one round. The multiplication modulo (2 16 + 1) operation is distributed in two pipeline stages to reduce the critical path.
Note that the output round is identical to the rst section (3 stages) of a regular round as shown in Figure 5 . 
Ascom IDEACrypt Coprocessor
Ascom Systec Ltd., holder of the IDEA TM patents, o ers a high-speed implementation of the IDEA TM algorithm as an embedded ASIC core. The IDEACrypt kernel implements data encryption and decryption in all common operating modes for block ciphers (ECB, CBC, CFB, OFB) 6]. IDEACrypt provides exible key management for both master-session key and asymmetric key and stores the keys in a RAM which may be either on-chip or o -chip. Therefore, with a RAM, a bus interface, and a global controller, a complete IDEA TM cipher may be implemented. The entire IDEACrypt coprocessor is implemented in synthesizable VHDL code. Ascom Systec lists a complexity of approximatively 35k gates. With 0.25 micron technology using a 3-stage pipeline, the chip provides a throughput of 300 Mbit/sec (@ 40 MHz) in ECB mode and a throughput of 100 Mbit/sec in the other modes. At 100 MHz, the throughput goes up to 720 Mbit/sec in ECB mode. 
A Scalable IDEA TM Pipeline
We adopted a highly scalable solution for our IDEA TM pipeline: the length of the pipeline can be chosen at compilation time. The regular round is inspired by the VINCI datapath (see Figure 5 ). Figure 7 shows the regular round consisting of seven pipeline stages. The rst three stages of a regular round simply form the output round (Fig. 8 ).
As Figure 9 shows, the minimal pipeline length is one regular round followed by one output round. Data has to be fed eight times through the regular round before passing through the output round. The longest pipeline (full-length pipeline) consists of eight rounds and one output round. In each con guration, the data needs 59 clock cycles to pass through the pipeline. A longer pipeline has a smaller latency and thus a higher throughput.
We use a fully self-controlled pipeline with a control pipeline associated in parallel to the data pipeline. The control pipeline addresses the key memories attached to each stage that needs an encryption key. Every 64-bit data block has an associated counter that indicates the current round. Data is automatically feed to the output stage if it was sent the correct number of times through the regular rounds or it is fed back through the block of regular rounds. Pipeline bubbles (data marked by a non-valid bit) are automatically inserted into the pipeline if the module preceding the pipeline (block chaining) is not able to deliver new data packets. This mechanism allows us to avoid pipeline stalls.
Multipliers and Modulo (2 16 + 1) We currently use simple bit-parallel multipliers optimized for FPGAs and the low-high algorithm 7] for the modulo (2 16 + 1) calculation. As stated in section 3, the combinatorial delay and area consumption of the multiplication modulo (2 16 + 1) units are crucial to the entire design and are limit the data path.
Bit-parallel multipliers are perhaps not the best choice, but we were surprised by the performance they achieved and the area they used in our FPGA-based implementation. In the near future, we intend to optimize the multiplication modulo (2 16 + 1) units to achieve yet higher throughput. 
Block-Chaining
The block-chaining module implements the commonly used block-chaining algorithms like ECB (Electronic Codebook Mode), CBC (Cipher Block Chaining Mode), CFB (Cipher Feedback Mode), and OFB (Output Feedback Mode). To prevent the pipeline from stalling, the block-chaining module always disposes of enough initial vectors (the number depends on the number of regular rounds used in the pipeline).
As with all other modules in the CryptoBooster architecture, the block-chaining module is connected to the other modules by CoreLink unidirectional point-to-point interconnections. Data has to be feed 8, 4, 2, and 1 time through the regular rounds.
Performance of the IDEACore CryptoCore
The CryptoBooster is designed to achieve maximum throughput for a given area in the FPGA. Our current implementation allows pipeline lengths of 1, 2, 4, or 8 regular rounds, followed by one output round. A full-length pipeline consists of 59 (8 regular rounds + 1 output round) stages with a latency of 1 clock cycle when using bit-parallel multipliers ( Figure 9 ). The peak performance of our current implementation is estimated at 200 Mbit/s for a 1-round pipeline (1 regular round + 1 output round) and it easily ts into a state-of-the-art FPGA. Performance for a full-length pipeline (8 regular rounds + 1 output round) is estimated at more than 1500 Mbit/s. However, the area needed in terms of recon gurable logic blocks in the FPGA is quite important.
Session initialization and key calculation slightly decrease the overall performance over a complete session. Depending on the block-chaining mode used, performance may, however, signi cantly decrease.
Conclusions
The CryptoBooster is a modular and recon gurable cryptographic coprocessor taking full advantage of current high-performance recon gurable circuits (FPGAs). Recon gurable circuits can be recon gured within a few milliseconds and they provide speed rates close to ASIC designs. Our main goal is to have maximum data throughput so as to provide a design able to cope with ever-increasing network speed. This justi es hardware implementation in place of a software solution. Physical security may, however, also be an argument for hardware implementation.
As our results show, the throughput of the CryptoBooster allows the coprocessor to be used in today's high-speed networks like ATM, Sonet, and GigaEthernet; moreover, it is competitive with full-custom circuits or DSP implementations. IDEA TM was chosen as a rst CryptoCore module. More modules with di erent algorithms (e.g., DES) are planned.
