Encryption Algorithm (IDEA) is presented in this paper. The design was implemented in both bit-parallel and bit-serial architectures and a comparison of design tradeo s using various measures is presented. On an Xilinx Virtex XCV300-6 FPGA, the bit-parallel implementation delivers an encryption rate of 1166 Mb/sec at a 82 MHz system clock rate, whereas the bit-serial implementation o ers a 600 Mb/sec throughput at 150 MHz. Both designs are suitable for real-time applications, such as online high-speed networks. The implementation is runtime recon gurable such that key-scheduling is done by directly modifying the bitstream downloaded to the FPGA, hence enabling an implementation without the logic required for key-scheduling. Both implementations are scalable such that higher throughput is obtained with increased resource requirements. The estimated performances of the bit-parallel and bit-serial implementations on an XCV1000-6 device are 5.25 Gb/sec and 2.40 Gb/sec respectively.
Introduction
Cryptography is concerned with the transfer of information between parties so that only the intended parties can read the data. Despite an assumption that an adversary may have full knowledge of the algorithms used, and has access to the media where data is transmitted, the aim of cryptography is to make it intractable to retrieve the data without knowledge of a secret piece of information called a key. Cryptography is an ideal application for custom computing machines since they o er the following advantages over VLSI technologies { it is possible to use the same Field-Programmable Custom Computing Machine (FCCM) hardware for many di erent cryptographic protocols { Moore's law continues to o er improved silicon technology at exponential rates which is available to FCCM designers without the costly manufacturing process required in VLSI { it is possible to specialize the hardware to an extent not possible in VLSI devices to improve performance { the recon gurable nature makes it feasible to attempt designs employing more sophisticated algorithms which leads to an improvement in performance.
The Data Encryption Standard (DES) algorithm has been a popular secret key encryption algorithm and is used in many commercial and nancial applications. Although introduced in 1976, it has proved resistant to all forms of cryptanalysis. However, its key size is too small by current standards and its entire 56-bit key space can be searched in approximately 22 hours 1].
In 1990, Lai and Massay introduced an iterated block cipher known as Proposed Encryption Standard (PES) 2]. The same authors, joined by Murphy, proposed a modi cation of PES called Improved PES (IPES) 3], which improves the security of the original algorithm against di erential analysis and truncated di erentials 4{6]. In 1992, IPES was commercialized and was renamed the International Data Encryption Algorithm (IDEA). Some believe that, to date, the algorithm is the best and the most secure block algorithm available to the public 7] .
Although IDEA involves only simple 16-bit operations, software implementations of this algorithm still cannot o er the encryption rate required for on-line encryption in high-speed networks. Ascom In this paper, two Xilinx Virtex Field Programmable Gate Array (FPGA) based implementations of the IDEA algorithm are described. On an XCV300-6 device, the bit-parallel implementation o ers a 1166 Mb/sec encryption rate, while the bit-serial implementation has a thorughput of 600 Mb/sec. The implementation is scalable so that throughput and area tradeo s can be addressed. Applications of these designs include Virtual Private Networks (VPNs) and embedded encryption/decryption devices. To illustrate various design tradeo s, an analysis on both of the designs in terms of area, latency, throughput and other design measures was carried out.
Key-scheduling in both implementations is achieved by modifying the bitstream downloading to the FPGA, in a manner similar to that described by Patterson in an implementation of DES 19] . Instead of doing this using the JBits Applications Programming Interface (API), a technique for the direct modi cation of the binary bitstream was used. The approach is advantageous because dedicated logic for key-scheduling is not required in the designs hence leaving more logic resources for performing computation.
This paper is organized as follows. In Section 2 the IDEA algorithm as well as algorithms for multiplication modulo 2 n + 1 are described. The bit-parallel and bit-serial implementations of IDEA are presented in Section 3 and 4 respectively. In Section 5 the methodology to achieve runtime recon gurability is described. In Section 6 results are given. Conclusions are drawn in Section 7.
2 The IDEA Algorithm IDEA belongs to a class of cryptosystems called secret-key cryptosystems which is characterized by the symmetry of encryption and decryption processes, and the possibility of implying the decryption key from the encryption key and vice versa. IDEA takes 64-bit plaintext inputs and produces 64-bit ciphertext outputs using a 128-bit key.
The design philosophy behind IDEA is mixing operations from di erent algebraic groups including XOR, addition modulo 2 16 , and multiplication modulo the Fermat prime 2 16 + 1. All these operations work on 16-bit sub-blocks. The IDEA block cipher 7] (depicted in Figure 1 ) consists of a cascade of eight identical blocks known as rounds, followed by a half-round or output transformation. In each round, XOR, addition and modular multiplication operations are applied. IDEA is believed to possess strong cryptographic strength because i , where i and r are the subkey number and round number respectively, are computed from the 128-bit secret key. Each round uses six subkeys and the remaining four subkeys are used in the output transformation. The decryption process is essentially the same as the encryption process except that the subkeys are derived using a di erent algorithm 7] .
The algorithmfor computing the encryption subkeys (called the key-schedule) involves only logical rotations. Order the 52 subkeys as Z 4 . The procedure begins by partitioning the 128-key secret key Z into eight 16-bit blocks and assigning them directly to the rst eight subkeys. Z is then rotated left by 25 bits, partitioned into eight 16-bit blocks and again assigned to the next eight subkeys. The process continues until all 52 subkeys are assigned. The decryption subkeys Z 0 (r) i can be computed from the encryption subkeys with reference to Table 1 . This algorithm requires a total of six additions and subtractions, one 16-bit multiplication and one comparison. However, in IDEA one of the operands of a modular multiplication operation is always a subkey, so the second subtraction can be eliminated if the associated subkeys are pre-decremented. Modulo multiplication is the bottleneck in the IDEA algorithm. In a single round of the algorithm there are four modular multiplications so a well-designed multiplication modulo 2 16 + 1 operator is crucial since it directly a ects the system performance both in terms of area and throughput.
The modular multiplication algorithm described in Section 2.1 was used in our design, but instead of taking x and y as inputs, the operator takes x and y d as inputs. As one of the operands is a subkey which is regarded as a constant, the modi cation eliminates one subtraction operator by taking the advantage of pre-decremented subkeys (Section 2.1, pseudocode line 6).
In order to implement a well-designed multiplication modulo 2 16 + 1 operator, the throughput of the operator is maximized by introducing more pipeline stages. In our design, 16-bit mulitplier used in Section 2.1 (pseudocode line 7) is constructed by Xilinx CORE Generator 22] which has a latency of 4 cycles. And the multiplication modulo 2 16 + 1 operator pipeline has a latency of 7 cycles.
Bit-Parallel IDEA Core
The IDEA algorithm is a cascade of eight identical rounds of operations, followed by a output transformation. By instantiating building blocks, that is, additions, XORs and modular multiplications, and inserting appropriate stage latches for time-alignment, a module for one round of computation is formed. For the best area-e ciency, stage latches are constructed by Virtex SRL16E primitives 23, 24] .
Due to limited hardware resources, each round of the algorithm shares the same physical resource, but with di erent key-schedules. Output transformation also reuses the resource. In our implementation the key-schedules are stored inside ROM primitives. The architecture of the bit-parallel IDEA core is shown in Figure 2 . As mentioned earlier, for ECB mode operations, data dependencies of the IDEA algorithm have no feedback paths. This property enabled the round architecture to take input values until the pipelined is lled, and output values are redirected to the input of the pipeline subsequently. In an IDEA round, the data passes through three multiplication modulo 2 16 + 1 operators, each of which has a latency of 7 cycles. Thus the full round pipeline has a latency of 21 cycles For an output transformation, the data must pass through a single multiplication modulo 2 16 + 1 operator with pipeline latency of 7 cycles. Therefore the core has a total latency of 21 8 + 7 = 175 cycles. The core takes 21 64-bit plaintexts per 21 9 = 189 cycles, equivalently performing encryption at (21 189) 64 f Mb/sec with a system clock rate of f MHz. For instance, at a 82 MHz clock rate, the core delivers an encryption rate of 583 Mb/sec with a latency of 2.134 s.
Bit-Serial Implementation
The bit-serial implementation mentioned below is an improved implementation of 15]. By register reordering and register duplication, the improved implementation o ers an encryption throughput of 600 Mb/sec, 20% faster than the original implementation.
Bit-serial architectures are characterized by the property that operators perform their computations in a bitwise fashion and communications between operators are multiplexed in time over a single wire. Data ow begins with either the least signi cant bit or the most signi cant bit, but the former is more commonly used due to its compatibility with two's complement arithmetic. In a typical bit-serial implementation, each variable is associated with a control signal which is set high only when the rst bit is transferred along associated data bus. To reduce area, control signals can be shared among the variables. Since bit-serial operators usually require the rst bits of their operands to enter the operators on the same clock cycle, appropriate stage latches must be inserted for time-alignment 25].
Two of the primitive operators used in IDEA, namely XOR and addition modulo 2 16 , can be easily implemented bit-serially. These two operators have latencies of one clock cycle and are capable of taking consecutive bit-serial operands. The multiplication modulo 2 16 + 1 operator has a latency of 35 clock cycles. As in the parallel implementation, stage latches and constants are implemented using SRL16E primitives. Additionally, constants are also implemented as SRL16E primitives, with its output connected to its input to form a cyclic shift register.
Multiplication Modulo 2 16 + 1
The modular multiplication algorithm described in Section 2.1 was directly applied in the bit-serial implementation of the algorithm. The operator optimization used in the bit-parallel implementation, described in Section 3.1, was not applied in the bit-serial implementation because comparisons in bit-serial architectures are not e cient in terms of latency.
An N N-bit multiplier generates a 2N-bit result, and requires 2N cycles to complete. Thus, throughput of bit-serial multipliers are restricted because the minimum interval between consecutive multiplications must be at least 2N cycles. In the IDEA algorithm one of the operands of every modular multiplication is a subkey and treated as a constant.
Recall in the modular multiplication algorithm that the intermediate result t is divided into two portions (Section 2.1, pseudocode line 7-9). The two portions, t h and t l , are respectively the upper and lower 16-bits of the double-word, which are operands to subsequent operations. A design that computes the upper and lower words of t independently is desirable, allowing all the inputs, outputs and intermediate variables of the operator to be 16-bit long. Using this scheme and duplicating hardware, the throughput of a modular multiplication operation can be doubled.
A modi ed version of Lyon's parallel-serial multiplier 26] was developed which addresses this problem. To generate two 16-bit results in 16 cycles, the throughput of the multiplier must be doubled. We achieved this by duplicating the hardware for multiplication, as illustrated in Figure 3 . Registers storing the constant are shared among the two multiplication pipelines. The outputs p and q correspond to the results of two consecutive multiplications, where the two 32-bit long variables have a time-di erence of 16 cycles. The control signal, which is high one clock cycle before the least signi cant bit enters the module, toggles the control register. The vector of input variables a n?1 : : :a 1 a 0 is consequently redirected into the two multiplication pipelines alternately. While the vector is being redirected to one pipeline, logic zero enters the other pipeline carrying out zero-padding. To obtain the time-aligned upper and lower words of t, a 16 stage shift register is required. The input and output of the shift register are the upper and lower words of t respectively, 16 cycles after t is valid. In the implementation the shift register is implemented as a SRL16E 23] primitive. The complete ar-chitecture for the modular multiplication operation is shown in Figure 4 . Upon initialization, the subkey associated with the operator is passed into the operator bit-serially. The pre-decremented subkey is shifted into the registers of the multiplier, and at the same time stored into the SRL16E primitive responsible for key storage. Utilizing the idea of multiple pipelines, the modular multiplication operation o ers a throughput of 16 cycles, even though a 32-bit intermediate result is computed. This scheme doubles the throughput but since sharing of the b registers can occur, the hardware cost is less than double.
Bit-Serial IDEA Core
The core implementation of IDEA is obtained by cascading eight identical rounds of operations shown in Figure 5 , followed by a output transformation. The core takes one 64-bit plaintext once every 16 cycles, yielding an e ective encryption rate of f 64 16 Mb/sec at a system clock rate of f MHz. At 150 MHz, for example, the performance of the core is 600 Mb/sec.
Each round has a latency of 109 cycles. The output transformation has a latency of 35 cycles. Each serial-to-parallel converter at the outputs has a latency of 16 cycles. Therefore, the IDEA core has an overall latency of 109 8+35+16 = 923 cycles. At a 150 MHz system clock rate, the equivalent latency is 6.153 s.
Runtime Recon gurability
The basic building block of the Virtex FPGA is the logic cell (LC). A LC includes a 4-input function generator, carry logic and a storage element. Each Virtex Con gurable Logic Blocks (CLB) contains four LCs, organized in two slices. The 4-input function generator are implemented as 4-input LUTs. Each of them Our approach to achieve runtime recon gurability is to build all con gurable blocks from LUTs. More speci cally, the key-schedule is stored only inside ROM or SRL16E primitives which are implemented as LUTs. After technologymapping, placement and routing, a circuit description (with a .ncd extension) is generated. Using the ncdread tool provided by Xilinx, the contents of the circuit description can be converted into a human-readable format. It is possible to extract the physical location of individual LUTs from the output of ncdread.
We have developed software to customize bitstreams for di erent key-schedules. In the rst step (which need only be performed once for a given design), information concerning the physical location of individual LUTs which are used in the key schedule is extracted from the .ncd le and written to a location le (locfile). To modify a bitstream, the LUTs are directly modi ed by a program to use a given key-schedule. The pseudocode below describes the technique that was used. On an Intel Pentium III 866 MHz machine, the recon guration process requires the modi cation of xx LUTs and changing a key takes approximately yy seconds. Note that the bitstream recon guration program could be further optimized for speed, and currently, the bottleneck is in recomputing the CRC.
In some applications, runtime recon guration may not be desirable e.g. if the bitstream is placed in a ROM or in an Application-Speci c Integrated Circuit (ASIC) implementation. For these cases, shift registers can be employed for the key-schedule. The shift registers are linked to form a large shift register when key-schedules are being fetched. This long shift register breaks down into the original shift registers after initialization. This method requires minimal logic and routing resources.
Results
Both the bit-parallel and bit-serial IDEA processor was veri ed with Synopsys VHDL Simulator, and was synthesized using Synopsys FPGA Express 3.5 and Xilinx Foundation Series 3.3i, with Xilinx Virtex XCV300-6 as target device.
Our serial and parallel implementations of IDEA were successfully implemented on Annapolis Micro Systems Wildcard Recon gurable Computing Engine 29]. The device is a Type II PCMCIA Card with a 33 MHz 32-bit CardBus interface, consisting of an Xilinx Virtex XCV300{6 FPGA as Processing Element (PE) and two 64k 32-bit SDRAMs. Furthermore, the parallel implementation was also tested using a Pilchard card 30] which uses a memory slot interface instead of a CardBus interface.
Performance of IDEA Core
For the bit-parallel implementation, a single core/round of the algorithm requires 1178 Virtex slices. An XCV300 device can accommodate two rounds of the algorithm, accounting for 2444 slices (including extra logic required for scaling), or 79.56% of the total 3072 slices.
For the bit-serial implementation,the fully-pipelined implementation (8 rounds plus output transformation), with parallel-to-serial converters at inputs and serial-to-parallel converters at outputs, requires 2878 Virtex slices which occupies 93.68% of CLB resources.
It is observed that the building blocks o er faster computations in standalone con guration, but performance degrades when they are being used as components in the hierarchical design. Hence, core performance improvement may be obtained by oorplanning, such that inter-component routing is minimized. The performance of the cores (assuming a high-bandwidth interface to the data sources and sinks) is summarized in In an attempt to explore tradeo s between performance and area, the core was generated for FPGAs of di erent capacities. Since there are no data dependencies, the implementations can be easily scalable by instantiation of multiple cores. The designs were maximally scaled within the resource limitation of each device to produce the results summarized in Table 3 .
Bit-parallel
Bit-serial Device (speed grade -6) XCV300 XCV600 XCV1000 XCV300 XCV600 XCV1000 Table 3 . Tradeo s between performance and area of the IDEA cores on di erent devices.
Performance
On the Wildcard implementation, the time taken to complete a transaction between the FPGA and host is dominated by the setup time of CardBus interface. When designing the interface between the IDEA core and the host, it is crucial that the number of discrete transactions is minimized and the amount of data transfered per transaction is maximized. Data from host is written directly to the core using a burst mode transfer of 1024 64-bit plaintext blocks. After the latency period, the ciphertext is written to consecutive locations in the BlockRAM. For XCV300 devices, there are eight 256 32-bits BlockRAM 31] on the chip and they are all used in the host/IDEA interface. The results are read by the host from the IDEA processor by doing a burst mode transfer of the contents of the BlockRAM. The decryption process is similar except the ciphertext is written to the IDEA core and the plaintext appears in the BlockRAM.
The interface between host and IDEA core on Wildcard requires approximately an additional 160 slices, resulting in a total of 2606 slices (84.83%) and 3039 slices (98.93%) utilization of the XCV300 for the bit-parallel and bit-serial implementations respectively.
The burst transfer rate of CardBus is 33 32 = 1056 Mb/sec. However, due to large overheads in the CardBus transactions, both the implementations achieve a measured performance of 0:61 10 6 encryptions per second (39 Mb/sec). The situation could be improved by using Direct Memory Access (DMA) channels. In addition, utilizing the two 64k 32-bits SDRAMs on Wildcard could provide a larger bu er for ciphertext storage hence reducing the number of transactions.
Pilchard
In an attempt to improve the PC to FPGA data transfer rate, the bit-parallel implementation was ported to a Pilchard card 30] which utilizes a memory slot interface for improved performance over a CardBus interface. The Pilchard card used the same XCV300{6 device as in the Wildcard. The implementation uses only a single IDEA core/round and requires a total of 1319 slices (42.93% utilization). Pilchard o ers a higher bandwidth between the PC and FPGA and the implementation achieved a measured encryption performance of 146 Mb/sec on an Intel Pentium IIII 866 MHz desktop PC.
Conclusion
Two high-performance runtime recon gurable implementations of the IDEA algorithm were presented in this paper. In both designs, the bitstream is customize for a particular key and this procedure saved hardware resources in our design. In implementations on the same XCV300-6 part, the bit-parallel version achieved an encryption rate of 1166 Mb/sec using an 82 MHz clock, whereas the bit-serial implementation achieved a 600 Mb/sec throughput at a clock rate of 150 MHz.
The bit-parallel implementation achieved a higher throughput with lower latency than the bit-serial implementation, while the bit-serial implementation permits a minimal area fully-parallel design.
