Abstract. This paper describes high performance single-chip FPGA implementations of the new Advanced Encryption Standard (AES) algorithm, Rijndael. The designs are implemented on the Virtex-E FPGA family of devices. FPGAs have proven to be very effective in implementing encryption algorithms. They provide more flexibility than ASIC implementations and produce higher data-rates than equivalent software implementations. A novel, generic, parameterisable Rijndael encryptor core capable of supporting varying key sizes is presented. The 192-bit key and 256-bit key designs run at data rates of 5.8 Gbits/sec and 5.1 Gbits/sec respectively. The 128-bit key encryptor core has a throughput of 7 Gbits/sec which is 3.5 times faster than similar existing hardware designs and 21 times faster than known software implementations, making it the fastest single-chip FPGA Rijndael encryptor core reported to date. A fully pipelined single-chip 128-bit key Rijndael encryptor/decryptor core is also presented. This design runs at a data rate of 3.2 Gbits/sec on a Xilinx Virtex-E XCV3200E-8-CG1156 FPGA device. There are no known singlechip FPGA implementations of an encryptor/decryptor Rijndael design.
Introduction
In September 1997 the National Institute of Standards and Technology (NIST) issued a request for possible candidates for a new Advanced Encryption Standard (AES) to replace the Data Encryption Standard (DES). In August 1998, 15 candidate algorithms were selected and a year later, in August 1999 five finalists were announced: MARS, RC6, Rijndael, Serpent and Twofish. On 02 October 2000, the Rijndael algorithm [1] , developed by Joan Daemen and Vincent Rijmen was selected as the winner of the AES development race. In performance comparison studies carried out on all five finalists [2, 3, 4, 7] , Rijndael proved to be one of the fastest and most efficient algorithms. It is also easily implemented on a wide range of platforms and is extendable to other key and block lengths.
In this paper two fully pipelined Rijndael algorithm designs are presented. The designs are implemented using Xilinx Foundation Series 3.1i software on the Virtex-E FPGA family of devices [5] . A fully pipelined Rijndael design requires considerable memory, hence, its implementation is ideally suited to the Virtex-E and Virtex-E Extended Memory range of FPGAs, which contain devices with up to 280 RAM Blocks (BRAMs). The first design presented is an encryption-only design capable of supporting 128-bit, 192-bit and 256-bit keys. The 128-bit key design is implemented on the XCV812E-8-BG560 device. The 192-bit and 256-bits are implemented on XCV3200E-8-CG1156 devices, as they are too large to place on the XCV812E device. The authors are not aware of any other Rijndael hardware design capable of supporting varying key sizes. However, software designs do exist. The fastest known software implementations of Rijndael are by Brian Gladman [6] . On a 933 MHz Pentium III processor, his 128-bit key design achieves a throughput of 325 Mbits/sec, the 192-bit key design runs at 275 Mbits/sec and the 256-bit key design at 236 Mbit/sec. Hardware implementations of a 128-bit key design do exist. An implementation on a Virtex XCV1000 FPGA device by Gaj and Chodowiec [4] achieved a data-rate of 331.5 Mbits/sec. Dandalis, Prasanna and Rolim [2] carried out an implementation on a Xilinx Virtex device and achieved an encryption rate of 353 Mbits/sec. A partially unrolled design by Elbirt, Yip, Chetwynd and Paar [7] on the Virtex XCV1000-BG560 FPGA performed at a data-rate of 1937.9 Mbits. The second design [8] presented is capable of both encryption and decryption operations and is implemented on an XCV3200E-8-CG1156 device. There are no known single-chip FPGA implementations of the Rijndael algorithm, which perform both encryption and decryption. However, Ichikawa, Kasuya and Matsui's [3] implementation on a CMOS ASIC achieves 1950 Mbits/sec. The encryptor/decryptor implementation by Weeks, Bean, Rozylowicz and Ficke [9] performs at a rate of 5163 Mbits/sec on a CMOS ASIC.
Section 2 of the paper provides a description of the Rijndael Algorithm. Section 3 outlines the design of the fully pipelined Rijndael implementations. Performance results are given in section 4. Finally, concluding remarks are made in section 5.
Rijndael Algorithm
The Rijndael algorithm is an iterated block cipher. The block and key lengths can be 128, 192 or 256 bits. The NIST requested that the AES must implement a symmetric block cipher with a block size of 128 bits, hence the variations of Rijndael which can operate on larger block sizes will not be included in the actual standard. Rijndael also has a variable number of iterations or 'rounds': 10, 12 and 14 when the key lengths are 128, 192 and 256 respectively. The transformations in Rijndael consider the data block as a 4 column rectangular array of 4-byte vectors or State, as shown in Fig. 1 15 . Hence, B 0 becomes P 0,0 , B 1 becomes P 1,0 , B 2 becomes P 2,0 … B 4 becomes P 0,1 and so on. The key is also considered to be a rectangular array of 4-byte vectors, the number of columns, N k , of which is dependent on the key length. This is illustrated in Fig. 2 . The algorithm design consists of an initial data/key addition, nine, eleven or thirteen rounds when the key length is 128-bits, 192-bits or 256-bits respectively and a final round, which is a variation of the typical round. The Rijndael key schedule expands the key entering the cipher so that a different sub-key or round key is created for each algorithm iteration. An outline of Rijndael is shown in Fig. 3 .
Rijndael Round
The Rijndael round comprises a ByteSub Transformation, a ShiftRow Transformation, a MixColumn Transformation and a Round Key Addition. The ByteSub transformation is the s-box of the Rijndael algorithm and operates on each of the State bytes independently. The s-box is constructed by finding the multiplicative inverse of each byte in GF (2 8 ). An affine transformation is then applied, which involves multiplying the result by a matrix and adding to the hexadecimal number '63'. In the ShiftRow transformation, the rows of the State are cyclically shifted to the left. Row 0 is not shifted, row 1 is shifted 1 place, row 2 by 2 places and row 3 by 3 places. The MixColumn transformation operates on the columns of the State. Each column is considered a polynomial over GF (2 8 ) and multiplied modulo x 4 +1 with a fixed polynomial c(x), where,
Finally the State bytes and round-key bytes are XORed in Round Key Addition. A typical Rijndael round is illustrated in Fig. 4 . In the final round the MixColumn transformation is excluded.
Key Schedule
The Rijndael key schedule consists of two parts: Key Expansion and Round Key Selection. Key Expansion involves expanding the cipher key into a linear array of 4-byte words, the length of which is determined by the data block length, N b , multiplied When the key block length, N k = 4, 6 and 8, the number of rounds is 10, 12 and 14 respectively. Hence the lengths of the expanded key are as shown in Table 1 . [3] and is utilized in the initial data/key addition, round key 1 is W [4] to W [7] and is used in round 0, round key 2 is W[8] to W [11] and used in round 1 and so on. Finally, round key 10 is used in the final round.
Decryption
The decryption process in Rijndael is effectively the inverse of its encryption process. It comprises an inverse of the final round, inverses of the rounds, followed by the initial data/key addition. The data/key addition remains the same as it involves an XOR operation, which is its own inverse. The inverse of the round is found by inverting each of the transformations in the round. The inverse of ByteSub is obtained by applying the inverse of the affine transformation and taking the multiplicative inverse in GF (2 8 ) of the result. In the inverse of the ShiftRow transformation, row 0 is 
Similarly to the data/key addition, Round Key addition is its own inverse. During decryption, the key schedule does not change, however the round keys constructed are now used in reverse order. For example, in a 10-round design, round key 0 is still utilized in the initial data/key addition and round key 10 in the final round. However, round key 1 is now used in round 8, round key 2 in round 7 and so on.
Design of Pipelined Rijndael Implementations
The Rijndael algorithm implementations presented in this paper are based on the Electronic Codebook (ECB) mode. Although ECB mode is less secure than other modes of operation, it is commonly used and its operation can be pipelined [10] . The fully pipelined Rijndael implementation will also operate in Counter mode. Counter mode is a simplification of Output Feedback (OFB) mode and it involves updating the input plaintext block, P, as a counter, P j+1 = P j +1, rather than using feedback. Hence, the ciphertext block, C, is not required in order to encrypt plaintext block, P+1 [11] . Counter mode provides more security than ECB mode and operation in either mode will achieve high throughputs.
A number of different architectures can be considered when designing encryption algorithms [7] . These are described as follows. Iterative Looping (IL) is where only one round is designed, hence for an n-round algorithm, n iterations of that round are carried out to perform an encryption. Loop Unrolling (LU) involves the unrolling of multiple rounds. Pipelining (P) is achieved by replicating the round and placing registers between each round to control the flow of data. A pipelined architecture generally provides the highest throughput. Sub-Pipelining (SP) is carried out on a partially pipelined design when the round is complex. It decreases the pipeline's delay between stages but increases the number of clock cycles required to perform an encryption.
The Rijndael designs described in this paper are coded using VHDL and are fully pipelined: the encryption design having ten, twelve or fourteen pipeline stages and the encryption/decryption design having ten pipeline stages.
Design of Generic Rijndael Encryptor Core
The main consideration in both designs is the memory requirement. The Rijndael s-box in the ByteSub transformation can be implemented as a look-up table (LUT) or ROM. This proves a faster and more cost-effective method than implementing the multiplicative inverse operation and affine transformation. Since the State bytes are operated on individually, each Rijndael round requires sixteen 8-bit to 8-bit LUTs. In the key schedule, LUTs can also be used, as words are passed through the s-box. The
Virtex-E and Virtex-E Extended Memory range of FPGAs are utilized for implementation as they contain devices with up to 280 Block SelectRAM (BRAM) memories. A single BRAM can be configured into two single port 256 x 8-bit RAMs, hence, eight BRAMs are used in each round. When the write enable of the RAM is low ('0'), transitions on the write clock are ignored and data stored in the RAM is not affected. Hence, if the RAM is initialized and both the input data and write enable pins are held low then the RAM can be utilized as a ROM or LUT.
The ShiftRow transformation is simply hardwired as no logic is involved. The MixColumn transformation can be written as a matrix multiplication as given in Equation 3, with a 4-byte input, a 0 , a 1 , a 2 , a 3 and output, b 0 , b 1 , b 2 Thus, in the overall 128-bit key design, a total of 100 ROMs are required, 80 ROMs are required for the 10 rounds and a further 20 for the key schedule. Similarly, 112 ROMS are required for the 192-bit design (96 for the 12 rounds and 16 for the key schedule) and 138 for the 256-bit design (112 for the 14 rounds and 26 for the key schedule).
Design of 128-bit Key Rijndael Encryptor/Decryptor Core
In the decryption operation, the inverse of the ByteSub transformation can also be implemented as a LUT. However the values in this LUT are different to those required for encryption. Therefore, it is necessary to accommodate for both encryption and decryption. One method would involve doubling the number of BRAMs utilized, however, this would prove costly on area. In the Rijndael encryption/decryption design, this was overcome by the addition of two BRAMs, which were utilized as ROMs, one containing the initialization values for the LUTs required during encryption, the other containing the values for the LUTs required during decryption. Therefore, instead of initializing each individual BRAM as a ROM, when the design is set to encrypt, all the BRAMs are initialized with data read from the ROM containing the values required for encryption. When the design is set to decrypt, the BRAMs are initialized with data from the ROM containing the values required for the decryption operation. This initialization procedure is outlined in Fig. 7 .
The Inverse ShiftRow transformation is also hardwired. Multiplexors (MUXs) select between the ShiftRow and Inverse ShiftRow wiring. Similarly to Fig. 5 , the Inverse MixColumn transformation can be implemented by XORing results of the multiplications in GF (2 8 ) and again, MUXs are used to select between the values required for encryption and those required during decryption.
Since the encryptor/decryptor core design assumes a key length of 128 bits, the design of the key schedule is a simplification of that shown in the flowchart illustrated in Fig. 6 . During decryption, the values of the LUTs utilized in the key schedule do not change, hence, the LUTs can simply be implemented as ROMs. However, the round keys are used in reverse order. The initialization process for either encryption or decryption takes 256 clock cycles as the 256 values contained in each ROM are read. Since the system clock for the encryption/decryption design is 25.3 MHz, this corresponds to an initialization time of only 10 µs. When encrypting data, the keys are produced as each round requires them, therefore, the encryption will take 10 clock cycles corresponding to the 10 rounds when using a 128-bit key. The design assumes that the same key is utilized during a session of data transfer. If decrypting data, the initialization process will be as described above. However, initial decryption will take 20 clock cycles, 10 clock cycles for the required round keys to be constructed and a further 10 corresponding to the 10 rounds. The overall 128-bit key encryptor/decryptor design, therefore, requires 102 BRAMs. 
Performance Results
The Rijndael designs are implemented using Xilinx Foundation Series 3.1i software and Synplify Pro V6.0 on Xilinx Virtex-E FPGA devices. Data blocks can be accepted every clock cycle and after an initial delay the respective encrypted/decrypted data blocks appear on consecutive clock cycles. The first design implemented is the generic encryptor core, which supports 128-bit, 192-bit and 256-bit keys. The performance results obtained for this implementation will be similar to those of a design with only decryption capabilities. The main difference in the two implementations would be the initial delay time as mentioned in section 3. The 128-bit key encryption design implemented on the Virtex-E XCV812e-8bg560 device, utilizes 2222 CLB slices (23%) and 100 BRAMs (35%). Of IOBs 384 of 404 are used. The design uses a system clock of 54.35 MHz and runs at a data-rate of 7 Gbits/sec (870 Mbytes/sec). This result proves faster than similar existing FPGA implementations, as illustrated in Table 3 below. The implementations included in the table are as outlined in section 1. The design is also the most efficient in terms of CLB utilization although it must be remembered that the previous implementations were limited in their use of device. Table 4 below. It is possible to enhance the performance figures of the two designs presented by further optimization of the algorithm specific to the requirements of the FPGA device on which the design is to be implemented. However, this would result in the design being less easy to migrate to other devices and technologies.
Conclusions
To conclude, this paper describes high performance single-chip FPGA implementations of the Rijndael algorithm. The generic, parameterisable encryption design is the only hardware Rijndael encryption design that supports varying key sizes, reported to date. When implemented, the 128-bit key encryption design performs at a data-rate of 7 Gbits/sec, which is 3.5 times faster than similar existing FPGA implementations and 21 times faster than software implementations. Previous Rijndael encryption-only designs are implemented on Virtex XCV1000 devices, which consist of only 32 BRAMs and therefore, cannot support a fully pipelined Rijndael design. The Virtex-E and Virtex-E Extended Memory family of FPGAs, however, contains up to 280 BRAMs and can easily accommodate large unrolled designs. The encryptor/decryptor core runs at 3.2 Gbits/sec. This implementation not only compares favorably with similar ASIC designs but is also the only known singlechip FPGA Rijndael design capable of both encryption and decryption. Future work will include parameterising the Rijndael encryption/decryption design so as it may also accept varying key sizes. Rijndael is set to be approved by NIST and replace DES as the Federal Information Processing Encryption Standard (FIPS) in the summer of 2001. It will replace DES in applications such as IPSec protocols, the Secure Socket Layer (SSL) protocol and in ATM cell encryption. In general, hardware implementations of encryption algorithms and their associated key schedules are physically secure, as they cannot easily be modified by an outside attacker. Also, the high speed Rijndael encryptor core and Rijndael encryptor/decryptor core presented, should prove beneficial in applications where speed is vital as with real-time communications such as satellite communications and electronic financial transactions.
