Abstract
Introduction
The Advanced Encryption Standard was accepted as a FIPS standard in November 2001 [1] . Since then, there have been many different hardware implementations for ASIC and FPGA. References [2, 3, 4 , and 5] present architectures and results for ASIC implementation. On the other hand, references [6, 7, 8, 9, 10 , and 11] present implementations of the AES algorithm on FPGA that can achieve a throughput rate from 1 to 20 Gbits/s. This paper presents our proposed fully pipelined architecture with an optimum number of pipeline stages for the byte substitution phase of the AES algorithm. It can provide a throughput of 21.54 Gbits/s with a throughput per area rate of 4.2 Mbps/Slice.
Fully pipelined AES implementation
The Advanced Encryption Standard [1] is composed of four different steps that are repeated in N r number of rounds. These are byte substitution, shift row, mix column, and key addition. When a key size of 128 bits is used, the number of rounds the algorithm is repeated (N r ) is equal to ten. Figure 1 shows the unrolled and fully pipelined implementation of the AES algorithm. The shift row step is just interconnection and the key addition is XORing of the round data and the round key. The mix column step consists of a chain of XORs to permute the elements of data in each column. The arithmetic of these three stages can be combined in one pipeline stage for each round. On the other hand the most expensive step is the byte substitution phase, which is explained next.
Byte substitution phase
In the byte substitution phase (Sbox), the input is considered as an element of GF (2 8 ). First the multiplicative inverse in GF (2 8 ) is calculated. Then, an affine transformation over GF(2) is applied [1] . Either, all the substitute values have to be precalculated and stored in the Block RAMs or on the fly calculation of the values must be implemented in logic. Rijmen [12] suggests an algorithm that calculates the byte substitution phase using the GF(2 4 ) operations. Figure 2 shows the architecture of byte substitution phase when the input is mapped into the GF(2 4 ) elements and the GF(2 4 ) operations are used. This is the most area efficient implementation of Sboxes. Due to the long delay of this architecture, pipelining must be used. Figure 3 shows the LUT usage and the critical path delay of the pipelined implementation of one Sbox using this architecture synthesized for VirtexII-Pro FPGA (pre-place and route). The bar graph shows the delay and the plotted line shows the LUT usage. The best delay-LUT combination is the design with three pipeline stages. Also figure  4 shows the throughput per area metric for different pipelined implementations. The most efficient designs are those with three and six pipeline stages for the byte substitution phase as shown in figure 2 . The dotted lines are the pipeline registers for the three-stage byte substitution and the solid lines are the registers for six-stage Sbox. In addtion the last pipeline stage of each round of the AES algorithm includes the shift row, mix-column and key addition phase (figure 1). Therefore the optimum pipelined implementations have a total of four or seven pipeline stages for each round of AES. 
Performance results
The performance results of our proposed architectures are shown in table 1 and are compared with related work in table 2. The Synplicity tool for synthesis and the Xilinx's ISE tool for place and route are used. Moreover, when the Block RAMs are used, the Sboxes of the key scheduling and the first five rounds of the encryption datapath are mapped onto Block RAMs and the rest of them are designed using the pipelined implementation of section 3. This way, the first five rounds take 10 clock cycles because byte substitution takes one clock cycle on a BRAM.
Conclusion
The architecture of a fully pipelined AES processor is presented. It can achieve a maximum throughput of 21.54 Gbits/s using 84 Block RAMs and 5177 Slices of VirtexII-Pro FPGA with an optimum number of pipeline stages for the byte substitution phase and a throughput/area rate of 4.2 Mbps/Slice.
Acknowledgment
This material is based upon work supported by the Space and Naval Warfare Systems Center -San Diego under contract No.N66001-02-1-8938. This funding is acknowledged. 
Table1. Performance result (After place and route)

