In this paper, we present a high-speed AES IP-core, which runs at 780 MHz on a 0. 13.um CMOS standard cell library, and which achieves 10 Gbps throughput in all encryption modes, including CBC mode. Although the CBC mode is the most widely used and important, achieving such high throughput was difficult because pipelining techniques cannot be applied. To reduce the propagation delays of the S-Box, the most critical function block, we developed a special circuit architecture that we call twisted-BDD, where the fanout of signals is distributed in the S-Box circuit. Our S-Box is 1.5 to 2 times faster than the conventional S-Box implementations. The T-Box algorithm, which merges the S-Box and another primitive function (MixColumns) into a single function, is also used for an additional speedup. function block in the entire circuit, we investigated various design techniques for the S-Box. Although many techniques for compact S-Box designs have been proposed [8, 9, 10] , the circuits obtained are too slow. Few techniques for realizing fast S-Boxes have ever reported except for the table-lookup method [11] , where the S-Box circuit is automatically synthesized from its truth table by using EDA tools.
function block in the entire circuit, we investigated various design techniques for the S-Box. Although many techniques for compact S-Box designs have been proposed [8, 9, 10] , the circuits obtained are too slow. Few techniques for realizing fast S-Boxes have ever reported except for the table-lookup method [11] , where the S-Box circuit is automatically synthesized from its truth table by using EDA tools.
In this paper, we propose a new fast S-Box circuit architecture named twisted-BDD. In the conventional BDD (Binary Decision Diagram) [12] representation of the S-Box, various structural characteristics are observed, such as the heavy sharing of input-side nodes and the independence of variable ordering. We tried to decrease the fanout of the primary input signals and reduce the propagation delay of the serially connected selectors. The resulting S-Box is 1.5 to 2 times faster than the conventional S-Box implementations.
As a result, we achieved a throughput of 10 Gbps at a 780 MHz clock rate using a 0. 13.um CMOS standard cell library, by combining the use of the twisted-BDD architecture and the T-Box algorithm [ 1, 7] that was originally developed for high-speed software. As far as the authors know, this is the first 10 Gbps AES circuit which can support all encryption modes.
The AES Algorithm

Basic Algorithm
An AES encryption process for 128-bit plain text data and a 128-bit secret key is shown in Figure I . A sequence of four primitive functions, SubBytes, ShiftRows, MixColumns, and AddRoundKey, are executed Nr-I times. Each loop is called a round and the concrete value of Nr is 10, 12, or 14 depending on the key length. Prior to this main loop, AddRoundKey is executed for initialization. After executing the main loop, a sequence of SubBytes, ShiftRows, and AddRoundKey is executed as the final round.
SubBytes is a 16-byte (128-bit) input/output nonlinear transformation that uses one-byte substitution tables (S-Boxes). Each S-Box is a multiplicative inversion on a Galois field GF(28) followed by an affine transformation. [3, 4, 5] and FPGAs [6, 7] . However, most of them are simple implementations according to the AES specification, and none are yet fast enough for practical use such as optical communication links with a VPN (Virtual Private Network) capability and/or a 10 Gbps WDM (Wavelength Division Multiplex) system.
In particular, no existing AES circuit achieves 10 Gbps throughput in the CBC (Cipher Block Chaining) mode, which is the most widely used and important mode, although more than 10 Gbps throughput was already reported in simple ECB (Electronic Code Block) mode [3] . In the CBC mode, a feedback operation is performed, and therefore pipelining techniques cannot be applied as a speedup method. Propagation delay reduction for the combinational circuits is the only speedup method for the CBC mode.
To reduce the delay of the S-Box, which is the slowest The irreducible polynomial used by the field is encryption, and InvSubBytes and InvMixColumns are merged in decryption [1, 7] . In the left half of Figure 2 , the original AES encryption and decryption processes are shown, while in the other half, the processes for the T-Box algorithm are shown. Some functions are reordered in order to merge the functions. Each one-byte S-Box output of SubBytes and InvSubBytes is multiplied by the four coefficients of the polynomials (2) and (3) respectively, and these two one-byte input and four-byte output relation tables become T-Boxes. The concrete circuit implementation of this algorithm will be described in Section 5.1. ShiftRows is a cyclic shift operation in each row of 4 x 4-byte data by 0-3-byte offsets. MixColumns treats the 4-byte data in each column as coefficients of a 4-term polynomial, and multiplies the data modulox4+ 1 with the fixed polynomial given by c(x) = {03}X3 +{0l}x2 +{Ol}x+{O2} . (2) AddRoundKey is a simple bit-wise XOR operation on the 128-bit round keys Ko-KNr and the data.
In the decryption process, the inverse operations of each primitive function are executed. The inverse of AddRoundKey is AddRoundKey itself. InvSubBytes, which is the inverse of SubBytes, executes an affine transformation before the multiplicative inversion. InvShiftRows is a cyclic rotation in the reverse direction. InvMixColumns uses the following polynomial for the multiplications: K; w+C
AddRoundKey 'm+m""+mmccc"" , 3 Issues in Designing Fast S-Box Circuits
Basic Approaches
There are two approaches for designing S-Box circuits: (1) construct a multiplicative inversion circuit and an affine transformation circuit independently, and then connect these two circuits in serial, and (2) construct a single circuit directly whose input-output relation is equivalent to the S-Box.
In Method (I), circuit area reduction using mathematical theorems over Galois fields (GF) is possible. Various methods for constructing compact inversion circuits over GF have been studied, based on Fermat's Little Theorem, the extended Euclid 's Algorithm and so on [11] . In particular, the composite field (or tower field) inversion [8] is effective over GF(28), and it can be used 
T-Box for High-Speed Implementations
The to create compact AES implementations [9, 10] . However, these methods are not suitable for achieving 10 Gbps AES circuits due to the large propagation delay. In Method (2), a fast implementation is possible. The S-Box circuit can be obtained from its truth table by using two-level logic such as SOP (Sum of Products), POS (Product of Sums), PPRM (Positive Polarity Reed-Muller form), etc. [13] , or by making a decision diagram such as a BDD [12] or an FDD [14] . In many actual AES implementations the table-lookup method is used [ II ] , where the S-Box circuit is automatically synthesized using EDA tools.
Our evaluation results of these various S-Boxes on a 0.13.um CMOS standard cell library are shown in Table I . The BDD and table look-up implementations are the fastest of all. These results are obtained in the following steps: (i) implement these S-Boxes as hard-coded VHDL sources (primitive cells are directly called), (ii) adjust cell strengths by a logic synthesis tool without changing circuit structures and (iii) evaluate circuit size and speed using the synthesis tool and static timing analyzer (STA). Although the absolute values of delay/size can vary depending on the ASIC libraries and synthesis tools, the ratios of the circuits' speeds were almost the same, as far as the authors' tests showed.
Analysis and Issues for BDD Implementation
We selected the BDD architecture as a candidate to achieve the throughput of 10 Gbps, but the speed was not yet adequate. We noticed the following structural characteristics of the S-Box/T-Box BDDs (Figure 3 
Twisted variable ordering between primary outputs
In the proposed twisted-BDD architecture (Figure 4 ), eight BDDs are arranged in parallel, where each BDD corresponds to each primary output. No node is shared between these BDDs and their variable ordering is twisted (rotated) as shown in Figure 4 so that each primary input i (0~i~7) drives the «8+i-j mod 8) + l)-th input ofBDDj. Each primary input signal is propagated to the next BDDs by passing through drivers (inverters).
As a result, the fanout of each primary input and each driver's output is significantly decreased from 150 down to 30, because the first and second stage selectors are distributed equally between each primary input. Because the BDD structure and its size are almost independent of the variable ordering, as described in Section 3.2, the fanout of each primary input is almost the same. In the same manner, the fanout of each output of the first stage selectors is decreased from 30 down to 5.
Parallel Decoding of Selector Control Signals
As shown in Figure 5 , we replaced the 25: 1 selectors on the output side in each BDD with a combination of a select-signal decoder (5-bit binary to 25-bit one-hot conversion) and a data selection part (I stage ANDs and 5 stage ORs). As a result, the delay of the 25: 1 selectors is reduced, because the decoding of the select-signals and the signal processing in the first and second stage selectors are performed in parallel.
Use of Negative-Output Selectors and Drivers
We used negative-output primitive-gates for implementing the selectors in each BDD and drivers to reduce the circuit delay and to decrease the number of gates. Because most of the CMOS primitive gates with positive outputs usually consist of a negative output gate followed by an inverter, primitive gates with negative output are faster and smaller. 
Evaluation Results and Discussion
As shown in Table 1 , the S-Box speed is increased 1.5 to 2 times by the proposed method. We obtained a 430-ps delay S-Box on a 0.13.um CMOS standard cell library, and this is the fastest we know of. In spite of the incorporated highly parallel circuit structure shown in Figure 4 , the total circuit size remains only double the original BDD, because in the original BDD, the selectors in the 3rd to the 7th stage are already unshared and separated between primary outputs.
We believe that further improvement in speed will be quite difficult if any of the other logic circuit structures described in Section 3.1 are used, for the following two reasons:
First, if any two-level logic such as SOP is used, the number of prime terms increases (for example, 150 terms in SOP) and the fanout of the prime inputs becomes large. However, in contrast to our twisted-BDD, distributing and reducing fanout are difficult, because each primary input signal drives almost the same number of prime selectors.
Second, if any decision diagram other than BDD is used, the propagation delay of each node becomes much larger. For example, if FDD is used, each node is implemented by AND+ XOR, and this is much slower than a 2:1 selector used as a BDD node. A fast 2:1 selector cell is available in most ASIC libraries. slower clock is available, it is still possible to achieve 10 We achieved 10 Gbps throughput and a 780 MHz clock Gbps throughput by duplicating the combinational circuit cycle on a 0.13.um CMOS standard cell library. We did blocks and connecting them in serial, i.e., by using the not implement any key scheduler under the assumption unrolling technique.
that the round-keys are stored in an external register file, In the encryption path shown in the upper half of but on-the-fly generation of round-keys is possibleP 6 Conclusion
In this paper, we presented a high-speed AES circuit design, running at speeds over 780 MHz and achieving 10 Gbps throughput in all encryption modes including the CBC mode. To reduce the propagation delay of the S-Boxes, we developed a special logic circuit architecture named twisted-BDD, where the fanout of signals is distributed in the S-Box. The T-Box algorithm that merges the S-Box and MixColumns function is also used. As far as the authors know, this is the first 10 Gbps AES circuit which can support all encryption modes. without difficulty.
The evaluation of the speed and size of our implementation was done by the same method described in Section 3.1. The power consumption was estimated by a simulation-based method. In this method, a timing simulation is performed using a synthesized net-list and a given set of test input data, and the number of switching events for all internal gates are counted. The effects of dynamic hazards is reflected in the power estimation results.
The T-Box architecture is almost 20% faster than the basic algorithm in Section 2.1 (Table 2) , although strong T-Box drivers are necessary. The circuit size and power are still reasonable. Much of the critical path delay is used by the T-Box (Table 3 ) and this shows that it will be difficult to achieve the maximum throughput without using the twisted-BDD architecture.
Regarding the T-Box design, the twisted-BDD architecture is suitable for a high clock-speed AES implementation because of its low propagation delay (see Table 1 in Section 3.1). 
