Abstract-Pipelined S-boxes are usually used in high speed hardware implementations of the Advanced Encryption Standard (AES), and not typically found in compact implementations because of the extra complexity added by the pipeline registers. In this paper, the area and speed performance of applying a pipelined S-box to compact AES hardware implementations is examined. A new compact AES encryption hardware core with 128-bit keys is proposed. The proposed design employs a single 4-stage pipelined S-box that is shared by the data path operation and the key expansion operation. Compared with the previous smallest encryption-only ASIC implementation of AES, it achieves an increase in throughput of 2.1 times while maintaining a similar gate count. This result indicates that it is reasonable to consider using pipelined S-boxes in AES hardware implementations targeted at applications requiring low area and moderate speed.
INTRODUCTION
AES is a block cipher algorithm standardized by the US government [1] , and it is regarded as the most reliable block cipher currently because there are no serious security flaws reported since it was released in 1999. Due to the wide recognition and adoption of AES, there has been a lot of interest in developing a compact AES hardware implementation for low cost security applications. Generally, a compact implementation refers to the low gate count of the implementation, and low gate count would result in low manufacture cost and contribute to low power consumption.
There have been many research works dedicated to the design of compact AES implementation. Typical works based on ASIC technology include [2] , [3] , [4] , [5] and [6] . According to the comparison in [2] , the design proposed in [2] achieves the smallest gate count while having a significant improvement on the throughput compared with [3] , [4] and [5] . The work presented in [6] is a recent proposal for compact AES implementation with the focus on low power consumption, and it has poorer area and speed performance than the design in [2] . The design of [2] is an AES encryption core with an 8-bit data path where two S-boxes are implemented, one used by round operations and the other used by the key expansion. Even though the throughput of this design is higher than other compact designs, the critical path, which determines the maximum clock frequency and consequently the throughput, is quite long because it comprises the entire critical path of the S-boxes. S-boxes are the most complex component in an AES implementation and it generally involves a large number of gates on its critical path. Commonly in an AES implementation for high speed applications, the S-boxes are pipelined to several stages in order to reduce the critical path of the overall design. However this method is seldom applied to compact implementations for throughput improvement because it is assumed that pipeline registers would incur large hardware overhead, which is not affordable for the compact implementations targeted at low cost applications. In this paper, the applicability of using a pipelined S-box in compact AES hardware implementations is examined. A new VLSI architecture design for AES implementation is proposed to accommodate a 4-stage pipelined S-box and the implementation results show that the new design can achieve more than double of the throughput of [2] while keeping the same gate count. The performance of using one S-box with other number of pipelined stages is also investigated and the results are compared and discussed. In the following, the design from [2] is referred to as the reference design.
II. AES OVERVIEW
AES is a block cipher algorithm with a block size of 128 The round keys are produced by the key expansion operation that involves substitution, word rotation, and XOR operations. Refer to [1] for a detailed description of the AES algorithm.
III. ARCHITECTURE DESIGN
The block diagram of the proposed architecture design is shown in Fig 1. In the architecture, the round operations have an 8-bit data path, and on the path, the ShiftRow, SubByte, MixColumn and AddRoundKey operations are performed byte by byte in sequence by the corresponding components. To complete the operation of one round of AES encryption, all the bytes of the State need to traverse the round operation data path once, so totally 10 traversals are required to encrypt an 128-bit plaintext after the data path loads it. The key expansion component also has an 8-bit data path and generates round keys on-the-fly using 128-bit keys. One Sbox is used alternately by round operations and the key expansion. During the period the S-box is occupied by the key expansion component, the round operations are frozen by clock gating. The proposed design is developed based on the reference design [2] and adopts the same ShiftRow, MixColumn and S-box structures. The proposed design has modified interconnection between components, key expansion component and data flow which allow the interleaving use of one pipelined S-box between the data path operation and the key expansion operation. The detailed architecture of the proposed design is shown in Fig. 2 . All the paths in Fig. 2 have a width of 8 bits except those between two consecutive pipeline stages in the S-box have the widths as the internal data widths of the S-box at the places where the pipeline registers are inserted. The blocks marked with "R" are 8-bit registers. The operation of each component and their interaction will be described in the following separately.
A. ShiftRow Component
The ShiftRow component consists of 12 8-bit registers connected in series and there are shortcuts from its input and every fourth register to its output. The component takes bytes arriving in the order of State columns and reorders the bytes while they are passing through. The detailed operation of the component is described in [2] .
B. S-box
The S-box adopted in the proposed design is developed in [7] , and is considered to be the most compact AES S-box hardware structure [8] . Since the computation of multiplicative inverse over GF (2 8 ) can be converted to the computations in subfields, in [7] the S-box structure is examined for a number of representations of subfields, including both polynomial bases and normal bases, and the one leading to the implementation with the smallest gate count is identified. In the proposed architecture, the S-box is pipelined to 4 stages. The pipeline registers are placed between two consecutive stages but not shown in Fig. 2 . The register placement is performed at the gate level under the rule of cutting the critical path into 4 pieces with delays as close as possible. The exact register placement is not presented in this paper due to the space limit and the simplicity of splitting a combinational circuit at gate level. 
C. MixColumn Component
The MixColumn component is a serial-in, parallel-out matrix multiplier. It takes one byte input per clock cycle continuously for 4 cycles to receive a column of the State. At every fourth clock cycle, the computation of the MixColumn operation on the current column of the State is completed and the first byte of the result is available at the output while the remaining three bytes are fed to the input of the parallel-in, serial-out shift registers incorporated in the MixColumn component. Subsequently, the three bytes are shifted out in the following three cycles. The blocks "X02" and "X03" in Fig. 2 generate the products of the current input byte and 02H and 03H, accordingly. The AND gates are used to bypass the XOR gates. This is done by setting EN to 0 and thus ensuring that the XOR operation does not change the data. During the loading of a 128-bit plaintext, only the shift registers at the right side of the component are working to shift in and shift out the plaintext bytes in serial. Refer to [2] for a detailed explanation of the component.
D. Key Expansion Component
The key expansion component has an 8-bit data path, which is implemented mainly by circularly connected shift registers R 17 to R 32 . The bytes of a round key are generated while the key state circulates through the shift registers and the generation of a round key is completed every time all of the key state has circulated along the path once. The computation of the next round key involves the substitution of the last four bytes of the current round key. This is realized by an 8-bit multiplexer switching the input of the S-box between the round operation data path and the key expansion data path. During the load period of key bits, the AND gate has EN set to 0 to bypass the XOR gate on the shift register path.
E. Overall Design
In order to clarify the operation of the architecture, the states of the numbered registers in Fig. 2 In both tables, "X" indicates a state holding a useless byte. The operation of the multiplexers and the AND gates can be easily determined from Tab. 1 and Tab. 2. Clock gating is applied regularly to both round operation and key expansion components. The selected cycles that demonstrate the happening of clock gating are marked with "*" in Tab. 1 and Tab. 2. The registers that require clock gating and the cycles when clock gating is active can be deduced from Tab. 1 and Tab. 2. It should be 
IV. IMPLEMENATION RESULTS, COMPARISON AND

DISCUSSION
The proposed AES architecture design with a 4-stage pipelined S-box is synthesized using Synopsys Design Compiler version X-2005.09 under 0.18-m CMOS standard cell technology from TSMC through CMC Microsystems [9] . The synthesis results of the proposed design with the constraint of minimum area are reported in Tab. 3. Since it is difficult to compare performance of implementations with different technologies, the implementation results under 0.13-m technology from [2] are not quoted for comparison here, and instead, the reference design is implemented and synthesized with the same tool and technology as the proposed design. The results are presented in Tab. 3. It can be seen that the design with the pipelined S-box uses slightly fewer gates than the reference design and achieves an increase in throughput by a factor of 2.1. Although the overhead of control logic is not included in the comparison, the slight increase in gates used for the controller of the proposed design would be cancelled out by the slight decrease in gates on the data path. The implementation results and comparison show that, even though the pipelined S-box would introduce the latency of several clock cycles per round operation compared with the reference design, the reduction of the critical path delay by using the pipelined S-box compensates for the increased latency and brings significant boost to the throughput. Therefore, when throughput is a concern for a low gate count AES hardware implementation, the proposed design with a pipelined S-box is a much better choice than the reference design in [2] with two S-boxes. Only the performance comparison with the reference design of [2] is presented here because the reference design uses the lowest hardware cost among published works based on an ASIC platform.
In order to determine the influence of the number of pipeline stages on the overall performance of a compact design, the scenarios for varying number of pipeline stages are investigated. The area and throughput performance of the architecture using a single S-box with a variety of pipeline stages is normalized to the 4-stage pipeline scenario and shown in Tab. 4. It should be noted that the architecture in Fig.  2 only works with a 4-stage pipelined S-box and for other stage numbers up to 5 the architecture requires minor changes to fit. For more than 5 stages of pipeline, a major modification on the architecture is required. The differences in area between pipeline stage numbers come from the different amount of pipeline registers used in each case. The figures of one pipeline stage in Tab. 4 indicate the scenario of using an un-pipelined S-box. From Tab. 4, it can be seen that the ratio of throughput/area is gradually improved as the number of the pipeline stages increases. The architecture with a 4-stage pipelined S-box is selected to be specified in this paper because it has the best performance for an architecture with an area smaller than the reference design of [2] .
V. CONCLUSION
In this paper, a new architecture design for compact hardware implementation of an AES encryption core is presented. The new design is featured with a 4-stage pipelined S-box. The implementation results show that, compared with the previous smallest encryption-only AES hardware implementation, the new design uses the same amount of gates to achieve an increase of 2.1 times in throughput. The implementation results indicate that pipelined S-boxes are applicable to compact implementations of AES for the purpose of speed improvement.
