Abstract. The Advanced Encryption System -AES is now used in almost all network-based applications to ensure security. In this paper, we propose a very efficient pipelined hardware implementation of AES-128. The design is versatile as it allows both encryption and decryption. The core computation of AES, which is performed on data blocks of 128 bits, is iterated for several rounds, depending on the key size. The security strength of AES has been proven proportional to the number of rounds applied. we show that if the required number of rounds must increase to defeat attackers, the proposed implementation stays efficient.
Introduction
Cryptographic algorithms used by nowadays cryptosystems fall into two main categories: symmetric key and asymmetric-key algorithms [8] . Symmetric-key ciphers use the same key for encryption and decryption, or to be more precise, the key used for decryption is computationally easy to compute given the key used for encryption. In turn, symmetric-key ciphers, fall into two categories: block ciphers and stream ciphers. Stream ciphers encrypt the plaintext one bit at a time, in contrast to block ciphers, which operate on a block of bits of a predefined length. Most popular block ciphers are DES, IDEA [7] and AES, and most popular stream cipher is RC6 [9] .
The Advanced Encryption System -AES is a block cipher, adopted as the new encryption standard in substitution to its predecessor Data Encryption Standard -DES [2] . AES main scrambling computation is performed on a fixed block size of 128 bits with a key size of 128, 192 or 256 bits. This core computation is iterated for many rounds. The number of rounds depends on the key size. Currently, it is set to 10, 12 and 14 for the cited keys sizes respectively. The resistance of AES against breaking attacks depends entirely on the number of rounds used. So far, the best known attacks are on 7 rounds for 128-bit keys, 8 rounds for 192-bit keys, and 9 rounds for 256-bit keys [5] . The small margin between these round numbers and the actual ones is very worrying for the cryptographer's community.
In this paper, we propose a novel hardware implementation of AES-128. The architecture allows one to perform the core computation of the algorithm is a pipelined manner. The throughput of the cryptographic hardware is 1Gbits per second. A unique hardware is used for encryption and decryption. The pipelined encryption and decryption allows an increase of the number of rounds without much loss of efficiency. Recall that increasing the number of rounds applied, increases the resistance of the AES algorithm.
This rest of this paper is organised in 4 subsequent sections. First, in Section 2, we give a brief description of the AES encryption and decryption algorithms as well as the modified version of these two algorithms, which are the basis of the proposed hardware architecture. Thereafter, in Section 3, we describe in a structured manner, the pipelined hardware architecture of AES-128 for encryption and decryption. Subsequently, in Section 4, we present some experimental result and compare our implementation to existing ones. Last but not least, in Section 5, we draw some conclusions and introduce some directions for future work.
Advanced Encryption Standard
AES is an elegant and a so-far-secure cipher. Encryption using AES proceeds as described in Algorithm 1, wherein functions SubBytes, ShiftRows, MixColumns and AddroundKey are defined as follows:
-Function SubBytes yields a new state simply by substituting each of the 16 bytes of state using a substitution box. The four most significant bits of the byte in question is used as the S-box row index while the remaining four bits are used as the S-box column index. -Function ShiftRows obtains a new state by cyclically shifting the state rows.
The bytes of row i are shifted i times, where 0 ≤ i ≤ 4. -Function MixColumns operates on the states columns. The bytes of a given column are used as coefficients of a polynomial over GF (2 8 ). The formed polynomial is multiplied by a fixed polynomial P (x) modulo x 4 + 1, wherein P (x) = {03}x 3 + {01}x 2 + {01}x + {02}. The details of the multiplication operation can be found in [3] , [1] .
-Function AddRoundKey computes the new state using a xor of the columns bytes and the key schedule of the current round.
Before the cipher operation takes place, a key schedule is generated. Four subkeys are required for each round of the cipher algorithm. The subkeys for the first round are the private cipher key. For a given round, the first subkey is obtained by first rotating once the last subkey form the previous round, then substituting each of byte using the S-box used by function subBytes, thereafter xoring the result with a given constant and finally xoring the result with first subkey of the previous round. The subsequent subkeys of the current round are computed using a xor of the previous key in the current round and the one inversely respective from the previous round.
Algorithm 1. AES-Cipher input:
Byte 
Pipelined Hardware Implementation of AES
The overall architecture of the AES hardware mirrors the structure of Algorithm 2 and Algorithm 4. It is a synchronous implementation of both the processes of cipher and decipher. It uses four 128-registers. Every clock transition, these registers are loaded, except Register 3 , which is loaded when an input state is completely ciphered. In the encryption/decryption process, Register 0 is loaded with the input data or the partially encrypted/decrypted plaintext/ciphertext; Register 1 with the result of the AddRoundKey component; Register 2 with the state after applying functions SubBytes (using the appropriate S-Box) and subsequently ShiftRows/InvShiftRows. The block architecture of the AES cipher and decipher hardware is shown in Fig. 1 . The component that implements function AddRoundKey is simply a net of xor gates that adds in GF (2 8 ) the key schedule to the current state. The component implementing function SubBytes uses 16 S-boxes (8 for ciphering and 8 for deciphering) stored in a Read-Only Memory (rom). The obtained state is row-shifted before its storage in Register 2 . The component architecture is given in Fig. 2 . Fig. 3 , wherein component mult yields the a special product of a given byte from the state times {01}, {02}, {03}, {09}, {0B}, {0D} or {0E} (see [3] , [1] for details on the operation). The architecture of component mult is presented in Fig. 4 . Component xtime computes the xtime operation as defined in [3] and its architecture is given in Fig. 5 .
For component synchronisation purposes, the architecture includes a controller. Among other actions, the controller determines when to reset the cipher hardware, accept input data, to register output results. As the excution of function MixColumn/InvMixColumn is conditional (see Algorithm 2), the controller decides when the result obtained by associated component can be used or must be ignored. Recall the hardware allows both encryption and decryption. When data is being deciphered, the key schedule generated by component KeyExpansion must be ordered differently [3] . The AES hardware of Fig. 1 takes advantage of component MixColumn to schedule the subkeys in the required order. The controller also controls this operation.
The controller is structured as in Fig. 6 . The included combinational logic permits the conversion of the 5-bit count to a single bit that triggers state transition. The sate machine includes six states. As long as control signal keyExpand is set, the current state is kept unchanged in S 0 . As soon as this signal is reset by the keyExpansion component, which means that the step of key schedule generation is complete, the machine transits to state S 1 , wherein it stays for 3 clock cycles, which is the required time to complete the processing of one 128-bit state. Also, during this period of time, the data input signal is active, which allows the hardware to accept the three states that will be ciphered/deciphered in pipelined manner. Synchronously with the fourth clock transition, the machine transits to state S 2 allowing to deactivate the data input signal and wait for the three accepted states are almost processed as only the last AddRoundKey is yet to be performed to complete the encryption/decryption process. At the 30th. clock transition, the machine state changes to S 3 to activate output result signal, which is maintained for the two subsequent clock periods. A the 33rd. clock transition, the encryption/decryption of the three accepted states is completed and therefore, the control is returned to state S 1 , where in data input signal is reactivated to allow more date to be entered and processed. The state machine transition diagram is shown in Fig. 7 . 
Experimental Results
The pipelined execution of the AES cipher using the architecture of Fig. 1 is illustrated in Fig. 8 . We implemented the hardware described throughout this paper using reconfigurable hardware. The FPGA family used is VIRTEX-II. Component KeyExpansion introduces a delay of 78.3ns. The clock cycle is 10.44ns. Every 33 clock cycles, the hardware can yield an encrypted datastream of 3 × 128 bits. The throughput, say tp can then be calculated as in (1) . The throughput is a little more than 1Gbps. 
As far as the authors know, the versatile hardware implementation of AES algorithm that performs both encryption and decryption is novel. We compared our implementation to the ones from [6] and [10] . Note that these implementations are for the cipher algorithm only while our implementation ciphers and deciphers. One may think that the implementation proposed and those from [6] and [10] are incomparable. They are cited here for reference only. The throughput, expressed in Mbps, as well as the hardware area required, expressed in number of CLBs, are given in Table 1 .
Recall that the resistance of AES-based encryption against cryptanalysis attacks depends entirely on the number of rounds used. The pipelined implementation we propose throughout this paper can be easily adapted to a higher round number. The chart of Fig. 9 shows that this can be done without much loss in efficiency and with much gain of security strength. To be able to increase the number of round, component KeyExpansion needs to generate more key schedules and therefore the delay introduced by it increases with the number of rounds. The throughput, say tp, can be expressed in terms of the round number, say rn, is as in (2) . The security strength, say st is proportional to the number of rounds applied. So, considering the security strength provided by applying 10 rounds as a reference, st would be defined as in (3) .
St(rn) = rn 10 (3) Fig. 8 . Pipelined execution of the AES algorithm using the hardware of Fig. 1 
Conclusion
In this paper, we propose a novel pipelined hardware implementation of AES-128 that can be used for both encryption and decryption. Besides, we show that if the required number of rounds must increase to defeat attackers, the rn tp/st proposed implementation stays efficient. The hardware proposed is massively parallel and executes the four main steps of the algorithm in a pipelined manner, which allows a reasonable throughput fo a little more of 1Gbs. Compared to existing implementations of the cipher algorithm, this kind of throughput may be considered somehow low. However, considering the 2-in-1 aspect of the hardware as it allows encryption and decryption, it comes handy for devices with restricted hardware area with a not too bad throughput of 1Gbs.
In future research work, we intend to investigate further the proposed implementation, with the hope to improve the throughput without much increase in required hardware area.
